Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Why use Windows, since there is a door? (By fachat@galileo.rhein-neckar.de, Andre Fachat)

VVM question

Subject	Author
VVM question	Thomas Koenig
Re: VVM question	Anton Ertl
Re: VVM question	Thomas Koenig
Re: VVM question	Anton Ertl
Re: VVM question	Thomas Koenig
Re: VVM question	Thomas Koenig
Re: VVM question	Anton Ertl
Re: VVM question	Thomas Koenig
Re: VVM question	Anton Ertl
Re: VVM question	Quadibloc
Re: VVM question	Anton Ertl
Re: VVM question	Terje Mathisen
Re: VVM question	Thomas Koenig
Re: VVM question	MitchAlsup
Re: VVM question	Thomas Koenig
Re: VVM question	Stephen Fuld
Re: VVM question	Anton Ertl
Re: VVM question	Terje Mathisen
Re: VVM question	luke.l...@gmail.com
Re: VVM question	luke.l...@gmail.com
Re: VVM question	Terje Mathisen
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	Stephen Fuld
Re: VVM question	Thomas Koenig
Re: VVM question	Stephen Fuld
Re: VVM question	MitchAlsup
Re: VVM question	Thomas Koenig
Re: VVM question	MitchAlsup
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	luke.l...@gmail.com
Re: VVM question	EricP
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	Terje Mathisen
Re: VVM question	EricP
Re: VVM question	MitchAlsup
Re: VVM question	Thomas Koenig
Re: VVM question	MitchAlsup
Re: VVM question	Thomas Koenig
Re: VVM question	Anton Ertl
Re: VVM question	Ivan Godard
Re: VVM question	Terje Mathisen
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	Stephen Fuld
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup
Re: VVM question	Ivan Godard
Re: VVM question	MitchAlsup
Re: VVM question	Ivan Godard
Re: VVM question	MitchAlsup
Re: VVM question	Ivan Godard
Re: VVM question	Stephen Fuld
Re: VVM question	MitchAlsup
Re: VVM question	luke.l...@gmail.com
Re: VVM question	Stephen Fuld
Re: VVM question	Thomas Koenig
Re: VVM question	Terje Mathisen
Re: VVM question	Thomas Koenig
Re: VVM question	MitchAlsup
Re: VVM question	Thomas Koenig
Re: VVM question	Thomas Koenig
Re: VVM question	MitchAlsup
Re: VVM question	MitchAlsup
Re: VVM question	Stephen Fuld
Re: VVM question	MitchAlsup
Re: VVM question	MitchAlsup
Re: VVM question	Stephen Fuld
Re: VVM question	MitchAlsup
Re: VVM question	Terje Mathisen
Re: VVM question	Stephen Fuld
Re: VVM question	luke.l...@gmail.com
Re: VVM question	Thomas Koenig
Re: VVM question	MitchAlsup
Re: VVM question	luke.l...@gmail.com
Re: VVM question	MitchAlsup

Pages:12 3 4

VVM question

<sftuaa$but$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20038&group=comp.arch#20038

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: VVM question
Date: Sun, 22 Aug 2021 16:34:18 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sftuaa$but$1@newsreader4.netcologne.de>
Injection-Date: Sun, 22 Aug 2021 16:34:18 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="12253"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 22 Aug 2021 16:34 UTC

Hi,

a question regarding VVM.

Take the following simplified version of Fortran's MAXLOC intrinsic,
which returns the position of the array element with the maximum
value (the first if there are many).

int m2(int * const restrict a, int n)
{ int m, nm;
int i;

m = INT_MIN;
nm = -1;
for (i=0; i<n; i++)
{
if (a[i] > m)
{
m = a[i];
nm = i;
}
}
return nm;
}

An SIMD version with m lanes would probably determine the maximum
value for each lane separately and, at the end of the loop, return
the smallest index of the largest value, so something like

for (i=0; i<n; i+=n_lanes)
{
if (a[i] > m[0])
{
m[0] = a[i];
nm[0] = i;
}
if (a[i+1] > m[1])
{
m[1] = a[i+1];
nm[1] = i + 1;
}
....
}

How would VVM handle that? Could it also use a similar parallel
approach, from just translating the scalar code?

Re: VVM question

<2021Aug22.193605@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20041&group=comp.arch#20041

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Sun, 22 Aug 2021 17:36:05 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 22
Distribution: world
Message-ID: <2021Aug22.193605@mips.complang.tuwien.ac.at>
References: <sftuaa$but$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="7f8452105a1c218d39a4f1a5c73923fb";
logging-data="5319"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Qf3BoayTsqx0/v4DicBY/"
Cancel-Lock: sha1:o+4c60SGq/pnMIi2+nHQb2aAatc=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sun, 22 Aug 2021 17:36 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Hi,
>
>a question regarding VVM.
>
>Take the following simplified version of Fortran's MAXLOC intrinsic,
>which returns the position of the array element with the maximum
>value (the first if there are many).

This is actually a significant part of the inner loop of Jon Bentley's
Traveling Salesman example which we looked at in the thread that
begins with <2016Nov14.164726@mips.complang.tuwien.ac.at>.
already5chosen@yahoo.com presented a vectorized version of the loop
over the whole array (rather than a loop that ends as soon as it finds
something closer than before) in
<b2aed821-2b7e-456d-9a6d-c2ea1fdedd55@googlegroups.com>, and I
discussed it in <2016Nov16.155150@mips.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: VVM question

<sfu996$lnn$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20043&group=comp.arch#20043

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!sa6kcu6mSvh5VOr71AVWvw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Sun, 22 Aug 2021 21:41:26 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sfu996$lnn$1@gioia.aioe.org>
References: <sftuaa$but$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="22263"; posting-host="sa6kcu6mSvh5VOr71AVWvw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Sun, 22 Aug 2021 19:41 UTC

Thomas Koenig wrote:
> Hi,
>
> a question regarding VVM.
>
> Take the following simplified version of Fortran's MAXLOC intrinsic,
> which returns the position of the array element with the maximum
> value (the first if there are many).
>
> int m2(int * const restrict a, int n)
> {
> int m, nm;
> int i;
>
> m = INT_MIN;
> nm = -1;
> for (i=0; i<n; i++)
> {
> if (a[i] > m)
> {
> m = a[i];
> nm = i;
> }
> }
> return nm;
> }
>
> An SIMD version with m lanes would probably determine the maximum
> value for each lane separately and, at the end of the loop, return
> the smallest index of the largest value, so something like
>
> for (i=0; i<n; i+=n_lanes)
> {
> if (a[i] > m[0])
> {
> m[0] = a[i];
> nm[0] = i;
> }
> if (a[i+1] > m[1])
> {
> m[1] = a[i+1];
> nm[1] = i + 1;
> }
> ...
> }
>
> How would VVM handle that? Could it also use a similar parallel
> approach, from just translating the scalar code?

A verctor version of that, VMM or SIMD, would probably run better with
predicates/conditional moves so as to remove all internal branching from
the core iteration.

One issue is of course that predicates or CMOVs require both some setup
time and latency limitations, so you might actually need to unroll the
code to use twice as many accumulators. If you do so then you can afford
the parallel compare and the pairs of conditional moves of a new max value.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: VVM question

<sfubh7$lro$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20045&group=comp.arch#20045

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Sun, 22 Aug 2021 20:19:51 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfubh7$lro$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<sfu996$lnn$1@gioia.aioe.org>
Injection-Date: Sun, 22 Aug 2021 20:19:51 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="22392"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 22 Aug 2021 20:19 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> Thomas Koenig wrote:

>> Take the following simplified version of Fortran's MAXLOC intrinsic,
>> which returns the position of the array element with the maximum
>> value (the first if there are many).

[...]

> A verctor version of that, VMM or SIMD, would probably run better with
> predicates/conditional moves so as to remove all internal branching from
> the core iteration.
>
> One issue is of course that predicates or CMOVs require both some setup
> time and latency limitations, so you might actually need to unroll the
> code to use twice as many accumulators. If you do so then you can afford
> the parallel compare and the pairs of conditional moves of a new max value.

FYI, there is some AVX2 code (which I did not write) at
https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121

Re: VVM question

<sfubm4$lro$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20046&group=comp.arch#20046

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Sun, 22 Aug 2021 20:22:28 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfubm4$lro$2@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<2021Aug22.193605@mips.complang.tuwien.ac.at>
Injection-Date: Sun, 22 Aug 2021 20:22:28 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="22392"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 22 Aug 2021 20:22 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> This is actually a significant part of the inner loop of Jon Bentley's
> Traveling Salesman example which we looked at in the thread that
> begins with <2016Nov14.164726@mips.complang.tuwien.ac.at>.

Is there a way to access those?

Google Groups always asks me for a userid, which I do not have.

Re: VVM question

<2021Aug22.233815@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20047&group=comp.arch#20047

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Sun, 22 Aug 2021 21:38:15 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 16
Distribution: world
Message-ID: <2021Aug22.233815@mips.complang.tuwien.ac.at>
References: <sftuaa$but$1@newsreader4.netcologne.de> <2021Aug22.193605@mips.complang.tuwien.ac.at> <sfubm4$lro$2@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="7f8452105a1c218d39a4f1a5c73923fb";
logging-data="17452"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ignfN5ckfUfqPonrXqinM"
Cancel-Lock: sha1:IA9GRtCq9KVXkoSVLPeW4adQG5Y=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sun, 22 Aug 2021 21:38 UTC

http://al.howardknight.net/

Bookmark it immediately!

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: VVM question

<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20049&group=comp.arch#20049

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:2f47:: with SMTP id v68mr19235928qkh.190.1629681065175;
Sun, 22 Aug 2021 18:11:05 -0700 (PDT)
X-Received: by 2002:a9d:7396:: with SMTP id j22mr26430933otk.206.1629681064955;
Sun, 22 Aug 2021 18:11:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 22 Aug 2021 18:11:04 -0700 (PDT)
In-Reply-To: <sftuaa$but$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 01:11:05 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Mon, 23 Aug 2021 01:11 UTC

On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
> Hi,
>
> a question regarding VVM.
>
> Take the following simplified version of Fortran's MAXLOC intrinsic,
> which returns the position of the array element with the maximum
> value (the first if there are many).
>
> int m2(int * const restrict a, int n)
> {
> int m, nm;
> int i;
>
> m = INT_MIN;
> nm = -1;
> for (i=0; i<n; i++)
> {
> if (a[i] > m)
> {
> m = a[i];
> nm = i;
> }
> }
> return nm;
> }
<
GLOBAL m2
ENTRY m2
m2:
MOV R3,#0x7FFFFFFFFFFFFFFF
MOV R4,#-1
MOV R5,#0
top:
VEC R8,{R3,R4}
LDW R6,[R1+R5<<2]
CMP R7,R6,R3
PGT R7,{2,TT}
MOV R3,R6 // Be careful on this assignment
MOV R4,R5 // Be careful on this assignment
LOOP LT,R5,#1,R2
MOV R1,R3
RET
>
> An SIMD version with m lanes would probably determine the maximum
> value for each lane separately and, at the end of the loop, return
> the smallest index of the largest value, so something like
>
> for (i=0; i<n; i+=n_lanes)
> {
> if (a[i] > m[0])
> {
> m[0] = a[i];
> nm[0] = i;
> }
> if (a[i+1] > m[1])
> {
> m[1] = a[i+1];
> nm[1] = i + 1;
> }
> ...
> }
>
> How would VVM handle that? Could it also use a similar parallel
> approach, from just translating the scalar code?
<
The VEC instruction tells each loop to watch for modifications to R3 and R4
and obey loop carried dependencies.

Re: VVM question

<sfvch3$bok$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20050&group=comp.arch#20050

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 05:42:59 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfvch3$bok$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<2021Aug22.193605@mips.complang.tuwien.ac.at>
<sfubm4$lro$2@newsreader4.netcologne.de>
<2021Aug22.233815@mips.complang.tuwien.ac.at>
Injection-Date: Mon, 23 Aug 2021 05:42:59 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="12052"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 23 Aug 2021 05:42 UTC

Thanks, very good link!

It does not do threading, unfortunately.

> Bookmark it immediately!

Done.

Re: VVM question

<sfvckb$bok$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20051&group=comp.arch#20051

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 05:44:43 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfvckb$bok$2@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
Injection-Date: Mon, 23 Aug 2021 05:44:43 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="12052"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 23 Aug 2021 05:44 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
[...]

>> int m2(int * const restrict a, int n)
>> {
>> int m, nm;
>> int i;
>>
>> m = INT_MIN;
>> nm = -1;
>> for (i=0; i<n; i++)
>> {
>> if (a[i] > m)
>> {
>> m = a[i];
>> nm = i;
>> }
>> }
>> return nm;
>> }
><
> GLOBAL m2
> ENTRY m2
> m2:
> MOV R3,#0x7FFFFFFFFFFFFFFF
> MOV R4,#-1
> MOV R5,#0
> top:
> VEC R8,{R3,R4}
> LDW R6,[R1+R5<<2]
> CMP R7,R6,R3
> PGT R7,{2,TT}
> MOV R3,R6 // Be careful on this assignment
> MOV R4,R5 // Be careful on this assignment
> LOOP LT,R5,#1,R2
> MOV R1,R3
> RET
>>
>> An SIMD version with m lanes would probably determine the maximum
>> value for each lane separately and, at the end of the loop, return
>> the smallest index of the largest value, so something like
>>
>> for (i=0; i<n; i+=n_lanes)
>> {
>> if (a[i] > m[0])
>> {
>> m[0] = a[i];
>> nm[0] = i;
>> }
>> if (a[i+1] > m[1])
>> {
>> m[1] = a[i+1];
>> nm[1] = i + 1;
>> }
>> ...
>> }
>>
>> How would VVM handle that? Could it also use a similar parallel
>> approach, from just translating the scalar code?
><
> The VEC instruction tells each loop to watch for modifications to R3 and R4
> and obey loop carried dependencies.

I'm afraid that does not answer my question, at least I do not
understand it this way.

Will it run several iterations in parallel without source code
modification, or not?

Re: VVM question

<sfvd7h$btt$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20052&group=comp.arch#20052

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 05:54:57 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfvd7h$btt$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<2021Aug22.193605@mips.complang.tuwien.ac.at>
Injection-Date: Mon, 23 Aug 2021 05:54:57 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="12221"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 23 Aug 2021 05:54 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tkoenig@netcologne.de> writes:
>>Hi,
>>
>>a question regarding VVM.
>>
>>Take the following simplified version of Fortran's MAXLOC intrinsic,
>>which returns the position of the array element with the maximum
>>value (the first if there are many).
>
> This is actually a significant part of the inner loop of Jon Bentley's
> Traveling Salesman example which we looked at in the thread that
> begins with <2016Nov14.164726@mips.complang.tuwien.ac.at>.
> already5chosen@yahoo.com presented a vectorized version of the loop
> over the whole array (rather than a loop that ends as soon as it finds
> something closer than before) in
><b2aed821-2b7e-456d-9a6d-c2ea1fdedd55@googlegroups.com>, and I
> discussed it in <2016Nov16.155150@mips.complang.tuwien.ac.at>.

From what I read in your article, the effect of the code seems
so-so and depend on the architecture.

Would it be possible to give the code at
https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
to see if it does better? It worked well on Zen 1 despite that
architecture only "faking" AVX2 with 128-bit registers.

Re: VVM question

<2021Aug23.123334@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20053&group=comp.arch#20053

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 10:33:34 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 295
Distribution: world
Message-ID: <2021Aug23.123334@mips.complang.tuwien.ac.at>
References: <sftuaa$but$1@newsreader4.netcologne.de> <2021Aug22.193605@mips.complang.tuwien.ac.at> <sfvd7h$btt$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="9650f806d0ce2f552fb764fd1a77c29f";
logging-data="22478"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TCpjKBRsDpe6v2pEDnnPa"
Cancel-Lock: sha1:aHzPsQVSiioyn/IanUVgroqgxBQ=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 23 Aug 2021 10:33 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> This is actually a significant part of the inner loop of Jon Bentley's
>> Traveling Salesman example which we looked at in the thread that
>> begins with <2016Nov14.164726@mips.complang.tuwien.ac.at>.
>> already5chosen@yahoo.com presented a vectorized version of the loop
>> over the whole array (rather than a loop that ends as soon as it finds
>> something closer than before) in
>><b2aed821-2b7e-456d-9a6d-c2ea1fdedd55@googlegroups.com>, and I
>> discussed it in <2016Nov16.155150@mips.complang.tuwien.ac.at>.
>
>From what I read in your article, the effect of the code seems
>so-so and depend on the architecture.

It also depends on the array size (between 1 and 10000 in my case).
Assuming the crossover point is, say, 2000, it's probably best to use
AVX "branchless" for the first 2000 elements, and then continue with
AVX hard. I wanted to look into "branchless" some more, but, as
usual, other things needed my attention and so I did not pursue it
further.

>Would it be possible to give the code at
>https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
>part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
>to see if it does better? It worked well on Zen 1 despite that
>architecture only "faking" AVX2 with 128-bit registers.

I could not compile it on Debian 11 ("relocation R_X86_64_32S against
`.data' can not be used when making a PIE object; recompile with
-fPIE"; this means that the assembly code contains an absolute address
and should be replaced with a rip-relative address), so I compiled it
on Debian 8 (gcc-4.9.2).

Below is what I see. What does it mean?

On Skylake:

# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.250980 0.198142 0.328205 0.227758
256 0.378698 0.351648 1.000000 0.479401
512 0.498054 0.453901 0.486692 0.609524
1024 0.533889 0.509453 0.499025 0.727273
2048 0.549946 0.558952 0.515869 0.768769
4096 0.560022 0.562174 0.465243 0.821830
8192 0.562560 0.563179 0.616496 0.836260
16384 0.568376 0.566568 0.840464 1.221957
32768 0.569482 0.568612 0.998598 1.522960
65536 0.569640 0.569839 1.227496 2.413316
131072 0.569141 0.570295 1.334039 1.866857
262144 0.570032 0.568262 1.389593 1.929232
524288 0.569357 0.566879 1.508152 1.673972
1048576 0.561443 0.555999 1.533845 1.503037
2097152 0.560509 0.560691 1.458560 1.509459
4194304 0.559187 0.560557 1.456157 1.503564
8388608 0.561024 0.560462 1.494831 1.514211
16777216 0.560297 0.559024 1.496209 1.510765
33554432 0.559756 0.560659 1.501258 1.512948
67108864 0.559765 0.560249 1.507910 1.512386
134217728 0.560098 0.560409 1.506587 1.515123
268435456 0.560284 0.560472 1.509522 1.516031
536870912 0.559883 0.560436 1.508366 1.516430

536870912 0.560183 0.560181 1.509494 1.516893
268435456 0.560113 0.560441 1.507528 1.516041
134217728 0.559948 0.560224 1.509935 1.519144
67108864 0.561124 0.561204 1.505807 1.519437
33554432 0.560492 0.559996 1.518871 1.521890
16777216 0.561216 0.560984 1.501925 1.512587
8388608 0.560717 0.560970 1.481175 1.511185
4194304 0.559531 0.560643 1.456585 1.486993
2097152 0.558511 0.561203 1.401787 1.453253
1048576 0.562318 0.558435 1.330910 1.345867
524288 0.570140 0.567899 1.630715 1.883340
262144 0.570461 0.569180 2.328472 2.177891
131072 0.570325 0.570285 2.357071 2.205857
65536 0.570335 0.569650 1.842244 1.968639
32768 0.569601 0.569621 1.501879 1.559045
16384 0.567431 0.566254 1.169951 1.247259
8192 0.563566 0.562097 0.786936 0.833876
4096 0.564187 0.553214 0.518219 0.807253
2048 0.552617 0.553813 0.516911 0.791957
1024 0.540084 0.522983 0.510469 0.725212
512 0.509960 0.454707 0.490421 0.621359
256 0.477612 0.391437 0.468864 0.481203
128 0.450704 0.421053 0.400000 0.278261

On Zen 3:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.198142 0.177285 0.673684 0.374269
256 0.374269 0.336842 1.122807 0.962406
512 0.561404 0.396285 1.924812 1.347368
1024 0.585812 0.402200 3.368421 2.694737
2048 0.612440 0.411410 2.836565 3.592982
4096 0.612440 0.417789 2.629012 4.145749
8192 0.626683 0.419414 2.661468 5.013464
16384 0.629428 0.420232 2.728847 5.258023
32768 0.629887 0.420437 2.703184 5.226156
65536 0.630809 0.420642 3.193762 5.372684
131072 0.631040 0.422911 3.101855 5.331164
262144 0.629772 0.420719 3.108845 5.360160
524288 0.630751 0.420783 2.422235 5.402135
1048576 0.631184 0.420764 2.097454 5.460935
2097152 0.631162 0.422665 1.937176 5.322937
4194304 0.630690 0.421456 1.999682 4.177444
8388608 0.627358 0.420154 2.007738 3.061207
16777216 0.618820 0.418670 2.549933 2.558949
33554432 0.621342 0.418237 2.360456 2.400368
67108864 0.623856 0.418145 2.394890 2.419667
134217728 0.625304 0.417954 2.421189 2.449932
268435456 0.626265 0.417947 2.452580 2.475416
536870912 0.626237 0.417929 2.441393 2.459276

536870912 0.626340 0.417924 2.446799 2.465893
268435456 0.626351 0.417885 2.438281 2.455598
134217728 0.626210 0.417981 2.430257 2.454633
67108864 0.621872 0.418221 2.435699 2.456569
33554432 0.615973 0.418177 2.423352 2.464991
16777216 0.615432 0.418174 2.377893 2.401301
8388608 0.616635 0.418836 2.178726 2.195453
4194304 0.630528 0.420659 2.144731 2.754934
2097152 0.631170 0.420805 4.347582 5.448535
1048576 0.629830 0.420802 4.644690 5.443698
524288 0.631126 0.420757 4.547479 5.419109
262144 0.630809 0.420796 4.442065 5.410609
131072 0.634289 0.420539 4.382799 5.314735
65536 0.630578 0.420744 4.422132 5.475021
32768 0.630809 0.420027 4.311579 5.322937
16384 0.632196 0.419823 4.311579 5.194673
8192 0.628510 0.419414 4.145749 4.899522
4096 0.623061 0.417789 3.992203 4.145749
2048 0.612440 0.411410 3.849624 3.170279
1024 0.585812 0.408293 2.994152 2.245614
512 0.561404 0.374269 2.245614 1.347368
256 0.481203 0.354571 1.347368 0.748538
128 0.421053 0.336842 0.842105 0.421053

On Zen 2:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.224561 0.210526 0.561404 0.481203
256 0.449123 0.320802 1.347368 1.122807
512 0.561404 0.384962 2.245614 1.347368
1024 0.573348 0.390542 2.994152 3.368421
2048 0.579513 0.396285 2.245614 0.769925
4096 0.602176 0.399220 1.996101 3.368421
8192 0.597172 0.399961 2.092999 4.311579
16384 0.557052 0.407522 1.774312 4.311579
32768 0.602597 0.402575 2.103209 4.311579
65536 0.611571 0.401077 1.783487 4.175863
131072 0.601651 0.407859 1.825007 4.508841
262144 0.608228 0.405629 1.760277 4.453535
524288 0.608362 0.405533 1.707133 4.460735
1048576 0.603678 0.402904 1.525548 4.465066
2097152 0.601913 0.401237 1.524494 4.422487
4194304 0.552595 0.388290 1.598199 1.880156
8388608 0.502226 0.367968 1.548654 1.485521
16777216 0.518826 0.359816 1.446895 1.508812
33554432 0.533121 0.365163 1.498006 1.497000
67108864 0.497297 0.364485 1.489156 1.489815
134217728 0.503081 0.363182 1.485826 1.497298
268435456 0.502331 0.362487 1.476280 1.483783
536870912 0.497604 0.363058 1.471117 1.486186

Click here to read the complete article

Re: VVM question

<sg0ctf$qoj$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20055&group=comp.arch#20055

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 07:55:41 -0700
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <sg0ctf$qoj$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 23 Aug 2021 14:55:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="607e866bd76e72d2fba5e11a8a40d10b";
logging-data="27411"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18zWniHAu0eBoEx8gAuTSuUxMlLwXkVs8Y="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:SWSSFjsqqj0U0AXnvYnp4AnqFkw=
In-Reply-To: <sfvckb$bok$2@newsreader4.netcologne.de>
Content-Language: en-US

by: Stephen Fuld - Mon, 23 Aug 2021 14:55 UTC

On 8/22/2021 10:44 PM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
> [...]
>
>>> int m2(int * const restrict a, int n)
>>> {
>>> int m, nm;
>>> int i;
>>>
>>> m = INT_MIN;
>>> nm = -1;
>>> for (i=0; i<n; i++)
>>> {
>>> if (a[i] > m)
>>> {
>>> m = a[i];
>>> nm = i;
>>> }
>>> }
>>> return nm;
>>> }
>> <
>> GLOBAL m2
>> ENTRY m2
>> m2:
>> MOV R3,#0x7FFFFFFFFFFFFFFF
>> MOV R4,#-1
>> MOV R5,#0
>> top:
>> VEC R8,{R3,R4}
>> LDW R6,[R1+R5<<2]
>> CMP R7,R6,R3
>> PGT R7,{2,TT}
>> MOV R3,R6 // Be careful on this assignment
>> MOV R4,R5 // Be careful on this assignment
>> LOOP LT,R5,#1,R2
>> MOV R1,R3
>> RET
>>>
>>> An SIMD version with m lanes would probably determine the maximum
>>> value for each lane separately and, at the end of the loop, return
>>> the smallest index of the largest value, so something like
>>>
>>> for (i=0; i<n; i+=n_lanes)
>>> {
>>> if (a[i] > m[0])
>>> {
>>> m[0] = a[i];
>>> nm[0] = i;
>>> }
>>> if (a[i+1] > m[1])
>>> {
>>> m[1] = a[i+1];
>>> nm[1] = i + 1;
>>> }
>>> ...
>>> }
>>>
>>> How would VVM handle that? Could it also use a similar parallel
>>> approach, from just translating the scalar code?
>> <
>> The VEC instruction tells each loop to watch for modifications to R3 and R4
>> and obey loop carried dependencies.
>
> I'm afraid that does not answer my question, at least I do not
> understand it this way.
>
> Will it run several iterations in parallel without source code
> modification, or not?

This hearkens back to the thread we had some months ago on reductions in
VVM. I think the answer is "mostly not". I say this because the full
cache line load streaming capability is sort of doing multiple loads in
parallel, but the the compare part of the loop will not use multiple
ALUs in parallel, even if they are available.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20057&group=comp.arch#20057

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:9244:: with SMTP id u65mr22130985qkd.46.1629733492326;
Mon, 23 Aug 2021 08:44:52 -0700 (PDT)
X-Received: by 2002:a05:6830:3482:: with SMTP id c2mr6982962otu.16.1629733492080;
Mon, 23 Aug 2021 08:44:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 08:44:51 -0700 (PDT)
In-Reply-To: <sfvckb$bok$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 15:44:52 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Mon, 23 Aug 2021 15:44 UTC

On Monday, August 23, 2021 at 12:44:45 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
> [...]
> >> int m2(int * const restrict a, int n)
> >> {
> >> int m, nm;
> >> int i;
> >>
> >> m = INT_MIN;
> >> nm = -1;
> >> for (i=0; i<n; i++)
> >> {
> >> if (a[i] > m)
> >> {
> >> m = a[i];
> >> nm = i;
> >> }
> >> }
> >> return nm;
> >> }
> ><
> > GLOBAL m2
> > ENTRY m2
> > m2:
> > MOV R3,#0x7FFFFFFFFFFFFFFF
> > MOV R4,#-1
> > MOV R5,#0
> > top:
> > VEC R8,{R3,R4}
> > LDW R6,[R1+R5<<2]
> > CMP R7,R6,R3
> > PGT R7,{2,TT}
> > MOV R3,R6 // Be careful on this assignment
> > MOV R4,R5 // Be careful on this assignment
> > LOOP LT,R5,#1,R2
> > MOV R1,R3
> > RET
> >>
> >> An SIMD version with m lanes would probably determine the maximum
> >> value for each lane separately and, at the end of the loop, return
> >> the smallest index of the largest value, so something like
> >>
> >> for (i=0; i<n; i+=n_lanes)
> >> {
> >> if (a[i] > m[0])
> >> {
> >> m[0] = a[i];
> >> nm[0] = i;
> >> }
> >> if (a[i+1] > m[1])
> >> {
> >> m[1] = a[i+1];
> >> nm[1] = i + 1;
> >> }
> >> ...
> >> }
> >>
> >> How would VVM handle that? Could it also use a similar parallel
> >> approach, from just translating the scalar code?
> ><
> > The VEC instruction tells each loop to watch for modifications to R3 and R4
> > and obey loop carried dependencies.
> I'm afraid that does not answer my question, at least I do not
> understand it this way.
>
> Will it run several iterations in parallel without source code
> modification, or not?
<
Yes iterations will run in parallel on multiple lanes.
However, any lane that writes to R3 or R4 will cause a serial dependency
at LOOP and will be backed up, much like branch repair, and played out again.
<
So, let us postulate that we have a 4-lanes, and the loop is zipping through
iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
run like expected.
<
In effect, the loop runs as expected, but this kind of dependency causes
a "blip" in execution width.

Re: VVM question

<2021Aug23.172406@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20058&group=comp.arch#20058

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 15:24:06 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 62
Message-ID: <2021Aug23.172406@mips.complang.tuwien.ac.at>
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <sg0ctf$qoj$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="9650f806d0ce2f552fb764fd1a77c29f";
logging-data="19568"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18SS0ojqn8MQA0dVQEu3fLQ"
Cancel-Lock: sha1:l9vdTGLeBDC2nW/bVCNE3mmX2Ng=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 23 Aug 2021 15:24 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 8/22/2021 10:44 PM, Thomas Koenig wrote:
|int m2(int * const restrict a, int n)
|{
| int m, nm;
| int i;
| | m = INT_MIN;
| nm = -1;
| for (i=0; i<n; i++)
| {
| if (a[i] > m)
| {
| m = a[i];
| nm = i;
| }
| }
| return nm;
|}
....
>> Will it run several iterations in parallel without source code
>> modification, or not?
>
>This hearkens back to the thread we had some months ago on reductions in
>VVM. I think the answer is "mostly not". I say this because the full
>cache line load streaming capability is sort of doing multiple loads in
>parallel, but the the compare part of the loop will not use multiple
>ALUs in parallel, even if they are available.

Why not? Consider this as the following equivalent code:

int m2(int * const restrict a, int n)
{ int m, nm;
int i;

m = INT_MIN;
nm = -1;
i=0;
while (i<n) {
while (i<n && a[i]<=m)
i++;
if (a[i] > m) {
m = a[i];
nm = i;
}
i++;
}
return nm;
}

Now look at the inner loop. It is so easy to vectorize that even VVM
may be able to do it (maybe even auto-vectorizing compilers). Of
course, at first it will have very short trip counts, but they
increase the further through the array you work, as the probability to
find an element larger than the largest one up to now decreases
(unless the array is sorted).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: VVM question

<sg0gr4$pet$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20059&group=comp.arch#20059

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 09:02:42 -0700
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <sg0gr4$pet$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 23 Aug 2021 16:02:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="607e866bd76e72d2fba5e11a8a40d10b";
logging-data="26077"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX194TQmPO1k5VvxLEaDBxx06ZDxj+6xF28E="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:G6Ob3Cu7q6jLJyr+R3v3mYVrIE8=
In-Reply-To: <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Mon, 23 Aug 2021 16:02 UTC

On 8/23/2021 8:44 AM, MitchAlsup wrote:
> On Monday, August 23, 2021 at 12:44:45 AM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
>> [...]
>>>> int m2(int * const restrict a, int n)
>>>> {
>>>> int m, nm;
>>>> int i;
>>>>
>>>> m = INT_MIN;
>>>> nm = -1;
>>>> for (i=0; i<n; i++)
>>>> {
>>>> if (a[i] > m)
>>>> {
>>>> m = a[i];
>>>> nm = i;
>>>> }
>>>> }
>>>> return nm;
>>>> }
>>> <
>>> GLOBAL m2
>>> ENTRY m2
>>> m2:
>>> MOV R3,#0x7FFFFFFFFFFFFFFF
>>> MOV R4,#-1
>>> MOV R5,#0
>>> top:
>>> VEC R8,{R3,R4}
>>> LDW R6,[R1+R5<<2]
>>> CMP R7,R6,R3
>>> PGT R7,{2,TT}
>>> MOV R3,R6 // Be careful on this assignment
>>> MOV R4,R5 // Be careful on this assignment
>>> LOOP LT,R5,#1,R2
>>> MOV R1,R3
>>> RET
>>>>
>>>> An SIMD version with m lanes would probably determine the maximum
>>>> value for each lane separately and, at the end of the loop, return
>>>> the smallest index of the largest value, so something like
>>>>
>>>> for (i=0; i<n; i+=n_lanes)
>>>> {
>>>> if (a[i] > m[0])
>>>> {
>>>> m[0] = a[i];
>>>> nm[0] = i;
>>>> }
>>>> if (a[i+1] > m[1])
>>>> {
>>>> m[1] = a[i+1];
>>>> nm[1] = i + 1;
>>>> }
>>>> ...
>>>> }
>>>>
>>>> How would VVM handle that? Could it also use a similar parallel
>>>> approach, from just translating the scalar code?
>>> <
>>> The VEC instruction tells each loop to watch for modifications to R3 and R4
>>> and obey loop carried dependencies.
>> I'm afraid that does not answer my question, at least I do not
>> understand it this way.
>>
>> Will it run several iterations in parallel without source code
>> modification, or not?
> <
> Yes iterations will run in parallel on multiple lanes.
> However, any lane that writes to R3 or R4 will cause a serial dependency
> at LOOP and will be backed up, much like branch repair, and played out again.
> <
> So, let us postulate that we have a 4-lanes, and the loop is zipping through
> iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
> Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
> begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
> run like expected.
> <
> In effect, the loop runs as expected, but this kind of dependency causes
> a "blip" in execution width.

Ahhh! I didn't understand that. So in the case of summing the elements
of an unsigned integer vector, it is the writes to the "running sum"
register that causes the serial dependency and thus prevents parallel
additions. That makes sense.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<sg0mie$74p$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20061&group=comp.arch#20061

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 17:40:30 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg0mie$74p$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<2021Aug22.193605@mips.complang.tuwien.ac.at>
<sfvd7h$btt$1@newsreader4.netcologne.de>
<2021Aug23.123334@mips.complang.tuwien.ac.at>
Injection-Date: Mon, 23 Aug 2021 17:40:30 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="7321"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 23 Aug 2021 17:40 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tkoenig@netcologne.de> writes:
>>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>> This is actually a significant part of the inner loop of Jon Bentley's
>>> Traveling Salesman example which we looked at in the thread that
>>> begins with <2016Nov14.164726@mips.complang.tuwien.ac.at>.
>>> already5chosen@yahoo.com presented a vectorized version of the loop
>>> over the whole array (rather than a loop that ends as soon as it finds
>>> something closer than before) in
>>><b2aed821-2b7e-456d-9a6d-c2ea1fdedd55@googlegroups.com>, and I
>>> discussed it in <2016Nov16.155150@mips.complang.tuwien.ac.at>.
>>
>>From what I read in your article, the effect of the code seems
>>so-so and depend on the architecture.
>
> It also depends on the array size (between 1 and 10000 in my case).
> Assuming the crossover point is, say, 2000, it's probably best to use
> AVX "branchless" for the first 2000 elements, and then continue with
> AVX hard. I wanted to look into "branchless" some more, but, as
> usual, other things needed my attention and so I did not pursue it
> further.
>
>>Would it be possible to give the code at
>>https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
>>part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
>>to see if it does better? It worked well on Zen 1 despite that
>>architecture only "faking" AVX2 with 128-bit registers.
>
> I could not compile it on Debian 11 ("relocation R_X86_64_32S against
> `.data' can not be used when making a PIE object;

A strangeness that some distributors have put into compilers
recently. Luckily enough, the base version of gcc does not do this.

> recompile with
> -fPIE"; this means that the assembly code contains an absolute address
> and should be replaced with a rip-relative address), so I compiled it
> on Debian 8 (gcc-4.9.2).
>
> Below is what I see. What does it mean?

The numbers mean average iterations per cycle. "n" is the vector
length. The "normal" version is the code as I posted it. The
"expect" version uses __builtin_expect to tell the compiler that
finding a new maximum seems unlikely. AVX2 is an AVX2 version of
the code, and AVX2_unroll is an unrolled version of AVX2.

Numbers go up and then down again to have some reproducibility.
I suspect the "going down" numbers are more reliable, so I'll
look at those.

> On Skylake:

> 536870912 0.560183 0.560181 1.509494 1.516893
> 268435456 0.560113 0.560441 1.507528 1.516041
> 134217728 0.559948 0.560224 1.509935 1.519144

So, for a very long vector: 0.56 iterations per cycle for normal
code, 1.52 iterations for the AVX2 code. Almost a factor of
three, not bad.

[...]

> 1024 0.540084 0.522983 0.510469 0.725212
> 512 0.509960 0.454707 0.490421 0.621359
> 256 0.477612 0.391437 0.468864 0.481203
> 128 0.450704 0.421053 0.400000 0.278261

Don't use the unrolled AVX2 stuff on short vectors, I suppose,
but at least the slowdown for AVX2 against normal code is slight.

> On Zen 3:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 0.626340 0.417924 2.446799 2.465893
> 268435456 0.626351 0.417885 2.438281 2.455598

Scalar code is about par with Skylake, the AVX2 code is better.
Strange that the __builtin_expect code is slower, but that may
just be the rather old compiler.

> 256 0.481203 0.354571 1.347368 0.748538
> 128 0.421053 0.336842 0.842105 0.421053

The clear winner for Zen3: The AVX2 stuff without unrolling.

> On Zen 2:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 0.501185 0.362092 1.474724 1.484411
> 268435456 0.505392 0.362086 1.480385 1.487473

> 256 0.481203 0.374269 1.122807 2.245614
> 128 0.481203 0.336842 1.684211 0.421053

Again, slower than Zen3, but still AVX2 wins hands-down.

> On Zen:
># Ints per cycle

>
> 536870912 0.484308 0.334349 1.383726 1.346343

> 512 0.374269 0.296296 1.292929 1.422222
> 256 0.323232 0.263374 1.015873 1.185185
> 128 0.296296 0.222222 0.888889 0.507937

Again, a clear win for the non-unrolled AVX2 code.
>
> On Tiger Lake:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 1.292826 1.297217 1.724536 1.724241
> 268435456 1.299574 1.298605 1.726204 1.708710

The scalar variant is _very_ good, AVX2 does gain some, but not
as much as the other architectures, especially when

> 512 1.221957 1.201878 0.695652 1.089362
> 256 1.000000 0.583144 0.744186 0.992248
> 128 0.761905 0.512000 0.677249 0.882759

it seems to get slower towards the end (but the numbers still are a
bit erratic).

AVX2 without unrolling seems to be the clear winner for all
architectures you checked, especially the AMD ones, except for Tiger
Lake, which combines excellent performance with of the scalar loop
wiht lackluster performance on AVX2. Maybe they figured that,
while they do support the instructions, performance was not
so inoportant for them after all. For a processor intended for
the mobile market, that makes sense.

Re: VVM question

<sg0n5u$74p$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20063&group=comp.arch#20063

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 17:50:54 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg0n5u$74p$2@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me>
Injection-Date: Mon, 23 Aug 2021 17:50:54 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="7321"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 23 Aug 2021 17:50 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
> On 8/23/2021 8:44 AM, MitchAlsup wrote:

>> Yes iterations will run in parallel on multiple lanes.
>> However, any lane that writes to R3 or R4 will cause a serial dependency
>> at LOOP and will be backed up, much like branch repair, and played out again.
>> <
>> So, let us postulate that we have a 4-lanes, and the loop is zipping through
>> iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
>> Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
>> begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
>> run like expected.
>> <
>> In effect, the loop runs as expected, but this kind of dependency causes
>> a "blip" in execution width.

Good explanation, thanks.

>
> Ahhh! I didn't understand that. So in the case of summing the elements
> of an unsigned integer vector, it is the writes to the "running sum"
> register that causes the serial dependency and thus prevents parallel
> additions. That makes sense.

So (moving the goalposts towards summation here), VVM-optimized code
could look like

for (i=0; i<n; i+=m) {
for (j=0; j<m; j++)
s[i+j] += a[i+j];
}

with suitable postprocessing (and pre-processing if n
is not divisible by m).

Hm. This doesn't really make it more elegant than doing the same kind
of thing in SIMD.

Or how should a reduction be written?

Re: VVM question

<sg0omd$itq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20064&group=comp.arch#20064

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 11:16:43 -0700
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <sg0omd$itq$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 23 Aug 2021 18:16:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="607e866bd76e72d2fba5e11a8a40d10b";
logging-data="19386"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+viuEmkrJKezrnhtaOnbYGgG7bXub+kTc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:45SOL424Vh7BPmbHtQ0gT0rfqrQ=
In-Reply-To: <sg0n5u$74p$2@newsreader4.netcologne.de>
Content-Language: en-US

by: Stephen Fuld - Mon, 23 Aug 2021 18:16 UTC

On 8/23/2021 10:50 AM, Thomas Koenig wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
>> On 8/23/2021 8:44 AM, MitchAlsup wrote:
>
>>> Yes iterations will run in parallel on multiple lanes.
>>> However, any lane that writes to R3 or R4 will cause a serial dependency
>>> at LOOP and will be backed up, much like branch repair, and played out again.
>>> <
>>> So, let us postulate that we have a 4-lanes, and the loop is zipping through
>>> iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
>>> Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
>>> begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
>>> run like expected.
>>> <
>>> In effect, the loop runs as expected, but this kind of dependency causes
>>> a "blip" in execution width.
>
> Good explanation, thanks.
>
>>
>> Ahhh! I didn't understand that. So in the case of summing the elements
>> of an unsigned integer vector, it is the writes to the "running sum"
>> register that causes the serial dependency and thus prevents parallel
>> additions. That makes sense.
>
> So (moving the goalposts towards summation here), VVM-optimized code
> could look like
>
> for (i=0; i<n; i+=m) {
> for (j=0; j<m; j++)
> s[i+j] += a[i+j];
> }
>
> with suitable postprocessing (and pre-processing if n
> is not divisible by m).
>
> Hm. This doesn't really make it more elegant than doing the same kind
> of thing in SIMD.
>
> Or how should a reduction be written?

I think that depends upon whether the order of the operations is
potentially significant. For example, if the values are signed, you may
hit an underflow/overflow at an intermediate step that gets "cancelled
out" by doing multiple intermediate sums then a final "sum of
intermediates" step. Many people have pointed out the issues with doing
the multiply/adds needed for an inner product in parallel. That is why
I specified unsigned integers in the vector to be summed.

So for full generality, you have to do one element at a time. You code
it that way, and VVM executes it that way, except with the benefit of
the full cache width loads. If you know that order is not significant
(i.e. summing unsigned integers), you would have to unroll the loop in
the source code, which would allow parallel additions) and thus buy you
improved performance. A final add outside the loop gets the grand total.

There is the question of how much to unroll the loop. I think you
probably want to unroll it four times in the source code. That way, you
get maximum performance on any system with up to four integer units
without source code changes. I don't think you can do eight, as you
might exceed the VVM instruction limit. You could certainly do two, and
that would work, but you would be giving away performance on a CPU with
4 integer units.

A question for Mitch. Suppose you unroll the loop for summing a vector
of unsigned integers. In VVM, the load for the first element causes a
cache line to be loaded into a streaming buffer. The next load, for the
second element, wants the next entry in the buffer. So, does VVM
recognize this all as a "dense reference" even though the references are
not from the same instruction?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20066&group=comp.arch#20066

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:a20f:: with SMTP id l15mr22103962qke.24.1629744043748;
Mon, 23 Aug 2021 11:40:43 -0700 (PDT)
X-Received: by 2002:a4a:3944:: with SMTP id x4mr15196945oog.69.1629744043513;
Mon, 23 Aug 2021 11:40:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 11:40:43 -0700 (PDT)
In-Reply-To: <sg0omd$itq$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de> <sg0omd$itq$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 18:40:43 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Mon, 23 Aug 2021 18:40 UTC

On Monday, August 23, 2021 at 1:16:47 PM UTC-5, Stephen Fuld wrote:
> On 8/23/2021 10:50 AM, Thomas Koenig wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> >> On 8/23/2021 8:44 AM, MitchAlsup wrote:
> >
> >>> Yes iterations will run in parallel on multiple lanes.
> >>> However, any lane that writes to R3 or R4 will cause a serial dependency
> >>> at LOOP and will be backed up, much like branch repair, and played out again.
> >>> <
> >>> So, let us postulate that we have a 4-lanes, and the loop is zipping through
> >>> iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
> >>> Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
> >>> begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
> >>> run like expected.
> >>> <
> >>> In effect, the loop runs as expected, but this kind of dependency causes
> >>> a "blip" in execution width.
> >
> > Good explanation, thanks.
> >
> >>
> >> Ahhh! I didn't understand that. So in the case of summing the elements
> >> of an unsigned integer vector, it is the writes to the "running sum"
> >> register that causes the serial dependency and thus prevents parallel
> >> additions. That makes sense.
> >
> > So (moving the goalposts towards summation here), VVM-optimized code
> > could look like
> >
> > for (i=0; i<n; i+=m) {
> > for (j=0; j<m; j++)
> > s[i+j] += a[i+j];
> > }
> >
> > with suitable postprocessing (and pre-processing if n
> > is not divisible by m).
> >
> > Hm. This doesn't really make it more elegant than doing the same kind
> > of thing in SIMD.
> >
> > Or how should a reduction be written?
> I think that depends upon whether the order of the operations is
> potentially significant. For example, if the values are signed, you may
> hit an underflow/overflow at an intermediate step that gets "cancelled
> out" by doing multiple intermediate sums then a final "sum of
> intermediates" step. Many people have pointed out the issues with doing
> the multiply/adds needed for an inner product in parallel. That is why
> I specified unsigned integers in the vector to be summed.
>
> So for full generality, you have to do one element at a time. You code
> it that way, and VVM executes it that way, except with the benefit of
> the full cache width loads. If you know that order is not significant
> (i.e. summing unsigned integers), you would have to unroll the loop in
> the source code, which would allow parallel additions) and thus buy you
> improved performance. A final add outside the loop gets the grand total.
>
> There is the question of how much to unroll the loop. I think you
> probably want to unroll it four times in the source code. That way, you
> get maximum performance on any system with up to four integer units
> without source code changes. I don't think you can do eight, as you
> might exceed the VVM instruction limit. You could certainly do two, and
> that would work, but you would be giving away performance on a CPU with
> 4 integer units.
>
>
> A question for Mitch. Suppose you unroll the loop for summing a vector
> of unsigned integers. In VVM, the load for the first element causes a
> cache line to be loaded into a streaming buffer. The next load, for the
> second element, wants the next entry in the buffer. So, does VVM
> recognize this all as a "dense reference" even though the references are
> not from the same instruction?
<
As the Loop is installed in the stations, memory reference address patterns are
examined. If the address pattern are based on indexing off of the register used
in the LOOP instruction, then certain inferences can be made. The determination
of dense is one of these.
<
On the other hand it is easy to code gather scatter in which the pointers/indexes
are dense and the indirect data not.
<
Dense, in a VVM sense, is that several iterations of the loop can all access one
streaming buffer (avoid cache and TLB) so that other stuff (gather/scatter memory
refs) have access through normal cache paths.
<
Back to the posed question:
<
If the programmer unrolled the loop by hand (like DGEMM without transposes):
The LDs would need to be coded using offsets from the index register to be
recognized as dense::

MOV Ri,#0
VEC R8,{}
LDD R4,[R2,Ri<<3]
LDD R5,[R2,Ri<<3+8]
LDD R6,[R2,Ri<<3+16]
LDD R7,[R2,Ri<<3+24]
....
LOOP LT,Ri,#4,Rmax
<
The above code would be recognized as dense.
<
MOV Ri,#0
ADD R9,R2,#8
ADD R9,R2,#16
ADD R10,R2,#24
VEC R8,{}
LDD R4,[R2,Ri<<3]
LDD R5,[R8,Ri<<3+8]
LDD R6,[R9,Ri<<3+16]
LDD R7,[R10,Ri<<3+24]
....
LOOP LT,Ri,#4,Rmax
<
This loop is harder to recognize as dense--even though the number of words
in the loop is less.
<
As the loop is being installed in the stations, the first iteration is performed,
so many of the address patterns can be detected using actual AGEN addresses
not just instructions patterns--so the second case might or might no be recognized.
<
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: VVM question

<sg13vd$hom$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20072&group=comp.arch#20072

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 21:29:17 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg13vd$hom$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
Injection-Date: Mon, 23 Aug 2021 21:29:17 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="18198"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 23 Aug 2021 21:29 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> Back to the posed question:
><
> If the programmer unrolled the loop by hand (like DGEMM without transposes):
> The LDs would need to be coded using offsets from the index register to be
> recognized as dense::
>
> MOV Ri,#0
> VEC R8,{}
> LDD R4,[R2,Ri<<3]
> LDD R5,[R2,Ri<<3+8]
> LDD R6,[R2,Ri<<3+16]
> LDD R7,[R2,Ri<<3+24]
> ...
> LOOP LT,Ri,#4,Rmax
><
> The above code would be recognized as dense.
><
> MOV Ri,#0
> ADD R9,R2,#8
> ADD R9,R2,#16
> ADD R10,R2,#24
> VEC R8,{}
> LDD R4,[R2,Ri<<3]
> LDD R5,[R8,Ri<<3+8]
> LDD R6,[R9,Ri<<3+16]
> LDD R7,[R10,Ri<<3+24]
> ...
> LOOP LT,Ri,#4,Rmax
><
> This loop is harder to recognize as dense--even though the number of words
> in the loop is less.

Hm... all possible, but less elegant that it could be. All the
manual unrolling and autovectorization and... rears its ugly
head again.

With all the mechanisms that VVM already offers, a way for the
programmer or a programming language to specify that operations
such as summation can be done in any order would be a very useful
addition.

Suggestion:

A variant of the VEC instruction, which does not specify a special
register to keep the address in (which can be hardwired if there
is no space in the thread header). This leaves five bits for
"reduction" registters, which specify that operations on that
register can be done in any order in the loop.

This would be a perfect match for OpenMP's reduction clause or
for the planned REDUCTION addition to Fortran's DO CONCURRENT.

It would not have a 1:1 match for C semantics, sure, but this
should not pose a problem, I hope :-)

Re: VVM question

<64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20073&group=comp.arch#20073

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:a603:: with SMTP id p3mr22985348qke.441.1629755725395; Mon, 23 Aug 2021 14:55:25 -0700 (PDT)
X-Received: by 2002:a9d:4786:: with SMTP id b6mr29025092otf.329.1629755725078; Mon, 23 Aug 2021 14:55:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 14:55:24 -0700 (PDT)
In-Reply-To: <sg13vd$hom$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com> <sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de> <sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com> <sg13vd$hom$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 21:55:25 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 88

by: MitchAlsup - Mon, 23 Aug 2021 21:55 UTC

On Monday, August 23, 2021 at 4:29:19 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Back to the posed question:
> ><
> > If the programmer unrolled the loop by hand (like DGEMM without transposes):
> > The LDs would need to be coded using offsets from the index register to be
> > recognized as dense::
> >
> > MOV Ri,#0
> > VEC R8,{}
> > LDD R4,[R2,Ri<<3]
> > LDD R5,[R2,Ri<<3+8]
> > LDD R6,[R2,Ri<<3+16]
> > LDD R7,[R2,Ri<<3+24]
> > ...
> > LOOP LT,Ri,#4,Rmax
> ><
> > The above code would be recognized as dense.
> ><
> > MOV Ri,#0
> > ADD R9,R2,#8
> > ADD R9,R2,#16
> > ADD R10,R2,#24
> > VEC R8,{}
> > LDD R4,[R2,Ri<<3]
> > LDD R5,[R8,Ri<<3+8]
> > LDD R6,[R9,Ri<<3+16]
> > LDD R7,[R10,Ri<<3+24]
> > ...
> > LOOP LT,Ri,#4,Rmax
> ><
> > This loop is harder to recognize as dense--even though the number of words
> > in the loop is less.
<
> Hm... all possible, but less elegant that it could be. All the
> manual unrolling and autovectorization and... rears its ugly
> head again.
>
> With all the mechanisms that VVM already offers, a way for the
> programmer or a programming language to specify that operations
> such as summation can be done in any order would be a very useful
> addition.
>
> Suggestion:
>
> A variant of the VEC instruction, which does not specify a special
> register to keep the address in (which can be hardwired if there
> is no space in the thread header). This leaves five bits for
> "reduction" registters, which specify that operations on that
> register can be done in any order in the loop.
<
That might be one way...........
<
My preferred means is to make a way to specify that a function unit is
performing a reduction, and that it should not deliver its value at the
end of its calculation, but hold on to it an use it in the next calculation..
<
So, a FMAC reduction would take the form of::

FMAC --,--,--,Rsum
VEC Rx,{}
LDD RA,[Ra+Ri<<3]
FMAC --,RA,RB,--
LOOP LT,Ri,#1,Rmax
FMAC Rsum,--,--,--
<
or something like that. where "--" means there is no operand or result being
specified, use the last operand that showed up, and make the destination
into an operand for the next cycle.
<
The multiplier ALREADY has the ability to perform accumulates
every cycle at the wide adder (3×52+52-bit incrementer), all we need
is an ISA way to specify feed back the last intermediate result as
an operand to the next calculation.
<
The major problem is where does one store the state on an interrupt
taken inside of the loop. I am letting my subconscious dwell on it right
now.
>
> This would be a perfect match for OpenMP's reduction clause or
> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>
> It would not have a 1:1 match for C semantics, sure, but this
> should not pose a problem, I hope :-)

Re: VVM question

<sg1mru$h6d$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20079&group=comp.arch#20079

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Mon, 23 Aug 2021 19:51:41 -0700
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <sg1mru$h6d$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 24 Aug 2021 02:51:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cad2cf2880430e998ee1f0ce2607f70";
logging-data="17613"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+coIt8tzPvNBhxsRh0ouLu7CKbjvslQV0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:4OvRzw378Jdqq2DhMPQ0HzTvRHY=
In-Reply-To: <sg13vd$hom$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Stephen Fuld - Tue, 24 Aug 2021 02:51 UTC

On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> Back to the posed question:
>> <
>> If the programmer unrolled the loop by hand (like DGEMM without transposes):
>> The LDs would need to be coded using offsets from the index register to be
>> recognized as dense::
>>
>> MOV Ri,#0
>> VEC R8,{}
>> LDD R4,[R2,Ri<<3]
>> LDD R5,[R2,Ri<<3+8]
>> LDD R6,[R2,Ri<<3+16]
>> LDD R7,[R2,Ri<<3+24]
>> ...
>> LOOP LT,Ri,#4,Rmax
>> <
>> The above code would be recognized as dense.
>> <
>> MOV Ri,#0
>> ADD R9,R2,#8
>> ADD R9,R2,#16
>> ADD R10,R2,#24
>> VEC R8,{}
>> LDD R4,[R2,Ri<<3]
>> LDD R5,[R8,Ri<<3+8]
>> LDD R6,[R9,Ri<<3+16]
>> LDD R7,[R10,Ri<<3+24]
>> ...
>> LOOP LT,Ri,#4,Rmax
>> <
>> This loop is harder to recognize as dense--even though the number of words
>> in the loop is less.
>
> Hm... all possible, but less elegant that it could be. All the
> manual unrolling and autovectorization and... rears its ugly
> head again.
>
> With all the mechanisms that VVM already offers, a way for the
> programmer or a programming language to specify that operations
> such as summation can be done in any order would be a very useful
> addition.

I am probably missing something here. To me the main advantage of
allowing out of order summations (using summations here as shorthand for
other similar type operations), was to allow the hardware to make use of
multiple functional units. That is, a core with two adders could, if
allowed, complete the summation in about half the time. Without that, I
don't see any advantage of out of order summations on VVM. If I am
wrong, please explain. If I am right, see below.

> Suggestion:
>
> A variant of the VEC instruction, which does not specify a special
> register to keep the address in (which can be hardwired if there
> is no space in the thread header). This leaves five bits for
> "reduction" registters, which specify that operations on that
> register can be done in any order in the loop.

Doing the operations in a different order isn't the problem. You need a
way to allow/specify the two partial sums to be added together in the
end. I don't see your proposal as doing that. And, of course, it is
limited to five registers which must be specified in the hardware design.

>
> This would be a perfect match for OpenMP's reduction clause or
> for the planned REDUCTION addition to Fortran's DO CONCURRENT.

I am not an OpenMP person, and my knowledge of Fortran is old, so could
you please give a brief explanation of what these two things do? Thanks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<sg23h5$5dj$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20080&group=comp.arch#20080

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 06:27:49 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg23h5$5dj$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
Injection-Date: Tue, 24 Aug 2021 06:27:49 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="5555"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 24 Aug 2021 06:27 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>> Back to the posed question:
>>> <
>>> If the programmer unrolled the loop by hand (like DGEMM without transposes):
>>> The LDs would need to be coded using offsets from the index register to be
>>> recognized as dense::
>>>
>>> MOV Ri,#0
>>> VEC R8,{}
>>> LDD R4,[R2,Ri<<3]
>>> LDD R5,[R2,Ri<<3+8]
>>> LDD R6,[R2,Ri<<3+16]
>>> LDD R7,[R2,Ri<<3+24]
>>> ...
>>> LOOP LT,Ri,#4,Rmax
>>> <
>>> The above code would be recognized as dense.
>>> <
>>> MOV Ri,#0
>>> ADD R9,R2,#8
>>> ADD R9,R2,#16
>>> ADD R10,R2,#24
>>> VEC R8,{}
>>> LDD R4,[R2,Ri<<3]
>>> LDD R5,[R8,Ri<<3+8]
>>> LDD R6,[R9,Ri<<3+16]
>>> LDD R7,[R10,Ri<<3+24]
>>> ...
>>> LOOP LT,Ri,#4,Rmax
>>> <
>>> This loop is harder to recognize as dense--even though the number of words
>>> in the loop is less.
>>
>> Hm... all possible, but less elegant that it could be. All the
>> manual unrolling and autovectorization and... rears its ugly
>> head again.
>>
>> With all the mechanisms that VVM already offers, a way for the
>> programmer or a programming language to specify that operations
>> such as summation can be done in any order would be a very useful
>> addition.
>
> I am probably missing something here.

Or, equvalently, I have been explaining things badly :-)

> To me the main advantage of
> allowing out of order summations (using summations here as shorthand for
> other similar type operations), was to allow the hardware to make use of
> multiple functional units.

Yes.

> That is, a core with two adders could, if
> allowed, complete the summation in about half the time.

Yes.

>Without that, I
> don't see any advantage of out of order summations on VVM. If I am
> wrong, please explain. If I am right, see below.

Seeing below.

>
>
>
>> Suggestion:
>>
>> A variant of the VEC instruction, which does not specify a special
>> register to keep the address in (which can be hardwired if there
>> is no space in the thread header). This leaves five bits for
>> "reduction" registters, which specify that operations on that
>> register can be done in any order in the loop.
>
> Doing the operations in a different order isn't the problem.

It's one half of the problem.

The way VVM is currently specified, it's stricly in-order semantics
you write down a C loop, and the hardware delivers the results
exactly in the order you wrote down. This would have to be
changed.

> You need a
> way to allow/specify the two partial sums to be added together in the
> end.

That as well.

>I don't see your proposal as doing that.

I thought I had implied it, but it was obviously not clear enough.

> And, of course, it is
> limited to five registers which must be specified in the hardware design.

Five reductions in a loop would be plenty, it is usually one, or more
rarely two.

>> This would be a perfect match for OpenMP's reduction clause or
>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>
> I am not an OpenMP person, and my knowledge of Fortran is old, so could
> you please give a brief explanation of what these two things do? Thanks.

#pragma omp simd reduction(+:var)

before a loop will tell the compiler that it can go wild
with the sequence of loops but that "var" will be used
in a summation reduction.

DO CONCURRENT also runs loops in an unspecified order,
the REDUCTION clause would then allow to, for example,
sum up all elements.

One problems with C and similar languages is that you have
to specify an ordering of the loop explicitly, which shapes
programmer's thinking and also shapes intermediate languages
for compilers...

Re: VVM question

<2021Aug24.071115@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20081&group=comp.arch#20081

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 05:11:15 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 181
Distribution: world
Message-ID: <2021Aug24.071115@mips.complang.tuwien.ac.at>
References: <sftuaa$but$1@newsreader4.netcologne.de> <2021Aug22.193605@mips.complang.tuwien.ac.at> <sfvd7h$btt$1@newsreader4.netcologne.de> <2021Aug23.123334@mips.complang.tuwien.ac.at> <sg0mie$74p$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="7be28b2749dab779aafc11b0b1055033";
logging-data="327"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18lBLQslFUOsFPJy0+ZPvYU"
Cancel-Lock: sha1:Qjpu7XEeACPBQGvucehcjB0KIcQ=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 24 Aug 2021 05:11 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>The numbers mean average iterations per cycle.

Actually per rdtsc unit (which have not been CPU cycles for over a
decade). And you time only a single run through the loop, so any
disturbance (rdtsc wobble, interrupt, etc.) will be very visible. You
also see one rtdsc (on average) in the result.

>"n" is the vector
>length. The "normal" version is the code as I posted it. The
>"expect" version uses __builtin_expect to tell the compiler that
>finding a new maximum seems unlikely.

I compiled with gcc-4.9 -O -std=c99, and this gives the following loops:

ml m2
mov (%rdi,%rdx,4),%ecx mov (%rdi,%rdx,4),%ecx
cmp %r8d,%ecx cmp %r8d,%ecx
jle 40077e <ml+0x21> jle 4007a9 <m2+0x21>
mov %edx,%eax mov %edx,%eax
mov %ecx,%r8d mov %ecx,%r8d
add $0x1,%rdx add $0x1,%rdx
cmp %edx,%esi cmp %edx,%esi
jg 400771 <ml+0x14> jg 40079c <m2+0x14>

So the same loop. Differences in performance may be from code
alignment (does that play a role with uCode caches?) or from
disturbances.

>AVX2 is an AVX2 version of
>the code, and AVX2_unroll is an unrolled version of AVX2.
>
>Numbers go up and then down again to have some reproducibility.
>I suspect the "going down" numbers are more reliable, so I'll
>look at those.
>
>
>> On Skylake:
>
>> 536870912 0.560183 0.560181 1.509494 1.516893
>> 268435456 0.560113 0.560441 1.507528 1.516041
>> 134217728 0.559948 0.560224 1.509935 1.519144
>
>So, for a very long vector: 0.56 iterations per cycle for normal
>code, 1.52 iterations for the AVX2 code. Almost a factor of
>three, not bad.

Given that all CPUs show higher AVX values in between, the loop seems
to run into some limit at some point. My first guess would be the L2
or L3 cache bandwidth, but the edge is not close to either limit:

before
edge L2 L3
1MB 0.25MB 6MB Skylake
8MB 0.5MB 32MB Zen3
4MB/8MB 0.5MB 16MB Zen2 (Zen2 allocates L3 only from a 16MB slice)
2MB/4MB 0.5MB 8MB Zen (Zen allocates L3 only from an 8MB slice).
2MB 1.5MB 8MB Tiger Lake

Given that the AVX code is branchless apart from the loop-back edge,
the limit cannot be branch predictor capacity.

Whatever the limit is, the result for the huge arrays show that limit,
not the capabilities of the SIMD units. For that better look at the
best SIMD results:

n normal expect AVX2 AVX2_unroll
131072 0.570325 0.570285 2.357071 2.205857 Skylake
1048576 0.629830 0.420802 4.644690 5.443698 Zen 3
262144 0.608389 0.407787 3.234190 4.476656 Zen 2
262144 0.513307 0.342414 2.043148 1.760585 Zen
262144 1.729501 1.727769 2.557078 2.947558 Tiger Lake

So Zen with its 128-bit SIMD units is worst, and Tiger Lake is better
than Skylake. It's surprising that Zen2 and Zen3 are so much better
than Tiger Lake. We see a speedup by a factor 8.6 of AVX2_unroll over
normal on Zen3, which is very impressive.

Let's also compare the n=512M values for across CPUs:

n normal expect AVX2 AVX2_unroll
536870912 0.560183 0.560181 1.509494 1.516893 Skylake
536870912 0.626340 0.417924 2.446799 2.465893 Zen 3
536870912 0.501185 0.362092 1.474724 1.484411 Zen 2
536870912 0.484308 0.334349 1.383726 1.346343 Zen
536870912 1.292826 1.297217 1.724536 1.724241 Tiger Lake

The limit seems to be similar on Skylake, Zen 2, and Zen, slightly
higher on Tiger Lake, and much higher on Zen 3. Given that main
memory bandwidth is an issue at that size, it would be good to know
how long an rtdsc tick is.

>> 1024 0.540084 0.522983 0.510469 0.725212
>> 512 0.509960 0.454707 0.490421 0.621359
>> 256 0.477612 0.391437 0.468864 0.481203
>> 128 0.450704 0.421053 0.400000 0.278261
>
>Don't use the unrolled AVX2 stuff on short vectors, I suppose,
>but at least the slowdown for AVX2 against normal code is slight.

I find it surprising that your branchless AVX2 code does not show a
speedup compared to the scalar code, which I expect to enter the if
code ~4.85 times for n=128, and have a branch misprediction every
time. And n=128 is not that small that SIMD should not provide a nice
benefit (only 16 iterations through the AVX2 inner loop).

Let's compare the n=128 cases for all CPUs:

n normal expect AVX2 AVX2_unroll
128 0.450704 0.421053 0.400000 0.278261 Skylake
128 0.421053 0.336842 0.842105 0.421053 Zen 3
128 0.481203 0.336842 1.684211 0.421053 Zen 2
128 0.296296 0.222222 0.888889 0.507937 Zen
128 0.761905 0.512000 0.677249 0.882759 Tiger Lake

So for the Zens AVX2 provides a nice speedup for n=128 (especially Zen
2). Skylake is not so great here. Maybe it's the 256-bit wakeup
slowdown. Tiger Lake already shows a speedup from unrolling at n=128.

>> On Zen 3:
>># Ints per cycle
>># n normal expect AVX2 AVX2_unroll
>
>> 536870912 0.626340 0.417924 2.446799 2.465893
>> 268435456 0.626351 0.417885 2.438281 2.455598
>
>Scalar code is about par with Skylake, the AVX2 code is better.
>Strange that the __builtin_expect code is slower, but that may
>just be the rather old compiler.

Apparently the different code alignment rubs Zen3 very much the wrong
way.

>The clear winner for Zen3: The AVX2 stuff without unrolling.

Not if n>=4k.

>> On Tiger Lake:
>># Ints per cycle
>># n normal expect AVX2 AVX2_unroll
>
>> 536870912 1.292826 1.297217 1.724536 1.724241
>> 268435456 1.299574 1.298605 1.726204 1.708710
>
>The scalar variant is _very_ good

More than 1.7 iterations/rtdsc unit at the high point. Either the
rtdsc unit is very far from the cycle time, or this processor can do
more than one back-to-back add in one cycle. My guess it's the
former. The base clock of this CPU (Core i5-1135G7) seems to be
2.4GHz, the turbo 4.2GHz. If the rtdsc uses the base clock and the
benchmark runs at max turbo, that would result in 1.75 cycles/rtdsc
unit, and the results fit that nicely. Still, 1 cycle/iteration of
the loop above is a very nice result; it means that Tiger Lake can
perform 2 taken branches per cycle (while, e.g., on Skylake, each
taken branch costs a cycle, resulting in at most 0.5 iterations/cycle
if we assume that a non-taken if branch results in a branch
misprediction).

Getting the result in cycles and in ns would be useful.

>AVX2 without unrolling seems to be the clear winner for all
>architectures you checked, especially the AMD ones, except for Tiger
>Lake, which combines excellent performance with of the scalar loop
>wiht lackluster performance on AVX2. Maybe they figured that,
>while they do support the instructions, performance was not
>so inoportant for them after all. For a processor intended for
>the mobile market, that makes sense.

Tiger Lake was not designed for the mobile market, it ended up being
sold only there because of the fab difficulties that Intel had. They
put in AVX-512 because they thought that SIMD performance is important
for the intended markets of this core. My guess is that the latency
per iteration of the vpcmpgtd-vpblendvb recurrence is relatively long;
according to https://www.agner.org/optimize/instruction_tables.pdf,
this recurrence has 3 cycles of latency; should not be so bad. Hmm.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: VVM question

<sg27qb$2lv$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20082&group=comp.arch#20082

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!T3F9KNSTSM9ffyC31YXeHw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 09:40:58 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sg27qb$2lv$1@gioia.aioe.org>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <sg0ctf$qoj$1@dont-email.me>
<2021Aug23.172406@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="2751"; posting-host="T3F9KNSTSM9ffyC31YXeHw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Tue, 24 Aug 2021 07:40 UTC

Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> On 8/22/2021 10:44 PM, Thomas Koenig wrote:
> |int m2(int * const restrict a, int n)
> |{
> | int m, nm;
> | int i;
> |
> | m = INT_MIN;
> | nm = -1;
> | for (i=0; i<n; i++)
> | {
> | if (a[i] > m)
> | {
> | m = a[i];
> | nm = i;
> | }
> | }
> | return nm;
> |}
> ...
>>> Will it run several iterations in parallel without source code
>>> modification, or not?
>>
>> This hearkens back to the thread we had some months ago on reductions in
>> VVM. I think the answer is "mostly not". I say this because the full
>> cache line load streaming capability is sort of doing multiple loads in
>> parallel, but the the compare part of the loop will not use multiple
>> ALUs in parallel, even if they are available.
>
> Why not? Consider this as the following equivalent code:
>
> int m2(int * const restrict a, int n)
> {
> int m, nm;
> int i;
>
> m = INT_MIN;
> nm = -1;
> i=0;
> while (i<n) {
> while (i<n && a[i]<=m)
> i++;
> if (a[i] > m) {
> m = a[i];
> nm = i;
> }
> i++;
> }
> return nm;
> }
>
> Now look at the inner loop. It is so easy to vectorize that even VVM
> may be able to do it (maybe even auto-vectorizing compilers). Of

The main issue is that the loop is buggy! The inner loop can exit due to
(i<n), at which point the next line "a[i] > m" becomes UB.

Modifying it to while (i+1<n && a[i] <= m) would work I think, but it is
easier to check the index below:

while (i<n) {
while (i<n && a[i]<=m)
i++;
if (i < n) {
m = a[i];
nm = i;
}
i++;

> course, at first it will have very short trip counts, but they
> increase the further through the array you work, as the probability to
> find an element larger than the largest one up to now decreases
> (unless the array is sorted).

Searching for a max value in N random elements expects log(n) hits, so
yes it is usually OK.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Pages:12 3 4

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor