Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

It's great to be smart 'cause then you know stuff.


devel / comp.arch / Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

SubjectAuthor
* Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
+- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
+* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmBranimir Maksimovic
|`* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
| `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmBranimir Maksimovic
|  `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|   +* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmTerje Mathisen
|   |`* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|   | +- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|   | +* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMarcus
|   | |`* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|   | | `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmTerje Mathisen
|   | |  `- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|   | `- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMitchAlsup
|   `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMitchAlsup
|    `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|     `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMitchAlsup
|      `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|       +* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMitchAlsup
|       |`- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|       `* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmThomas Koenig
|        +* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|        |+* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMitchAlsup
|        ||+- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|        ||`* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmJimBrakefield
|        || +- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmMitchAlsup
|        || `- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|        |`* Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmThomas Koenig
|        | `- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|        +- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com
|        `- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmTerje Mathisen
`- Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithmluke.l...@gmail.com

Pages:12
Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18950&group=comp.arch#18950

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e849:: with SMTP id l9mr28116603qvo.41.1626804576836; Tue, 20 Jul 2021 11:09:36 -0700 (PDT)
X-Received: by 2002:a9d:6c8e:: with SMTP id c14mr3116514otr.5.1626804576557; Tue, 20 Jul 2021 11:09:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 20 Jul 2021 11:09:36 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
Subject: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 20 Jul 2021 18:09:36 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 68
 by: luke.l...@gmail.com - Tue, 20 Jul 2021 18:09 UTC

i thought people on here might appreciate knowing that i've
just managed to design and implement an in-place RADIX-2
general-purpose iterative DCT algorithm.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/fastdctlee.py;hb=HEAD

based on Nayuki's excellent work, which implements a recursive
Lee algorithm, i first removed the recursion and created an iterative
version.

the second step was to pre-load the data in bit-reversed order
(0 1 2 3 4 5 6 7 loads the data in order 0 4 2 6 1 5 3 7)

the third step was to note that when the top half of the inner
butterfly is order-inverted, a post-coefficient series of in-place
swaps can be carried out:

j = list(range(i, i + halfsize))
jr = list(range(i+halfsize, i + size))
jr.reverse() # here is the order reversal of the top half
for ci, (jl, jh) in enumerate(zip(j, jr)):
t1, t2 = vec[[jl], vec[jh]
coeff = (math.cos((ci + 0.5) * math.pi / size) * 2.0)
vec[jl] = t1 + t2
vec[jh] = (t1 - t2) * (1/coeff)
# now do in-place swap
hz2 = halfsize // 2
for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
jlh = jl+halfsize
tmp1, tmp2 = vec[jlh], vec[jh]
vec[jlh], vec[jh] = tmp2, tmp1

step four was to realise that with some indirection - a list-of-indices -
*you don't need to move the data*, you swap the indices that *refer*
to that data.

step five was to realise that, just as with the pre-loading of the data
in index-bit-reversed order, the *same thing can be done with the
swap-indices*.

the only trick is that where bit-reverse is its own inverse, recursive
index-swapping is not. however i managed to work out the algorithm
for doing so, and it's actually really simple:

def halfrev(l, pre_rev=True):
n = len(l)
if n == 1: return l
ll, lh = l[:n//2], l[n//2:]
lh.reverse()
return halfrev(ll, pre_rev) + halfrev(lh, pre_rev)

this is the *inverse* of the order-swapping that occurs during the
inner butterfly, such that by the time the swapping is performed
the data *ends up* in the correct order. without actually ever doing
any swaps, simply in-place twin FMACs.

all other algorithms that i have seen are either recursive or completely
and fully loop-unrolled, inline assembler. it took a week to find even
one iterative DCT algorithm: it wasn't in-place and it didn't work (several
bugs).

i would be interested to hear peoples' thoughts on how this could be
efficiently and effectively implemented in a processor core: what sorts
of instructions and architecture would be needed. i have some ideas
for SVP64, it would just be fascinating to have a comparative discussion
particularly given that this algorithm is ridiculously popular and of such
importance in so many areas.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<727fba1c-8031-4349-bf32-2f40919757edn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18956&group=comp.arch#18956

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9a47:: with SMTP id c68mr26643020qke.37.1626814557452;
Tue, 20 Jul 2021 13:55:57 -0700 (PDT)
X-Received: by 2002:aca:c7cb:: with SMTP id x194mr22779387oif.119.1626814557262;
Tue, 20 Jul 2021 13:55:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 20 Jul 2021 13:55:57 -0700 (PDT)
In-Reply-To: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <727fba1c-8031-4349-bf32-2f40919757edn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 20 Jul 2021 20:55:57 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Tue, 20 Jul 2021 20:55 UTC

fascinating.

creating the indices for the reversal order that would normally
have to be done recursively is as simple as this:

for i in range(len(vec)):
res.append(i ^ (i>>1))

which is *really* surprising.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<n8HJI.28032$qk6.1643@fx36.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18957&group=comp.arch#18957

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!feeder5.feed.usenet.farm!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx36.iad.POSTED!not-for-mail
Newsgroups: comp.arch
From: branimir...@gmail.com (Branimir Maksimovic)
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
User-Agent: slrn/1.0.3 (Darwin)
Lines: 74
Message-ID: <n8HJI.28032$qk6.1643@fx36.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Tue, 20 Jul 2021 21:22:27 UTC
Organization: usenet-news.net
Date: Tue, 20 Jul 2021 21:22:27 GMT
X-Received-Bytes: 3742
 by: Branimir Maksimovic - Tue, 20 Jul 2021 21:22 UTC

On 2021-07-20, luke.l...@gmail.com <luke.leighton@gmail.com> wrote:
> i thought people on here might appreciate knowing that i've
> just managed to design and implement an in-place RADIX-2
> general-purpose iterative DCT algorithm.
>
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/fastdctlee.py;hb=HEAD
>
> based on Nayuki's excellent work, which implements a recursive
> Lee algorithm, i first removed the recursion and created an iterative
> version.
>
> the second step was to pre-load the data in bit-reversed order
> (0 1 2 3 4 5 6 7 loads the data in order 0 4 2 6 1 5 3 7)
>
> the third step was to note that when the top half of the inner
> butterfly is order-inverted, a post-coefficient series of in-place
> swaps can be carried out:
>
> j = list(range(i, i + halfsize))
> jr = list(range(i+halfsize, i + size))
> jr.reverse() # here is the order reversal of the top half
> for ci, (jl, jh) in enumerate(zip(j, jr)):
> t1, t2 = vec[[jl], vec[jh]
> coeff = (math.cos((ci + 0.5) * math.pi / size) * 2.0)
> vec[jl] = t1 + t2
> vec[jh] = (t1 - t2) * (1/coeff)
> # now do in-place swap
> hz2 = halfsize // 2
> for ci, (jl, jh) in enumerate(zip(j[:hz2], jr[:hz2])):
> jlh = jl+halfsize
> tmp1, tmp2 = vec[jlh], vec[jh]
> vec[jlh], vec[jh] = tmp2, tmp1
>
> step four was to realise that with some indirection - a list-of-indices -
> *you don't need to move the data*, you swap the indices that *refer*
> to that data.
>
> step five was to realise that, just as with the pre-loading of the data
> in index-bit-reversed order, the *same thing can be done with the
> swap-indices*.
>
> the only trick is that where bit-reverse is its own inverse, recursive
> index-swapping is not. however i managed to work out the algorithm
> for doing so, and it's actually really simple:
>
> def halfrev(l, pre_rev=True):
> n = len(l)
> if n == 1: return l
> ll, lh = l[:n//2], l[n//2:]
> lh.reverse()
> return halfrev(ll, pre_rev) + halfrev(lh, pre_rev)
>
> this is the *inverse* of the order-swapping that occurs during the
> inner butterfly, such that by the time the swapping is performed
> the data *ends up* in the correct order. without actually ever doing
> any swaps, simply in-place twin FMACs.
>
> all other algorithms that i have seen are either recursive or completely
> and fully loop-unrolled, inline assembler. it took a week to find even
> one iterative DCT algorithm: it wasn't in-place and it didn't work (several
> bugs).
>
> i would be interested to hear peoples' thoughts on how this could be
> efficiently and effectively implemented in a processor core: what sorts
> of instructions and architecture would be needed. i have some ideas
> for SVP64, it would just be fascinating to have a comparative discussion
> particularly given that this algorithm is ridiculously popular and of such
> importance in so many areas.
>
> l.
Well implement it in Verilog, rather then Python :P

--
bmaxa now listens Diy by Robots in disguise from Robots in disguise

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18958&group=comp.arch#18958

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:6b0f:: with SMTP id w15mr28143191qts.366.1626817556084;
Tue, 20 Jul 2021 14:45:56 -0700 (PDT)
X-Received: by 2002:a05:6808:1313:: with SMTP id y19mr22576375oiv.37.1626817555882;
Tue, 20 Jul 2021 14:45:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 20 Jul 2021 14:45:55 -0700 (PDT)
In-Reply-To: <n8HJI.28032$qk6.1643@fx36.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com> <n8HJI.28032$qk6.1643@fx36.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 20 Jul 2021 21:45:56 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Tue, 20 Jul 2021 21:45 UTC

On Tuesday, July 20, 2021 at 10:22:32 PM UTC+1, Branimir Maksimovic wrote:

> Well implement it in Verilog, rather then Python :P

nmigen HDL... which is in python :) outputs to Verilog (which i consider
to be machine-code) from yosys. need about 6-8 weeks to do it, am
currently focussing on implementing it in the SVP64 Simulator. which
(surpriiise) is in python, too :)

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<hNHJI.25335$rr3.14798@fx34.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18959&group=comp.arch#18959

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx34.iad.POSTED!not-for-mail
Newsgroups: comp.arch
From: branimir...@gmail.com (Branimir Maksimovic)
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
User-Agent: slrn/1.0.3 (Darwin)
Lines: 17
Message-ID: <hNHJI.25335$rr3.14798@fx34.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Tue, 20 Jul 2021 22:06:05 UTC
Organization: usenet-news.net
Date: Tue, 20 Jul 2021 22:06:05 GMT
X-Received-Bytes: 1331
 by: Branimir Maksimovic - Tue, 20 Jul 2021 22:06 UTC

On 2021-07-20, luke.l...@gmail.com <luke.leighton@gmail.com> wrote:
> On Tuesday, July 20, 2021 at 10:22:32 PM UTC+1, Branimir Maksimovic wrote:
>
>> Well implement it in Verilog, rather then Python :P
>
> nmigen HDL... which is in python :) outputs to Verilog (which i consider
> to be machine-code) from yosys. need about 6-8 weeks to do it, am
> currently focussing on implementing it in the SVP64 Simulator. which
> (surpriiise) is in python, too :)

Well, why do you ask then? ;)
>
> l.

--
bmaxa now listens Woman In Disguise by Angelic Upstarts from The Independent Punk Singles Collection

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18976&group=comp.arch#18976

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:ff01:: with SMTP id w1mr35304848qvt.28.1626859296214;
Wed, 21 Jul 2021 02:21:36 -0700 (PDT)
X-Received: by 2002:a05:6808:10d5:: with SMTP id s21mr2954023ois.7.1626859295917;
Wed, 21 Jul 2021 02:21:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 02:21:35 -0700 (PDT)
In-Reply-To: <hNHJI.25335$rr3.14798@fx34.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 09:21:36 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 09:21 UTC

On Tuesday, July 20, 2021 at 11:06:08 PM UTC+1, Branimir Maksimovic wrote:
> Well, why do you ask then? ;)

1) to start a comparative architectural discussion (answering the blindingly-obvious
question "why the hell hasn't anyone come up with this before now??)
2) to see if it can be done better or faster
3) to find out if there's any other algorithms that can be done similarly

regarding (1) i am actually really shocked that, out of 120 references
on wikipedia and hundreds of academic papers into one of the world's
most heavily-used algorithms, none of them are in-place general-purpose.
the research is the following:

a) hardware versions of DCT
b) versions which minimise the number of sequentially-run but loop-unrolled operations

in software, the fact that bit-reversal is quite expensive has dead-ended.
there haven't been any recent Vector Processor innovations, so everyone
has been trying desperately to use SIMD and of course the only way to use
that in any kind of optimal way is to use recursive *assembly-level* macros
to create a blatch of loop-unrolled instructions (thousands of them for large
DCTs).

etc. etc.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<sd8r52$q48$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18979&group=comp.arch#18979

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!uNkxFD/dgvFUE+WUQcvYbA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
Date: Wed, 21 Jul 2021 11:58:57 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sd8r52$q48$1@gioia.aioe.org>
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="26760"; posting-host="uNkxFD/dgvFUE+WUQcvYbA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Wed, 21 Jul 2021 09:58 UTC

luke.l...@gmail.com wrote:
> On Tuesday, July 20, 2021 at 11:06:08 PM UTC+1, Branimir Maksimovic wrote:
>> Well, why do you ask then? ;)
>
> 1) to start a comparative architectural discussion (answering the blindingly-obvious
> question "why the hell hasn't anyone come up with this before now??)
> 2) to see if it can be done better or faster
> 3) to find out if there's any other algorithms that can be done similarly
>
> regarding (1) i am actually really shocked that, out of 120 references
> on wikipedia and hundreds of academic papers into one of the world's
> most heavily-used algorithms, none of them are in-place general-purpose.
> the research is the following:
>
> a) hardware versions of DCT
> b) versions which minimise the number of sequentially-run but loop-unrolled operations
>
> in software, the fact that bit-reversal is quite expensive has dead-ended.

Huh?

I have written numerous *FT implementations, bit reversal have _never_
been a bottleneck.

The most obvious solution is to embed the bit reversal in the loop
construct, in which case the overhead becomes very close to zero, or you
unroll by say 16 and get to handle the nybbles that way.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18980&group=comp.arch#18980

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4810:: with SMTP id g16mr10201649qtq.149.1626863937960;
Wed, 21 Jul 2021 03:38:57 -0700 (PDT)
X-Received: by 2002:a05:6830:3109:: with SMTP id b9mr11595042ots.276.1626863937752;
Wed, 21 Jul 2021 03:38:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 03:38:57 -0700 (PDT)
In-Reply-To: <sd8r52$q48$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.171.106; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.171.106
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 10:38:57 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 10:38 UTC

On Wednesday, July 21, 2021 at 10:59:00 AM UTC+1, Terje Mathisen wrote:
> luke.l...@gmail.com wrote:

> I have written numerous *FT implementations,

(caveat: i haven't. this is an advantage, new eyes, and a disadvantage, not knowing all the "tricks"

> bit reversal have _never_ been a bottleneck.

interesting.

> The most obvious solution is to embed the bit reversal in the loop
> construct, in which case the overhead becomes very close to zero,

compared to say VLIW DSPs like the TMS320 and the qualcom hexagon which
can do FFT in a single (admittedly VLIW) instruction and therefore keep FMAC
pipelines 100% occupied *without multi-issue*, is that not resulting in slowdown?

the other thing about the algorithm (and associated ISA augmentations) is that
even the simplest embedded implementations would be able to spam backend
FMAC pipelines 100% full per clock; multi-issue engines would be able to do N IPC
for N way issue.

> or you
> unroll by say 16 and get to handle the nybbles that way.

loop unrolling is one of the design goals to entirely avoid (for inplace execution
up to the maximum number of registers that can be held without spill).

for an 8 wide DCT that does mean 12 coefficients in registers *and* 8 input values
however it's not 12 + 8 + another 8 for a copy of the row.

with the right Vector ISA augmentations i believe it can be done in about 9
instructions, assuming all coefficients are already in regs. if they have to be
computed then assuming a hardware COS instruction it's more like... 20-25
including a DIV, +0.5, fcvt etc etc.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<2ed274e2-727d-41fe-a7ea-a20d9ab2e7abn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18981&group=comp.arch#18981

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7b52:: with SMTP id m18mr30771034qtu.131.1626866046290;
Wed, 21 Jul 2021 04:14:06 -0700 (PDT)
X-Received: by 2002:a9d:6f84:: with SMTP id h4mr26815744otq.240.1626866046000;
Wed, 21 Jul 2021 04:14:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 04:14:05 -0700 (PDT)
In-Reply-To: <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.171.104; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.171.104
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org> <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2ed274e2-727d-41fe-a7ea-a20d9ab2e7abn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 11:14:06 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2245
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 11:14 UTC

Terje: one of the tricks that I will put into SVP64 to deal with this and get down
to 9 instructions is a bit-reversed LD instruction, combined with bit-reversed
Vector Element Indexing. the reason for that requires a little explanation.

in all the diagrams for DCT the butterfly layout is extremely pretty and obvious.
however if you look closer you find that the input is loaded sequentially but
the *output* ends up in bit-reversed order, which is a pain in the ass to fix inplace.

if however you load the data in bitreversed order but then *also access it
in bit-reversed order* then by the time the butterfly is completed the data
*ends up in sequential order*.

this trick inherently relies on bitreverse being its own inverse.

more later, because DCT does half swapping as well (0123 7654).

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<sd8vvg$h5a$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18982&group=comp.arch#18982

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
Date: Wed, 21 Jul 2021 13:21:19 +0200
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <sd8vvg$h5a$1@dont-email.me>
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org>
<f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 21 Jul 2021 11:21:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b28e696dcafce0392e3c74d6cc33e41b";
logging-data="17578"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/lH5Q1bkUWHUjDfrZGNbm1GmVZRnij9+4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:Uf/RaS180F8Hu3EBJ5XP4OJYLh0=
In-Reply-To: <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
Content-Language: en-US
 by: Marcus - Wed, 21 Jul 2021 11:21 UTC

On 2021-07-21 12:38, luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 10:59:00 AM UTC+1, Terje Mathisen wrote:
>> luke.l...@gmail.com wrote:
>
>> I have written numerous *FT implementations,
>
> (caveat: i haven't. this is an advantage, new eyes, and a disadvantage, not knowing all the "tricks"
>
>> bit reversal have _never_ been a bottleneck.
>
> interesting.
>
>> The most obvious solution is to embed the bit reversal in the loop
>> construct, in which case the overhead becomes very close to zero,
>
> compared to say VLIW DSPs like the TMS320 and the qualcom hexagon which
> can do FFT in a single (admittedly VLIW) instruction and therefore keep FMAC
> pipelines 100% occupied *without multi-issue*, is that not resulting in slowdown?
>
> the other thing about the algorithm (and associated ISA augmentations) is that
> even the simplest embedded implementations would be able to spam backend
> FMAC pipelines 100% full per clock; multi-issue engines would be able to do N IPC
> for N way issue.
>
>> or you
>> unroll by say 16 and get to handle the nybbles that way.
>
> loop unrolling is one of the design goals to entirely avoid (for inplace execution
> up to the maximum number of registers that can be held without spill).
>
> for an 8 wide DCT that does mean 12 coefficients in registers *and* 8 input values
> however it's not 12 + 8 + another 8 for a copy of the row.
>
> with the right Vector ISA augmentations i believe it can be done in about 9
> instructions, assuming all coefficients are already in regs. if they have to be
> computed then assuming a hardware COS instruction it's more like... 20-25
> including a DIV, +0.5, fcvt etc etc.

My gut feeling is that the most important FFT use cases to optimize for
are ones for fixed sizes. That is, anything related to video/audio
codecs, DSP or similar, where things are usually processed in batches of
fixed size blocks.

Free-size FFT:s may be useful for scientific workloads, but I fail to
see that having a generic FFT() routine that would do different sized
transformations on each call would be an important use case.

Just my 2 c.

/Marcus

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<f6b7bf90-f2b1-4159-894f-8a1edb789e93n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18983&group=comp.arch#18983

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9445:: with SMTP id w66mr34439996qkd.410.1626868301473;
Wed, 21 Jul 2021 04:51:41 -0700 (PDT)
X-Received: by 2002:a9d:6c8e:: with SMTP id c14mr5925381otr.5.1626868301233;
Wed, 21 Jul 2021 04:51:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 04:51:40 -0700 (PDT)
In-Reply-To: <sd8vvg$h5a$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.171.112; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.171.112
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org> <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
<sd8vvg$h5a$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f6b7bf90-f2b1-4159-894f-8a1edb789e93n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 11:51:41 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 11:51 UTC

On Wednesday, July 21, 2021 at 12:21:22 PM UTC+1, Marcus wrote:

> My gut feeling is that the most important FFT use cases to optimize for
> are ones for fixed sizes.

yes, there is an extremely common 8x8 2D DCT used for video processing
in pretty much all modern CODECs.

> That is, anything related to video/audio
> codecs, DSP or similar, where things are usually processed in batches of
> fixed size blocks.

yes, this is my priority goal.

> Free-size FFT:s may be useful for scientific workloads, but I fail to
> see that having a generic FFT() routine that would do different sized
> transformations on each call would be an important use case.

loop unrolling, the "usual" technique, takes up L1 cache. the algorithms i'm
doing fit into 2-3 cache lines, up to the limit of the number of registers available
(128 for the hybrid CPU-VPU-GPU we are developing).

Real-only FFT is only 3 instructions (assuming coefficients preloaded),
Complex is 11 (likewise). generating FFT coefficients is a Vectorised SIN/COS
(about another 10 instructions).

free-size FFT i have left for another time.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<sd9ahb$h7g$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18990&group=comp.arch#18990

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!uNkxFD/dgvFUE+WUQcvYbA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
Date: Wed, 21 Jul 2021 16:21:30 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sd9ahb$h7g$1@gioia.aioe.org>
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org>
<f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
<sd8vvg$h5a$1@dont-email.me>
<f6b7bf90-f2b1-4159-894f-8a1edb789e93n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="17648"; posting-host="uNkxFD/dgvFUE+WUQcvYbA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Wed, 21 Jul 2021 14:21 UTC

luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 12:21:22 PM UTC+1, Marcus wrote:
>
>> My gut feeling is that the most important FFT use cases to optimize for
>> are ones for fixed sizes.
>
> yes, there is an extremely common 8x8 2D DCT used for video processing
> in pretty much all modern CODECs.
>
>> That is, anything related to video/audio
>> codecs, DSP or similar, where things are usually processed in batches of
>> fixed size blocks.
>
> yes, this is my priority goal.

In Ogg-Vorbis, the (by far) most time-consuming operation is the IMDCT,
which after optimization runs at more or less the maximum SIMD vector
speed. The relevant sizes are 256/512/1024/2048 and all of them fit
easily in $L1 making the actual data processing the limiter.

I admire your goal of making this pretty much a single op, but it still
only matters if the actual throughput is significantly better than what
we get with current SIMD code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18995&group=comp.arch#18995

 copy link   Newsgroups: comp.arch
X-Received: by 2002:aed:2149:: with SMTP id 67mr10741816qtc.60.1626881515332;
Wed, 21 Jul 2021 08:31:55 -0700 (PDT)
X-Received: by 2002:aca:c78d:: with SMTP id x135mr2937212oif.30.1626881514908;
Wed, 21 Jul 2021 08:31:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 08:31:54 -0700 (PDT)
In-Reply-To: <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:401c:aaae:86e:918;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:401c:aaae:86e:918
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 21 Jul 2021 15:31:55 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 21 Jul 2021 15:31 UTC

On Wednesday, July 21, 2021 at 4:21:37 AM UTC-5, luke.l...@gmail.com wrote:
> snip>
> there haven't been any recent Vector Processor innovations,
<
short vector SIMD does not count ??
<
VVM does not count ??
>
> etc. etc.
>
> l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<e3e503de-8a25-458f-830a-244ee1105da3n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=18996&group=comp.arch#18996

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7087:: with SMTP id y7mr23565642qto.91.1626881910674;
Wed, 21 Jul 2021 08:38:30 -0700 (PDT)
X-Received: by 2002:a9d:7f91:: with SMTP id t17mr26293053otp.22.1626881910436;
Wed, 21 Jul 2021 08:38:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 08:38:30 -0700 (PDT)
In-Reply-To: <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:401c:aaae:86e:918;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:401c:aaae:86e:918
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org> <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e3e503de-8a25-458f-830a-244ee1105da3n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 21 Jul 2021 15:38:30 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 21 Jul 2021 15:38 UTC

On Wednesday, July 21, 2021 at 5:38:59 AM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 10:59:00 AM UTC+1, Terje Mathisen wrote:
> > luke.l...@gmail.com wrote:
>
> > I have written numerous *FT implementations,
> (caveat: i haven't. this is an advantage, new eyes, and a disadvantage, not knowing all the "tricks"
> > bit reversal have _never_ been a bottleneck.
> interesting.
> > The most obvious solution is to embed the bit reversal in the loop
> > construct, in which case the overhead becomes very close to zero,
<
> compared to say VLIW DSPs like the TMS320 and the qualcom hexagon which
> can do FFT in a single (admittedly VLIW) instruction and therefore keep FMAC
> pipelines 100% occupied *without multi-issue*, is that not resulting in slowdown?
<
Large FFTs have bad memory performance and keeping the FMACs 100% busy
and then waiting on memory is not optimal.
<
>
> the other thing about the algorithm (and associated ISA augmentations) is that
> even the simplest embedded implementations would be able to spam backend
> FMAC pipelines 100% full per clock; multi-issue engines would be able to do N IPC
> for N way issue.
> > or you
> > unroll by say 16 and get to handle the nybbles that way.
<
> loop unrolling is one of the design goals to entirely avoid (for inplace execution
> up to the maximum number of registers that can be held without spill).
>
> for an 8 wide DCT that does mean 12 coefficients in registers *and* 8 input values
> however it's not 12 + 8 + another 8 for a copy of the row.
>
> with the right Vector ISA augmentations i believe it can be done in about 9
> instructions, assuming all coefficients are already in regs. if they have to be
> computed then assuming a hardware COS instruction it's more like... 20-25
> including a DIV, +0.5, fcvt etc etc.
>
> l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19001&group=comp.arch#19001

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:209:: with SMTP id b9mr24844357qtx.136.1626883264181; Wed, 21 Jul 2021 09:01:04 -0700 (PDT)
X-Received: by 2002:a05:6808:1494:: with SMTP id e20mr3051013oiw.122.1626883263825; Wed, 21 Jul 2021 09:01:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 09:01:03 -0700 (PDT)
In-Reply-To: <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.171.104; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.171.104
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com> <n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com> <hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com> <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 16:01:04 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 24
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 16:01 UTC

On Wednesday, July 21, 2021 at 4:31:56 PM UTC+1, MitchAlsup wrote:
> On Wednesday, July 21, 2021 at 4:21:37 AM UTC-5, luke.l...@gmail.com wrote:
> > snip>
> > there haven't been any recent Vector Processor innovations,
> <
> short vector SIMD does not count ??

i've not heard of that one. where can i find out about it?

> <
> VVM does not count ??

okok i meant, "no innovations in Cray-style Vector Processors",
the recent revival through RVV still uses the same paradigm (Horizontal-First),
NEC SX-Aurora was the last commercial HF (Cray style) ISA before
RVV.

VVM is Vertical-First, it's a completely different paradigm. allow me to re-state:

there haven't been any recent Cray-style (Horizontal-First) Vector Processor innovations,

sorry about that :)

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19004&group=comp.arch#19004

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:16:: with SMTP id x22mr20594450qtw.140.1626891628450;
Wed, 21 Jul 2021 11:20:28 -0700 (PDT)
X-Received: by 2002:a05:6808:158a:: with SMTP id t10mr3463448oiw.175.1626891628225;
Wed, 21 Jul 2021 11:20:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 11:20:28 -0700 (PDT)
In-Reply-To: <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:401c:aaae:86e:918;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:401c:aaae:86e:918
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 21 Jul 2021 18:20:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 21 Jul 2021 18:20 UTC

On Wednesday, July 21, 2021 at 11:01:05 AM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 4:31:56 PM UTC+1, MitchAlsup wrote:
> > On Wednesday, July 21, 2021 at 4:21:37 AM UTC-5, luke.l...@gmail.com wrote:
> > > snip>
> > > there haven't been any recent Vector Processor innovations,
> > <
> > short vector SIMD does not count ??
> i've not heard of that one. where can i find out about it?
MMX
SSE
SSE2
AVX
........
> > <
> > VVM does not count ??
> okok i meant, "no innovations in Cray-style Vector Processors",
> the recent revival through RVV still uses the same paradigm (Horizontal-First),
> NEC SX-Aurora was the last commercial HF (Cray style) ISA before
> RVV.
>
> VVM is Vertical-First, it's a completely different paradigm.
<
Agreed that it is a completely different paradigm.
But it is vectorization, just not in the CRAY way.
<
> allow me to re-state:
>
> there haven't been any recent Cray-style (Horizontal-First) Vector Processor innovations,
>
> sorry about that :)
<
I agree, but these is a reason for this. Cray vectors were a means to absorb
memory* latency, and the kinds of latencies associated with memory got
to the point where this style could not absorb as much memory latency as
existed, so it "ran out of steam" {Maybe except by using 128-256 registers
in a vector, but, still, the writing was on the walls.}
<
When the 80 MHz CRAY 1 came out mainframes were using 6 cycle memory
latency and super computers were using 10 cycle memory latency. When the
Y-MP came out memory latency was ½ in the routing to and from the memory
boards and ½ from the DRAM latencies.
<Warning hand waving accuracy follows>
This ended up in the 30 cycle range (½ a vector length) but now they were running
4 lanes in parallel and instead of being ½ a vector length they were 2× the vector
length.
<
(*)You could add function unit latency, but it was really memory latency that
was the problem child (being unpredictable and all that.)
>
> l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<8b79d9c4-bd99-433b-86e7-0615e6b3a53cn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19011&group=comp.arch#19011

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:13c6:: with SMTP id p6mr32150079qtk.253.1626894640107;
Wed, 21 Jul 2021 12:10:40 -0700 (PDT)
X-Received: by 2002:a9d:6c8e:: with SMTP id c14mr7249885otr.5.1626894638409;
Wed, 21 Jul 2021 12:10:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 12:10:38 -0700 (PDT)
In-Reply-To: <sd9ahb$h7g$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<sd8r52$q48$1@gioia.aioe.org> <f4f50a1a-fade-466d-9941-6e5871c001afn@googlegroups.com>
<sd8vvg$h5a$1@dont-email.me> <f6b7bf90-f2b1-4159-894f-8a1edb789e93n@googlegroups.com>
<sd9ahb$h7g$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8b79d9c4-bd99-433b-86e7-0615e6b3a53cn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 19:10:40 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 19:10 UTC

On Wednesday, July 21, 2021 at 3:21:33 PM UTC+1, Terje Mathisen wrote:

> In Ogg-Vorbis, the (by far) most time-consuming operation is the IMDCT,

(inverse dct is next on my list)

> which after optimization runs at more or less the maximum SIMD vector
> speed. The relevant sizes are 256/512/1024/2048 and all of them fit
> easily in $L1 making the actual data processing the limiter.

(btw i assume you've seen this?
https://ffmpeg.org/doxygen/1.0/dct32_8c-source.html)

if those runs are using this code (general-purpose)
https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/dct.c

that seems to do some pre-processing and post-processing then
chucks the data at a "Real" (non-imaginary) DFT
https://ffmpeg.org/doxygen/2.7/rdft_8c_source.html

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19012&group=comp.arch#19012

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e54e:: with SMTP id n14mr37710323qvm.41.1626895454070; Wed, 21 Jul 2021 12:24:14 -0700 (PDT)
X-Received: by 2002:a05:6830:1658:: with SMTP id h24mr16487472otr.182.1626895453852; Wed, 21 Jul 2021 12:24:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 12:24:13 -0700 (PDT)
In-Reply-To: <4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com> <n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com> <hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com> <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com> <4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 19:24:14 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 40
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 19:24 UTC

On Wednesday, July 21, 2021 at 7:20:29 PM UTC+1, MitchAlsup wrote:
> On Wednesday, July 21, 2021 at 11:01:05 AM UTC-5, luke.l...@gmail.com wrote:
> > > short vector SIMD does not count ??
> > i've not heard of that one. where can i find out about it?
> MMX
> SSE
> SSE2
> AVX

oh right, those. yes. god-awful messes, all of them. no, they don't count:
they're distinctly part of the problem, not the solution. as *back-ends*
to a decent Vector ISA (with a powerful permutation engine), no problem.

> I agree, but these is a reason for this. Cray vectors were a means to absorb
> memory* latency, and the kinds of latencies associated with memory got
> to the point where this style could not absorb as much memory latency as
> existed, so it "ran out of steam" {Maybe except by using 128-256 registers
> in a vector, but, still, the writing was on the walls.}

Peter Hsu (developer of MIPS R8000) recently told me he got to speak with
Seymour Cray in his later years. Seymour said that he predicted Vector ISAs
would hit a wall, not because of the processing capability but because of the
*system* level memory bandwidth (thinking 1000s+ machines) and inter-
processor and I/O communication bandwidth.

if the inter-processor and overall I/O communication bandwidth is the
bottleneck, and the power consumption of the same completely dwarfs
processing power consumption, what, then, is the point of having even a
*remotely* decent Vector ISA for a multi-cluster Supercomputer?

one key counter-answer to that is: if the "simpler" processor is so rubbish
it consumes more power to do a decent job, that power consumption
of the entire cluster is badly compromised.

the other aspect is: for smaller systems (single-chip GPU and HPC scenarios)
the I/O and memory power consumption is a smaller percentage, and the
main CPU power consumption becomes significant for "broken" (SIMD)
designs.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<48100f40-9521-488e-afbb-a9d35f399c63n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19014&group=comp.arch#19014

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5ccc:: with SMTP id iu12mr36281301qvb.21.1626897276877;
Wed, 21 Jul 2021 12:54:36 -0700 (PDT)
X-Received: by 2002:a4a:e923:: with SMTP id a3mr25369498ooe.45.1626897276625;
Wed, 21 Jul 2021 12:54:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 12:54:36 -0700 (PDT)
In-Reply-To: <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:401c:aaae:86e:918;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:401c:aaae:86e:918
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <48100f40-9521-488e-afbb-a9d35f399c63n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 21 Jul 2021 19:54:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 21 Jul 2021 19:54 UTC

On Wednesday, July 21, 2021 at 2:24:15 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 7:20:29 PM UTC+1, MitchAlsup wrote:
> > On Wednesday, July 21, 2021 at 11:01:05 AM UTC-5, luke.l...@gmail.com wrote:
> > > > short vector SIMD does not count ??
> > > i've not heard of that one. where can i find out about it?
> > MMX
> > SSE
> > SSE2
> > AVX
> oh right, those. yes. god-awful messes, all of them. no, they don't count:
> they're distinctly part of the problem, not the solution. as *back-ends*
> to a decent Vector ISA (with a powerful permutation engine), no problem.
<
While I am in complete agreement with you on that (the mess part)
They are perceived as vectors and are newer than CRAY. As an architect
you need to develop the sensitivity to see these things outside of your
own point of view.
<
> > I agree, but these is a reason for this. Cray vectors were a means to absorb
> > memory* latency, and the kinds of latencies associated with memory got
> > to the point where this style could not absorb as much memory latency as
> > existed, so it "ran out of steam" {Maybe except by using 128-256 registers
> > in a vector, but, still, the writing was on the walls.}
<
> Peter Hsu (developer of MIPS R8000) recently told me he got to speak with
> Seymour Cray in his later years. Seymour said that he predicted Vector ISAs
> would hit a wall, not because of the processing capability but because of the
> *system* level memory bandwidth (thinking 1000s+ machines) and inter-
> processor and I/O communication bandwidth.
<
More or less what I mentioned above.
>
> if the inter-processor and overall I/O communication bandwidth is the
> bottleneck, and the power consumption of the same completely dwarfs
> processing power consumption, what, then, is the point of having even a
> *remotely* decent Vector ISA for a multi-cluster Supercomputer?
<
The economics have changed. When CRAY-1 was being designed and constructed
there were some really large problems people with lots of money (i.e., military)
willing to spend on big iron that making their problems 4× faster paid for the
company!
<
AND the reason CRAY-1 was designed at 300KVA (300K Watts) was that this
is the biggest load one may have on the electrical grid that can be turned on
and off without calling up the power company first ! If you bought a CRAY-1
with a little memory system and wanted to upgrade to the bigger one later,
CARY inserted resistor boards in the place of memory boards so that when
you bought the upgrade the power supply would see the same load before
and after.
<
This is no longer the case.
>
> one key counter-answer to that is: if the "simpler" processor is so rubbish
> it consumes more power to do a decent job, that power consumption
> of the entire cluster is badly compromised.
>
> the other aspect is: for smaller systems (single-chip GPU and HPC scenarios)
> the I/O and memory power consumption is a smaller percentage, and the
> main CPU power consumption becomes significant for "broken" (SIMD)
> designs.
>
> l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<sd9vi3$mq9$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19018&group=comp.arch#19018

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-a40-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
Date: Wed, 21 Jul 2021 20:20:19 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sd9vi3$mq9$1@newsreader4.netcologne.de>
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>
<9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>
<1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
Injection-Date: Wed, 21 Jul 2021 20:20:19 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-a40-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:a40:0:7285:c2ff:fe6c:992d";
logging-data="23369"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 21 Jul 2021 20:20 UTC

luke.l...@gmail.com <luke.leighton@gmail.com> schrieb:
> On Wednesday, July 21, 2021 at 7:20:29 PM UTC+1, MitchAlsup wrote:
>> On Wednesday, July 21, 2021 at 11:01:05 AM UTC-5, luke.l...@gmail.com wrote:
>> > > short vector SIMD does not count ??
>> > i've not heard of that one. where can i find out about it?
>> MMX
>> SSE
>> SSE2
>> AVX
>
> oh right, those. yes. god-awful messes, all of them. no, they don't count:
> they're distinctly part of the problem, not the solution.

I agree that these instruction sets are badly designed and a major
headache for compiler writers (anybody who doubts that can read
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and weep.)

For POWER, which you work on, there is also VFX. Again, by far not
the best of design, but they also do something.

All these instructions work well enough that, for example, you get
to 60-70% of the theoretical maximum floating point performance
for matrix multiplication if you take a 30-year old Fortran code
from Netlib, mechanically translate it to C via f2c, tune for more
recent cache sizes, sprinkle a few restrict qualifiers here and
there and translate for several versions for AVX 128, AVX 256 and
AVX 512 and select at runtime.

The original code is at
http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
and what I am referring to is at
https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk
..

Now, that is something to beat.

(It would also be interesting to see how VVM fares on that code,
or if this optimization which originated for PowerPC is a
pessimization there).

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19022&group=comp.arch#19022

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4bcf:: with SMTP id l15mr11198877qvw.11.1626904110937;
Wed, 21 Jul 2021 14:48:30 -0700 (PDT)
X-Received: by 2002:a05:6830:3109:: with SMTP id b9mr13384663ots.276.1626904110714;
Wed, 21 Jul 2021 14:48:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 14:48:30 -0700 (PDT)
In-Reply-To: <sd9vi3$mq9$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 21:48:30 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 21:48 UTC

On Wednesday, July 21, 2021 at 9:20:21 PM UTC+1, Thomas Koenig wrote:

> I agree that these instruction sets are badly designed and a major
> headache for compiler writers (anybody who doubts that can read
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and weep.)

dang - 200+ dependent bugs.

> For POWER, which you work on, there is also VFX.

you may not be aware, this video helps explain it,
https://youtu.be/gexy0z1YqFY?t=673

i have made it very bluntly plain and clear that i am flatly refusing
to implement any part of Power ISA SIMD in Libre-SOC. i consider it to
be so harmful a paradigm that even though EABIv2.0 makes it mandatory
i've still said "no f***** way". it's LITERALLY 700 extra instructions.
Power ISA's scalar instruction count is just over 200 (214 or so).
[which part of "RISC" was not clear, when VSX was added??]

this will leave our processor "behind" for a number of years whilst the
EABIv2.0 is sorted out and SIMD being mandatory is removed, replaced
with something that makes it optional again (but at the same time makes
it possible to run SIMD binaries). despite this, *i still said absolutely no
way* - SIMD is that bad - and we will instead use illegal instruction
trap-and-emulate.

to give you some idea, Paul Mackerras has added only some of the VSX
instructions to Microwatt, and it's increased the LUT4 usage in an FPGA
from 20,000 to *50,000*.

> Again, by far not
> the best of design, but they also do something.

yes - cause no end of problems.

> All these instructions work well enough that, for example, you get
> to 60-70% of the theoretical maximum floating point performance
> for matrix multiplication

SVP64's Matrix REMAP Schedule will get 100%.

> if you take a 30-year old Fortran code
> from Netlib, mechanically translate it to C via f2c, tune for more
> recent cache sizes, sprinkle a few restrict qualifiers here and
> there and translate for several versions for AVX 128, AVX 256 and
> AVX 512 and select at runtime.

no assembly optimisations are needed for SVP64 Matrix-Multiply.
it's too short (and specifically designed for the job).

> The original code is at
> http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
> and what I am referring to is at
> https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk
> .
>
> Now, that is something to beat.

done, last week. replaced by three instructions.

yes, that wasn't a typo: three (3) instructions.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD

there are some caveats:

1) the size of the register file (Libre-SOC will be going for 128 FPs and 128 INTs)
is a limiting factor.
2) the total number of FMACs needed must not exceed 127.
3) if you need the result to be zero'd out before beginning, it's 5 instructions not 3.

if you want to experiment and see what's going on, here's some stand-alone
executable code:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD

here's a unit test for it:
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remapmatrixprint.py;hb=HEAD

what that does is, it can be used to yield any of the three indices x y and z
needed for performing matrix element lookups (assuming the data has
been flattened from 2D to 1D).

it should also be possible to specify that the data has been transposed
(hmm, must add those as modes in the setup instruction, svshape)
which is simply a matter of turning the sequence-order round of which
of the three loops shall be inner and which outer.

it should also be possible to set "inversion" (mirroring) mode, such that
the input matrices may be considered mirrored (in any given dimension)
but *without* actually needing to copy it.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19023&group=comp.arch#19023

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4bca:: with SMTP id l10mr38465003qvw.50.1626907493586; Wed, 21 Jul 2021 15:44:53 -0700 (PDT)
X-Received: by 2002:aca:dac5:: with SMTP id r188mr26783866oig.78.1626907493369; Wed, 21 Jul 2021 15:44:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 15:44:53 -0700 (PDT)
In-Reply-To: <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:401c:aaae:86e:918; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:401c:aaae:86e:918
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com> <n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com> <hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com> <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com> <4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com> <sd9vi3$mq9$1@newsreader4.netcologne.de> <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 21 Jul 2021 22:44:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 101
 by: MitchAlsup - Wed, 21 Jul 2021 22:44 UTC

On Wednesday, July 21, 2021 at 4:48:32 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 9:20:21 PM UTC+1, Thomas Koenig wrote:
>
> > I agree that these instruction sets are badly designed and a major
> > headache for compiler writers (anybody who doubts that can read
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and weep.)
> dang - 200+ dependent bugs.
> > For POWER, which you work on, there is also VFX.
> you may not be aware, this video helps explain it,
> https://youtu.be/gexy0z1YqFY?t=673
>
> i have made it very bluntly plain and clear that i am flatly refusing
> to implement any part of Power ISA SIMD in Libre-SOC. i consider it to
> be so harmful a paradigm that even though EABIv2.0 makes it mandatory
> i've still said "no f***** way". it's LITERALLY 700 extra instructions.
> Power ISA's scalar instruction count is just over 200 (214 or so).
> [which part of "RISC" was not clear, when VSX was added??]
>
> this will leave our processor "behind" for a number of years whilst the
> EABIv2.0 is sorted out and SIMD being mandatory is removed, replaced
> with something that makes it optional again (but at the same time makes
> it possible to run SIMD binaries). despite this, *i still said absolutely no
> way* - SIMD is that bad - and we will instead use illegal instruction
> trap-and-emulate.
>
> to give you some idea, Paul Mackerras has added only some of the VSX
> instructions to Microwatt, and it's increased the LUT4 usage in an FPGA
> from 20,000 to *50,000*.
> > Again, by far not
> > the best of design, but they also do something.
> yes - cause no end of problems.
> > All these instructions work well enough that, for example, you get
> > to 60-70% of the theoretical maximum floating point performance
> > for matrix multiplication
> SVP64's Matrix REMAP Schedule will get 100%.
> > if you take a 30-year old Fortran code
> > from Netlib, mechanically translate it to C via f2c, tune for more
> > recent cache sizes, sprinkle a few restrict qualifiers here and
> > there and translate for several versions for AVX 128, AVX 256 and
> > AVX 512 and select at runtime.
> no assembly optimisations are needed for SVP64 Matrix-Multiply.
> it's too short (and specifically designed for the job).
> > The original code is at
> > http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
> > and what I am referring to is at
> > https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk
> > .
> >
> > Now, that is something to beat.
> done, last week. replaced by three instructions.
>
> yes, that wasn't a typo: three (3) instructions.
>
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD
>
> there are some caveats:
>
> 1) the size of the register file (Libre-SOC will be going for 128 FPs and 128 INTs)
> is a limiting factor.
> 2) the total number of FMACs needed must not exceed 127.
> 3) if you need the result to be zero'd out before beginning, it's 5 instructions not 3.
<
How many instructions get executed in a 10,000×10,000 matrix multiplication ?
In double precision. How many memory references are performed ?
>
> if you want to experiment and see what's going on, here's some stand-alone
> executable code:
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD
>
> here's a unit test for it:
> https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remapmatrixprint.py;hb=HEAD
>
> what that does is, it can be used to yield any of the three indices x y and z
> needed for performing matrix element lookups (assuming the data has
> been flattened from 2D to 1D).
>
> it should also be possible to specify that the data has been transposed
> (hmm, must add those as modes in the setup instruction, svshape)
> which is simply a matter of turning the sequence-order round of which
> of the three loops shall be inner and which outer.
>
> it should also be possible to set "inversion" (mirroring) mode, such that
> the input matrices may be considered mirrored (in any given dimension)
> but *without* actually needing to copy it.
>
> l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<6c8564c8-3bbb-4892-afdb-bfd7596b45d8n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19025&group=comp.arch#19025

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7b52:: with SMTP id m18mr33263431qtu.131.1626907939546;
Wed, 21 Jul 2021 15:52:19 -0700 (PDT)
X-Received: by 2002:a9d:7f14:: with SMTP id j20mr26297764otq.82.1626907939302;
Wed, 21 Jul 2021 15:52:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 15:52:19 -0700 (PDT)
In-Reply-To: <48100f40-9521-488e-afbb-a9d35f399c63n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.200.219; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.200.219
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<48100f40-9521-488e-afbb-a9d35f399c63n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c8564c8-3bbb-4892-afdb-bfd7596b45d8n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 22:52:19 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 22:52 UTC

On Wednesday, July 21, 2021 at 8:54:37 PM UTC+1, MitchAlsup wrote:

> While I am in complete agreement with you on that (the mess part)
> They are perceived as vectors and are newer than CRAY.

yyeah, i see that.

> As an architect
> you need to develop the sensitivity to see these things outside of your
> own point of view.

it was an unintended use of a general expression, with more meanings
than the clear picture in my head i wished to convey at the time, apologies.

exactly these kinds of distinctions i am looking to document on wikipedia.
is there anything online that could be considered "notable" about VVM yet?
conference papers would do the trick.

> AND the reason CRAY-1 was designed at 300KVA (300K Watts) was that this
> is the biggest load one may have on the electrical grid that can be turned on
> and off without calling up the power company first !

that would have been fun: destabilising a country's electrical grid by switching
on one computer :)

> If you bought a CRAY-1
> with a little memory system and wanted to upgrade to the bigger one later,
> CRAY inserted resistor boards in the place of memory boards so that when
> you bought the upgrade the power supply would see the same load before
> and after.

dang. otherwise you get issues with the cooling system as well.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<7c40b9f9-4944-4b40-8eb8-13577cc5cdbfn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19026&group=comp.arch#19026

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:411:: with SMTP id 17mr37867742qke.225.1626908679603;
Wed, 21 Jul 2021 16:04:39 -0700 (PDT)
X-Received: by 2002:a9d:5c8b:: with SMTP id a11mr13170302oti.206.1626908679413;
Wed, 21 Jul 2021 16:04:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 16:04:39 -0700 (PDT)
In-Reply-To: <8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.200.219; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.200.219
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de> <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
<8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7c40b9f9-4944-4b40-8eb8-13577cc5cdbfn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Wed, 21 Jul 2021 23:04:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Wed, 21 Jul 2021 23:04 UTC

On Wednesday, July 21, 2021 at 11:44:54 PM UTC+1, MitchAlsup wrote:

> How many instructions get executed in a 10,000×10,000 matrix multiplication ?
> In double precision. How many memory references are performed ?

i concentrated on the smaller matrices for now, given the 3D focus for LibreSOC.
in 1990 this was the focus of one of our first year exercises at Imperial.
we had to assume that the matrices were far larger than memory could
handle.

it should be 10000^3 FMACs.

memory load/stores: at the time (assuming real memory) i kinda cheated
(not really) and simply went through all possible permutations of partial
loads, assuming at the end of a row that the partially-loaded data could be
left in memory and the summing be done in *reverse* order.

when you get to the end of a row, leave *that* data in place, change
direction and start loading columns in forward order again.

saved a tiny bit of LD/STs.

partial row loads were a bit of a pain, and the pre-analysis algorithm
could cope, but it was brute force.

consequently, i didn't do.any actual analysis, so have no idea what
the answer would be :) of course, today, caches are involved which
would need extra work.

l.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<f2db11bc-e750-4008-9921-7a77fa0bef4cn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=19027&group=comp.arch#19027

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9f8d:: with SMTP id i135mr37332549qke.296.1626910442264; Wed, 21 Jul 2021 16:34:02 -0700 (PDT)
X-Received: by 2002:a9d:3a49:: with SMTP id j67mr29271640otc.114.1626910441639; Wed, 21 Jul 2021 16:34:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 16:34:01 -0700 (PDT)
In-Reply-To: <8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=99.57.120.98; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 99.57.120.98
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com> <n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com> <hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com> <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com> <4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com> <sd9vi3$mq9$1@newsreader4.netcologne.de> <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com> <8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f2db11bc-e750-4008-9921-7a77fa0bef4cn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Wed, 21 Jul 2021 23:34:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 117
 by: JimBrakefield - Wed, 21 Jul 2021 23:34 UTC

On Wednesday, July 21, 2021 at 5:44:54 PM UTC-5, MitchAlsup wrote:
> On Wednesday, July 21, 2021 at 4:48:32 PM UTC-5, luke.l...@gmail.com wrote:
> > On Wednesday, July 21, 2021 at 9:20:21 PM UTC+1, Thomas Koenig wrote:
> >
> > > I agree that these instruction sets are badly designed and a major
> > > headache for compiler writers (anybody who doubts that can read
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and weep.)
> > dang - 200+ dependent bugs.
> > > For POWER, which you work on, there is also VFX.
> > you may not be aware, this video helps explain it,
> > https://youtu.be/gexy0z1YqFY?t=673
> >
> > i have made it very bluntly plain and clear that i am flatly refusing
> > to implement any part of Power ISA SIMD in Libre-SOC. i consider it to
> > be so harmful a paradigm that even though EABIv2.0 makes it mandatory
> > i've still said "no f***** way". it's LITERALLY 700 extra instructions.
> > Power ISA's scalar instruction count is just over 200 (214 or so).
> > [which part of "RISC" was not clear, when VSX was added??]
> >
> > this will leave our processor "behind" for a number of years whilst the
> > EABIv2.0 is sorted out and SIMD being mandatory is removed, replaced
> > with something that makes it optional again (but at the same time makes
> > it possible to run SIMD binaries). despite this, *i still said absolutely no
> > way* - SIMD is that bad - and we will instead use illegal instruction
> > trap-and-emulate.
> >
> > to give you some idea, Paul Mackerras has added only some of the VSX
> > instructions to Microwatt, and it's increased the LUT4 usage in an FPGA
> > from 20,000 to *50,000*.
> > > Again, by far not
> > > the best of design, but they also do something.
> > yes - cause no end of problems.
> > > All these instructions work well enough that, for example, you get
> > > to 60-70% of the theoretical maximum floating point performance
> > > for matrix multiplication
> > SVP64's Matrix REMAP Schedule will get 100%.
> > > if you take a 30-year old Fortran code
> > > from Netlib, mechanically translate it to C via f2c, tune for more
> > > recent cache sizes, sprinkle a few restrict qualifiers here and
> > > there and translate for several versions for AVX 128, AVX 256 and
> > > AVX 512 and select at runtime.
> > no assembly optimisations are needed for SVP64 Matrix-Multiply.
> > it's too short (and specifically designed for the job).
> > > The original code is at
> > > http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
> > > and what I am referring to is at
> > > https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk
> > > .
> > >
> > > Now, that is something to beat.
> > done, last week. replaced by three instructions.
> >
> > yes, that wasn't a typo: three (3) instructions.
> >
> > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD
> >
> > there are some caveats:
> >
> > 1) the size of the register file (Libre-SOC will be going for 128 FPs and 128 INTs)
> > is a limiting factor.
> > 2) the total number of FMACs needed must not exceed 127.
> > 3) if you need the result to be zero'd out before beginning, it's 5 instructions not 3.
> <
> How many instructions get executed in a 10,000×10,000 matrix multiplication ?
> In double precision. How many memory references are performed ?
> >
> > if you want to experiment and see what's going on, here's some stand-alone
> > executable code:
> > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD
> >
> > here's a unit test for it:
> > https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remapmatrixprint.py;hb=HEAD
> >
> > what that does is, it can be used to yield any of the three indices x y and z
> > needed for performing matrix element lookups (assuming the data has
> > been flattened from 2D to 1D).
> >
> > it should also be possible to specify that the data has been transposed
> > (hmm, must add those as modes in the setup instruction, svshape)
> > which is simply a matter of turning the sequence-order round of which
> > of the three loops shall be inner and which outer.
> >
> > it should also be possible to set "inversion" (mirroring) mode, such that
> > the input matrices may be considered mirrored (in any given dimension)
> > but *without* actually needing to copy it.
> >
> > l.
This is a trick question: It's a question of how to make best use of the caches.
Or how do you manage cache churn?
|> How many instructions get executed in a 10,000×10,000 matrix multiplication ?
|> In double precision. How many memory references are performed ?

Pages:12
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor