Message-ID:

I was attacked by dselect as a small child and have since avoided debian. -- Andrew Morton

devel / comp.arch / Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<a8579534-0ab6-43cc-8b30-94f4582c4aa6n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=19028&group=comp.arch#19028

X-Received: by 2002:ad4:5d62:: with SMTP id fn2mr38842425qvb.61.1626911845897; Wed, 21 Jul 2021 16:57:25 -0700 (PDT)
X-Received: by 2002:a05:6808:1313:: with SMTP id y19mr26317074oiv.37.1626911845634; Wed, 21 Jul 2021 16:57:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 16:57:25 -0700 (PDT)
In-Reply-To: <f2db11bc-e750-4008-9921-7a77fa0bef4cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:401c:aaae:86e:918; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:401c:aaae:86e:918
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com> <n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com> <hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com> <5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com> <4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com> <sd9vi3$mq9$1@newsreader4.netcologne.de> <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com> <8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com> <f2db11bc-e750-4008-9921-7a77fa0bef4cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a8579534-0ab6-43cc-8b30-94f4582c4aa6n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 21 Jul 2021 23:57:25 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 133

by: MitchAlsup - Wed, 21 Jul 2021 23:57 UTC

On Wednesday, July 21, 2021 at 6:34:03 PM UTC-5, JimBrakefield wrote:
> On Wednesday, July 21, 2021 at 5:44:54 PM UTC-5, MitchAlsup wrote:
> > On Wednesday, July 21, 2021 at 4:48:32 PM UTC-5, luke.l...@gmail.com wrote:
> > > On Wednesday, July 21, 2021 at 9:20:21 PM UTC+1, Thomas Koenig wrote:
> > >
> > > > I agree that these instruction sets are badly designed and a major
> > > > headache for compiler writers (anybody who doubts that can read
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and weep.)
> > > dang - 200+ dependent bugs.
> > > > For POWER, which you work on, there is also VFX.
> > > you may not be aware, this video helps explain it,
> > > https://youtu.be/gexy0z1YqFY?t=673
> > >
> > > i have made it very bluntly plain and clear that i am flatly refusing
> > > to implement any part of Power ISA SIMD in Libre-SOC. i consider it to
> > > be so harmful a paradigm that even though EABIv2.0 makes it mandatory
> > > i've still said "no f***** way". it's LITERALLY 700 extra instructions.
> > > Power ISA's scalar instruction count is just over 200 (214 or so).
> > > [which part of "RISC" was not clear, when VSX was added??]
> > >
> > > this will leave our processor "behind" for a number of years whilst the
> > > EABIv2.0 is sorted out and SIMD being mandatory is removed, replaced
> > > with something that makes it optional again (but at the same time makes
> > > it possible to run SIMD binaries). despite this, *i still said absolutely no
> > > way* - SIMD is that bad - and we will instead use illegal instruction
> > > trap-and-emulate.
> > >
> > > to give you some idea, Paul Mackerras has added only some of the VSX
> > > instructions to Microwatt, and it's increased the LUT4 usage in an FPGA
> > > from 20,000 to *50,000*.
> > > > Again, by far not
> > > > the best of design, but they also do something.
> > > yes - cause no end of problems.
> > > > All these instructions work well enough that, for example, you get
> > > > to 60-70% of the theoretical maximum floating point performance
> > > > for matrix multiplication
> > > SVP64's Matrix REMAP Schedule will get 100%.
> > > > if you take a 30-year old Fortran code
> > > > from Netlib, mechanically translate it to C via f2c, tune for more
> > > > recent cache sizes, sprinkle a few restrict qualifiers here and
> > > > there and translate for several versions for AVX 128, AVX 256 and
> > > > AVX 512 and select at runtime.
> > > no assembly optimisations are needed for SVP64 Matrix-Multiply.
> > > it's too short (and specifically designed for the job).
> > > > The original code is at
> > > > http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
> > > > and what I am referring to is at
> > > > https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk
> > > > .
> > > >
> > > > Now, that is something to beat.
> > > done, last week. replaced by three instructions.
> > >
> > > yes, that wasn't a typo: three (3) instructions.
> > >
> > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD
> > >
> > > there are some caveats:
> > >
> > > 1) the size of the register file (Libre-SOC will be going for 128 FPs and 128 INTs)
> > > is a limiting factor.
> > > 2) the total number of FMACs needed must not exceed 127.
> > > 3) if you need the result to be zero'd out before beginning, it's 5 instructions not 3.
> > <
> > How many instructions get executed in a 10,000×10,000 matrix multiplication ?
> > In double precision. How many memory references are performed ?
> > >
> > > if you want to experiment and see what's going on, here's some stand-alone
> > > executable code:
> > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD
> > >
> > > here's a unit test for it:
> > > https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remapmatrixprint.py;hb=HEAD
> > >
> > > what that does is, it can be used to yield any of the three indices x y and z
> > > needed for performing matrix element lookups (assuming the data has
> > > been flattened from 2D to 1D).
> > >
> > > it should also be possible to specify that the data has been transposed
> > > (hmm, must add those as modes in the setup instruction, svshape)
> > > which is simply a matter of turning the sequence-order round of which
> > > of the three loops shall be inner and which outer.
> > >
> > > it should also be possible to set "inversion" (mirroring) mode, such that
> > > the input matrices may be considered mirrored (in any given dimension)
> > > but *without* actually needing to copy it.
> > >
> > > l.
> This is a trick question: It's a question of how to make best use of the caches.
> Or how do you manage cache churn?
> |> How many instructions get executed in a 10,000×10,000 matrix multiplication ?
> |> In double precision. How many memory references are performed ?
<
I think the real question is "Why not just use *GEMM() from BLAS level 3 ??
<
Knowing this is for scale-rotate,projections of 3D graphics is justification
for doing something other than DGEMM, but random matrix multiplication
one real arrays is not (that is use DGEMM--it will surprise you at how efficient
it is.)

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<2319d9b8-7df1-4bae-933f-4e8bc99809a8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19031&group=comp.arch#19031

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:300c:: with SMTP id ke12mr34943121qvb.38.1626927658678;
Wed, 21 Jul 2021 21:20:58 -0700 (PDT)
X-Received: by 2002:a9d:3a49:: with SMTP id j67mr29851875otc.114.1626927658458;
Wed, 21 Jul 2021 21:20:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 21 Jul 2021 21:20:58 -0700 (PDT)
In-Reply-To: <sd9vi3$mq9$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.200.219; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.200.219
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2319d9b8-7df1-4bae-933f-4e8bc99809a8n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 22 Jul 2021 04:20:58 +0000
Content-Type: text/plain; charset="UTF-8"

by: luke.l...@gmail.com - Thu, 22 Jul 2021 04:20 UTC

On Wednesday, July 21, 2021 at 9:20:21 PM UTC+1, Thomas Koenig wrote:

> The original code is at
> http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
> and what I am referring to is at
> https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk

Thomas: apologies, small screen, I originally looked at the
function header, and went, "naah, it'll be some tiny matmul function"
therefore comparative to SVP64 MM REMAP.

naah :)

Mitch later points out that the smaller matrices for 3D work are
quite different workloads. these are primarily lots of repeated
LOAD PROCESS STORE, nothing common between them, and regfiles
have to be made big enough to prevent spill (128, 256).

Much of what I sm doing is targetted at that style of workload
(8x8 DCT/IMDCT), no register spill.

I will have to look again at the larger DCT though because if it
is quite common then it is worth investigating. although, if
the better algorithm to use is RDFT i have that covered as well
in a different way.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<sdav10$c2s$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19032&group=comp.arch#19032

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-a40-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
Date: Thu, 22 Jul 2021 05:17:20 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sdav10$c2s$1@newsreader4.netcologne.de>
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>
<9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>
<1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de>
<5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
Injection-Date: Thu, 22 Jul 2021 05:17:20 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-a40-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:a40:0:7285:c2ff:fe6c:992d";
logging-data="12380"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 22 Jul 2021 05:17 UTC

luke.l...@gmail.com <luke.leighton@gmail.com> schrieb:
> On Wednesday, July 21, 2021 at 9:20:21 PM UTC+1, Thomas Koenig wrote:

>> All these instructions work well enough that, for example, you get
>> to 60-70% of the theoretical maximum floating point performance
>> for matrix multiplication
>
> SVP64's Matrix REMAP Schedule will get 100%.
>
>> if you take a 30-year old Fortran code
>> from Netlib, mechanically translate it to C via f2c, tune for more
>> recent cache sizes, sprinkle a few restrict qualifiers here and
>> there and translate for several versions for AVX 128, AVX 256 and
>> AVX 512 and select at runtime.
>
> no assembly optimisations are needed for SVP64 Matrix-Multiply.
> it's too short (and specifically designed for the job).
>
>> The original code is at
>> http://www.netlib.org/blas/gemm_based/ssgemmbased.tgz
>> and what I am referring to is at
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r8.c;hb=refs/heads/trunk
>> .
>>
>> Now, that is something to beat.
>
> done, last week. replaced by three instructions.

Do your three instructions also reorder loops, interchange strides
and copy stuff around in to a buffer? :-)

If you take a look at the source, this is where you'll find the
performance mostly comes from.

A simple looping construct, even fully vectorized, will not even
start come close for large matrices. It's how caching is dealt
with which allows whatever vectorization / SIMD / ... happens in
the central loop(s) to be used efficiently.

Or, if I may quote Terje, "Almost all programming can be viewed
as an exercise in caching."

> yes, that wasn't a typo: three (3) instructions.
>
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD
>
> there are some caveats:
>
> 1) the size of the register file (Libre-SOC will be going for 128 FPs and 128 INTs)
> is a limiting factor.

Ah.

I thought that the aim of that project was to stay compatible with
POWER, to make use of the existing infrastructure, but looking at
the web site, that does not appear to be the case.

So, if you want to move away from POWER anyway, what is the reason
to start with it? I can see why you would not want to use RISC-V
(whose drawbacks we recently discussed :-), but what about Mitch's
My 66000? It is certainly a much cleaner design than POWER,
which has grown over x generations.

[...]

> it should also be possible to specify that the data has been transposed
> (hmm, must add those as modes in the setup instruction, svshape)
> which is simply a matter of turning the sequence-order round of which
> of the three loops shall be inner and which outer.
>
> it should also be possible to set "inversion" (mirroring) mode, such that
> the input matrices may be considered mirrored (in any given dimension)
> but *without* actually needing to copy it.

For large matrices, that is of questionable use because you need
to copy into cache-friendly buffers anyway.

For matrices much smaller than relevant for caches (3*3 or 4*4
matrices are also often used) that can of course be a win.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<sdbgfk$in9$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19040&group=comp.arch#19040

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!uNkxFD/dgvFUE+WUQcvYbA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
Date: Thu, 22 Jul 2021 12:15:16 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sdbgfk$in9$1@gioia.aioe.org>
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad>
<fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad>
<6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com>
<9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com>
<1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="19177"; posting-host="uNkxFD/dgvFUE+WUQcvYbA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 22 Jul 2021 10:15 UTC

Thomas Koenig wrote:
> luke.l...@gmail.com <luke.leighton@gmail.com> schrieb:
>> On Wednesday, July 21, 2021 at 7:20:29 PM UTC+1, MitchAlsup wrote:
>>> On Wednesday, July 21, 2021 at 11:01:05 AM UTC-5, luke.l...@gmail.com wrote:
>>>>> short vector SIMD does not count ??
>>>> i've not heard of that one. where can i find out about it?
>>> MMX
>>> SSE
>>> SSE2
>>> AVX
>>
>> oh right, those. yes. god-awful messes, all of them. no, they don't count:
>> they're distinctly part of the problem, not the solution.
>
> I agree that these instruction sets are badly designed and a major
> headache for compiler writers (anybody who doubts that can read
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and weep.)

I like the status ("verified") and the "Depends on:" list. :-)
>
> For POWER, which you work on, there is also VFX. Again, by far not
> the best of design, but they also do something.
>
> All these instructions work well enough that, for example, you get
> to 60-70% of the theoretical maximum floating point performance

My rule of thumb for optimization work has been "is this within a factor
of two of speed of light?", i.e. theoretical maximum?

If it is, I'll usually go on to look for other opportunities unless I've
already done so and this is the final iteration of a competition.

> for matrix multiplication if you take a 30-year old Fortran code
> from Netlib, mechanically translate it to C via f2c, tune for more
> recent cache sizes, sprinkle a few restrict qualifiers here and
> there and translate for several versions for AVX 128, AVX 256 and
> AVX 512 and select at runtime.

That's usually the way to the fastest code, I like to use runtime
version selection by benchmarking the various applicable versions and
picking the fastest on the current HW.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<0021fce6-a5e2-4057-82a5-e0b20f9d4201n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19042&group=comp.arch#19042

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:6b0f:: with SMTP id w15mr34768745qts.366.1626950579406;
Thu, 22 Jul 2021 03:42:59 -0700 (PDT)
X-Received: by 2002:a9d:6d83:: with SMTP id x3mr10207741otp.110.1626950579109;
Thu, 22 Jul 2021 03:42:59 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 22 Jul 2021 03:42:58 -0700 (PDT)
In-Reply-To: <sdav10$c2s$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de> <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
<sdav10$c2s$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0021fce6-a5e2-4057-82a5-e0b20f9d4201n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 22 Jul 2021 10:42:59 +0000
Content-Type: text/plain; charset="UTF-8"

by: luke.l...@gmail.com - Thu, 22 Jul 2021 10:42 UTC

On Thursday, July 22, 2021 at 6:17:22 AM UTC+1, Thomas Koenig wrote:

> > done, last week. replaced by three instructions.
> Do your three instructions also reorder loops,

yes, that's exactly what they're designed to do. it's termed "REMAP".

> interchange strides

yes.

> and copy stuff around in to a buffer? :-)

that's an "action". so you would apply loop-reordering (REMAP)
*to* a MV or a LD or a ST.

> If you take a look at the source, this is where you'll find the
> performance mostly comes from.
>
> A simple looping construct, even fully vectorized, will not even
> start come close for large matrices.

no, correct: i didn't read the source in detail (small screen),
agreed 100%.

however, we're not _doing_ a "simple" looping construct....
but, i have to focus unfortunately: i'd love to go down the
additional rabbithole of coping with 10k+ matrix dimensions,
i'll have to settle for 3D workloads (3, 4, 5. 8 dimensions) at
the moment.

> Or, if I may quote Terje, "Almost all programming can be viewed
> as an exercise in caching."

> > yes, that wasn't a typo: three (3) instructions.
> >
> > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD
> >
> > there are some caveats:
> >
> > 1) the size of the register file (Libre-SOC will be going for 128 FPs and 128 INTs)
> > is a limiting factor.
> Ah.
>
> I thought that the aim of that project was to stay compatible with
> POWER,

with the newly-created "Embedded Floating Point Compliancy Subset",
yes. what we will then do is take advantage of the fact that it is OPTIONAL
to add say an MMU, and anything else.

Henriok and i worked to update the wikipedia page a couple months back:
https://en.wikipedia.org/wiki/Power_ISA#Compliancy

> to make use of the existing infrastructure, but looking at
> the web site, that does not appear to be the case.

at the low level (basic Scalar level), yes. we can execute basic non-accelerated
programs.

that is good enough to get started.

however a SIMD ISA as a front-end paradigm is just hopelessly inadequate
for 3D VPU GPU work.

it would be commercially suicidal to even attempt to use SIMD (designed
around 2003 and great for the PS3), in the year 2021, for today's GPU and
VPU workloads.

> So, if you want to move away from POWER anyway, what is the reason
> to start with it?

because of the existing infrastructure.

> I can see why you would not want to use RISC-V
> (whose drawbacks we recently discussed :-),

this says everything that needs to be said... technically:
https://news.ycombinator.com/item?id=24459314

> but what about Mitch's
> My 66000? It is certainly a much cleaner design than POWER,
> which has grown over x generations.

llvm gcc binutils linux kernel u-boot libc6 OSes packages libraries mindshare
zephyr RTOS openocd

add up the cost in time alone of developing ports of those and you start
to see why OpenRISC 1k took around 12 years to reach maturity.

> For large matrices, that is of questionable use because you need
> to copy into cache-friendly buffers anyway.
>
> For matrices much smaller than relevant for caches (3*3 or 4*4
> matrices are also often used) that can of course be a win.

yes, that's the plan, that's what SVP64 REMAP is primarily designed for.

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<99ea375d-765f-425f-a547-d937613eb21cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19045&group=comp.arch#19045

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:66d1:: with SMTP id m17mr35972958qtp.146.1626955419734;
Thu, 22 Jul 2021 05:03:39 -0700 (PDT)
X-Received: by 2002:a9d:5603:: with SMTP id e3mr28848651oti.178.1626955419539;
Thu, 22 Jul 2021 05:03:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 22 Jul 2021 05:03:39 -0700 (PDT)
In-Reply-To: <f2db11bc-e750-4008-9921-7a77fa0bef4cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.200.220; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.200.220
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
<n8HJI.28032$qk6.1643@fx36.iad> <fa1f3c1f-96bc-425b-9663-addaa4fef487n@googlegroups.com>
<hNHJI.25335$rr3.14798@fx34.iad> <6789ffa8-6f88-4f97-810c-91b3ed8f8defn@googlegroups.com>
<5d829291-9361-42e9-b198-97bced29701en@googlegroups.com> <9013cb16-d34b-4655-823d-7f2745a87940n@googlegroups.com>
<4a8f6770-b1f2-454f-9523-6c5ce1048068n@googlegroups.com> <1bc2aad6-aeb5-4e38-8a79-af4806a89dfcn@googlegroups.com>
<sd9vi3$mq9$1@newsreader4.netcologne.de> <5e7db9d6-a7c8-48b5-9bfd-42ba180b40a7n@googlegroups.com>
<8e5a8ef6-3b9e-409b-b077-22f549b4fdc4n@googlegroups.com> <f2db11bc-e750-4008-9921-7a77fa0bef4cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <99ea375d-765f-425f-a547-d937613eb21cn@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Thu, 22 Jul 2021 12:03:39 +0000
Content-Type: text/plain; charset="UTF-8"

by: luke.l...@gmail.com - Thu, 22 Jul 2021 12:03 UTC

On Thursday, July 22, 2021 at 12:34:03 AM UTC+1, JimBrakefield wrote:

> This is a trick question: It's a question of how to make best use of the caches.
> Or how do you manage cache churn?

fascinatingly, then, the algorithm i developed err... 30 years ago would still be relevant
if adapted to L1/L2 caches rather than the original exercise, Real memory plus much
slower external storage.

the last row (middle loop) can be kept in memory before moving on to the next column.

for i in 0 to n
start, end, direction = end, start, -direction
for j in start to end by direction
for k in 0 to p
madds

except, err, offset the middle access by one so that the values loaded from the middle row
can be left in memory for one inner loop.

efficient arge matrix multiply is a whooole rabbithole area of computer science research :)

Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm

<83c9a386-1fca-4536-b912-af5fa30031c5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19115&group=comp.arch#19115

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:be85:: with SMTP id n5mr8510076qvi.59.1627113468865;
Sat, 24 Jul 2021 00:57:48 -0700 (PDT)
X-Received: by 2002:a4a:d6c2:: with SMTP id j2mr4983757oot.66.1627113468529;
Sat, 24 Jul 2021 00:57:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 24 Jul 2021 00:57:48 -0700 (PDT)
In-Reply-To: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <debd883d-6391-48dc-9628-ac6519809ec5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <83c9a386-1fca-4536-b912-af5fa30031c5n@googlegroups.com>
Subject: Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 24 Jul 2021 07:57:48 +0000
Content-Type: text/plain; charset="UTF-8"

by: luke.l...@gmail.com - Sat, 24 Jul 2021 07:57 UTC

now implemented, it's 5 instructions to perform the inner and outer
butterfly schedules:

"svremap 27, 1, 0, 2, 0, 1, 1",
"svshape 8, 1, 1, 2, 0",
"sv.fdmadds 0.v, 0.v, 0.v, 8.v",
"svshape 8, 1, 1, 3, 0",
"sv.fadds 0.v, 0.v, 0.v"

the first instruction should technically speaking come after the
second. svshape 8,1,1,2,0 establishes a DCT of length 8,
where "2" indicates "outer butterfly mode". this sets up
the REMAP schedule but does not say which actual registers
the REMAPing should apply to: that's svremap's job.

in binary, 27 - "0b11011" - applies to each of the 5 arguments,
which represent 1st src, 2nd src, 3rd src, 1st dest and 2nd dest
respectively, saying which register should be REMAPed to which
schedule (0-3)

sv.fdmadds is the core in-place (twin) MUL-ADD-SUB, it does this:

FRS <- FPADD32(FRA, FRB)
tmp <- FPSUB32(FRA, FRB)
FRT <- FPMUL32(FRC, tmp)

so it is *three* input registers and *two* outputs - this is key
to achieving an in-place DCT because the registers which are
going to be overwritten are "in-flight" in the same pipeline
stage.

fourth instruction is the "inner" butterfly of the Lee DCT, and
this performs the usual iterative shifted-by-one adds. nothing
special, here: simple register RaW / WaR hazard avoidance
can ensure that the chain of adds operate correctly:
ADD x[i], x[i-1]

where it will get very interesting is to do in-place DCT without
having the COS tables pre-computed, using Vertical-First
Mode (similar to VVM)

Subject	Author
Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Branimir Maksimovic
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Branimir Maksimovic
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Terje Mathisen
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Marcus
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Terje Mathisen
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	MitchAlsup
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	MitchAlsup
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	MitchAlsup
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	MitchAlsup
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Thomas Koenig
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	MitchAlsup
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	JimBrakefield
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	MitchAlsup
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Thomas Koenig
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	Terje Mathisen
Re: Iterative in-place RADIX-2 Discrete Cosine Transform algorithm	luke.l...@gmail.com