Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Round Numbers are always false. -- Samuel Johnson


devel / comp.arch / Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

SubjectAuthor
* Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
+* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsThomas Koenig
|`* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
| `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsThomas Koenig
|  `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|   `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|    `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsThomas Koenig
|     `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|      `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|       `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|        +* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|        |+* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|        ||`* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|        || `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|        ||  `- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|        |`- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsThomas Koenig
|        `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsStephen Fuld
|         +* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|         |`* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|         | `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|         |  +* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsIvan Godard
|         |  |`- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
|         |  `- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|         +- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
|         `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsTim Rentsch
|          `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsQuadibloc
|           +- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsQuadibloc
|           `- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsTim Rentsch
+* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsQuadibloc
|`- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
`* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
 `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
  `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
   `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
    `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup
     `* Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectorsluke.l...@gmail.com
      `- Re: Libre-SOC going to 180nm silicon, and Virtual Vertical VectorsMitchAlsup

Pages:12
Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18548&group=comp.arch#18548

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4741:: with SMTP id k1mr33028100qtp.374.1625825689639; Fri, 09 Jul 2021 03:14:49 -0700 (PDT)
X-Received: by 2002:a05:6808:1313:: with SMTP id y19mr26443310oiv.37.1625825689382; Fri, 09 Jul 2021 03:14:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!news.uzoreto.com!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 03:14:49 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
Subject: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 09 Jul 2021 10:14:49 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 49
 by: luke.l...@gmail.com - Fri, 9 Jul 2021 10:14 UTC

hi folks, thought you might appreciate knowing, an early version of the Libre-SOC Power ISA core is going to 180nm MPW:

https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1266246-libre-soc-test-asic-going-to-fabrication-using-tsmc-180nm-process
https://openpowerfoundation.org/libre-soc-180nm-power-isa-asic-submitted-to-imec-for-fabrication/

anyway, that as an aside: we're currently exploring what we're calling "Vertical Vectors". fit enough Vs in and you might end up creating a new buzzword nobody's thought of before.

the current Cray-style Vector system that we have designed is near-identical to the old x86 "REP" instruction. REP ADD would do:

for i = 0 to VL-1
Regfile[RT+i] = ADD(Regfile[RA+i], Regfile[RB+i])

the next phase on top of that is to add a REMAP system:

for i = 0 to VL-1
Regfile[RT+REMAP1(i)] =
ADD(Regfile[RA+REMAP2(i)],
Regfile[RB+REMAP3(i)])

by designing hardware REMAP you can do *in-place* Matrix Multiply, or even a full RADIX-N Butterfly schedule for FFT, DCT, DFT and so on.

then you can do the ENTIRE butterfly schedule by adding a special 3-input 2-output instruction which does this:

TWINFFMADD (RT, RS, RA, RC, RB):
temp = RA * RC
RT = RB - temp
RA = RA + temp

this we have implemented in a simulator: it works really well.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD

however, as you no doubt are aware, FFTs involve complex numbers, and if you want to do in-place FFTs with only a single instruction, you have to have Complex numbers as a *FIRST ORDER* primitive in your ISA. this we considered to be too much.

so it got me wondering, "hmmm, what would happen if we allowed the REMAP schedules to be applied to multiple instructions before moving on to the next for-loop in the sequence? effectively this:

for i = 0 to VL-1
rt = REMAP1(i)
ra = REMAP2(i)
rb = REMAP3(i)
Regfile[RT+rt] = ADD(Regfile[RA+rb], Regfile[RB+rb])
Regfile[RS+rt] = MUL(Regfile[RA+rb], Regfile[RB+rb])

in other words you *TURN ROUND* the looping, allowing multiple elements to be executed under the same schedule, then move the "Schedule" on by one (with an explicit instruciton), then branch back.

then it hit me: after two years trying to get my head round MyISA 66000 Vector Loops and completely failing, i suspect that Vertical-First Vectors is the exact same thing.

can anyone confirm?

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<sc9jfi$q0k$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18550&group=comp.arch#18550

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-1ba0-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
Date: Fri, 9 Jul 2021 13:37:54 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sc9jfi$q0k$1@newsreader4.netcologne.de>
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
Injection-Date: Fri, 9 Jul 2021 13:37:54 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-1ba0-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:1ba0:0:7285:c2ff:fe6c:992d";
logging-data="26644"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 9 Jul 2021 13:37 UTC

luke.l...@gmail.com <luke.leighton@gmail.com> schrieb:

> https://openpowerfoundation.org/libre-soc-180nm-power-isa-asic-submitted-to-imec-for-fabrication/

Interesting. Is see that floating point is missing, but OK :-)

> anyway, that as an aside: we're currently exploring what we're
> calling "Vertical Vectors". fit enough Vs in and you might end
> up creating a new buzzword nobody's thought of before.

> the current Cray-style Vector system that we have designed is
> near-identical to the old x86 "REP" instruction. REP ADD would do:

>
> for i = 0 to VL-1
> Regfile[RT+i] = ADD(Regfile[RA+i], Regfile[RB+i])

> the next phase on top of that is to add a REMAP system:
>
> for i = 0 to VL-1
> Regfile[RT+REMAP1(i)] =
> ADD(Regfile[RA+REMAP2(i)],
> Regfile[RB+REMAP3(i)])
>
> by designing hardware REMAP you can do *in-place* Matrix Multiply,
> or even a full RADIX-N Butterfly schedule for FFT, DCT, DFT and
> so on.

How would this be expressed in the ISA? Coud you maybe elaborate
a little more? What exactly does REMAP do?

[...]

> however, as you no doubt are aware, FFTs involve complex numbers,
> and if you want to do in-place FFTs with only a single instruction,
> you have to have Complex numbers as a *FIRST ORDER* primitive in
> your ISA. this we considered to be too much.

Yes.

Did you have a look at SVE? That may be interesting for the
complex number case (but ARM may have some sort of intellectual
property protection on what they specified there, I don't know).

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18553&group=comp.arch#18553

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:f9ca:: with SMTP id j10mr9389471qvo.23.1625841700725; Fri, 09 Jul 2021 07:41:40 -0700 (PDT)
X-Received: by 2002:a05:6808:107:: with SMTP id b7mr8877392oie.44.1625841700368; Fri, 09 Jul 2021 07:41:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 07:41:40 -0700 (PDT)
In-Reply-To: <sc9jfi$q0k$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com> <sc9jfi$q0k$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 09 Jul 2021 14:41:40 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 121
 by: luke.l...@gmail.com - Fri, 9 Jul 2021 14:41 UTC

On Friday, July 9, 2021 at 2:37:56 PM UTC+1, Thomas Koenig wrote:
> luke.l...@gmail.com <luke.l...@gmail.com> schrieb:
>
> > https://openpowerfoundation.org/libre-soc-180nm-power-isa-asic-submitted-to-imec-for-fabrication/
>
> Interesting. Is see that floating point is missing, but OK :-)

first test ASIC, we had to keep it below 6 x 6 mm.
we achieved 5.1 x 5.9 mm, 130,000 cells i believe.
if we'd added an FPU it would not have helped
achieve the main goal, "can you, having never done
HDL before, even make a processor *at all*"?

> > by designing hardware REMAP you can do *in-place* Matrix Multiply,
> > or even a full RADIX-N Butterfly schedule for FFT, DCT, DFT and
> > so on.
> How would this be expressed in the ISA? Coud you maybe elaborate
> a little more? What exactly does REMAP do?

the summary there (calling a "function" - hardware block) is so brief
it's not clear that that's really all there is to it. hilariously the logic
involved for Matrix Multiply is literally identical to that of ZOLC
(Zero-Overhead Loop Control).

there are two types of schedule. one is for Matrix Multiply,
the other is for FFT Butterfly. they're best expressed in actual
executable code, which you can run:
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remapyield.py;hb=HEAD
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remap_fft_yield.py;hb=HEAD

with the Matrix one, instead of straight Vector element operations 0 1 2 3 ... 11
etc. you get, for *three different operands* res X and Y:

order 0 res 0 X 0 Y 0
order 1 res 1 X 0 Y 1
order 2 res 2 X 3 Y 0
order 3 res 3 X 3 Y 1
order 4 res 0 X 1 Y 2
order 5 res 1 X 1 Y 3
order 6 res 2 X 4 Y 2
order 7 res 3 X 4 Y 3
order 8 res 0 X 2 Y 4
order 9 res 1 X 2 Y 5
order 10 res 2 X 5 Y 4
order 11 res 3 X 5 Y 5

so you "pretend" that your Matrices have been "flattened" into
linear 1D arrays, the REMAP schedule "pretends" that actually
they're 2D, and automatically REMAPs access to the Vector
Elements so that a *single FMAC* does the *ENTIRE* matrix
multiply.

for the butterfly one:
size 2 halfsize 1 tablestep 4
0 i 0 j=0 jh=1 k=0 -> j[jl=0 ] j[jh=1 ] exptable[k=0]
1 i 2 j=2 jh=3 k=0 -> j[jl=2 ] j[jh=3 ] exptable[k=0]
2 i 4 j=4 jh=5 k=0 -> j[jl=4 ] j[jh=5 ] exptable[k=0]
3 i 6 j=6 jh=7 k=0 -> j[jl=6 ] j[jh=7 ] exptable[k=0]
size 4 halfsize 2 tablestep 2
4 i 0 j=0 jh=2 k=0 -> j[jl=0 ] j[jh=2 ] exptable[k=0]
5 i 0 j=1 jh=3 k=2 -> j[jl=1 ] j[jh=3 ] exptable[k=2]
6 i 4 j=4 jh=6 k=0 -> j[jl=4 ] j[jh=6 ] exptable[k=0]
7 i 4 j=5 jh=7 k=2 -> j[jl=5 ] j[jh=7 ] exptable[k=2]
size 8 halfsize 4 tablestep 1
8 i 0 j=0 jh=4 k=0 -> j[jl=0 ] j[jh=4 ] exptable[k=0]
9 i 0 j=1 jh=5 k=1 -> j[jl=1 ] j[jh=5 ] exptable[k=1]
10 i 0 j=2 jh=6 k=2 -> j[jl=2 ] j[jh=6 ] exptable[k=2]
11 i 0 j=3 jh=7 k=3 -> j[jl=3 ] j[jh=7 ] exptable[k=3]

for DFT it's 2 instructions:

"svremap 8, 1, 1, 1",
"sv.ffmadds 0.v, 0.v, 0.v, 8.v"
the first one sets up an 8-wide butterfly.
the second one is the "base" (twin +/-) FMAC.

the full unit test, which includes some python code from Nayuki
that produces the same result (unit test, go figure) but is "easier
to understand what happens"

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD

> Did you have a look at SVE? That may be interesting for the
> complex number case

to be honest, although it might on the face of it sound useful, it's very unlikely.
SVE is basically fixed-width-SIMD-with-predication
where SVP64 is Variable-Length Cray-style Vectors
https://libre-soc.org/openpower/sv/svp64/

the paradigms are... well, if you wanted to turn SVE into Cray-style
Vectors, what you would do is actually very simple:

1) add a setvl instruction which stores VL in a CSR
2) have VL turned into an *automatic* predicate mask (1<<VL)-1
3) AND that automatically with **ALL** SVE operations
[so if there's a predicate argument, it's ANDed with (1<<VL)-1
before being sent to the back-end SIMD ALUs]

that's it.

that's all that ARM had to do, and they could call their ISA "True
Cray Vectors". Intel could do the same thing with AVX-512 and they'd
achieve the same thing.

ah well.

but REMAP - and Vertical-First Mode - these are *another* layer even on top
of that, and consequently i am banging my head against a brick wall looking
at other people's "optimised" code, academic paper after academic paper
going "arrrrrgggggh noooo" :)

even Cray-style Vectors, with linear element operations 0 1 2 3 4 ...
will not really help here because they will perform a Vector Indexed LOAD,
computing the indices of the REMAP schedule with an algorithm, into
a Vector of offsets, which is then passed in to the Indexed LOAD.
do one "batch", then STORE, then have *another* loop around that.

REMAP does the *ENTIRE* triple-loop butterfly in *ONE* instruction.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<894d1924-9192-4392-951a-aa9268ed5adfn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18555&group=comp.arch#18555

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1192:: with SMTP id b18mr21990016qkk.100.1625846973845;
Fri, 09 Jul 2021 09:09:33 -0700 (PDT)
X-Received: by 2002:a9d:5f19:: with SMTP id f25mr16977558oti.206.1625846973612;
Fri, 09 Jul 2021 09:09:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 09:09:33 -0700 (PDT)
In-Reply-To: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:64ea:dbf5:c09:80ff;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:64ea:dbf5:c09:80ff
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <894d1924-9192-4392-951a-aa9268ed5adfn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 09 Jul 2021 16:09:33 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Fri, 9 Jul 2021 16:09 UTC

On Friday, July 9, 2021 at 4:14:50 AM UTC-6, luke.l...@gmail.com wrote:
> hi folks, thought you might appreciate knowing, an early version of the Libre-SOC Power ISA core is going to 180nm MPW:

Congratulations!

John Savard

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18556&group=comp.arch#18556

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9504:: with SMTP id x4mr37774282qkd.235.1625848181051; Fri, 09 Jul 2021 09:29:41 -0700 (PDT)
X-Received: by 2002:aca:dbd6:: with SMTP id s205mr3152127oig.155.1625848180828; Fri, 09 Jul 2021 09:29:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 09:29:40 -0700 (PDT)
In-Reply-To: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5951:f4b0:e259:25f0; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5951:f4b0:e259:25f0
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 09 Jul 2021 16:29:41 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 140
 by: MitchAlsup - Fri, 9 Jul 2021 16:29 UTC

On Friday, July 9, 2021 at 5:14:50 AM UTC-5, luke.l...@gmail.com wrote:
> hi folks, thought you might appreciate knowing, an early version of the Libre-SOC Power ISA core is going to 180nm MPW:
<
Congratulations !! well done !
>
> https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1266246-libre-soc-test-asic-going-to-fabrication-using-tsmc-180nm-process
> https://openpowerfoundation.org/libre-soc-180nm-power-isa-asic-submitted-to-imec-for-fabrication/
>
> anyway, that as an aside: we're currently exploring what we're calling "Vertical Vectors". fit enough Vs in and you might end up creating a new buzzword nobody's thought of before.
>
> the current Cray-style Vector system that we have designed is near-identical to the old x86 "REP" instruction. REP ADD would do:
>
> for i = 0 to VL-1
> Regfile[RT+i] = ADD(Regfile[RA+i], Regfile[RB+i])
>
> the next phase on top of that is to add a REMAP system:
>
> for i = 0 to VL-1
> Regfile[RT+REMAP1(i)] =
> ADD(Regfile[RA+REMAP2(i)],
> Regfile[RB+REMAP3(i)])
>
> by designing hardware REMAP you can do *in-place* Matrix Multiply, or even a full RADIX-N Butterfly schedule for FFT, DCT, DFT and so on.
>
> then you can do the ENTIRE butterfly schedule by adding a special 3-input 2-output instruction which does this:
>
> TWINFFMADD (RT, RS, RA, RC, RB):
> temp = RA * RC
> RT = RB - temp
> RA = RA + temp
>
> this we have implemented in a simulator: it works really well.
>
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
>
> however, as you no doubt are aware, FFTs involve complex numbers, and if you want to do in-place FFTs with only a single instruction, you have to have Complex numbers as a *FIRST ORDER* primitive in your ISA. this we considered to be too much.
>
> so it got me wondering, "hmmm, what would happen if we allowed the REMAP schedules to be applied to multiple instructions before moving on to the next for-loop in the sequence? effectively this:
>
> for i = 0 to VL-1
> rt = REMAP1(i)
> ra = REMAP2(i)
> rb = REMAP3(i)
> Regfile[RT+rt] = ADD(Regfile[RA+rb], Regfile[RB+rb])
> Regfile[RS+rt] = MUL(Regfile[RA+rb], Regfile[RB+rb])
>
> in other words you *TURN ROUND* the looping, allowing multiple elements to be executed under the same schedule, then move the "Schedule" on by one (with an explicit instruciton), then branch back.
>
> then it hit me: after two years trying to get my head round MyISA 66000 Vector Loops and completely failing, i suspect that Vertical-First Vectors is the exact same thing.
<
What My 66000 did was as follows::
<
SUBROUTINE FFT2( A, N )
COMPLEX DOUBLE PRECISION A[N]
DOUBLE PRECISION RE[N] = &REAL(A[0])
DOUBLE PRECISION IM[N] = &IMAG(A[0])
N2 = N
DO 10 K = 1, M
DOUBLE PRECISION RO = &RE[N2]
DOUBLE PRECISION IN = &IM[N2]
N1 = N2
N2 = N2/2
E = 6.28318/N1
A = 0
DO 20 J = 1, N2
C = COS (A)
S =-SIN (A)
A = J*E
DO 30 I = J, N, N1
XT = RE(I) - RO(I)
YT = IM(I) - IN(I)
RE(I) = RE(I) + RO(I)
IM(I) = IM(I) + IN(I)
RO(I) = XT*C - YT*S
IN(I) = XT*S + YT*C
30 CONTINUE
20 CONTINUE
10 CONTINUE
<
By taking the addresses of the real and imaginary parts and incrementing
by the pair, you treat the real an imaginary parts as individual 1D vectors.
<
FFT2:
// MOV RRE,RA
ADD RIM,RA,8
MOV RN2,RN
MOV RK,1
loop10:
LDA RRO,[RRE+RN2<<3]
LDA RIN,[RIM+RN2<<3]
MOV RN1,RN2
SRA RN2,RN2,1
CVT (double)RN1F,(int)RN1
FDIV RE,6.283185307179586476925286766559,RN1F
MOV RA,0
MOV RJ,1
loop20:
COS RC,RA
SIN -RS,RA
CVT (double)RJF,(int)RJ
FMUL RA,RJF,RE
MOV RI,RJ
loop30:
VEC RI,{RI}
LDD RXI,[RRE+RI<<3]
LDD RXL,[RRO+RI<<3]
LDD RYI,[RIM+RI<<3]
LDD RYL,[RIN+RI<<3]
FADD RXT,RXI,-RXL
FADD RYT,RYI,-RYL
FADD RXI,RXI,RXL
FADD RYI,RYI,RYL
FMUL RXC,RXT,RC
FMUL RXS,RYT,RS
FMUL RYS,RXT,RS
FMUL RYC,RYT,RC
FADD RXL,RXC,-RXS
FADD RYL,RYS,RYC
STD RXI,[RRE+RI<<3]
STD RYI,[RIM+RI<<3]
STD RXL,[RRO+RI<<3]
STD RYL,[RIN+RI<<3]
// ADD RI,RI,RN1
// CMP Rc,RI,RN
// BLE Rc,loop30
LOOP Ri,RN1,Rn
ADD RJ,RJ,1
CMP Rc,RJ,RN2
BLE Rc,loop20
ADD RK,RK,1
CMP Rc,RK,RM
BLE loop10
RET
>
> can anyone confirm?
<
I still don't think you have your head around it--VVM vectorizes loops while CRAY
vectorizes instructions (and narrows your vision). Once a loop is identified (loaded)
One can map the available resources to perform the loop more than once per cycle.
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<58106ebb-c0e6-4bae-a537-633fa9c70626n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18558&group=comp.arch#18558

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b6c5:: with SMTP id g188mr37619032qkf.92.1625855611335;
Fri, 09 Jul 2021 11:33:31 -0700 (PDT)
X-Received: by 2002:a54:4109:: with SMTP id l9mr6846833oic.0.1625855611012;
Fri, 09 Jul 2021 11:33:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 11:33:30 -0700 (PDT)
In-Reply-To: <894d1924-9192-4392-951a-aa9268ed5adfn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com> <894d1924-9192-4392-951a-aa9268ed5adfn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <58106ebb-c0e6-4bae-a537-633fa9c70626n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 09 Jul 2021 18:33:31 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Fri, 9 Jul 2021 18:33 UTC

On Friday, July 9, 2021 at 5:09:34 PM UTC+1, Quadibloc wrote:
> On Friday, July 9, 2021 at 4:14:50 AM UTC-6, luke.l...@gmail.com wrote:
> > hi folks, thought you might appreciate knowing, an early version of the Libre-SOC Power ISA core is going to 180nm MPW:
> Congratulations!

thanks john :)

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18562&group=comp.arch#18562

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:56e4:: with SMTP id cr4mr16369887qvb.54.1625865068105;
Fri, 09 Jul 2021 14:11:08 -0700 (PDT)
X-Received: by 2002:aca:53ce:: with SMTP id h197mr695580oib.30.1625865067794;
Fri, 09 Jul 2021 14:11:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 14:11:07 -0700 (PDT)
In-Reply-To: <dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com> <dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 09 Jul 2021 21:11:08 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Fri, 9 Jul 2021 21:11 UTC

On Friday, July 9, 2021 at 5:29:42 PM UTC+1, MitchAlsup wrote:
> On Friday, July 9, 2021 at 5:14:50 AM UTC-5, luke.l...@gmail.com wrote:
> > hi folks, thought you might appreciate knowing, an early version of the Libre-SOC Power ISA core is going to 180nm MPW:
> <
> Congratulations !! well done !

appreciated.

> > then it hit me: after two years trying to get my head round MyISA 66000 Vector Loops and completely failing, i suspect that Vertical-First Vectors is the exact same thing.
> <
> What My 66000 did was as follows::
> <
> SUBROUTINE FFT2( A, N )

looks reasonable. REMAP does the triple for-loops automatically,
so that all the instructions involving loops aren't needed. so for
up to about... 16-wide FP64, and excluding the LD/STs i can get
away with about 8 instructions *total*.

of course, if you want to do FFTs that are greater than the total
number of registers available *then* you have to do a recursive
divide-and-conquer using smaller SUB-FFTs, and do the "joining"
layer by hand.

> By taking the addresses of the real and imaginary parts and incrementing
> by the pair, you treat the real an imaginary parts as individual 1D vectors.

so everything between here:
loop30:
VEC RI,{RI}
...

and here:
// BLE Rc,loop30
LOOP Ri,RN1,Rn

is repeated, but with different-numbered "in-flight" registers?

> I still don't think you have your head around it--VVM vectorizes loops while CRAY
> vectorizes instructions (and narrows your vision).

i've deviated somewhat radically from the original Cray design. SVP64 is more
like how MMX (SIMD overloaded on x87 FP regs) used to be as far as
registers are concerned, and more like the x86 REP instruction used to be as
far as looping is concerned.

it fits on top of a standard multi-issue superscalar architecture: there *are*
no *actual* vector registers. the "REP-like" looping simply spams as many
"element" operations into ALUs as there are pathways available, at *issue*
time.

> Once a loop is identified (loaded)
> one can map the available resources to perform the loop more than once per cycle.

if i got the bit right about it being "in-flight" registers but otherwise executed
sequentially, then conceptually SVP64 "Vertical-First" is effectively an *explicit*
version of VVM.

the exact same sorts of resource-identification tricks would need to be played,
to say, "oh, you got a bunch of element 0 scalar operations spammed at me by
issue, hmmm, and oh look, now we've got a bunch of element 1 scalar options:
let me just join those together into matching SIMD operations for you".

which, obviously, gets very painful if those were 8-bit operations.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18569&group=comp.arch#18569

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b6c5:: with SMTP id g188mr38880306qkf.92.1625874990759;
Fri, 09 Jul 2021 16:56:30 -0700 (PDT)
X-Received: by 2002:aca:31ca:: with SMTP id x193mr22278123oix.84.1625874990525;
Fri, 09 Jul 2021 16:56:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 9 Jul 2021 16:56:30 -0700 (PDT)
In-Reply-To: <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b57c:8ce6:33af:d741;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b57c:8ce6:33af:d741
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com> <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 09 Jul 2021 23:56:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Fri, 9 Jul 2021 23:56 UTC

On Friday, July 9, 2021 at 4:11:09 PM UTC-5, luke.l...@gmail.com wrote:
> On Friday, July 9, 2021 at 5:29:42 PM UTC+1, MitchAlsup wrote:
> > On Friday, July 9, 2021 at 5:14:50 AM UTC-5, luke.l...@gmail.com wrote:
> > > hi folks, thought you might appreciate knowing, an early version of the Libre-SOC Power ISA core is going to 180nm MPW:
> > <
> > Congratulations !! well done !
> appreciated.
> > > then it hit me: after two years trying to get my head round MyISA 66000 Vector Loops and completely failing, i suspect that Vertical-First Vectors is the exact same thing.
> > <
> > What My 66000 did was as follows::
> > <
> > SUBROUTINE FFT2( A, N )
> looks reasonable. REMAP does the triple for-loops automatically,
> so that all the instructions involving loops aren't needed. so for
> up to about... 16-wide FP64, and excluding the LD/STs i can get
> away with about 8 instructions *total*.
>
> of course, if you want to do FFTs that are greater than the total
> number of registers available *then* you have to do a recursive
> divide-and-conquer using smaller SUB-FFTs, and do the "joining"
> layer by hand.
> > By taking the addresses of the real and imaginary parts and incrementing
> > by the pair, you treat the real an imaginary parts as individual 1D vectors.
> so everything between here:
> loop30:
> VEC RI,{RI}
> ...
>
> and here:
> // BLE Rc,loop30
> LOOP Ri,RN1,Rn
> is repeated, but with different-numbered "in-flight" registers?
<
It is more accurate to understand that everything between one VEC and its corresponding
LOOP can be executed in parallel with other VEC-LOOP iterations given resources to pull
it all off.
<
Also note: HW knows this due to the {RI} in the VEC instruction !!, RI is recirculated
from loop to loop (after being incremented) so each iteration has its own RI; and
since RI is the only value recirculated from loop to loop, the rest of the dataflow is
loop independent !!!
<
Also note: The paucity of the data inside {} of VEC indicates that NONE of the registers
used inside VEC-LOOP is live outside of the loop and thus do not have to be saved !!
<
> > I still don't think you have your head around it--VVM vectorizes loops while CRAY
> > vectorizes instructions (and narrows your vision).
<
> i've deviated somewhat radically from the original Cray design. SVP64 is more
> like how MMX (SIMD overloaded on x87 FP regs) used to be as far as
> registers are concerned, and more like the x86 REP instruction used to be as
> far as looping is concerned.
>
> it fits on top of a standard multi-issue superscalar architecture: there *are*
> no *actual* vector registers. the "REP-like" looping simply spams as many
> "element" operations into ALUs as there are pathways available, at *issue*
> time.
<
All subject to the "SIMD is considered harmful" paradigm.........
<
> > Once a loop is identified (loaded)
> > one can map the available resources to perform the loop more than once per cycle.
>
> if i got the bit right about it being "in-flight" registers but otherwise executed
> sequentially, then conceptually SVP64 "Vertical-First" is effectively an *explicit*
> version of VVM.
>
> the exact same sorts of resource-identification tricks would need to be played,
> to say, "oh, you got a bunch of element 0 scalar operations spammed at me by
> issue, hmmm, and oh look, now we've got a bunch of element 1 scalar options:
> let me just join those together into matching SIMD operations for you".
<
Take the loop::
<
for( i = 0; i < MAX; I++ )
a[i] = b[i];
<
This "loop" executes ½ as wide as the cache access path per cycle--so if the cache
can access 32 bytes per cycle, the loop executes at 16-bytes per cycle (1R 1W 2 cycles) this is independent of whether sizeof a is {byte, half, word, double, struct,
union} and for the byte case, this is equivalent of 60 I/C {5 instructions LD-ST-ADD-
CMP-BLE} executing 16 iterations per cycle. And it can execute this fast on a 1-wide
in-order machine.
>
> which, obviously, gets very painful if those were 8-bit operations.
<
Not when done as I explained above.
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<64f25153-635a-41eb-a885-aa9f2b0a3badn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18583&group=comp.arch#18583

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:14c7:: with SMTP id u7mr24782172qtx.246.1625937256453;
Sat, 10 Jul 2021 10:14:16 -0700 (PDT)
X-Received: by 2002:a4a:ab07:: with SMTP id i7mr31788082oon.89.1625937256215;
Sat, 10 Jul 2021 10:14:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 10 Jul 2021 10:14:16 -0700 (PDT)
In-Reply-To: <9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.174.204; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.174.204
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com> <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>
<9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <64f25153-635a-41eb-a885-aa9f2b0a3badn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 10 Jul 2021 17:14:16 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Sat, 10 Jul 2021 17:14 UTC

On Saturday, July 10, 2021 at 12:56:31 AM UTC+1, MitchAlsup wrote:

> It is more accurate to understand that everything between one VEC and its corresponding
> LOOP can be executed in parallel with other VEC-LOOP iterations given resources to pull
> it all off.

finally after 2 frickin years i get the concept.

> Also note: The paucity of the data inside {} of VEC indicates that NONE of the registers
> used inside VEC-LOOP is live outside of the loop and thus do not have to be saved !!

right.

ok.

so now i know what the hell's going on, i can start asking useful questions, such as "what if you want to do a horizontal add or a vector multiply accumulate". i.e what happens if you *need* to use a scalar register as an accumulator of vector data in the middle of the loop?

i've run into this in a realworld scenario, FFMPEG's DCT algorithm, it needs to add up the sum of 8 multiplies.

> All subject to the "SIMD is considered harmful" paradigm.........

the number one reason i'm not doing SIMD at the frontend... investigations of the past month show it's far, far worse than even the sigarch article makes out. you should see ffmpeg's "optimised" FFT assembler, and how to do horizontal add eith AVX512. a discussion for another thread, that one.

> union} and for the byte case, this is equivalent of 60 I/C {5 instructions LD-ST-ADD-
> CMP-BLE} executing 16 iterations per cycle. And it can execute this fast on a 1-wide
> in-order machine.

mad, isn't it? when there's literally an order of magnitude jump in stats compared to industry-standard accepted "high performance", it stuns and shocks people into total silence, they can't handle it.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<8b8cafbf-ffc9-40f9-b450-76a1a04d0f08n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18585&group=comp.arch#18585

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4741:: with SMTP id k1mr39567548qtp.374.1625938541648; Sat, 10 Jul 2021 10:35:41 -0700 (PDT)
X-Received: by 2002:a4a:2242:: with SMTP id z2mr31300543ooe.90.1625938541432; Sat, 10 Jul 2021 10:35:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 10 Jul 2021 10:35:41 -0700 (PDT)
In-Reply-To: <64f25153-635a-41eb-a885-aa9f2b0a3badn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e0a4:3d45:fda5:f014; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e0a4:3d45:fda5:f014
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com> <dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com> <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com> <9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com> <64f25153-635a-41eb-a885-aa9f2b0a3badn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8b8cafbf-ffc9-40f9-b450-76a1a04d0f08n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 10 Jul 2021 17:35:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 72
 by: MitchAlsup - Sat, 10 Jul 2021 17:35 UTC

On Saturday, July 10, 2021 at 12:14:17 PM UTC-5, luke.l...@gmail.com wrote:
> On Saturday, July 10, 2021 at 12:56:31 AM UTC+1, MitchAlsup wrote:
>
> > It is more accurate to understand that everything between one VEC and its corresponding
> > LOOP can be executed in parallel with other VEC-LOOP iterations given resources to pull
> > it all off.
> finally after 2 frickin years i get the concept.
> > Also note: The paucity of the data inside {} of VEC indicates that NONE of the registers
> > used inside VEC-LOOP is live outside of the loop and thus do not have to be saved !!
> right.
>
> ok.
>
> so now i know what the hell's going on, i can start asking useful questions, such as "what if you want to do a horizontal add or a vector multiply accumulate". i.e what happens if you *need* to use a scalar register as an accumulator of vector data in the middle of the loop?
<
A vector reduction is written as::
<
for( sum= 0, i = 0; i < MAX; i++ )
sum += a[i];
<
Then you make the compiler realize this is a vector sum reduction (sum is a recurrence)
and then you allow the FADD instruction to hold onto the value as an accumulator so you
can pump a new operand into the sum every cycle (per lane). So FADD does not "have"
to deliver a result, but to recirculate the result to the next operand internally.
<
Depending on your accuracy requirements, you can do this in the 200-odd bit wide accumulator
of the FMAC instruction so you maintain great accuracy {Or do something like quires in posits}
<
And follow the loop termination with a final summation of the large intermediates and give the
result of such a large accumulator rounded exactly once.
>
> i've run into this in a realworld scenario, FFMPEG's DCT algorithm, it needs to add up the sum of 8 multiplies.
<
Estrin's method is useful, here--treeify the summations.
<
> > All subject to the "SIMD is considered harmful" paradigm.........
<
> the number one reason i'm not doing SIMD at the frontend... investigations of the past month show it's far, far worse than even the sigarch article makes out. you should see ffmpeg's "optimised" FFT assembler, and how to do horizontal add eith AVX512. a discussion for another thread, that one.
<
> > union} and for the byte case, this is equivalent of 60 I/C {5 instructions LD-ST-ADD-
> > CMP-BLE} executing 16 iterations per cycle. And it can execute this fast on a 1-wide
> > in-order machine.
<
> mad, isn't it? when there's literally an order of magnitude jump in stats compared to industry-standard accepted "high performance", it stuns and shocks people into total silence, they can't handle it.
<
Actually 5*16 is 80 not 60.............but I digress........and all done inside another construct
that did not require any SIMD-ed-ness yet delivers significantly higher performance..........
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<6bed56b7-76eb-42d0-9400-b30325c93aa1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18587&group=comp.arch#18587

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4654:: with SMTP id y20mr8746882qvv.21.1625941051474;
Sat, 10 Jul 2021 11:17:31 -0700 (PDT)
X-Received: by 2002:aca:3144:: with SMTP id x65mr1569784oix.157.1625941051238;
Sat, 10 Jul 2021 11:17:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 10 Jul 2021 11:17:31 -0700 (PDT)
In-Reply-To: <8b8cafbf-ffc9-40f9-b450-76a1a04d0f08n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.174.209; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.174.209
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com> <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>
<9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com> <64f25153-635a-41eb-a885-aa9f2b0a3badn@googlegroups.com>
<8b8cafbf-ffc9-40f9-b450-76a1a04d0f08n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6bed56b7-76eb-42d0-9400-b30325c93aa1n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 10 Jul 2021 18:17:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Sat, 10 Jul 2021 18:17 UTC

On Saturday, July 10, 2021 at 6:35:42 PM UTC+1, MitchAlsup wrote:
> On Saturday, July 10, 2021 at 12:14:17 PM UTC-5, luke.l...@gmail.com wrote:

> > so now i know what the hell's going on, i can start asking useful questions, such as "what if you want to do a horizontal add or a vector multiply accumulate". i.e what happens if you *need* to use a scalar register as an accumulator of vector data in the middle of the loop?
> <
> A vector reduction is written as::
> <
> for( sum= 0, i = 0; i < MAX; i++ )
> sum += a[i];
> <
> Then you make the compiler realize this is a vector sum reduction (sum is a recurrence)
> and then you allow the FADD instruction to hold onto the value as an accumulator so you
> can pump a new operand into the sum every cycle (per lane). So FADD does not "have"
> to deliver a result, but to recirculate the result to the next operand internally.

i meant, how would VVM VEC/LOOP handle this, would it cope with one of the operands being a scalar accumulator when all other operands are sort-of-in-flight-allocated?

SVP64 copes fine because the registers are all explicitly defined: they end up being handled by standard hazard dependency tracking

> Depending on your accuracy requirements, you can do this in the 200-odd bit wide accumulator
> of the FMAC instruction so you maintain great accuracy {Or do something like quires in posits}

hilariously for ffmpeg MP3 this gave us the wrong answer during unit test comparisons against the existing algorithns, because less accurate explicit FP32 mul and FP32 add had been used.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<550ccb9c-be0a-4ad2-8601-0b07c8d85184n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18590&group=comp.arch#18590

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:6044:: with SMTP id u65mr15159555qkb.330.1625956567892;
Sat, 10 Jul 2021 15:36:07 -0700 (PDT)
X-Received: by 2002:a05:6808:1313:: with SMTP id y19mr32719882oiv.37.1625956567625;
Sat, 10 Jul 2021 15:36:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 10 Jul 2021 15:36:07 -0700 (PDT)
In-Reply-To: <6bed56b7-76eb-42d0-9400-b30325c93aa1n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c8a7:9d0e:acbe:345;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c8a7:9d0e:acbe:345
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<dca3880d-affa-4d4e-8de1-d520ae6d734cn@googlegroups.com> <d0022997-7682-4ce7-886e-b16605ace58bn@googlegroups.com>
<9fff1831-c440-4f72-8d17-9609069c0e16n@googlegroups.com> <64f25153-635a-41eb-a885-aa9f2b0a3badn@googlegroups.com>
<8b8cafbf-ffc9-40f9-b450-76a1a04d0f08n@googlegroups.com> <6bed56b7-76eb-42d0-9400-b30325c93aa1n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <550ccb9c-be0a-4ad2-8601-0b07c8d85184n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 10 Jul 2021 22:36:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sat, 10 Jul 2021 22:36 UTC

On Saturday, July 10, 2021 at 1:17:32 PM UTC-5, luke.l...@gmail.com wrote:
> On Saturday, July 10, 2021 at 6:35:42 PM UTC+1, MitchAlsup wrote:
> > On Saturday, July 10, 2021 at 12:14:17 PM UTC-5, luke.l...@gmail.com wrote:
>
> > > so now i know what the hell's going on, i can start asking useful questions, such as "what if you want to do a horizontal add or a vector multiply accumulate". i.e what happens if you *need* to use a scalar register as an accumulator of vector data in the middle of the loop?
> > <
> > A vector reduction is written as::
> > <
> > for( sum= 0, i = 0; i < MAX; i++ )
> > sum += a[i];
> > <
> > Then you make the compiler realize this is a vector sum reduction (sum is a recurrence)
> > and then you allow the FADD instruction to hold onto the value as an accumulator so you
> > can pump a new operand into the sum every cycle (per lane). So FADD does not "have"
> > to deliver a result, but to recirculate the result to the next operand internally.
<
> i meant, how would VVM VEC/LOOP handle this, would it cope with one of the operands being a scalar accumulator when all other operands are sort-of-in-flight-allocated?
>
> SVP64 copes fine because the registers are all explicitly defined: they end up being handled by standard hazard dependency tracking
> > Depending on your accuracy requirements, you can do this in the 200-odd bit wide accumulator
> > of the FMAC instruction so you maintain great accuracy {Or do something like quires in posits}
<
> hilariously for ffmpeg MP3 this gave us the wrong answer during unit test comparisons against the existing algorithns, because less accurate explicit FP32 mul and FP32 add had been used.
<
Gold (S.E.L.) 32/87 did divide better (about 1/3rd of a bit after rounding)than the 32/50,
and every single <previous> customer complained !
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<scf70f$jfp$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18608&group=comp.arch#18608

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2e47-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
Date: Sun, 11 Jul 2021 16:41:51 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <scf70f$jfp$1@newsreader4.netcologne.de>
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de>
<1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 11 Jul 2021 16:41:51 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2e47-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2e47:0:7285:c2ff:fe6c:992d";
logging-data="19961"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 11 Jul 2021 16:41 UTC

luke.l...@gmail.com <luke.leighton@gmail.com> schrieb:
> On Friday, July 9, 2021 at 2:37:56 PM UTC+1, Thomas Koenig wrote:

>> Did you have a look at SVE? That may be interesting for the
>> complex number case
>
> to be honest, although it might on the face of it sound useful, it's very unlikely.

I should probably have been a bit more specific, I meant the
possibility of rotating complex numbers before doing operations
on them, or even more specific, the FCADD and FCMLA instructions,
Floating-point complex add with rotate and Floating-point complex
multiply-add with rotate.

To quote the manual:

# The FCADD instructions rotate the complex numbers in the second
# source vector by 90 degrees or 270 degrees in the direction from
# the positive real axis towards the positive imaginary axis, when
# considered in polar representation, before adding active pairs of
# elements to the corresponding elements of the first source vector
# in a destructive manner.

# The FCMLA instructions perform a transformation of the operands to
# allow the creation of multiply-add or multiply-subtract operations
# on complex numbers by combining two of the instructions. The
# transformations performed are as follows:

# • The complex numbers in the second source vector, considered
# in polar form, are rotated by 0 degrees or 180 degrees before
# multiplying by the duplicated real components of the first source
# vector.

# • The complex numbers in the second source vector, considered
# in polar form, are rotated by 90 degrees or 270 degrees before
# multiplying by the duplicated imaginary components of the first
# source vector. The resulting products are then added to the
# corresponding components of the destination and addend vector,
# without intermediate rounding.

Those instructions can indeed come in handy when dealing with
complex numbers. Again, not sure what IP prodection ARM has there,
if any.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18610&group=comp.arch#18610

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1c4:: with SMTP id t4mr33264834qtw.140.1626022165819; Sun, 11 Jul 2021 09:49:25 -0700 (PDT)
X-Received: by 2002:aca:53ce:: with SMTP id h197mr7063370oib.30.1626022165555; Sun, 11 Jul 2021 09:49:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jul 2021 09:49:25 -0700 (PDT)
In-Reply-To: <scf70f$jfp$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com> <sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com> <scf70f$jfp$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 11 Jul 2021 16:49:25 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 20
 by: luke.l...@gmail.com - Sun, 11 Jul 2021 16:49 UTC

On Sunday, July 11, 2021 at 5:41:54 PM UTC+1, Thomas Koenig wrote:

> I should probably have been a bit more specific, I meant the
> possibility of rotating complex numbers before doing operations
> on them, or even more specific, the FCADD and FCMLA instructions,
> Floating-point complex add with rotate and Floating-point complex
> multiply-add with rotate.

ah ok, yes, interesting. yes, we gave some consideration to having
complex numbers as "First-order" types, as tagged registers. however
this is a step too far along for what is already an advanced Vector ISA.

> Those instructions can indeed come in handy when dealing with
> complex numbers. Again, not sure what IP prodection ARM has there,
> if any.

none. standards constitute facts. facts are uncopyrightable. the
material *describing* a standard may be copyrighted: the facts
*in* the material may not.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18616&group=comp.arch#18616

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:ee2a:: with SMTP id l10mr47130259qvs.22.1626025910011;
Sun, 11 Jul 2021 10:51:50 -0700 (PDT)
X-Received: by 2002:a9d:363:: with SMTP id 90mr2784980otv.114.1626025909805;
Sun, 11 Jul 2021 10:51:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Jul 2021 10:51:49 -0700 (PDT)
In-Reply-To: <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8d64:8b5a:78fe:dd70;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8d64:8b5a:78fe:dd70
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 11 Jul 2021 17:51:50 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 11 Jul 2021 17:51 UTC

On Sunday, July 11, 2021 at 11:49:26 AM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, July 11, 2021 at 5:41:54 PM UTC+1, Thomas Koenig wrote:
>
> > I should probably have been a bit more specific, I meant the
> > possibility of rotating complex numbers before doing operations
> > on them, or even more specific, the FCADD and FCMLA instructions,
> > Floating-point complex add with rotate and Floating-point complex
> > multiply-add with rotate.
<
> ah ok, yes, interesting. yes, we gave some consideration to having
> complex numbers as "First-order" types, as tagged registers. however
> this is a step too far along for what is already an advanced Vector ISA.
<
I, ultimately, came to the same conclusion::
<
Although complex arithmetic is well understood, and a <mostly> straight
forward blending of std floating point arithmetic, and not "that hard" to
integrate into ISA, it did not make the cut to first class citizenship.
<
> > Those instructions can indeed come in handy when dealing with
> > complex numbers. Again, not sure what IP prodection ARM has there,
> > if any.
<
> none. standards constitute facts. facts are uncopyrightable. the
> material *describing* a standard may be copyrighted: the facts
> *in* the material may not.
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<scfem1$pf9$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18621&group=comp.arch#18621

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2e47-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
Date: Sun, 11 Jul 2021 18:52:49 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <scfem1$pf9$1@newsreader4.netcologne.de>
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de>
<1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de>
<0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com>
Injection-Date: Sun, 11 Jul 2021 18:52:49 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2e47-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2e47:0:7285:c2ff:fe6c:992d";
logging-data="26089"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 11 Jul 2021 18:52 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> Although complex arithmetic is well understood, and a <mostly> straight
> forward blending of std floating point arithmetic, and not "that hard" to
> integrate into ISA, it did not make the cut to first class citizenship.

With the possibility of negating all operands, you can do most of
what needs doing for complex arithmetic anyway.

It is also interesting that gcc at least deals with complex
variables basically as a struct of two single variables.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18667&group=comp.arch#18667

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e07:: with SMTP id h7mr1164097qke.481.1626132687834;
Mon, 12 Jul 2021 16:31:27 -0700 (PDT)
X-Received: by 2002:aca:53ce:: with SMTP id h197mr12032773oib.30.1626132687556;
Mon, 12 Jul 2021 16:31:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 16:31:27 -0700 (PDT)
In-Reply-To: <scfem1$pf9$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.177.1; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.177.1
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 12 Jul 2021 23:31:27 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Mon, 12 Jul 2021 23:31 UTC

On Sunday, July 11, 2021 at 7:52:51 PM UTC+1, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Although complex arithmetic is well understood, and a <mostly> straight
> > forward blending of std floating point arithmetic, and not "that hard" to
> > integrate into ISA, it did not make the cut to first class citizenship.
> With the possibility of negating all operands, you can do most of
> what needs doing for complex arithmetic anyway.
>
> It is also interesting that gcc at least deals with complex
> variables basically as a struct of two single variables.

yes, we considered register "tagging" to indicate that two consecutive registers
would be treated as real, imag parts.

instead, i updated SVP64 with a "Vertical-First" mode, effectively
an explicit version of Mitch's VVM, and the pseudocode and
assembler looks like this:

102 mul1_r = vec_r[jh] * cos_r[k]
103 mul2_r = vec_i[jh] * sin_i[k]
104 tpre = mul1_r + mul2_r
108 mul1_i = vec_r[jh] * sin_i[k]
109 mul2_i = vec_i[jh] * cos_r[k]
110 tpim = -mul1_i + mul2_i
114 vec_r[jh] = vec_r[jl] - tpre
115 vec_i[jh] = vec_i[jl] - tpim
116 vec_r[jl] += tpre
117 vec_i[jl] += tpim

i deliberately spelled that out so that it is closer to the assembler.

515 # set triple butterfly mode
516 "svshape 8, 1, 1, 1, 1",
517 # tpre
518 "svremap 5, 1, 0, 2, 0, 0",
519 "sv.fmuls 24, 0.v, 16.v", # mul1_r = r*cos_r
520 "svremap 5, 1, 0, 2, 0, 0",
521 "sv.fmuls 25, 8.v, 20.v", # mul2_r = i*sin_i
522 "fadds 24, 24, 25", # tpre = mul1_r + mul2_r
523 # tpim
524 "svremap 5, 1, 0, 2, 0, 0",
525 "sv.fmuls 26, 0.v, 20.v", # mul1_i = r*sin_i
526 "svremap 5, 1, 0, 2, 0, 0",
527 "sv.fmuls 27, 8.v, 16.v", # mul2_i = i*cos_r
528 "fsubs 26, 27, 26", # tpim = mul2_i - mul1_i
529 # vec_r jh/jl
530 "svremap 26, 0, 0, 0, 0, 1",
531 "sv.ffadds 0.v, 24, 0.v", # vh/vl +/- tpre
532 # vec_i jh/jl
533 "svremap 26, 0, 0, 0, 0, 1",
534 "sv.ffadds 8.v, 26, 8.v", # vh/vl +- tpim
536 # svstep loop
537 "setvl. 0, 0, 0, 1, 0, 0",
538 "bc 4, 2, -84"

if you prefer the original source it's here
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD

in particular in the original source you can see the triple loop generating
indices jl, jh and k, which is straight out of the excellent and clear nayuki
example code
https://www.nayuki.io/page/free-small-fft-in-multiple-languages

the "svshape" instruction establishes an O(N log2 N) Virtual Vector Length, but sets up 3 loop schedules which create the appropriate triple for-loop that would normally be done explicitly.

three 32-bit SPRs are set up behind the scenes:
* SVSHAPE0 for vec[j] (jl)
* SVSHAPE1 for vec[j+halfstep] (jh)
* SVSHAPE2 for cos[k] and sin[k] tables.

the next odd bit is svremap. register operands are given an index
0 thru 4 for FRA, FRB, FRC, FRT and FRS. these make sense to anyone familiar with the Power ISA.

ABC are input operands, R and S are output.

svremap therefore contains a bitmask in binary to enable REMAPPing of FR{ABCTS} and the next 5 arguments indicate which SVSHAPE schedule they should use.

"svremap 5, 1, 0, 2, 0, 0" means:

* 0b101 therefore FRA and FRC are "remapped", B T and S are ignored
* 1st argument, 1, means that FRA uses SVSHAPE1
* 2nd argument ignored
* 3rd argument 2, means that FRC uses SVSHAPE2
* 4th and 5th arguments ignored.

the next instruction is "sv.fmuls 24, 0.v, 16.v" which is FMULS FRT, FRA, FRC

combined with its REMAP schedule it says:

* FRA shall read register REMAP(0, SVSHAPE1)
* FRC shall read register REMAP(16, SVSHAPE2)

and in this way, instead of FRA cycling through 0 1 2 3 4 5
it instead cycles through 0 2 4 6 0 4 according to a standard FFT
butterfly schedule for the lower index.

likewise FRC cycles through the multiply coefficients needed.

the only other "oddity" is the twin ADD/SUB instruction.
although only declared with 3 operands FRT, FRA, FRB it
actually has 2 in 2 out: FRS is the same starting point as
FRT, but with different REMAP schedules, FRT can target jh
and FRS can target jl offsets.

it has to be 2in 2 out so that the operands remain "in-flight" and thus
can do an in-place overwrite. if that takes 2 cycles, so what: it's doing 2
operations (FADD, FSUB) anyway.

it would be very nice if this was more compact: i am thinking
of ways to reduce the number of times that svremap has to be called.

stunningly, if complex numbers were first-class citizens, this reduces down
to THREE instructions.

151 "svshape 8, 1, 1, 1, 0",
152 "svremap 31, 1, 0, 2, 0, 1",
153 "sv.ffmadds 0.v, 0.v, 0.v, 8.v"

that ffmdadds is FIVE operands, FRS and FRT as output, FRA FRB
and FRC as input. yes, it does the twin add/sub of the FRC*FRA
coefficient in a butterfly swap, in-place.

the costly bit about not having complex numbers as first class citizens is that scalar operands need to be used to store the intermediate conputation of four multiplies *and then* a *pair* of butterfly add/subs have to be done (one for real, the other for imag).

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18668&group=comp.arch#18668

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:a0d:: with SMTP id dw13mr1819392qvb.41.1626135135166;
Mon, 12 Jul 2021 17:12:15 -0700 (PDT)
X-Received: by 2002:a05:6808:14c8:: with SMTP id f8mr1022125oiw.7.1626135134917;
Mon, 12 Jul 2021 17:12:14 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 17:12:14 -0700 (PDT)
In-Reply-To: <fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jul 2021 00:12:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Tue, 13 Jul 2021 00:12 UTC

On Monday, July 12, 2021 at 6:31:29 PM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, July 11, 2021 at 7:52:51 PM UTC+1, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> > > Although complex arithmetic is well understood, and a <mostly> straight
> > > forward blending of std floating point arithmetic, and not "that hard" to
> > > integrate into ISA, it did not make the cut to first class citizenship.
> > With the possibility of negating all operands, you can do most of
> > what needs doing for complex arithmetic anyway.
> >
> > It is also interesting that gcc at least deals with complex
> > variables basically as a struct of two single variables.
> yes, we considered register "tagging" to indicate that two consecutive registers
> would be treated as real, imag parts.
>
> instead, i updated SVP64 with a "Vertical-First" mode, effectively
> an explicit version of Mitch's VVM, and the pseudocode and
> assembler looks like this:
>
> 102 mul1_r = vec_r[jh] * cos_r[k]
> 103 mul2_r = vec_i[jh] * sin_i[k]
> 104 tpre = mul1_r + mul2_r
> 108 mul1_i = vec_r[jh] * sin_i[k]
> 109 mul2_i = vec_i[jh] * cos_r[k]
> 110 tpim = -mul1_i + mul2_i
> 114 vec_r[jh] = vec_r[jl] - tpre
> 115 vec_i[jh] = vec_i[jl] - tpim
> 116 vec_r[jl] += tpre
> 117 vec_i[jl] += tpim
>
> i deliberately spelled that out so that it is closer to the assembler.
>
> 515 # set triple butterfly mode
> 516 "svshape 8, 1, 1, 1, 1",
> 517 # tpre
> 518 "svremap 5, 1, 0, 2, 0, 0",
> 519 "sv.fmuls 24, 0.v, 16.v", # mul1_r = r*cos_r
> 520 "svremap 5, 1, 0, 2, 0, 0",
> 521 "sv.fmuls 25, 8.v, 20.v", # mul2_r = i*sin_i
> 522 "fadds 24, 24, 25", # tpre = mul1_r + mul2_r
> 523 # tpim
> 524 "svremap 5, 1, 0, 2, 0, 0",
> 525 "sv.fmuls 26, 0.v, 20.v", # mul1_i = r*sin_i
> 526 "svremap 5, 1, 0, 2, 0, 0",
> 527 "sv.fmuls 27, 8.v, 16.v", # mul2_i = i*cos_r
> 528 "fsubs 26, 27, 26", # tpim = mul2_i - mul1_i
> 529 # vec_r jh/jl
> 530 "svremap 26, 0, 0, 0, 0, 1",
> 531 "sv.ffadds 0.v, 24, 0.v", # vh/vl +/- tpre
> 532 # vec_i jh/jl
> 533 "svremap 26, 0, 0, 0, 0, 1",
> 534 "sv.ffadds 8.v, 26, 8.v", # vh/vl +- tpim
> 536 # svstep loop
> 537 "setvl. 0, 0, 0, 1, 0, 0",
> 538 "bc 4, 2, -84"
>
> if you prefer the original source it's here
> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
<
That is pretty close to the code I illustrated a few days ago.
<
FORTRAN
DO 20 J = 1, N2
C = COS (A)
S =-SIN (A)
A = J*E
DO 30 I = J, N, N1
XT = RE(I) - RO(I)
YT = IM(I) - IN(I)
RE(I) = RE(I) + RO(I)
IM(I) = IM(I) + IN(I)
RO(I) = XT*C - YT*S
IN(I) = XT*S + YT*C
30 CONTINUE
20 CONTINUE
<
My 66000 inner loop::
loop30:
VEC RI,{RI}
LDD RXI,[RRE+RI<<3]
LDD RXL,[RRO+RI<<3]
LDD RYI,[RIM+RI<<3]
LDD RYL,[RIN+RI<<3]
FADD RXT,RXI,-RXL
FADD RYT,RYI,-RYL
FADD RXI,RXI,RXL
FADD RYI,RYI,RYL
FMUL RXC,RXT,RC
FMUL RXS,RYT,RS
FMUL RYS,RXT,RS
FMUL RYC,RYT,RC
FADD RXL,RXC,-RXS
FADD RYL,RYS,RYC
STD RXI,[RRE+RI<<3]
STD RYI,[RIM+RI<<3]
STD RXL,[RRO+RI<<3]
STD RYL,[RIN+RI<<3]
LOOP Ri,RN1,Rn
<
Also note SIN and COS are instructions in My 66000--no need for a table
of these things.
<
> in particular in the original source you can see the triple loop generating
> indices jl, jh and k, which is straight out of the excellent and clear nayuki
> example code
> https://www.nayuki.io/page/free-small-fft-in-multiple-languages
>
> the "svshape" instruction establishes an O(N log2 N) Virtual Vector Length, but sets up 3 loop schedules which create the appropriate triple for-loop that would normally be done explicitly.
>
> three 32-bit SPRs are set up behind the scenes:
> * SVSHAPE0 for vec[j] (jl)
> * SVSHAPE1 for vec[j+halfstep] (jh)
> * SVSHAPE2 for cos[k] and sin[k] tables.
>
> the next odd bit is svremap. register operands are given an index
> 0 thru 4 for FRA, FRB, FRC, FRT and FRS. these make sense to anyone familiar with the Power ISA.
>
> ABC are input operands, R and S are output.
>
> svremap therefore contains a bitmask in binary to enable REMAPPing of FR{ABCTS} and the next 5 arguments indicate which SVSHAPE schedule they should use.
>
> "svremap 5, 1, 0, 2, 0, 0" means:
>
> * 0b101 therefore FRA and FRC are "remapped", B T and S are ignored
> * 1st argument, 1, means that FRA uses SVSHAPE1
> * 2nd argument ignored
> * 3rd argument 2, means that FRC uses SVSHAPE2
> * 4th and 5th arguments ignored.
>
> the next instruction is "sv.fmuls 24, 0.v, 16.v" which is FMULS FRT, FRA, FRC
>
> combined with its REMAP schedule it says:
>
> * FRA shall read register REMAP(0, SVSHAPE1)
> * FRC shall read register REMAP(16, SVSHAPE2)
>
> and in this way, instead of FRA cycling through 0 1 2 3 4 5
> it instead cycles through 0 2 4 6 0 4 according to a standard FFT
> butterfly schedule for the lower index.
>
> likewise FRC cycles through the multiply coefficients needed.
>
> the only other "oddity" is the twin ADD/SUB instruction.
> although only declared with 3 operands FRT, FRA, FRB it
> actually has 2 in 2 out: FRS is the same starting point as
> FRT, but with different REMAP schedules, FRT can target jh
> and FRS can target jl offsets.
>
> it has to be 2in 2 out so that the operands remain "in-flight" and thus
> can do an in-place overwrite. if that takes 2 cycles, so what: it's doing 2
> operations (FADD, FSUB) anyway.
>
> it would be very nice if this was more compact: i am thinking
> of ways to reduce the number of times that svremap has to be called.
>
> stunningly, if complex numbers were first-class citizens, this reduces down
> to THREE instructions.
>
> 151 "svshape 8, 1, 1, 1, 0",
> 152 "svremap 31, 1, 0, 2, 0, 1",
> 153 "sv.ffmadds 0.v, 0.v, 0.v, 8.v"
>
> that ffmdadds is FIVE operands, FRS and FRT as output, FRA FRB
> and FRC as input. yes, it does the twin add/sub of the FRC*FRA
> coefficient in a butterfly swap, in-place.
>
> the costly bit about not having complex numbers as first class citizens is that scalar operands need to be used to store the intermediate conputation of four multiplies *and then* a *pair* of butterfly add/subs have to be done (one for real, the other for imag).
<
I looked into complex 5-odd years ago to be first class citizens, But ran into Quaternions
(complex with 3 imaginaries i,j,k) and Octonernion (complex with 7 imaginaries.) I had no
basis for why complex (binarnion) should be left in and Quaternion left out.. So, in effect
I punted. Complex did not fit into my memory reference AGEN pattern, either..
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18670&group=comp.arch#18670

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1137:: with SMTP id p23mr1554408qkk.490.1626137500979;
Mon, 12 Jul 2021 17:51:40 -0700 (PDT)
X-Received: by 2002:a9d:4e0a:: with SMTP id p10mr1014079otf.329.1626137500715;
Mon, 12 Jul 2021 17:51:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 17:51:40 -0700 (PDT)
In-Reply-To: <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.176.255; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.176.255
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com> <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 13 Jul 2021 00:51:40 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Tue, 13 Jul 2021 00:51 UTC

On Tuesday, July 13, 2021 at 1:12:16 AM UTC+1, MitchAlsup wrote:

> > if you prefer the original source it's here
> > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
> <
> That is pretty close to the code I illustrated a few days ago.

yehyeh.

> Also note SIN and COS are instructions in My 66000--no need for a table
> of these things.

we'll be adding SIN and COS etc because 3D. although.. aren't these supposed to be expensive latency even in hardware (like divide?) and so best cached in tables anyway?

> I looked into complex 5-odd years ago to be first class citizens, But ran into Quaternions
> (complex with 3 imaginaries i,j,k) and Octonernion (complex with 7 imaginaries.) I had no
> basis for why complex (binarnion) should be left in and Quaternion left out. So, in effect
> I punted.

makes perfect sense to me.

> Complex did not fit into my memory reference AGEN pattern, either.

if considered an opaque vec2 it's just 128 bits where 64 would normally be done. except quarternions etc. yeah we skip those.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<402482d6-e2f5-46ee-9574-a36d52252f7an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18672&group=comp.arch#18672

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e07:: with SMTP id h7mr1468570qke.481.1626138093145;
Mon, 12 Jul 2021 18:01:33 -0700 (PDT)
X-Received: by 2002:aca:dbd6:: with SMTP id s205mr12753482oig.155.1626138092908;
Mon, 12 Jul 2021 18:01:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 18:01:32 -0700 (PDT)
In-Reply-To: <f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com> <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <402482d6-e2f5-46ee-9574-a36d52252f7an@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jul 2021 01:01:33 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Tue, 13 Jul 2021 01:01 UTC

On Monday, July 12, 2021 at 7:51:42 PM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 1:12:16 AM UTC+1, MitchAlsup wrote:
>
> > > if you prefer the original source it's here
> > > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
> > <
> > That is pretty close to the code I illustrated a few days ago.
> yehyeh.
> > Also note SIN and COS are instructions in My 66000--no need for a table
> > of these things.
> we'll be adding SIN and COS etc because 3D. although.. aren't these supposed to be expensive latency even in hardware (like divide?) and so best cached in tables anyway?
<
While they are FDIV-like in latency the FFTs can be arranged so that these are
initialized outside of the innermost loop, which considerably decreases the costs.
And if the FFT is a big data set, each inner loop will totally trash the L1 and most
of the L2 caches, so having these in tables can be significantly more expensive
than a cache hitting LD; and at this point the instructions are higher performing....
<
Also note: USPTO 10,761,806 has been issued which covers how to do these
at FDIV speeds. I am willing to sell licenses for less than it will cost you to
assign an engineer to develop noninfrigning implementation. We should take
this off line.......
<
> > I looked into complex 5-odd years ago to be first class citizens, But ran into Quaternions
> > (complex with 3 imaginaries i,j,k) and Octonernion (complex with 7 imaginaries.) I had no
> > basis for why complex (binarnion) should be left in and Quaternion left out. So, in effect
> > I punted.
> makes perfect sense to me.
> > Complex did not fit into my memory reference AGEN pattern, either.
> if considered an opaque vec2 it's just 128 bits where 64 would normally be done. except quarternions etc. yeah we skip those.
>
> l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<scip94$ia9$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18674&group=comp.arch#18674

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
Date: Mon, 12 Jul 2021 18:12:02 -0700
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <scip94$ia9$1@dont-email.me>
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de>
<1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de>
<0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com>
<scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com>
<6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 13 Jul 2021 01:12:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2803600291fdb187722b2d5e72f33d36";
logging-data="18761"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/pM9Eu4cd7/W2lXcTJ3ldVe/WYx9ehJRc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:JFeaxzKEUh4Wp6ROINVoivwRFnQ=
In-Reply-To: <f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 13 Jul 2021 01:12 UTC

On 7/12/2021 5:51 PM, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 1:12:16 AM UTC+1, MitchAlsup wrote:
>
>>> if you prefer the original source it's here
>>> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
>> <
>> That is pretty close to the code I illustrated a few days ago.
>
> yehyeh.
>
>
>> Also note SIN and COS are instructions in My 66000--no need for a table
>> of these things.
>
> we'll be adding SIN and COS etc because 3D. although.. aren't these supposed to be expensive latency even in hardware (like divide?) and so best cached in tables anyway?
>
>> I looked into complex 5-odd years ago to be first class citizens, But ran into Quaternions
>> (complex with 3 imaginaries i,j,k) and Octonernion (complex with 7 imaginaries.) I had no
>> basis for why complex (binarnion) should be left in and Quaternion left out. So, in effect
>> I punted.
>
> makes perfect sense to me.

I am certainly not an expert in this area, but isn't complex used much
more frequently than Quaternions or Octonerions? I am not saying either
of you should include complex, but isn't usage an argument for including
it but not the others? After all, that is what Fortran decided. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<c26d525c-ff9e-4b2a-8e2c-9882bb0b4cc0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18675&group=comp.arch#18675

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:ed8:: with SMTP id x24mr1571553qkm.299.1626139503552;
Mon, 12 Jul 2021 18:25:03 -0700 (PDT)
X-Received: by 2002:aca:c7cb:: with SMTP id x194mr1228409oif.119.1626139503321;
Mon, 12 Jul 2021 18:25:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 18:25:03 -0700 (PDT)
In-Reply-To: <scip94$ia9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.176.255; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.176.255
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com> <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com> <scip94$ia9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c26d525c-ff9e-4b2a-8e2c-9882bb0b4cc0n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 13 Jul 2021 01:25:03 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Tue, 13 Jul 2021 01:25 UTC

On Tuesday, July 13, 2021 at 2:12:07 AM UTC+1, Stephen Fuld wrote:

> I am certainly not an expert in this area, but isn't complex used much
> more frequently than Quaternions or Octonerions? I am not saying either
> of you should include complex, but isn't usage an argument for including
> it but not the others? After all, that is what Fortran decided. :-)

:)

in this particular case, everything being added has to be submitted to the (soon to be formally announced and long awaited) OpenPOWER Foundation ISA WG.

the rule that was explained to me sums up as: the higher the cost, the bigger the payoff has to be.

we have about a hundred instructions that need writing up and formally proposing, already (bitmanip, carryless mul, Galois Field, transcendentals, 3D texture interpolation, in addition to SVP64 itself) which is slightly freaking me out given that the embedded FP subset is 214 or so, we have 50% more to add to that, in Draft form.

purely from a time and resource perspective, complex has to wait. we have already achieved a 5x to 10x reduction in code size due to most ISAs having to macro-unroll FFTs for explicit sizes, or use almost a thousand lines of hand-optimised assembler. you should see the ffmpeg source code for x86, ppc and aarch64, it's an "inspiration" to do a better ISA - waaay better.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<0a8fecd6-b011-4c22-8ebb-abd21e81a72dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18677&group=comp.arch#18677

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e54e:: with SMTP id n14mr2035287qvm.41.1626140097913;
Mon, 12 Jul 2021 18:34:57 -0700 (PDT)
X-Received: by 2002:a4a:ab07:: with SMTP id i7mr1512719oon.89.1626140097664;
Mon, 12 Jul 2021 18:34:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 18:34:57 -0700 (PDT)
In-Reply-To: <402482d6-e2f5-46ee-9574-a36d52252f7an@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.176.255; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.176.255
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com> <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com> <402482d6-e2f5-46ee-9574-a36d52252f7an@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0a8fecd6-b011-4c22-8ebb-abd21e81a72dn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 13 Jul 2021 01:34:57 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: luke.l...@gmail.com - Tue, 13 Jul 2021 01:34 UTC

On Tuesday, July 13, 2021 at 2:01:34 AM UTC+1, MitchAlsup wrote:
> On Monday, July 12, 2021 at 7:51:42 PM UTC-5, luke.l...@gmail.com wrote:
..
> > we'll be adding SIN and COS etc because 3D. although.. aren't these supposed to be expensive latency even in hardware (like divide?) and so best cached in tables anyway?
> <
> While they are FDIV-like in latency the FFTs can be arranged so that these are
> initialized outside of the innermost loop, which considerably decreases the costs.

ohh yehyehyeh, seen examples like that.

> And if the FFT is a big data set, each inner loop will totally trash the L1 and most
> of the L2 caches, so having these in tables can be significantly more expensive
> than a cache hitting LD; and at this point the instructions are higher performing....

interesting. would a 4 way set associative L1 cache not help? i am keenly aware that the FFT hits the same power of 2 point, so if there are say 64 cache lines it is ABSOLUTELY guaranteed that for large FFTs the exact same cache line is going to get hammered.

> <
> Also note: USPTO 10,761,806 has been issued which covers how to do these
> at FDIV speeds. I am willing to sell licenses for less than it will cost you to
> assign an engineer to develop noninfrigning implementation. We should take
> this off line.......

cando, this will be something the commercial operation would look into when established. Libre-SOC is NLnet Grant funded (charitable foundation) and a Libre/Open R&D group.

l.

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<8acd6131-79e9-4d3d-b9ec-d587924f690cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18678&group=comp.arch#18678

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:46d0:: with SMTP id h16mr1738750qto.362.1626140243472;
Mon, 12 Jul 2021 18:37:23 -0700 (PDT)
X-Received: by 2002:a9d:5603:: with SMTP id e3mr1470189oti.178.1626140243270;
Mon, 12 Jul 2021 18:37:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 18:37:23 -0700 (PDT)
In-Reply-To: <scip94$ia9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com> <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com> <scip94$ia9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8acd6131-79e9-4d3d-b9ec-d587924f690cn@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jul 2021 01:37:23 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3454
 by: MitchAlsup - Tue, 13 Jul 2021 01:37 UTC

On Monday, July 12, 2021 at 8:12:07 PM UTC-5, Stephen Fuld wrote:
> On 7/12/2021 5:51 PM, luke.l...@gmail.com wrote:
> > On Tuesday, July 13, 2021 at 1:12:16 AM UTC+1, MitchAlsup wrote:
> >
> >>> if you prefer the original source it's here
> >>> https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD
> >> <
> >> That is pretty close to the code I illustrated a few days ago.
> >
> > yehyeh.
> >
> >
> >> Also note SIN and COS are instructions in My 66000--no need for a table
> >> of these things.
> >
> > we'll be adding SIN and COS etc because 3D. although.. aren't these supposed to be expensive latency even in hardware (like divide?) and so best cached in tables anyway?
> >
> >> I looked into complex 5-odd years ago to be first class citizens, But ran into Quaternions
> >> (complex with 3 imaginaries i,j,k) and Octonernion (complex with 7 imaginaries.) I had no
> >> basis for why complex (binarnion) should be left in and Quaternion left out. So, in effect
> >> I punted.
> >
> > makes perfect sense to me.
<
> I am certainly not an expert in this area, but isn't complex used much
> more frequently than Quaternions or Octonerions? I am not saying either
> of you should include complex, but isn't usage an argument for including
> it but not the others? After all, that is what Fortran decided. :-)
<
You can use FORTRAN as justification.
<
But what bothered me, personally, was trying to decide on including or excluding
something that I had insufficient information about, and the more I looked the
more the scope of "doing it right" was escaping from me.
<
Thus, time to punt.
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors

<cc762ae9-3ecf-45fb-aab6-4105ac2047f2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18679&group=comp.arch#18679

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5d09:: with SMTP id f9mr1714673qtx.91.1626140368976;
Mon, 12 Jul 2021 18:39:28 -0700 (PDT)
X-Received: by 2002:a05:6830:3108:: with SMTP id b8mr1484463ots.182.1626140368736;
Mon, 12 Jul 2021 18:39:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 18:39:28 -0700 (PDT)
In-Reply-To: <c26d525c-ff9e-4b2a-8e2c-9882bb0b4cc0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <9cda8070-fe83-4c7c-beb2-33a4e4612fc9n@googlegroups.com>
<sc9jfi$q0k$1@newsreader4.netcologne.de> <1fd1e468-645d-4d10-a433-240a18b3a20fn@googlegroups.com>
<scf70f$jfp$1@newsreader4.netcologne.de> <0229daca-7af9-4bc1-8774-82551950dd53n@googlegroups.com>
<90d7fd1c-3176-4a4e-9213-a541a0e43cb9n@googlegroups.com> <scfem1$pf9$1@newsreader4.netcologne.de>
<fb6733b8-64b4-4c19-aad9-3166cb93bfffn@googlegroups.com> <6824a820-f53b-4f93-a849-ce4f05dd8701n@googlegroups.com>
<f5ad32cc-7c33-41a7-9f5b-e5c3cd09daf1n@googlegroups.com> <scip94$ia9$1@dont-email.me>
<c26d525c-ff9e-4b2a-8e2c-9882bb0b4cc0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cc762ae9-3ecf-45fb-aab6-4105ac2047f2n@googlegroups.com>
Subject: Re: Libre-SOC going to 180nm silicon, and Virtual Vertical Vectors
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jul 2021 01:39:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Tue, 13 Jul 2021 01:39 UTC

On Monday, July 12, 2021 at 8:25:04 PM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 2:12:07 AM UTC+1, Stephen Fuld wrote:
>
> > I am certainly not an expert in this area, but isn't complex used much
> > more frequently than Quaternions or Octonerions? I am not saying either
> > of you should include complex, but isn't usage an argument for including
> > it but not the others? After all, that is what Fortran decided. :-)
> :)
>
> in this particular case, everything being added has to be submitted to the (soon to be formally announced and long awaited) OpenPOWER Foundation ISA WG.
>
> the rule that was explained to me sums up as: the higher the cost, the bigger the payoff has to be.
>
> we have about a hundred instructions that need writing up and formally proposing, already (bitmanip, carryless mul, Galois Field, transcendentals, 3D texture interpolation, in addition to SVP64 itself) which is slightly freaking me out given that the embedded FP subset is 214 or so, we have 50% more to add to that, in Draft form.
<
As a point of comparison, My 66000 has exactly 61 instructions total.
<
>
> purely from a time and resource perspective, complex has to wait. we have already achieved a 5x to 10x reduction in code size due to most ISAs having to macro-unroll FFTs for explicit sizes, or use almost a thousand lines of hand-optimised assembler. you should see the ffmpeg source code for x86, ppc and aarch64, it's an "inspiration" to do a better ISA - waaay better.
>
> l.

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor