Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
 If you find that it is broken please let me know here rocksolid.nodes.help


devel / comp.arch / Re: Reconsidering Variable-Length Operands

SubjectAuthor
* Reconsidering Variable-Length OperandsQuadibloc
+* Re: Reconsidering Variable-Length OperandsQuadibloc
|`- Re: Reconsidering Variable-Length OperandsQuadibloc
+- Re: Reconsidering Variable-Length OperandsIvan Godard
+* Re: Reconsidering Variable-Length OperandsBill Findlay
|`- Re: Reconsidering Variable-Length OperandsQuadibloc
+* Re: Reconsidering Variable-Length OperandsMitchAlsup
|+- Re: Reconsidering Variable-Length OperandsQuadibloc
|+* Re: Reconsidering Variable-Length OperandsQuadibloc
||`* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| +* Re: Reconsidering Variable-Length OperandsMarcus
|| |+- Re: Reconsidering Variable-Length OperandsBGB
|| |+* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||`* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| || +- Re: Reconsidering Variable-Length OperandsBGB
|| || `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||  `* Re: Reconsidering Variable-Length OperandsMarcus
|| ||   `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||    +* Re: Reconsidering Variable-Length OperandsIvan Godard
|| ||    |+- Re: Reconsidering Variable-Length OperandsMarcus
|| ||    |`* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||    | `* Re: Reconsidering Variable-Length OperandsStephen Fuld
|| ||    |  `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||    |   `* Re: Reconsidering Variable-Length OperandsStephen Fuld
|| ||    |    `* Re: Reconsidering Variable-Length OperandsEricP
|| ||    |     +* Re: Reconsidering Variable-Length OperandsStephen Fuld
|| ||    |     |`* Re: Reconsidering Variable-Length OperandsIvan Godard
|| ||    |     | `* Re: Reconsidering Variable-Length OperandsStephen Fuld
|| ||    |     |  `* Re: Reconsidering Variable-Length OperandsIvan Godard
|| ||    |     |   +- Re: Reconsidering Variable-Length OperandsGeorge Neuner
|| ||    |     |   +- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||    |     |   +* Re: Reconsidering Variable-Length OperandsAndy Valencia
|| ||    |     |   |`- Re: Reconsidering Variable-Length OperandsDavid Brown
|| ||    |     |   `- Re: Reconsidering Variable-Length OperandsStephen Fuld
|| ||    |     `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||    |      `* Re: Reconsidering Variable-Length OperandsIvan Godard
|| ||    |       `- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| ||    `* Re: Reconsidering Variable-Length OperandsTerje Mathisen
|| ||     +- Re: Reconsidering Variable-Length OperandsIvan Godard
|| ||     +- Re: Reconsidering Variable-Length OperandsEricP
|| ||     +* Re: Reconsidering Variable-Length OperandsStephen Fuld
|| ||     |`- Re: Reconsidering Variable-Length OperandsBGB
|| ||     `* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| ||      `* Re: Reconsidering Variable-Length OperandsMichael S
|| ||       `* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| ||        `- Re: Reconsidering Variable-Length OperandsMichael S
|| |`* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| | +* Re: Reconsidering Variable-Length OperandsBGB
|| | |`* Re: Reconsidering Variable-Length OperandsIvan Godard
|| | | `* Re: Reconsidering Variable-Length OperandsBGB
|| | |  `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |   +- Re: Reconsidering Variable-Length OperandsBGB
|| | |   `* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| | |    `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     +- Re: Reconsidering Variable-Length OperandsStephen Fuld
|| | |     +* Re: Reconsidering Variable-Length OperandsEricP
|| | |     |+- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |`* Re: Reconsidering Variable-Length OperandsBGB
|| | |     | `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |  `- Re: Reconsidering Variable-Length OperandsBGB
|| | |     +* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |+* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| | |     ||`- Re: Reconsidering Variable-Length OperandsBGB
|| | |     |`* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| | |     | `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |  `- Re: Reconsidering Variable-Length OperandsThomas Koenig
|| | |     +* Re: Reconsidering Variable-Length OperandsQuadibloc
|| | |     |+- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |+* Re: Reconsidering Variable-Length OperandsThomas Koenig
|| | |     ||+- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     ||+* Re: Reconsidering Variable-Length OperandsBGB
|| | |     |||`* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     ||| `* Re: Reconsidering Variable-Length OperandsBGB
|| | |     |||  +- Re: Reconsidering Variable-Length OperandsBGB
|| | |     |||  `* Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |||   `* Re: Reconsidering Variable-Length OperandsBGB
|| | |     |||    `- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     ||`* Re: Reconsidering Variable-Length OperandsQuadibloc
|| | |     || +- Re: Reconsidering Variable-Length OperandsBGB
|| | |     || `* Re: Reconsidering Variable-Length OperandsTerje Mathisen
|| | |     ||  `* Re: Reconsidering Variable-Length OperandsEricP
|| | |     ||   `* Re: Reconsidering Variable-Length OperandsIvan Godard
|| | |     ||    +- Re: Reconsidering Variable-Length OperandsBGB
|| | |     ||    `* Re: Reconsidering Variable-Length OperandsEricP
|| | |     ||     `- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |+* Re: Reconsidering Variable-Length OperandsStephen Fuld
|| | |     ||`- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |     |`* Re: Reconsidering Variable-Length OperandsStefan Monnier
|| | |     | `- Re: Reconsidering Variable-Length OperandsBGB
|| | |     `* Re: Reconsidering Variable-Length OperandsPaul A. Clayton
|| | |      +- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | |      `- Re: Reconsidering Variable-Length OperandsMitchAlsup
|| | `- Re: Reconsidering Variable-Length OperandsGeorge Neuner
|| `* Re: Reconsidering Variable-Length OperandsMitchAlsup
||  `- Re: Reconsidering Variable-Length OperandsIvan Godard
|`* Re: Reconsidering Variable-Length OperandsPaul A. Clayton
| +* Re: Reconsidering Variable-Length OperandsAnton Ertl
| |+- Re: Reconsidering Variable-Length OperandsMitchAlsup
| |`* Re: Reconsidering Variable-Length OperandsPaul A. Clayton
| | +* Re: Reconsidering Variable-Length OperandsMitchAlsup
| | |`- Re: Reconsidering Variable-Length OperandsPaul A. Clayton
| | `* Re: Reconsidering Variable-Length OperandsAnton Ertl
| `- Re: Reconsidering Variable-Length OperandsScott Lurndal
+* Re: Reconsidering Variable-Length OperandsBGB
+* Re: Reconsidering Variable-Length OperandsStefan Monnier
`* Re: Reconsidering Variable-Length OperandsJohn Dallman

Pages:123456
Re: Reconsidering Variable-Length Operands

<t7us5d$jns$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25702&group=comp.arch#25702

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 09:34:04 +0200
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <t7us5d$jns$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 10 Jun 2022 07:34:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1bcd2c5db8e8fab0bb6f75ac97857ae1";
logging-data="20220"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xUEMAYeZDPEmwY1qKTJEU4w/n8AajYPY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.9.1
Cancel-Lock: sha1:GQy2FWafrO4ddABVLap8O4Djql8=
In-Reply-To: <95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
Content-Language: en-US
 by: Marcus - Fri, 10 Jun 2022 07:34 UTC

On 2022-06-08, MitchAlsup wrote:
> On Tuesday, June 7, 2022 at 12:53:22 PM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> On Tuesday, June 7, 2022 at 1:58:33 AM UTC-5, Marcus wrote:
>>>> On 2022-06-07, Thomas Koenig wrote:
>>>>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>>>>> On Sunday, June 5, 2022 at 7:17:54 PM UTC-6, MitchAlsup wrote:
>>>>>>
>>>>>>> I still don't think you have progressed to the point where you realize
>>>>>>> architecture is just about as much about what to leave out as what
>>>>>>> to put in.
>>>>>>
>>>>>> I'm _aware_ of the principle, even if I don't practice it much.
>>>>>>
>>>>>> Also, my inclination is to leave the "leaving out" to the _implementation_
>>>>>> stage; the ISA is designed to have room for everything one might want
>>>>>> in a computer... and then any given implementation leaves out what is
>>>>>> not useful to the intended customer.
>>>>>
>>>>> What variant of your ISA should software vendors assume?
>>>> Most ISA:s have some mechanism for handling "extensions", which can be
>>>> exposed to software in the form of flags in a system register, for
>>>> instance.
>>> <
>>> For example: the decode of an invalid instruction causes exception transfer
>>> to handler where the instruction can be evaluated in SW. Make these transfers
>>> of control fast enough and you don't need the flag.
>> Looking at the matmul code I just posted in reply to Marcus... it is
>> part of
>>
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/generated/matmul_r4.c;hb=HEAD
>>
>> How, exactly, would you do this in libgfortran? Clearly, trapping
>> on and emulating on every AVX2 instruction in a tightly written
>> matmul would be ridiculous from a speed perspective.
>>
>> Do you have other options in mind?
> <
> It is the totally exceptional instruction that gets emulated. 99.999,85% of instructions
> generated by the compiler are native (maybe even 100%).
> <
> Plus I have nothing similar to AVX--using VEC-LOOP and native instructions for
> vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the implementation
> of the moment decides with width of the SIMD path, and the programmer/compiler
> accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide, 8-wide, 96-wide !!
> and no wide-registers are needed to provide the capability.
> <
> VEC-LOOP provides vectorized string handling, memory handling, and when I get around
> to it, decimal handling.

Regardless of how well designed an ISA is, 20-30 years after its initial
release there *will* be new instructions in the ISA, and they will
mostly be about improving performance in special cases in certain
domains (e.g. encryption, codecs, security, etc). If the ISA is popular
enough to have several different implementations and running most of the
world's software, SW developers will start making different code paths
for different implementations, *if* it gives a 10%+ performance gain.

Even if you make exception handling really fast, you still have:

exception > function call > inlined code > specialized solution

And performance sensitive SW developers are going to want to be on the
right end of that scale.

Sure, you can minimize the problem with a good design, perhaps to a
point where most people do not care, but I don't think that you can
eliminate it completely.

/Marcus

Re: Reconsidering Variable-Length Operands

<940e7246-0cfc-4c83-9306-2294db5f505en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25705&group=comp.arch#25705

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5f54:0:b0:305:2f7:a0c with SMTP id y20-20020ac85f54000000b0030502f70a0cmr10176586qta.187.1654858239566;
Fri, 10 Jun 2022 03:50:39 -0700 (PDT)
X-Received: by 2002:a0c:ec87:0:b0:464:5b0f:f77a with SMTP id
u7-20020a0cec87000000b004645b0ff77amr43903547qvo.63.1654858239378; Fri, 10
Jun 2022 03:50:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 03:50:39 -0700 (PDT)
In-Reply-To: <b3cba7be-b6b7-42d2-9d43-9e9c6db100dbn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:2112:5b3b:fbf6:ca72;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:2112:5b3b:fbf6:ca72
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<memo.20220607205909.11972J@jgd.cix.co.uk> <doOnK.13820$_T.13523@fx40.iad>
<t7odja$t4u$2@newsreader4.netcologne.de> <1daa2ffd-0e87-40e9-8dee-ce0efe588cf6n@googlegroups.com>
<t7pcm4$iec$1@newsreader4.netcologne.de> <b3cba7be-b6b7-42d2-9d43-9e9c6db100dbn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <940e7246-0cfc-4c83-9306-2294db5f505en@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 10 Jun 2022 10:50:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Michael S - Fri, 10 Jun 2022 10:50 UTC

On Wednesday, June 8, 2022 at 8:57:17 PM UTC+3, Michael S wrote:
> On Wednesday, June 8, 2022 at 8:39:19 AM UTC+3, Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> > > On Tuesday, June 7, 2022 at 11:48:45 PM UTC+3, Thomas Koenig wrote:
> > >> EricP <ThatWould...@thevillage.com> schrieb:
> > >> > What about double-doubles?
> > >> > Too slow?
> > >> On the same machine which gave 250 MFlops for a matrix
> > >> multiplication for IEEE qp, I also got around 50 MFlops for double
> > >> double (similar to libquadmath on my home box).
> > >
> > > What is your home box and how many cores were used to get 50 MFlops QP?
> > /proc/cpuinfo tells me it's a AMD Ryzen 7 1700X Eight-Core Processor
> > , so Zen 1. And I have no idea what it was clocked at the time.
> >
> > It was a single core, just using gfortran's matmul (which is
> > OK for moderate sizes of double and has not been tuned at all
> > for 16-byte reals - we may well overflow the cache there).
> > > If it's more than one core on modern Intel/AMD, it sounds rather poor.
> > > In fact, on really modern Intel/AMD able to go above 4.5 GHz even on one core
> > > it sounds poor.
> > Like I said, it's my home box (which is about to be replaced anyway).
> I reproduced your measurement, both with your program and in C/C++ with repetitions and
> statistical filtering.
> Your home box is o.k. Faster than 2 out 4 CPUs that I tested.
> My measurements were as following:
> 42.8 - i7-3770
> 50.7 - E3-1271 v3
> 55.6 - E-2176G @4.25GHz
> 44.9 - EPYC 7543P
>
> So, gnu quadmath is indeed very slow even for two basic primitives.
> May be, tomorrow I'll test how it compares with my own quad-precision class.

Yesterday I was too busy with real work, so games had to wait until today.
89/123 - E3-1271 v3
95/132 - E-2176G @4.25GHz
First number is matmul( M=401, N=401, K=667) implemented with MUL/ADD
The second number - implemented with FMA.

My class has higher precision and exponent range than IEEE binary128
(128 bits vs 112 bits, 32 bits vs 15 bits), bigger memory footprint (24 bytes
vs 16 bytes) and more software-oriented internal format.
All in all, I expect that competently implemented IEEE binary128 for big
matmul should produce approximately the same speed as my class.

Now, the question is - do gnu people want competent implementation?
If the do, I am willing to help.

Re: Reconsidering Variable-Length Operands

<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25707&group=comp.arch#25707

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:de0c:0:b0:69e:cd37:7646 with SMTP id h12-20020a37de0c000000b0069ecd377646mr31709165qkj.449.1654876494201;
Fri, 10 Jun 2022 08:54:54 -0700 (PDT)
X-Received: by 2002:a05:622a:14c:b0:305:1bd4:c1d9 with SMTP id
v12-20020a05622a014c00b003051bd4c1d9mr2951910qtw.426.1654876494067; Fri, 10
Jun 2022 08:54:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 08:54:53 -0700 (PDT)
In-Reply-To: <t7us5d$jns$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:24be:340e:f29:f516;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:24be:340e:f29:f516
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com> <t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com> <t7us5d$jns$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 10 Jun 2022 15:54:54 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4204
 by: MitchAlsup - Fri, 10 Jun 2022 15:54 UTC

On Friday, June 10, 2022 at 2:34:08 AM UTC-5, Marcus wrote:
> On 2022-06-08, MitchAlsup wrote:
> > On Tuesday, June 7, 2022 at 12:53:22 PM UTC-5, Thomas Koenig wrote:
> >> MitchAlsup <Mitch...@aol.com> schrieb:
> > <
> > Plus I have nothing similar to AVX--using VEC-LOOP and native instructions for
> > vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the implementation
> > of the moment decides with width of the SIMD path, and the programmer/compiler
> > accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide, 8-wide, 96-wide !!
> > and no wide-registers are needed to provide the capability.
> > <
> > VEC-LOOP provides vectorized string handling, memory handling, and when I get around
> > to it, decimal handling.
>
> Regardless of how well designed an ISA is, 20-30 years after its initial
> release there *will* be new instructions in the ISA, and they will
<
I completely agree with this, new requirements happen, which is why more
than 40% of the ISA encoding space is currently unoccupied--leaving plenty
of room to grow.
<
> mostly be about improving performance in special cases in certain
> domains (e.g. encryption, codecs, security, etc). If the ISA is popular
<
As to encryption and several procedures like encryption:: would it not be
better to provide an attached processor, specialized for <say> SHA256-512
where you hand it a from-address, a to-address, a set of keys, and a length
and just tell it to start processing. It goes off and performs the task, then
sets a memory location or interrupts the thread. The length could be
arbitrarily long, might encode from/into memory spaces this process
cannot see (adding security), with other unstated advantages.
<
This would give a lot better than 10% improvement, because the CPU can
be doing other things while encryption is in process. The HW could be
designed as a data-path that only knows how to do encryption things,
maybe 512-bits wide per cycle,.....
<
> enough to have several different implementations and running most of the
> world's software, SW developers will start making different code paths
> for different implementations, *if* it gives a 10%+ performance gain.
>
> Even if you make exception handling really fast, you still have:
>
> exception > function call > inlined code > specialized solution
>
> And performance sensitive SW developers are going to want to be on the
> right end of that scale.
>
> Sure, you can minimize the problem with a good design, perhaps to a
> point where most people do not care, but I don't think that you can
> eliminate it completely.
>
> /Marcus

Re: Reconsidering Variable-Length Operands

<250a87f8-9c13-40c8-a17e-73c14f0d8e89n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25708&group=comp.arch#25708

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4310:b0:67b:3fc1:86eb with SMTP id u16-20020a05620a431000b0067b3fc186ebmr30725671qko.495.1654876777928;
Fri, 10 Jun 2022 08:59:37 -0700 (PDT)
X-Received: by 2002:a05:620a:4507:b0:6a7:2845:e8f2 with SMTP id
t7-20020a05620a450700b006a72845e8f2mr5882038qkp.270.1654876777766; Fri, 10
Jun 2022 08:59:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 08:59:37 -0700 (PDT)
In-Reply-To: <355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:24be:340e:f29:f516;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:24be:340e:f29:f516
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <250a87f8-9c13-40c8-a17e-73c14f0d8e89n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 10 Jun 2022 15:59:37 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3140
 by: MitchAlsup - Fri, 10 Jun 2022 15:59 UTC

On Friday, June 10, 2022 at 12:07:06 AM UTC-5, Quadibloc wrote:
> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>
> > But if VEC-LOOP provides the functionality of SIMD without the overhead
> > (register file, and ISA additions) why implement SIMD in ISA at all ?
<
> In order _for_ it to provide equivalent functionality, one basically has
> to have the same hardware in place; the only gain is that the ISA has
> fewer instructions in it, but the loss is now that the instruction streams
> have to be converted to micro-ops through a more complicated
> process that gets close to what a compiler does.
<
I have an ISA that provides* all of what x86 provides in 59 instructions
versus <what> more than 2000 !?!
<
(*) where "provides" means feature functionality desired by HLLs.
<
But as far as HW is concerned, great big implementations will have
great wide SIMD paths close to the cache, whereas little bitty
implementations will have little SIMD path close to cache, and
minimalist implementations may have no extra HW at all !!
ALL running the same instructions and achieving the same results
just in different amounts of time.
>
> Maybe I'm being unfair. For example, it may be that SIMD only
> needs a register file because it's done in the ISA, and a technique
> like VVM can get the same performance just with operand
> forwarding.
>
> John Savard

Re: Reconsidering Variable-Length Operands

<t8026t$3v0$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25713&group=comp.arch#25713

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 18:23:25 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t8026t$3v0$1@newsreader4.netcologne.de>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<95991f79-6f34-4777-860e-21599436d229n@googlegroups.com>
Injection-Date: Fri, 10 Jun 2022 18:23:25 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30b7:0:7285:c2ff:fe6c:992d";
logging-data="4064"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 10 Jun 2022 18:23 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> But overnight I figured out a way to add FP128 and FP16 (and more if desired.)
> along with a way to add SIMD that fits in 64-bit registers.

That's great to hear! I'm looking forward to read how you fit this
into your ISA.

Re: Reconsidering Variable-Length Operands

<t8030p$3v0$2@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25714&group=comp.arch#25714

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 18:37:13 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t8030p$3v0$2@newsreader4.netcologne.de>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
Injection-Date: Fri, 10 Jun 2022 18:37:13 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30b7:0:7285:c2ff:fe6c:992d";
logging-data="4064"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 10 Jun 2022 18:37 UTC

Quadibloc <jsavard@ecn.ab.ca> schrieb:
> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>
>> But if VEC-LOOP provides the functionality of SIMD without the overhead
>> (register file, and ISA additions) why implement SIMD in ISA at all ?
>
> In order _for_ it to provide equivalent functionality, one basically has
> to have the same hardware in place; the only gain is that the ISA has
> fewer instructions in it,

Take a look at

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

and tell me that this advantage is not significant :-)

>but the loss is now that the instruction streams
> have to be converted to micro-ops through a more complicated
> process that gets close to what a compiler does.

Auto-vectorization to SIMD is one of the things that are very hard
for compilers to do. A sufficiently skilled and maso^H^H^Hotivated
programmer is still able to write assembler (or "intrinsics" C
code, which IMHO sometimes worse to write) is still able to beat
a compiler by a significant factor, 2 or 4 for many cases.

If you doubt that, I invite you to take a quick look at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and all the
PRs that are blocked by it.

Re: Reconsidering Variable-Length Operands

<t803pg$q39$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25715&group=comp.arch#25715

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 11:50:25 -0700
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <t803pg$q39$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
<t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 10 Jun 2022 18:50:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="00a92b3c88ba81f8e7063936a23a10f9";
logging-data="26729"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+UZJOC4lrXuvxApXe++BO0"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.1
Cancel-Lock: sha1:2j1AEX0zMutnaxQKARv/pHV93AA=
In-Reply-To: <09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Fri, 10 Jun 2022 18:50 UTC

On 6/10/2022 8:54 AM, MitchAlsup wrote:
> On Friday, June 10, 2022 at 2:34:08 AM UTC-5, Marcus wrote:
>> On 2022-06-08, MitchAlsup wrote:
>>> On Tuesday, June 7, 2022 at 12:53:22 PM UTC-5, Thomas Koenig wrote:
>>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> <
>>> Plus I have nothing similar to AVX--using VEC-LOOP and native instructions for
>>> vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the implementation
>>> of the moment decides with width of the SIMD path, and the programmer/compiler
>>> accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide, 8-wide, 96-wide !!
>>> and no wide-registers are needed to provide the capability.
>>> <
>>> VEC-LOOP provides vectorized string handling, memory handling, and when I get around
>>> to it, decimal handling.
>>
>> Regardless of how well designed an ISA is, 20-30 years after its initial
>> release there *will* be new instructions in the ISA, and they will
> <
> I completely agree with this, new requirements happen, which is why more
> than 40% of the ISA encoding space is currently unoccupied--leaving plenty
> of room to grow.
> <
>> mostly be about improving performance in special cases in certain
>> domains (e.g. encryption, codecs, security, etc). If the ISA is popular
> <
> As to encryption and several procedures like encryption:: would it not be
> better to provide an attached processor, specialized for <say> SHA256-512
> where you hand it a from-address, a to-address, a set of keys, and a length
> and just tell it to start processing. It goes off and performs the task, then
> sets a memory location or interrupts the thread. The length could be
> arbitrarily long, might encode from/into memory spaces this process
> cannot see (adding security), with other unstated advantages.
> <
> This would give a lot better than 10% improvement, because the CPU can
> be doing other things while encryption is in process. The HW could be
> designed as a data-path that only knows how to do encryption things,
> maybe 512-bits wide per cycle,.....

It sounds so easy, but...

you lose so much synchronizing with the outboard work in a legacy ISA
that there's a minimal task size for the co-processor to be worth
having. Sort of an uncanny valley in efficiency.

To make it useful and not the bugfest and programming pain of the usual
asynchronous code you need to have built a continuation model into the
ISA from the beginning. You've paid a lot of attention to Ada
rendezvous, but that's too heavy and thread oriented to really do the
job IMO.

Re: Reconsidering Variable-Length Operands

<807772a2-5133-4595-8784-40aef7d3f351n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25717&group=comp.arch#25717

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:10a2:b0:6a6:b540:1c61 with SMTP id h2-20020a05620a10a200b006a6b5401c61mr20274465qkk.333.1654888783587;
Fri, 10 Jun 2022 12:19:43 -0700 (PDT)
X-Received: by 2002:a05:6214:c82:b0:46a:b677:e284 with SMTP id
r2-20020a0562140c8200b0046ab677e284mr25611398qvr.28.1654888783430; Fri, 10
Jun 2022 12:19:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 12:19:43 -0700 (PDT)
In-Reply-To: <t8030p$3v0$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:24be:340e:f29:f516;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:24be:340e:f29:f516
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com> <t8030p$3v0$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <807772a2-5133-4595-8784-40aef7d3f351n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 10 Jun 2022 19:19:43 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4488
 by: MitchAlsup - Fri, 10 Jun 2022 19:19 UTC

On Friday, June 10, 2022 at 1:37:19 PM UTC-5, Thomas Koenig wrote:
> Quadibloc <jsa...@ecn.ab.ca> schrieb:
> > On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
> >
> >> But if VEC-LOOP provides the functionality of SIMD without the overhead
> >> (register file, and ISA additions) why implement SIMD in ISA at all ?
> >
> > In order _for_ it to provide equivalent functionality, one basically has
> > to have the same hardware in place; the only gain is that the ISA has
> > fewer instructions in it,
> Take a look at
>
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>
> and tell me that this advantage is not significant :-)
> >but the loss is now that the instruction streams
> > have to be converted to micro-ops through a more complicated
> > process that gets close to what a compiler does.
<
> Auto-vectorization to SIMD is one of the things that are very hard
> for compilers to do. A sufficiently skilled and maso^H^H^Hotivated
> programmer is still able to write assembler (or "intrinsics" C
> code, which IMHO sometimes worse to write) is still able to beat
> a compiler by a significant factor, 2 or 4 for many cases.
<
Brian did not find it "all that difficult" to translate scalar loop codes
to VVM precisely because VVM does not vectorize instructions,
but vectorizes loops. Basically, you find a looping condition (such as
ADD-CMP-BC) and once found you simply add a VEC instruction at
the top and substitute LOOP for the found pattern (ADD-CMP-BC)
<
In particular, the compiler does not have to disambiguate memory,
or do many of the other things compilers targeting machines that
vectorize instructions (rather than vectorizing loops).
<
And on top of this, you do not lose precise exceptions, either;
or add state to the machine (vector/SIMD register file).
<
SIMD is just a kind of vectorization applied to instructions and
registers.
<
In the case of VVM, My 66000 needs information as to which registers
remain live after the loop terminates, this is made present by a bit
vector attached to the the VEC instruction. This is typically very sparse
often {} mainly {loop-index} and seldom {loop-index, register1 and register 2}
<
Providing the HW knowledge of which registers are no longer alive
makes setting up the HW for SIMD vectorization a lot easier--the
goal is to operate at cache access width continuously. Each implementation
can choose how wide is appropriate for that implementation. All code runs
on all implementations without modification.
>
> If you doubt that, I invite you to take a quick look at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and all the
> PRs that are blocked by it.

Re: Reconsidering Variable-Length Operands

<t807cm$3v4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25720&group=comp.arch#25720

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 12:51:48 -0700
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <t807cm$3v4$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 10 Jun 2022 19:51:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="31791bebee370517c9b86513b11d92f3";
logging-data="4068"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18O7sC2ej0HT5dL2i5vTqwx4ZqV2Whs7z0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:PX1Wm36zrY2bPpEsvRKzkaEqpVo=
In-Reply-To: <355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Fri, 10 Jun 2022 19:51 UTC

On 6/9/2022 10:07 PM, Quadibloc wrote:
> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>
>> But if VEC-LOOP provides the functionality of SIMD without the overhead
>> (register file, and ISA additions) why implement SIMD in ISA at all ?
>
> In order _for_ it to provide equivalent functionality, one basically has
> to have the same hardware in place;

Except no SIMD register sets.

> the only gain is that the ISA has
> fewer instructions in it,

And that benefit grows over time, as your non-VVM design adds new sets
of instructions to support the wider vectors the next version of your
hardware supports. And, of course, no recompile/source changes needed
to take advantage of new, more capable hardware.

> but the loss is now that the instruction streams
> have to be converted to micro-ops through a more complicated
> process that gets close to what a compiler does.

Huh? VVM doesn't use micro-ops.

> Maybe I'm being unfair. For example, it may be that SIMD only
> needs a register file because it's done in the ISA, and a technique
> like VVM can get the same performance just with operand
> forwarding.

That is Mitch's claim.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Reconsidering Variable-Length Operands

<t80a13$op5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25721&group=comp.arch#25721

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 15:35:39 -0500
Organization: A noiseless patient Spider
Lines: 101
Message-ID: <t80a13$op5$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<95991f79-6f34-4777-860e-21599436d229n@googlegroups.com>
<t8026t$3v0$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 10 Jun 2022 20:36:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="75f9cf107992510d5996e2372b435636";
logging-data="25381"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19pYwfKs6WOm5n3JEMQVnux"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:W3xb2UecWI1y/mlu3D7I+/d5nEk=
In-Reply-To: <t8026t$3v0$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Fri, 10 Jun 2022 20:35 UTC

On 6/10/2022 1:23 PM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>
>> But overnight I figured out a way to add FP128 and FP16 (and more if desired.)
>> along with a way to add SIMD that fits in 64-bit registers.
>
> That's great to hear! I'm looking forward to read how you fit this
> into your ISA.

FWIW, in my case the SIMD was defined (partially) relative to the
register size:
64-bit: 4x FP16, 2x FP32, ...

With 128-bit cases being twice this (two registers).
No current plans to go bigger.
Current 128-bit cases:
4x Single
4x Int32
2x Double
2x Int64
Int128

I mostly skipped over 8x cases (such as packed byte in 64b, or 8x
Binary16 in 128b), mostly because these add a lot of complexity and
didn't seem particularly worthwhile.

I had also skipped out on saturating arithmetic, mostly for similar
reasons. Supporting saturating operations would significantly increase
the complexity of packed-integer SIMD support in an ISA (and it is
"cheaper" to ask software to fake it if needed, *).

Had also skipped on having operations with built-in type conversions, ...

Mostly, had wanted to avoid the excessive complexity of something like
NEON or similar (this was also something that displeased me looking at
some of the RISC-V SIMD extensions; many went "all in" with NEON-like
levels of complexity).

*: Though, this could be easier to do efficiently if I added PADDC or
PADDV instructions, which would update the PQRO bits based on Carry or
Overflow status.

PADDC.W R4, R5 //Packed add with Carry
PSATU.W R5, R5 //Saturation fixup

PADDV.W R4, R5 //Packed add with Overflow
PSATS.W R5, R5 //Saturation fixup

PSATU would, for each element:
Flag bit is 0: Pass through unchanged.
Flag bit is 1:
Element Sign is 0, set to FFFF
Element Sign is 1, set to 0000

PSATS would, for each element:
Flag bit is 0: Pass through unchanged.
Flag bit is 1:
Element Sign is 0, set to 8000
Element Sign is 1, set to 7FFF

Would still need to be debate whether adding this would be worthwhile
(vs, say, structuring the math such that it doesn't overflow).

In any case, would still be cheaper than going the more traditional
route of turning it into an N x M problem, where adding saturation would
effectively add a multiplier on the number of packed-integer SIMD
instructions.

I mostly prefer to avoid this sort of thing, with a few exceptions:
FMOV.S / FMOV.H
Because single-precision loads/stores are common enough (in Quake and
similar) to become a performance issue (nevermind if at the moment, I am
trying to hunt down a bug involving this feature).

The 'LDTEX' instruction, which is kinda expensive, but can have a "semi
obvious" improvement on the performance of the span-drawing functions
(seems to roughly halve the amount of clock-cycles spent in span-drawing
relative to "everything else").

So, for example, using normal Load/Store ops, the span-drawing takes
around 20% of the clock-cycles, bur with LDTEX drops to around 12%.

Would probably be more significant if I didn't already have helper ops
for working with Morton shuffle and block-compressed textures and
similar. With LDTEX just sort of shoe-horning these onto a LOAD
operation (or, more specifically, an edge case of the FMOV operation).

Decided to leave out going a bunch into the LDTEX instruction and
effects of Morton order vs non-Morton order textures...

But, yeah, the simple summary of the latter is that if the texture is
sufficiently-non-square, there will be a fairly significant performance
impact in the rasterizer whenever drawing geometry using the texture.

....

Re: Reconsidering Variable-Length Operands

<47359e45-349b-41be-8cfa-67cbfcd04eabn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25723&group=comp.arch#25723

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1a1b:b0:6a7:aa:d474 with SMTP id bk27-20020a05620a1a1b00b006a700aad474mr11023236qkb.680.1654897378514;
Fri, 10 Jun 2022 14:42:58 -0700 (PDT)
X-Received: by 2002:ac8:5a8b:0:b0:304:b7de:7922 with SMTP id
c11-20020ac85a8b000000b00304b7de7922mr37541788qtc.552.1654897378366; Fri, 10
Jun 2022 14:42:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 14:42:58 -0700 (PDT)
In-Reply-To: <t807cm$3v4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:24be:340e:f29:f516;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:24be:340e:f29:f516
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com> <t807cm$3v4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <47359e45-349b-41be-8cfa-67cbfcd04eabn@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 10 Jun 2022 21:42:58 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3319
 by: MitchAlsup - Fri, 10 Jun 2022 21:42 UTC

On Friday, June 10, 2022 at 2:51:53 PM UTC-5, Stephen Fuld wrote:
> On 6/9/2022 10:07 PM, Quadibloc wrote:
> > On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
> >
> >> But if VEC-LOOP provides the functionality of SIMD without the overhead
> >> (register file, and ISA additions) why implement SIMD in ISA at all ?
> >
> > In order _for_ it to provide equivalent functionality, one basically has
> > to have the same hardware in place;
> Except no SIMD register sets.
> > the only gain is that the ISA has
> > fewer instructions in it,
> And that benefit grows over time, as your non-VVM design adds new sets
> of instructions to support the wider vectors the next version of your
> hardware supports. And, of course, no recompile/source changes needed
> to take advantage of new, more capable hardware.
> > but the loss is now that the instruction streams
> > have to be converted to micro-ops through a more complicated
> > process that gets close to what a compiler does.
> Huh? VVM doesn't use micro-ops.
<
VVM does not NECESSARILY use micro-ops--but nothing prevents
an implementation form doing so--although since is it nearly pure
RISC those micro-ops would be very close to native instructions.
<
> > Maybe I'm being unfair. For example, it may be that SIMD only
> > needs a register file because it's done in the ISA, and a technique
> > like VVM can get the same performance just with operand
> > forwarding.
> That is Mitch's claim.
<
Yes, indeed it is.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Reconsidering Variable-Length Operands

<t80et9$gnq$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25724&group=comp.arch#25724

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 16:58:57 -0500
Organization: A noiseless patient Spider
Lines: 133
Message-ID: <t80et9$gnq$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
<t8030p$3v0$2@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 10 Jun 2022 22:00:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eea7439487e53e79a43fbcbb32f8621";
logging-data="17146"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/G/lxxCSFjcpyA+ouHCNGs"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:JT7cg0TmKszhchBcrT5Ub6vNjPc=
In-Reply-To: <t8030p$3v0$2@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Fri, 10 Jun 2022 21:58 UTC

On 6/10/2022 1:37 PM, Thomas Koenig wrote:
> Quadibloc <jsavard@ecn.ab.ca> schrieb:
>> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>>
>>> But if VEC-LOOP provides the functionality of SIMD without the overhead
>>> (register file, and ISA additions) why implement SIMD in ISA at all ?
>>
>> In order _for_ it to provide equivalent functionality, one basically has
>> to have the same hardware in place; the only gain is that the ISA has
>> fewer instructions in it,
>
> Take a look at
>
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>
> and tell me that this advantage is not significant :-)
>

This is where you "cut your losses" and "throw in the towel".

Intel has gone far beyond this point...

>> but the loss is now that the instruction streams
>> have to be converted to micro-ops through a more complicated
>> process that gets close to what a compiler does.
>
> Auto-vectorization to SIMD is one of the things that are very hard
> for compilers to do. A sufficiently skilled and maso^H^H^Hotivated
> programmer is still able to write assembler (or "intrinsics" C
> code, which IMHO sometimes worse to write) is still able to beat
> a compiler by a significant factor, 2 or 4 for many cases.
>
> If you doubt that, I invite you to take a quick look at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and all the
> PRs that are blocked by it.

Given compilers with far more time and money invested in this than I
could reasonably expect to manage, still basically do crap job at this,
trying to bet on autovectorization does not seem like a worthwhile strategy.

However, I don't expect hardware scale inference to be any better here,
nor want to try to bet on mechanisms which assume that the hardware is
"actually clever".

So, this mostly leaves ISA choices as:
Write scalar code
Hope that future magic will come along and make it faster;
Have SIMD instructions, hope that things don't get too unwieldy
Try to avoid the rabbit holes AVX and NEON went down;
Have Vector instructions
Hope that hardware magic makes them fast;
Simple cases devolve into a scalar loops or pipelined loops.

From the ways many Vector ISA's are designed (basically as operating on
memory arrays), the "simple case" implementations are likely to perform
poorly even vs scalar code, since the scalar code can potentially be
organized to better exploit pipelining.

If one has a vector ISA that can manage to pipeline stuff effectively,
it is still "kinda meh" as the scalar ISA could likely also pipeline
stuff in these cases. It isn't hard to imagine how to have an ISA with
fully pipelined FPU operations, just it would add some cost in other
areas (such as those associated with making the pipeline long enough to
handle Binary64 FPU operations or similar directly; and/or allow for
fully pipelined Single but stall on Double).

However, unless one can schedule the initial loads for the vector
operation several cycles ahead of the vector body, the vector operation
is likely to pay a steep penalty for short vector operations (if
compared with SIMD).

This means that likely one would end up with vector registers something
like:
Vector Address (visible);
Vector length and element type (visible);
Cached vector data (SIMD style, hidden);
...
Then one can potentially side-step the memory loads/stores for short
vectors if the value is already in a vector register, but would then
need a way to make sure that the vector is written back to memory when
the program is done with it, ...

Is this worthwhile? I will express some doubt.

An alternative route is some sort of variable-length SIMD:
We have vector/SIMD registers, but the length is implicit;
Some way to specify "maximum vector length" is given, and to compare
this against however much has been consumed, in a way where the logic is
agnostic to the actual size of the SIMD registers.

....

I personally took a different approach, deciding to see SIMD registers,
not as some way to express arbitrary length vectors, but rather as
another layer in the same tree as scalar types.

A 3 element vector will always be a 3 element vector, just,
conceptually, as a new type of entity (distinct from the floating-point
elements which it contains).

Something like a large vector might in-effect contain this vector as a
sub-element within itself (and one could potentially focus more advanced
vector optimizations on "vectors of vectors" or similar).

Though, at present, there isn't really a cost-effective way to pull off
the latter.

Main option I can come up with at present being to consider some
hypothetical version of my ISA where the GPRs are effectively 2x or 4x
larger, and handled as multiple instances of the same instruction
operating on 64-bit slices of each register; this would likely also
involve some more advanced predication or "lane masking".

However, if implemented, normal C wouldn't likely be able to use this
effectively (absent a lot of language extensions and compiler hints).

This is likely to be bottlenecked by memory bandwidth even worse than my
current core (where, say, the span-drawing loops are already prone to
spend roughly 70% of their clock cycles mostly dealing with cache
misses; if one could run 4x as wide, but now have 93% of the clock
cycles going into cache misses, this would suck).

....

Re: Reconsidering Variable-Length Operands

<jwvfskcfaug.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25725&group=comp.arch#25725

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 18:02:11 -0400
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <jwvfskcfaug.fsf-monnier+comp.arch@gnu.org>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="567f59909c8c01b1a1d203951658c5a6";
logging-data="29472"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GkqMInHuMDPTZhLiQ6Csi"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:y03h9ASwR6yEWVEfjAjgEnR5bmU=
sha1:NTRt77HPyd4Iso2KpQYcwoGSnI0=
 by: Stefan Monnier - Fri, 10 Jun 2022 22:02 UTC

Quadibloc [2022-06-09 22:07:04] wrote:
> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>> But if VEC-LOOP provides the functionality of SIMD without the overhead
>> (register file, and ISA additions) why implement SIMD in ISA at all ?
> In order _for_ it to provide equivalent functionality, one basically has
> to have the same hardware in place; the only gain is that the ISA has
> fewer instructions in it, but the loss is now that the instruction streams
> have to be converted to micro-ops through a more complicated
> process that gets close to what a compiler does.

The usual SIMD instructions try to provide "the best instructions we
can" in an ad-hoc way. Those instructions use the hardware in an
efficient way (possibly more efficient than VVM).

The problem is that it's damn hard to use those instructions, so while
the instructions themselves are efficient, the overall result is not
necessarily as efficient as you'd want it to be, and it requires a lot
of work to get even this unsatisfactory result.

In a sense, VVM's proposition is to "bridge the semantic gap" ;-) and
make the machine language higher-level so that it's much easier to
generate that code from some normal source code written by someone else
than Terje, and the hardware does all the hard work for you (and it's
not completely magical: the VVM approach is not just looking at "what
would a programmer want" but does take into account what the hardware
can do such that it can hopefully execute the code efficiently without
heroic efforts).

VVM is still just vaporware and maybe it won't deliver what it promises.
But on paper it looks like a very nice solution. I don't think Mitch
claims that it requires significantly less overall hardware than your
typical SIMD solution to get the same performance, but it should
hopefully require much less human labor (and it should also result in
more compact machine code, with a single binary working more-or-less
optimally for a whole range of implementations).

Stefan

Re: Reconsidering Variable-Length Operands

<t80hr0$37t$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25726&group=comp.arch#25726

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Fri, 10 Jun 2022 17:48:55 -0500
Organization: A noiseless patient Spider
Lines: 98
Message-ID: <t80hr0$37t$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
<jwvfskcfaug.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 10 Jun 2022 22:50:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eea7439487e53e79a43fbcbb32f8621";
logging-data="3325"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX199pS3xMj1ew8BTd0vjOL0A"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:jW4RktR2x3EeoqpxPAhyIfEvscg=
In-Reply-To: <jwvfskcfaug.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: BGB - Fri, 10 Jun 2022 22:48 UTC

On 6/10/2022 5:02 PM, Stefan Monnier wrote:
> Quadibloc [2022-06-09 22:07:04] wrote:
>> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>>> But if VEC-LOOP provides the functionality of SIMD without the overhead
>>> (register file, and ISA additions) why implement SIMD in ISA at all ?
>> In order _for_ it to provide equivalent functionality, one basically has
>> to have the same hardware in place; the only gain is that the ISA has
>> fewer instructions in it, but the loss is now that the instruction streams
>> have to be converted to micro-ops through a more complicated
>> process that gets close to what a compiler does.
>
> The usual SIMD instructions try to provide "the best instructions we
> can" in an ad-hoc way. Those instructions use the hardware in an
> efficient way (possibly more efficient than VVM).
>
> The problem is that it's damn hard to use those instructions, so while
> the instructions themselves are efficient, the overall result is not
> necessarily as efficient as you'd want it to be, and it requires a lot
> of work to get even this unsatisfactory result.
>
> In a sense, VVM's proposition is to "bridge the semantic gap" ;-) and
> make the machine language higher-level so that it's much easier to
> generate that code from some normal source code written by someone else
> than Terje, and the hardware does all the hard work for you (and it's
> not completely magical: the VVM approach is not just looking at "what
> would a programmer want" but does take into account what the hardware
> can do such that it can hopefully execute the code efficiently without
> heroic efforts).
>
> VVM is still just vaporware and maybe it won't deliver what it promises.
> But on paper it looks like a very nice solution. I don't think Mitch
> claims that it requires significantly less overall hardware than your
> typical SIMD solution to get the same performance, but it should
> hopefully require much less human labor (and it should also result in
> more compact machine code, with a single binary working more-or-less
> optimally for a whole range of implementations).
>

Yeah.

I went "stuff I can do now" approach.

In my case, I went with SIMD because:
Moderately effective;
Easy (enough) to understand how it works;
Could be implemented more cost-effectively than other options.

Does not address the "what if we need bigger vectors later?" question. I
decided mostly to ignore this for now, and instead assume 64 and 128
bits to be the "end all" sizes. Well, and also, "4 elements should be
enough for anyone", ...

Don't necessarily want to go the AVX route though, we can all see where
that path leads...

The current FP-SIMD ops still mostly follow the pattern of feeding
elements one at a time through shared FPU, but mostly because:
FP-SIMD isn't eating quite enough clock-cycles yet to justify the cost;
Mock-ups of "faster SIMD ops" in my emulator (fully pipelined with a
3-cycle latency) had negligible impact on overall performance vs the
current approach (fixed 10 cycle latency).

Also, pipelining stuff through the existing FPU was the cheapest option
at the time.

Also, even with a fully pipelined FPU:
FLDCF R4, R20
FLDCF R6, R22
FLDCFH R4, R21
FLDCFH R6, R23
FADD R20, R22, R16
FADD R21, R23, R17
FLDCF R5, R20
FLDCF R7, R22
FLDCFH R5, R21
FLDCFH R7, R23
FADD R20, R22, R18
FADD R21, R23, R19
FSTCF R16, R20
FSTCF R17, R21
FSTCF R18, R22
FSTCF R19, R23
MOVHLD R21, R20, R2
MOVHLD R23, R22, R3

Would still require significantly more clock cycles than a 10-cycle:
PADDX.F R4, R6, R2

Though, going from 10-cycle to 3-cycle (pipelined) doesn't save quite as
much, even if it does go from, say, 2% to 0.35% of the total clock-cycle
budget...

....

Re: Reconsidering Variable-Length Operands

<536360d5-0ea2-4d55-b12b-658563cb2a78n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25727&group=comp.arch#25727

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:40d3:b0:6a6:e11d:77be with SMTP id g19-20020a05620a40d300b006a6e11d77bemr14383694qko.633.1654909434639;
Fri, 10 Jun 2022 18:03:54 -0700 (PDT)
X-Received: by 2002:a05:6214:d05:b0:464:6293:be35 with SMTP id
5-20020a0562140d0500b004646293be35mr42308788qvh.120.1654909434467; Fri, 10
Jun 2022 18:03:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 18:03:54 -0700 (PDT)
In-Reply-To: <t80et9$gnq$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:24be:340e:f29:f516;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:24be:340e:f29:f516
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com> <t8030p$3v0$2@newsreader4.netcologne.de>
<t80et9$gnq$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <536360d5-0ea2-4d55-b12b-658563cb2a78n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 11 Jun 2022 01:03:54 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 8135
 by: MitchAlsup - Sat, 11 Jun 2022 01:03 UTC

On Friday, June 10, 2022 at 5:00:13 PM UTC-5, BGB wrote:
> On 6/10/2022 1:37 PM, Thomas Koenig wrote:
> > Quadibloc <jsa...@ecn.ab.ca> schrieb:
> >> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
> >>
> >>> But if VEC-LOOP provides the functionality of SIMD without the overhead
> >>> (register file, and ISA additions) why implement SIMD in ISA at all ?
> >>
> >> In order _for_ it to provide equivalent functionality, one basically has
> >> to have the same hardware in place; the only gain is that the ISA has
> >> fewer instructions in it,
> >
> > Take a look at
> >
> > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
> >
> > and tell me that this advantage is not significant :-)
> >
> This is where you "cut your losses" and "throw in the towel".
>
> Intel has gone far beyond this point...
> >> but the loss is now that the instruction streams
> >> have to be converted to micro-ops through a more complicated
> >> process that gets close to what a compiler does.
> >
> > Auto-vectorization to SIMD is one of the things that are very hard
> > for compilers to do. A sufficiently skilled and maso^H^H^Hotivated
> > programmer is still able to write assembler (or "intrinsics" C
> > code, which IMHO sometimes worse to write) is still able to beat
> > a compiler by a significant factor, 2 or 4 for many cases.
> >
> > If you doubt that, I invite you to take a quick look at
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and all the
> > PRs that are blocked by it.
> Given compilers with far more time and money invested in this than I
> could reasonably expect to manage, still basically do crap job at this,
> trying to bet on autovectorization does not seem like a worthwhile strategy.
>
> However, I don't expect hardware scale inference to be any better here,
> nor want to try to bet on mechanisms which assume that the hardware is
> "actually clever".
>
>
>
> So, this mostly leaves ISA choices as:
> Write scalar code -- obviously ues
> Hope that future magic will come along and make it faster; --you bet against this
> Have SIMD instructions, hope that things don't get too unwieldy -- avoid SIMD ISA
> Try to avoid the rabbit holes AVX and NEON went down; --absofriggenluytely
> Have Vector instructions No, no, no -- make a way to vectorize loops.
> Hope that hardware magic makes them fast;
> Simple cases devolve into a scalar loops or pipelined loops.
>
>
> From the ways many Vector ISA's are designed (basically as operating on
> memory arrays), the "simple case" implementations are likely to perform
> poorly even vs scalar code, since the scalar code can potentially be
> organized to better exploit pipelining.
>
> If one has a vector ISA that can manage to pipeline stuff effectively,
> it is still "kinda meh" as the scalar ISA could likely also pipeline
<
Amdahl's law strikes again.
<
> stuff in these cases. It isn't hard to imagine how to have an ISA with
> fully pipelined FPU operations, just it would add some cost in other
> areas (such as those associated with making the pipeline long enough to
> handle Binary64 FPU operations or similar directly; and/or allow for
> fully pipelined Single but stall on Double).
>
> However, unless one can schedule the initial loads for the vector
> operation several cycles ahead of the vector body, the vector operation
> is likely to pay a steep penalty for short vector operations (if
> compared with SIMD).
<
CRAY-like machines had to schedule LDs: 13+implementation-delay
before vector FP instructions.
>
>
> This means that likely one would end up with vector registers something
> like:
> Vector Address (visible);
> Vector length and element type (visible);
> Cached vector data (SIMD style, hidden);
> ...
> Then one can potentially side-step the memory loads/stores for short
> vectors if the value is already in a vector register, but would then
> need a way to make sure that the vector is written back to memory when
> the program is done with it, ...
>
> Is this worthwhile? I will express some doubt.
<
Note: Vector machines seldom had caches.
Worthwhile: TO Livermore and Aberdeen: yes, average application: NO.
>
>
> An alternative route is some sort of variable-length SIMD:
<Cringe>
<
> We have vector/SIMD registers, but the length is implicit;
> Some way to specify "maximum vector length" is given, and to compare
> this against however much has been consumed, in a way where the logic is
> agnostic to the actual size of the SIMD registers.
>
As VVM has shown, you don't need vector registers.
As VVM is yet to show, you don't need SIMD registers, either.
> ...
>
>
> I personally took a different approach, deciding to see SIMD registers,
> not as some way to express arbitrary length vectors, but rather as
> another layer in the same tree as scalar types.
>
> A 3 element vector will always be a 3 element vector, just,
> conceptually, as a new type of entity (distinct from the floating-point
> elements which it contains).
>
> Something like a large vector might in-effect contain this vector as a
> sub-element within itself (and one could potentially focus more advanced
> vector optimizations on "vectors of vectors" or similar).
>
>
> Though, at present, there isn't really a cost-effective way to pull off
> the latter.
>
> Main option I can come up with at present being to consider some
> hypothetical version of my ISA where the GPRs are effectively 2x or 4x
> larger, and handled as multiple instances of the same instruction
> operating on 64-bit slices of each register; this would likely also
> involve some more advanced predication or "lane masking".
>
> However, if implemented, normal C wouldn't likely be able to use this
> effectively (absent a lot of language extensions and compiler hints).
>
>
> This is likely to be bottlenecked by memory bandwidth even worse than my
> current core (where, say, the span-drawing loops are already prone to
> spend roughly 70% of their clock cycles mostly dealing with cache
> misses; if one could run 4x as wide, but now have 93% of the clock
> cycles going into cache misses, this would suck).
>
> ...

Re: Reconsidering Variable-Length Operands

<ae7a6e19-20a6-4efa-9c5e-b9002e3bc28an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25728&group=comp.arch#25728

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e392:0:b0:467:db64:dde7 with SMTP id a18-20020a0ce392000000b00467db64dde7mr32618893qvl.85.1654921440288;
Fri, 10 Jun 2022 21:24:00 -0700 (PDT)
X-Received: by 2002:a05:620a:271b:b0:6a6:ff0b:e198 with SMTP id
b27-20020a05620a271b00b006a6ff0be198mr12153289qkp.127.1654921440126; Fri, 10
Jun 2022 21:24:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 10 Jun 2022 21:23:59 -0700 (PDT)
In-Reply-To: <t8030p$3v0$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:6947:3c86:73e1:a64e
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com> <t8030p$3v0$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ae7a6e19-20a6-4efa-9c5e-b9002e3bc28an@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 11 Jun 2022 04:24:00 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2149
 by: Quadibloc - Sat, 11 Jun 2022 04:23 UTC

On Friday, June 10, 2022 at 12:37:19 PM UTC-6, Thomas Koenig wrote:

> Take a look at
>
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>
> and tell me that this advantage is not significant :-)

But that _also_ shows me that the number of instructions can be pruned to
a reasonable level _without_ taking such extraordinary measures as
Mitch Alsup did.

John Savard

Re: Reconsidering Variable-Length Operands

<t8184c$607$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25729&group=comp.arch#25729

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 11 Jun 2022 00:09:23 -0500
Organization: A noiseless patient Spider
Lines: 331
Message-ID: <t8184c$607$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
<t8030p$3v0$2@newsreader4.netcologne.de> <t80et9$gnq$1@dont-email.me>
<536360d5-0ea2-4d55-b12b-658563cb2a78n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Jun 2022 05:10:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eea7439487e53e79a43fbcbb32f8621";
logging-data="6151"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fFAHoq2npG4KoHV+InJ4J"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:3wE4UNtHv75XJMeJ73jdfWI36vc=
In-Reply-To: <536360d5-0ea2-4d55-b12b-658563cb2a78n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sat, 11 Jun 2022 05:09 UTC

On 6/10/2022 8:03 PM, MitchAlsup wrote:
> On Friday, June 10, 2022 at 5:00:13 PM UTC-5, BGB wrote:
>> On 6/10/2022 1:37 PM, Thomas Koenig wrote:
>>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>>> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
>>>>
>>>>> But if VEC-LOOP provides the functionality of SIMD without the overhead
>>>>> (register file, and ISA additions) why implement SIMD in ISA at all ?
>>>>
>>>> In order _for_ it to provide equivalent functionality, one basically has
>>>> to have the same hardware in place; the only gain is that the ISA has
>>>> fewer instructions in it,
>>>
>>> Take a look at
>>>
>>> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>>>
>>> and tell me that this advantage is not significant :-)
>>>
>> This is where you "cut your losses" and "throw in the towel".
>>
>> Intel has gone far beyond this point...
>>>> but the loss is now that the instruction streams
>>>> have to be converted to micro-ops through a more complicated
>>>> process that gets close to what a compiler does.
>>>
>>> Auto-vectorization to SIMD is one of the things that are very hard
>>> for compilers to do. A sufficiently skilled and maso^H^H^Hotivated
>>> programmer is still able to write assembler (or "intrinsics" C
>>> code, which IMHO sometimes worse to write) is still able to beat
>>> a compiler by a significant factor, 2 or 4 for many cases.
>>>
>>> If you doubt that, I invite you to take a quick look at
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and all the
>>> PRs that are blocked by it.
>> Given compilers with far more time and money invested in this than I
>> could reasonably expect to manage, still basically do crap job at this,
>> trying to bet on autovectorization does not seem like a worthwhile strategy.
>>
>> However, I don't expect hardware scale inference to be any better here,
>> nor want to try to bet on mechanisms which assume that the hardware is
>> "actually clever".
>>
>>
>>
>> So, this mostly leaves ISA choices as:
>> Write scalar code -- obviously ues
>> Hope that future magic will come along and make it faster; --you bet against this
>> Have SIMD instructions, hope that things don't get too unwieldy -- avoid SIMD ISA
>> Try to avoid the rabbit holes AVX and NEON went down; --absofriggenluytely
>> Have Vector instructions No, no, no -- make a way to vectorize loops.
>> Hope that hardware magic makes them fast;
>> Simple cases devolve into a scalar loops or pipelined loops.
>>
>>
>> From the ways many Vector ISA's are designed (basically as operating on
>> memory arrays), the "simple case" implementations are likely to perform
>> poorly even vs scalar code, since the scalar code can potentially be
>> organized to better exploit pipelining.
>>
>> If one has a vector ISA that can manage to pipeline stuff effectively,
>> it is still "kinda meh" as the scalar ISA could likely also pipeline
> <
> Amdahl's law strikes again.
> <

Yeah, seems like partly a matter, not just of how efficiently
calculations can be done, but of how efficiently values can be moved
from place to place and turned into the correct form.

As can noted from elsewhere, one bottleneck with faking SIMD with scalar
ops, was more just sort of moving the bits around and transforming them,
which ends up costing more than the "actually computing stuff" task.

Similar I think with the LDTEX mechanism:
The texture-load operation was more being limited by moving the bits around.

As noted, luckily at least, the task of extracting a texel value from a
compressed-texture block isn't particularly high latency.

And, 5 instructions ended up being mashed into one instruction, as there
wasn't any good way to do it in smaller pieces. Like, seemingly, the
parts were interdependent in a way that mostly limited it to either 5
instructions, or 1 instruction.

>> stuff in these cases. It isn't hard to imagine how to have an ISA with
>> fully pipelined FPU operations, just it would add some cost in other
>> areas (such as those associated with making the pipeline long enough to
>> handle Binary64 FPU operations or similar directly; and/or allow for
>> fully pipelined Single but stall on Double).
>>
>> However, unless one can schedule the initial loads for the vector
>> operation several cycles ahead of the vector body, the vector operation
>> is likely to pay a steep penalty for short vector operations (if
>> compared with SIMD).
> <
> CRAY-like machines had to schedule LDs: 13+implementation-delay
> before vector FP instructions.

Hmm...

Not entirely sure how CRAYs pulled this part off...

It is at least easier to imagine how stuff works when the answer is, say:
We have pipelined instructions;
Data moves from place to place via registers;
...

But, say, if it is something like the RISC-V V extension, which
seemingly defines its vector registers as essentially specialized tagged
array pointers, it is less obvious how it is supposed to work.

>>
>>
>> This means that likely one would end up with vector registers something
>> like:
>> Vector Address (visible);
>> Vector length and element type (visible);
>> Cached vector data (SIMD style, hidden);
>> ...
>> Then one can potentially side-step the memory loads/stores for short
>> vectors if the value is already in a vector register, but would then
>> need a way to make sure that the vector is written back to memory when
>> the program is done with it, ...
>>
>> Is this worthwhile? I will express some doubt.
> <
> Note: Vector machines seldom had caches.
> Worthwhile: TO Livermore and Aberdeen: yes, average application: NO.

OK.

This turns it into more of a mystery then.

>>
>>
>> An alternative route is some sort of variable-length SIMD:
> <Cringe>

IIRC, ARM has apparently done something like this with SVE.

> <
>> We have vector/SIMD registers, but the length is implicit;
>> Some way to specify "maximum vector length" is given, and to compare
>> this against however much has been consumed, in a way where the logic is
>> agnostic to the actual size of the SIMD registers.
>>
> As VVM has shown, you don't need vector registers.
> As VVM is yet to show, you don't need SIMD registers, either.

But, WRT:
How exactly does it work?
How exactly would it be implemented?
...

I am limited partly by my imagination here, I can implement things if I
can imagine how they will work.

I can imagine SIMD.

I can also imaging vector machines, albeit most of what I can imagine
for possible implementations would fall well short of being
performance-competitive.

I don't fault the concept in this case, but "if I implement it how I can
imagine it, it would likely be a boat-anchor" is not a huge motivation.

In my case, at least SIMD on BJX2 uses GPRs, so I avoid this source of
issues.

I guess an open question is "What if I want an exploding set of new SIMD
instructions?"

Could probably put them in Op64 space or something.

Could maybe define:
* FFw0_0ZZZ-F0nm_6GoD
* FFw0_0ZZZ-F0nm_6GoE
* FFw0_0ZZZ-F0nm_6GoF

As a big space for 128-bit 3R SIMD ops or something...
* Well, if I feel a sudden need for a bunch of new 128-bit 3R ops...

Goes looking at commits, recent added instructions (newer to older):
* LDTEX (Texture Load)
* MULU.X / DUVU.X / REMU.X (Reserved encodings for 128-bit MUL/DIV)
* CMPNANTEQ (Compare NaN Tag Equal)
* BCDADC / BCDSBB (BCD ADD / SUB)
* FDIVA / FSQRTA (Fast approximations for FDIV and FSQRT)
* FDIV and FSQRT (FPU Divide and Square Root ops)
* DIVS.Q / DIVU.Q, MODS.Q / MODU.Q (64-bit Divide and Modulo)
** Also implemented 64-bit Multiply.
* Op40x2 Encodings (New encoding space based on hacked XGPR)
* --- 2022 ---
* BITSEL and MOV.C
* Scaled-Index encodings
* DMAC (Integer Multiply-Accumulate Experiment)
* XMOV (Load/Store, 128-bit pointers)
* ...

Looks like around this time last year was when I added XGPR encodings
(expanding to 64 GPRs), ...

Not really looking like A SIMD explosion, more a "misc stuff" explosion...

So, it would appear that, over the past 1.5 years:
* Have filled F0nm_3eo1..F0nm_3eo7 with ALUX ops
** 8..F are reserved for more 2R space (for when 1zz8..1zzF gets full).
** Some space originally used for Imm5 ALU ops was also reused for ALUX.
*** The Imm5 ALU ops were redundant with the Imm9 encodings.
* Have filled F0nm_6eoX (Mostly with MUL/DIV and FPU related ops).
* Have filled F0nm_8eoX (XMOV ops).


Click here to read the complete article
Re: Reconsidering Variable-Length Operands

<t81881$607$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25730&group=comp.arch#25730

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 11 Jun 2022 00:11:22 -0500
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <t81881$607$2@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
<t8030p$3v0$2@newsreader4.netcologne.de>
<ae7a6e19-20a6-4efa-9c5e-b9002e3bc28an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Jun 2022 05:12:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eea7439487e53e79a43fbcbb32f8621";
logging-data="6151"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX187Zk+/2KNnI14aJUJVe1Lr"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:B8zoNutjbWXRrZiZYXzricLV25o=
In-Reply-To: <ae7a6e19-20a6-4efa-9c5e-b9002e3bc28an@googlegroups.com>
Content-Language: en-US
 by: BGB - Sat, 11 Jun 2022 05:11 UTC

On 6/10/2022 11:23 PM, Quadibloc wrote:
> On Friday, June 10, 2022 at 12:37:19 PM UTC-6, Thomas Koenig wrote:
>
>> Take a look at
>>
>> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>>
>> and tell me that this advantage is not significant :-)
>
> But that _also_ shows me that the number of instructions can be pruned to
> a reasonable level _without_ taking such extraordinary measures as
> Mitch Alsup did.
>

Yeah, errm, just don't add every possible combination as an
ever-expanding polynomial explosion...

Re: Reconsidering Variable-Length Operands

<t81esp$n77$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25732&group=comp.arch#25732

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 11 Jun 2022 02:04:49 -0500
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <t81esp$n77$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com>
<t8030p$3v0$2@newsreader4.netcologne.de> <t80et9$gnq$1@dont-email.me>
<536360d5-0ea2-4d55-b12b-658563cb2a78n@googlegroups.com>
<t8184c$607$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Jun 2022 07:06:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eea7439487e53e79a43fbcbb32f8621";
logging-data="23783"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/odEVVYZVj3tMFqK8yRuJ6"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:Lw4KDCMWu5rK3+WiB2D4SKrBS8Y=
In-Reply-To: <t8184c$607$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sat, 11 Jun 2022 07:04 UTC

(ADD/Correction)

On 6/11/2022 12:09 AM, BGB wrote:

....

>
> Remaining spaces within the F0 block:
> * F0nm_7eoX
> * F0nm_9eoX
> * F0nm_AeoX
> * F0nm_BeoX
> Potentially Reclaimable:
> * F0nm_EeoX (Deprecated BT Encoding)
> * F0nm_FeoX (Deprecated BF Encoding)
>
> And, other in-use blocks (within F0):
> * F0nm_0eoX (Filled, Initial ISA design, Ld/St)
> * F0nm_1eoX (Filled, Initial ISA design, ALU and 2R)
> * F0nm_2eoX (Filled, ~ 2019/2020)
> * F0nm_3eoX (~ 2021, 1R and ALUX)
> * F0nm_4eoX (Ld/St, Still mostly free, MOV.X/MOV.C/FMOV.S)
> * F0nm_5eoX (Filled, ~ 2019/2020)
> * F0nm_6eoX (Filled, ~ 2021/2022)
> * F0nm_8eoX (Filled, 2021, XMOV Ld/St)
> * F0nm_CeoX (BRA Disp20)
> * F0nm_DeoX (BSR Disp20)
> * F0nm_EeoX (Deprecated)
> * F0nm_DeoX (Deprecated)
>

* F0dd_Cddd (BRA Disp20)
* F0dd_Dddd (BSR Disp20)
* F0dd_Eddd (Deprecated, BT)
** Replaced with E0dd_Cddd
* F0dd_Dddd (Deprecated, BF)
** Replaced with E4dd_Cddd

Because, well, it is Disp20 and not 3R.

Also, because each 'X' represents a hex-digit, not too hard to figure
out how many possible instructions exist within a block.

As for:
* F0nm_9eoX
Originally, this was the dedicated FPU block (based around dedicated FPU
registers), which followed a similar layout to the SH-4 FPU block (FnmX).

It went away when the original FPU design basically died off (unlike
RISC-V Zfinx/Zdinx, I didn't "reinterpret" the FPU instructions, I
re-added them elsewhere; and then later ended up dropping the original
when it became more obvious that I wasn't going to revive it).

Earlier on, I had imagined:
* F9nm_XeoX

As possibly dedicated mostly to FPU and/or SIMD (some earlier ideas
would have been more akin to SSE or NEON; which would have needed a
little more encoding space for the SIMD ops), but this idea went away
when I decided to instead put SIMD in the GPRs (and then subsequently
ended up nuking the original FPU design).

It is likely at this point that F9 may be used when F0 gets full, though
is slightly less desirable as some encodings are not possible with F9
(neither XGPR nor PrWEX encodings exist for F9).

Though, these are less of an issue if one assumes putting SIMD
instructions in this space. Could consider this if/when I start needing
more 128-bit encodings.

There is also some reserved blocks in XGPR space:
70zz_zzzz
90zz_zzzz
91zz_zzzz
98zz_zzzz
99zz_zzzz
And:
78ww_zzzz (Op40x2 prefix)

78zz_zzzz-F4nm_XeoX-F0nm_XeoX (Op40x2 Bundle)
Unpacks effectively to 2x:
FFw0_00zz-F0nm_XeoX

Partly because, for reasons, Jumbo and Op64 encodings can't be used in
bundles; this at least sorta helps though, and allows encoding a few
cases things that could not be encoded otherwise.

....

Re: Reconsidering Variable-Length Operands

<t81mbq$54d$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25733&group=comp.arch#25733

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 11 Jun 2022 09:13:30 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t81mbq$54d$1@newsreader4.netcologne.de>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me>
<9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de>
<c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<95991f79-6f34-4777-860e-21599436d229n@googlegroups.com>
Injection-Date: Sat, 11 Jun 2022 09:13:30 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30b7:0:7285:c2ff:fe6c:992d";
logging-data="5261"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 11 Jun 2022 09:13 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> But overnight I figured out a way to add FP128 and FP16 (and more if desired.)
> along with a way to add SIMD that fits in 64-bit registers.

SIMD can also be important together with VVM (or other vectorization
methids), to expose SLP opportunities. (SLP means superword level
parallesism, exploiting SIMD opportunites in straight-line code,
for example when somebody writes

a = b + c
c = e + f

for a type that fits twice into a SIMD register, that could
be done as a single instruction.

Re: Reconsidering Variable-Length Operands

<3a78f448-818c-411f-8a45-31e175efbe2an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25734&group=comp.arch#25734

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:409:b0:305:1e21:7c81 with SMTP id n9-20020a05622a040900b003051e217c81mr6028249qtx.655.1654965608553;
Sat, 11 Jun 2022 09:40:08 -0700 (PDT)
X-Received: by 2002:a05:622a:14c:b0:305:1bd4:c1d9 with SMTP id
v12-20020a05622a014c00b003051bd4c1d9mr6948180qtw.426.1654965608273; Sat, 11
Jun 2022 09:40:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 11 Jun 2022 09:40:08 -0700 (PDT)
In-Reply-To: <t8184c$607$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d9bc:f0da:255a:4081;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d9bc:f0da:255a:4081
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com> <t8030p$3v0$2@newsreader4.netcologne.de>
<t80et9$gnq$1@dont-email.me> <536360d5-0ea2-4d55-b12b-658563cb2a78n@googlegroups.com>
<t8184c$607$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3a78f448-818c-411f-8a45-31e175efbe2an@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 11 Jun 2022 16:40:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 17394
 by: MitchAlsup - Sat, 11 Jun 2022 16:40 UTC

On Saturday, June 11, 2022 at 12:10:40 AM UTC-5, BGB wrote:
> On 6/10/2022 8:03 PM, MitchAlsup wrote:
> > On Friday, June 10, 2022 at 5:00:13 PM UTC-5, BGB wrote:
> >> On 6/10/2022 1:37 PM, Thomas Koenig wrote:
> >>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
> >>>> On Wednesday, June 8, 2022 at 10:10:30 AM UTC-6, MitchAlsup wrote:
> >>>>
> >>>>> But if VEC-LOOP provides the functionality of SIMD without the overhead
> >>>>> (register file, and ISA additions) why implement SIMD in ISA at all ?
> >>>>
> >>>> In order _for_ it to provide equivalent functionality, one basically has
> >>>> to have the same hardware in place; the only gain is that the ISA has
> >>>> fewer instructions in it,
> >>>
> >>> Take a look at
> >>>
> >>> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
> >>>
> >>> and tell me that this advantage is not significant :-)
> >>>
> >> This is where you "cut your losses" and "throw in the towel".
> >>
> >> Intel has gone far beyond this point...
> >>>> but the loss is now that the instruction streams
> >>>> have to be converted to micro-ops through a more complicated
> >>>> process that gets close to what a compiler does.
> >>>
> >>> Auto-vectorization to SIMD is one of the things that are very hard
> >>> for compilers to do. A sufficiently skilled and maso^H^H^Hotivated
> >>> programmer is still able to write assembler (or "intrinsics" C
> >>> code, which IMHO sometimes worse to write) is still able to beat
> >>> a compiler by a significant factor, 2 or 4 for many cases.
> >>>
> >>> If you doubt that, I invite you to take a quick look at
> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 and all the
> >>> PRs that are blocked by it.
> >> Given compilers with far more time and money invested in this than I
> >> could reasonably expect to manage, still basically do crap job at this,
> >> trying to bet on autovectorization does not seem like a worthwhile strategy.
> >>
> >> However, I don't expect hardware scale inference to be any better here,
> >> nor want to try to bet on mechanisms which assume that the hardware is
> >> "actually clever".
> >>
> >>
> >>
> >> So, this mostly leaves ISA choices as:
> >> Write scalar code -- obviously ues
> >> Hope that future magic will come along and make it faster; --you bet against this
> >> Have SIMD instructions, hope that things don't get too unwieldy -- avoid SIMD ISA
> >> Try to avoid the rabbit holes AVX and NEON went down; --absofriggenluytely
> >> Have Vector instructions No, no, no -- make a way to vectorize loops.
> >> Hope that hardware magic makes them fast;
> >> Simple cases devolve into a scalar loops or pipelined loops.
> >>
> >>
> >> From the ways many Vector ISA's are designed (basically as operating on
> >> memory arrays), the "simple case" implementations are likely to perform
> >> poorly even vs scalar code, since the scalar code can potentially be
> >> organized to better exploit pipelining.
> >>
> >> If one has a vector ISA that can manage to pipeline stuff effectively,
> >> it is still "kinda meh" as the scalar ISA could likely also pipeline
> > <
> > Amdahl's law strikes again.
> > <
> Yeah, seems like partly a matter, not just of how efficiently
> calculations can be done, but of how efficiently values can be moved
> from place to place and turned into the correct form.
>
>
> As can noted from elsewhere, one bottleneck with faking SIMD with scalar
> ops, was more just sort of moving the bits around and transforming them,
> which ends up costing more than the "actually computing stuff" task.
>
>
> Similar I think with the LDTEX mechanism:
> The texture-load operation was more being limited by moving the bits around.
>
> As noted, luckily at least, the task of extracting a texel value from a
> compressed-texture block isn't particularly high latency.
>
> And, 5 instructions ended up being mashed into one instruction, as there
> wasn't any good way to do it in smaller pieces. Like, seemingly, the
> parts were interdependent in a way that mostly limited it to either 5
> instructions, or 1 instruction.
> >> stuff in these cases. It isn't hard to imagine how to have an ISA with
> >> fully pipelined FPU operations, just it would add some cost in other
> >> areas (such as those associated with making the pipeline long enough to
> >> handle Binary64 FPU operations or similar directly; and/or allow for
> >> fully pipelined Single but stall on Double).
> >>
> >> However, unless one can schedule the initial loads for the vector
> >> operation several cycles ahead of the vector body, the vector operation
> >> is likely to pay a steep penalty for short vector operations (if
> >> compared with SIMD).
> > <
> > CRAY-like machines had to schedule LDs: 13+implementation-delay
> > before vector FP instructions.
> Hmm...
>
> Not entirely sure how CRAYs pulled this part off...
<
Is spit out 1 address per cycle to a unoccupied bank in memory,
There was routing delay to the bank (3+ cycles) routing delay back (3+)
and at least 7+ cycles in the memory bank.
<
CRAY 1 and 1S had 1 memory port per cycle
CRAY XMP and YMP had 2 LDs per cycle and 1 ST per cycle and more interleaving
At this point it becomes simple routing.
>
>
> It is at least easier to imagine how stuff works when the answer is, say:
> We have pipelined instructions;
> Data moves from place to place via registers;
> ...
>
> But, say, if it is something like the RISC-V V extension, which
> seemingly defines its vector registers as essentially specialized tagged
> array pointers, it is less obvious how it is supposed to work.
<
You are being unimaginative........
> >>
> >>
> >> This means that likely one would end up with vector registers something
> >> like:
> >> Vector Address (visible);
> >> Vector length and element type (visible);
> >> Cached vector data (SIMD style, hidden);
> >> ...
> >> Then one can potentially side-step the memory loads/stores for short
> >> vectors if the value is already in a vector register, but would then
> >> need a way to make sure that the vector is written back to memory when
> >> the program is done with it, ...
> >>
> >> Is this worthwhile? I will express some doubt.
> > <
> > Note: Vector machines seldom had caches.
> > Worthwhile: TO Livermore and Aberdeen: yes, average application: NO.
> OK.
>
> This turns it into more of a mystery then.
> >>
> >>
> >> An alternative route is some sort of variable-length SIMD:
> > <Cringe>
> IIRC, ARM has apparently done something like this with SVE.
> > <
> >> We have vector/SIMD registers, but the length is implicit;
> >> Some way to specify "maximum vector length" is given, and to compare
> >> this against however much has been consumed, in a way where the logic is
> >> agnostic to the actual size of the SIMD registers.
> >>
> > As VVM has shown, you don't need vector registers.
> > As VVM is yet to show, you don't need SIMD registers, either.
> But, WRT:
> How exactly does it work?
<
The loop defines a unit of work to be performed.
The Decoder examines the work as it is loaded into the execution window
When Decoder can determine that it can perform multiple iterations per
cycle--it SIMDs the loop to execute at cache access width.
<
> How exactly would it be implemented?
> ...
>
> I am limited partly by my imagination here, I can implement things if I
> can imagine how they will work.
>
>
> I can imagine SIMD.
<
Imaging a SIMD engine of cache access width (1× 2× or 4× to begin with)
Cache accesses data, SIMD engine crunches data, cache absorbs modified
data; repeat as necessary DIV-SIMD-width.
>
> I can also imaging vector machines, albeit most of what I can imagine
> for possible implementations would fall well short of being
> performance-competitive.
<
Think of shift register scheduling. At Decode time everything is dropped into
shift register per function unit. At this point you just have to obey register
dependencies.
>
> I don't fault the concept in this case, but "if I implement it how I can
> imagine it, it would likely be a boat-anchor" is not a huge motivation.
>
>
>
> In my case, at least SIMD on BJX2 uses GPRs, so I avoid this source of
> issues.
>
> I guess an open question is "What if I want an exploding set of new SIMD
> instructions?"
>
> Could probably put them in Op64 space or something.
>
> Could maybe define:
> * FFw0_0ZZZ-F0nm_6GoD
> * FFw0_0ZZZ-F0nm_6GoE
> * FFw0_0ZZZ-F0nm_6GoF
>
> As a big space for 128-bit 3R SIMD ops or something...
> * Well, if I feel a sudden need for a bunch of new 128-bit 3R ops...
>
>
>
> Goes looking at commits, recent added instructions (newer to older):
> * LDTEX (Texture Load)
> * MULU.X / DUVU.X / REMU.X (Reserved encodings for 128-bit MUL/DIV)
> * CMPNANTEQ (Compare NaN Tag Equal)
> * BCDADC / BCDSBB (BCD ADD / SUB)
> * FDIVA / FSQRTA (Fast approximations for FDIV and FSQRT)
> * FDIV and FSQRT (FPU Divide and Square Root ops)
> * DIVS.Q / DIVU.Q, MODS.Q / MODU.Q (64-bit Divide and Modulo)
> ** Also implemented 64-bit Multiply.
> * Op40x2 Encodings (New encoding space based on hacked XGPR)
> * --- 2022 ---
> * BITSEL and MOV.C
> * Scaled-Index encodings
> * DMAC (Integer Multiply-Accumulate Experiment)
> * XMOV (Load/Store, 128-bit pointers)
> * ...
>
> Looks like around this time last year was when I added XGPR encodings
> (expanding to 64 GPRs), ...
>
> Not really looking like A SIMD explosion, more a "misc stuff" explosion....
<
Consider having 1 instruction for each SIMD width so you have k SIMD
instructions {signed(+-×/), unsigned(+-×/), logical(&,&~,|,|~,^,%~), FP(+-×/)}
and j SIMD-widths {8, 16, 32, 64, 128, 256,512} or about k×j now add in p
predicated operations (lane control) and you end up with k×j×p SIMD
instructions.
>
> So, it would appear that, over the past 1.5 years:
> * Have filled F0nm_3eo1..F0nm_3eo7 with ALUX ops
> ** 8..F are reserved for more 2R space (for when 1zz8..1zzF gets full).
> ** Some space originally used for Imm5 ALU ops was also reused for ALUX.
> *** The Imm5 ALU ops were redundant with the Imm9 encodings.
> * Have filled F0nm_6eoX (Mostly with MUL/DIV and FPU related ops).
> * Have filled F0nm_8eoX (XMOV ops).
>
>
> Last SIMD ops I could find in this survey for type conversion ops for
> FP8 and similar.
>
> It looks like the encodings for most of the SIMD ops were added around
> early 2020, and some in late 2019 (along with Jumbo encodings and similar).
>
>
> Most activity from 2018 is related to core parts of the ISA (or, the end
> of the BJX1 project and start of the BJX2 project).
>
> The F1/F2 blocks mostly go back to the 2018/2019 timeframe, and were
> already mostly full as they were originally defined (Imm9/Disp9 space
> didn't go very far).
>
>
> The F8 block is mostly the same as it was in the initial spec, the only
> real addition being the "FLDCH Imm16, Rn" instruction (Load a Binary16
> constant as a Binary64).
>
> The Jumbo encodings don't really change the existing blocks, but in
> their present form did come at the cost of the complete elimination of
> the original Op48 encoding space (a new Op48 encodings could be added
> based on Jumbo, but in this case have little practical advantage over
> Op64 encodings).
>
>
>
> Remaining spaces within the F0 block:
> * F0nm_7eoX
> * F0nm_9eoX
> * F0nm_AeoX
> * F0nm_BeoX
> Potentially Reclaimable:
> * F0nm_EeoX (Deprecated BT Encoding)
> * F0nm_FeoX (Deprecated BF Encoding)
>
> And, other in-use blocks (within F0):
> * F0nm_0eoX (Filled, Initial ISA design, Ld/St)
> * F0nm_1eoX (Filled, Initial ISA design, ALU and 2R)
> * F0nm_2eoX (Filled, ~ 2019/2020)
> * F0nm_3eoX (~ 2021, 1R and ALUX)
> * F0nm_4eoX (Ld/St, Still mostly free, MOV.X/MOV.C/FMOV.S)
> * F0nm_5eoX (Filled, ~ 2019/2020)
> * F0nm_6eoX (Filled, ~ 2021/2022)
> * F0nm_8eoX (Filled, 2021, XMOV Ld/St)
> * F0nm_CeoX (BRA Disp20)
> * F0nm_DeoX (BSR Disp20)
> * F0nm_EeoX (Deprecated)
> * F0nm_DeoX (Deprecated)
>
>
> Not sure which block I would expand into next.
>
> The 7/9/E/F blocks do have the drawback that any encodings within this
> block would not automatically re-align the decoder if the decoder
> becomes misaligned.
>
> For the most part, the ISA has the property that if one tries to start
> decoding from a misaligned instruction word, then the decoding will
> quickly realign itself within a few instructions.
>
>
> Within the 32-bit encoding space, there is still the F3 and F9 blocks,
> currently mostly intended as User Blocks.
> >> ...
> >>
> >>
> >> I personally took a different approach, deciding to see SIMD registers,
> >> not as some way to express arbitrary length vectors, but rather as
> >> another layer in the same tree as scalar types.
> >>
> >> A 3 element vector will always be a 3 element vector, just,
> >> conceptually, as a new type of entity (distinct from the floating-point
> >> elements which it contains).
> >>
> >> Something like a large vector might in-effect contain this vector as a
> >> sub-element within itself (and one could potentially focus more advanced
> >> vector optimizations on "vectors of vectors" or similar).
> >>
> >>
> >> Though, at present, there isn't really a cost-effective way to pull off
> >> the latter.
> >>
> >> Main option I can come up with at present being to consider some
> >> hypothetical version of my ISA where the GPRs are effectively 2x or 4x
> >> larger, and handled as multiple instances of the same instruction
> >> operating on 64-bit slices of each register; this would likely also
> >> involve some more advanced predication or "lane masking".
> >>
> >> However, if implemented, normal C wouldn't likely be able to use this
> >> effectively (absent a lot of language extensions and compiler hints).
> >>
> >>
> >> This is likely to be bottlenecked by memory bandwidth even worse than my
> >> current core (where, say, the span-drawing loops are already prone to
> >> spend roughly 70% of their clock cycles mostly dealing with cache
> >> misses; if one could run 4x as wide, but now have 93% of the clock
> >> cycles going into cache misses, this would suck).
> >>
> >> ...


Click here to read the complete article
Re: Reconsidering Variable-Length Operands

<8271d6b5-4e1c-47c5-8250-4a4a8410cf65n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25735&group=comp.arch#25735

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5944:0:b0:46b:8c3c:8218 with SMTP id eo4-20020ad45944000000b0046b8c3c8218mr26034746qvb.127.1654965925379;
Sat, 11 Jun 2022 09:45:25 -0700 (PDT)
X-Received: by 2002:a05:620a:f0d:b0:67e:1961:b061 with SMTP id
v13-20020a05620a0f0d00b0067e1961b061mr34787055qkl.82.1654965925234; Sat, 11
Jun 2022 09:45:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 11 Jun 2022 09:45:25 -0700 (PDT)
In-Reply-To: <t81mbq$54d$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d9bc:f0da:255a:4081;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d9bc:f0da:255a:4081
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<95991f79-6f34-4777-860e-21599436d229n@googlegroups.com> <t81mbq$54d$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8271d6b5-4e1c-47c5-8250-4a4a8410cf65n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 11 Jun 2022 16:45:25 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2432
 by: MitchAlsup - Sat, 11 Jun 2022 16:45 UTC

On Saturday, June 11, 2022 at 4:13:33 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > But overnight I figured out a way to add FP128 and FP16 (and more if desired.)
> > along with a way to add SIMD that fits in 64-bit registers.
> SIMD can also be important together with VVM (or other vectorization
> methids), to expose SLP opportunities. (SLP means superword level
> parallesism, exploiting SIMD opportunites in straight-line code,
> for example when somebody writes
>
> a = b + c
> c = e + f
>
> for a type that fits twice into a SIMD register, that could
> be done as a single instruction.
<
Except for the loop carried dependency c.

Re: Reconsidering Variable-Length Operands

<t82hhf$3o9$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25736&group=comp.arch#25736

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 11 Jun 2022 18:57:19 +0200
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <t82hhf$3o9$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
<t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
<t803pg$q39$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Jun 2022 16:57:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="deb346e7fcc1ca3356401216464a26b2";
logging-data="3849"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+WbC1mBP+WvyEM9Pz/ztGoWYHn22k5l5c="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.9.1
Cancel-Lock: sha1:UZOAurabAy8zcip460uT3lYueiw=
In-Reply-To: <t803pg$q39$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Sat, 11 Jun 2022 16:57 UTC

On 2022-06-10, Ivan Godard wrote:
> On 6/10/2022 8:54 AM, MitchAlsup wrote:
>> On Friday, June 10, 2022 at 2:34:08 AM UTC-5, Marcus wrote:
>>> On 2022-06-08, MitchAlsup wrote:
>>>> On Tuesday, June 7, 2022 at 12:53:22 PM UTC-5, Thomas Koenig wrote:
>>>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>> <
>>>> Plus I have nothing similar to AVX--using VEC-LOOP and native
>>>> instructions for
>>>> vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the
>>>> implementation
>>>> of the moment decides with width of the SIMD path, and the
>>>> programmer/compiler
>>>> accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide,
>>>> 8-wide, 96-wide !!
>>>> and no wide-registers are needed to provide the capability.
>>>> <
>>>> VEC-LOOP provides vectorized string handling, memory handling, and
>>>> when I get around
>>>> to it, decimal handling.
>>>
>>> Regardless of how well designed an ISA is, 20-30 years after its initial
>>> release there *will* be new instructions in the ISA, and they will
>> <
>> I completely agree with this, new requirements happen, which is why more
>> than 40% of the ISA encoding space is currently unoccupied--leaving
>> plenty
>> of room to grow.
>> <
>>> mostly be about improving performance in special cases in certain
>>> domains (e.g. encryption, codecs, security, etc). If the ISA is popular
>> <
>> As to encryption and several procedures like encryption:: would it not be
>> better to provide an attached processor, specialized for <say> SHA256-512
>> where you hand it a from-address, a to-address, a set of keys, and a
>> length
>> and just tell it to start processing. It goes off and performs the
>> task, then
>> sets a memory location or interrupts the thread. The length could be
>> arbitrarily long, might encode from/into memory spaces this process
>> cannot see (adding security), with other unstated advantages.
>> <
>> This would give a lot better than 10% improvement, because the CPU can
>> be doing other things while encryption is in process. The HW could be
>> designed as a data-path that only knows how to do encryption things,
>> maybe 512-bits wide per cycle,.....
>
> It sounds so easy, but...
>
> you lose so much synchronizing with the outboard work in a legacy ISA
> that there's a minimal task size for the co-processor to be worth
> having. Sort of an uncanny valley in efficiency.
>
> To make it useful and not the bugfest and programming pain of the usual
> asynchronous code you need to have built a continuation model into the
> ISA from the beginning. You've paid a lot of attention to Ada
> rendezvous, but that's too heavy and thread oriented to really do the
> job IMO.
>

There's certainly a place for dedicated co-processors. For encryption it
makes perfect sense to have HW solutions in the memory or I/O data paths
for instance, making RAM or disk encryption "free" and completely
transparent to the SW.

For more "dynamic" SW solutions, I agree that you will typically at
least do the initial algorithm development and deployment in regular SW,
possibly using intrinsics / assembler to make certain parts of the
kernels go faster. A dedicated async encryption machine may be useful in
certain enterprise/server domains, but for mainstream it's probably more
practical and economical to just add a few specialized instructions.

BTW, encryption was just an example. 30 years ago encryption and
security were not as high up on the list of important features as they
are today, and I dare not guess what functionality will be a must-have
30 years from now.

/Marcus

Re: Reconsidering Variable-Length Operands

<81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25737&group=comp.arch#25737

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:58cc:0:b0:305:f6b:8dc0 with SMTP id u12-20020ac858cc000000b003050f6b8dc0mr11177446qta.300.1654970515450;
Sat, 11 Jun 2022 11:01:55 -0700 (PDT)
X-Received: by 2002:ad4:5b83:0:b0:464:56c2:881e with SMTP id
3-20020ad45b83000000b0046456c2881emr49571078qvp.84.1654970515258; Sat, 11 Jun
2022 11:01:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 11 Jun 2022 11:01:55 -0700 (PDT)
In-Reply-To: <t803pg$q39$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d9bc:f0da:255a:4081;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d9bc:f0da:255a:4081
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com> <t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com> <t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com> <t803pg$q39$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 11 Jun 2022 18:01:55 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 11 Jun 2022 18:01 UTC

On Friday, June 10, 2022 at 1:50:28 PM UTC-5, Ivan Godard wrote:
> On 6/10/2022 8:54 AM, MitchAlsup wrote:
> > On Friday, June 10, 2022 at 2:34:08 AM UTC-5, Marcus wrote:
> >> On 2022-06-08, MitchAlsup wrote:
> >>> On Tuesday, June 7, 2022 at 12:53:22 PM UTC-5, Thomas Koenig wrote:
> >>>> MitchAlsup <Mitch...@aol.com> schrieb:
> >>> <
> >>> Plus I have nothing similar to AVX--using VEC-LOOP and native instructions for
> >>> vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the implementation
> >>> of the moment decides with width of the SIMD path, and the programmer/compiler
> >>> accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide, 8-wide, 96-wide !!
> >>> and no wide-registers are needed to provide the capability.
> >>> <
> >>> VEC-LOOP provides vectorized string handling, memory handling, and when I get around
> >>> to it, decimal handling.
> >>
> >> Regardless of how well designed an ISA is, 20-30 years after its initial
> >> release there *will* be new instructions in the ISA, and they will
> > <
> > I completely agree with this, new requirements happen, which is why more
> > than 40% of the ISA encoding space is currently unoccupied--leaving plenty
> > of room to grow.
> > <
> >> mostly be about improving performance in special cases in certain
> >> domains (e.g. encryption, codecs, security, etc). If the ISA is popular
> > <
> > As to encryption and several procedures like encryption:: would it not be
> > better to provide an attached processor, specialized for <say> SHA256-512
> > where you hand it a from-address, a to-address, a set of keys, and a length
> > and just tell it to start processing. It goes off and performs the task, then
> > sets a memory location or interrupts the thread. The length could be
> > arbitrarily long, might encode from/into memory spaces this process
> > cannot see (adding security), with other unstated advantages.
> > <
> > This would give a lot better than 10% improvement, because the CPU can
> > be doing other things while encryption is in process. The HW could be
> > designed as a data-path that only knows how to do encryption things,
> > maybe 512-bits wide per cycle,.....
> It sounds so easy, but...
>
> you lose so much synchronizing with the outboard work in a legacy ISA
> that there's a minimal task size for the co-processor to be worth
> having. Sort of an uncanny valley in efficiency.
<
What if the attached processor could write to a register in the originating
thread? Then if the thread continued running, if the thread used the target
register as a operand, the CPU would wait for the attached processor
just like any potentially long running instruction (uncacheable LD down the
PCIe tree takes hundreds of cycles.)
>
> To make it useful and not the bugfest and programming pain of the usual
> asynchronous code you need to have built a continuation model into the
> ISA from the beginning. You've paid a lot of attention to Ada
> rendezvous, but that's too heavy and thread oriented to really do the
> job IMO.

Re: Reconsidering Variable-Length Operands

<t82ltm$tim$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=25738&group=comp.arch#25738

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 11 Jun 2022 11:12:04 -0700
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <t82ltm$tim$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
<t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
<t803pg$q39$1@dont-email.me>
<81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Jun 2022 18:12:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2a7340b79493581fa3cad7cf438a28fe";
logging-data="30294"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CB2XaPHYcLa0qsxgzsmHMMi9S3kmr7x4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:pH/u5Rj8CXwqK1JE9egTxZ1/AxM=
In-Reply-To: <81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Sat, 11 Jun 2022 18:12 UTC

On 6/11/2022 11:01 AM, MitchAlsup wrote:
> On Friday, June 10, 2022 at 1:50:28 PM UTC-5, Ivan Godard wrote:
>> On 6/10/2022 8:54 AM, MitchAlsup wrote:
>>> On Friday, June 10, 2022 at 2:34:08 AM UTC-5, Marcus wrote:
>>>> On 2022-06-08, MitchAlsup wrote:
>>>>> On Tuesday, June 7, 2022 at 12:53:22 PM UTC-5, Thomas Koenig wrote:
>>>>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>>> <
>>>>> Plus I have nothing similar to AVX--using VEC-LOOP and native instructions for
>>>>> vectorization. VEC-LOOP eliminate any need for SIMD ISA--that is the implementation
>>>>> of the moment decides with width of the SIMD path, and the programmer/compiler
>>>>> accesses this via VEC-LOOP. 1-wide, 2-wide, 3-wide !!, 4-wide, 8-wide, 96-wide !!
>>>>> and no wide-registers are needed to provide the capability.
>>>>> <
>>>>> VEC-LOOP provides vectorized string handling, memory handling, and when I get around
>>>>> to it, decimal handling.
>>>>
>>>> Regardless of how well designed an ISA is, 20-30 years after its initial
>>>> release there *will* be new instructions in the ISA, and they will
>>> <
>>> I completely agree with this, new requirements happen, which is why more
>>> than 40% of the ISA encoding space is currently unoccupied--leaving plenty
>>> of room to grow.
>>> <
>>>> mostly be about improving performance in special cases in certain
>>>> domains (e.g. encryption, codecs, security, etc). If the ISA is popular
>>> <
>>> As to encryption and several procedures like encryption:: would it not be
>>> better to provide an attached processor, specialized for <say> SHA256-512
>>> where you hand it a from-address, a to-address, a set of keys, and a length
>>> and just tell it to start processing. It goes off and performs the task, then
>>> sets a memory location or interrupts the thread. The length could be
>>> arbitrarily long, might encode from/into memory spaces this process
>>> cannot see (adding security), with other unstated advantages.
>>> <
>>> This would give a lot better than 10% improvement, because the CPU can
>>> be doing other things while encryption is in process. The HW could be
>>> designed as a data-path that only knows how to do encryption things,
>>> maybe 512-bits wide per cycle,.....
>> It sounds so easy, but...
>>
>> you lose so much synchronizing with the outboard work in a legacy ISA
>> that there's a minimal task size for the co-processor to be worth
>> having. Sort of an uncanny valley in efficiency.
> <
> What if the attached processor could write to a register in the originating
> thread? Then if the thread continued running, if the thread used the target
> register as a operand, the CPU would wait for the attached processor
> just like any potentially long running instruction (uncacheable LD down the
> PCIe tree takes hundreds of cycles.)

Does that work if, between the initiation and the termination of the
asynchronous operation, the originating thread gets switched out of
control of the CPU (say another higher priority task gets switched in)?
Does the attached processor know enough to update the copy of the
register in the originating processed register save area? Or is the
operation of the attached processor for that task stopped when the task
is switched out? Some other solution?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)


devel / comp.arch / Re: Reconsidering Variable-Length Operands

Pages:123456
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor