novaBBS - comp.arch - Re: Wide issue variable length ISA

Re: Wide issue variable length ISA

<52b16590-b60a-49f6-9e0e-c3b3c6b6da97n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=16469&group=comp.arch#16469

X-Received: by 2002:ac8:a83:: with SMTP id d3mr26558561qti.91.1620221590483;
Wed, 05 May 2021 06:33:10 -0700 (PDT)
X-Received: by 2002:a4a:952b:: with SMTP id m40mr24427752ooi.69.1620221590162;
Wed, 05 May 2021 06:33:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 5 May 2021 06:33:09 -0700 (PDT)
In-Reply-To: <df63ece2-40f8-4954-86ed-2e8a8c72711en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:c1b7:8232:ffb7:d942;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:c1b7:8232:ffb7:d942
References: <df63ece2-40f8-4954-86ed-2e8a8c72711en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <52b16590-b60a-49f6-9e0e-c3b3c6b6da97n@googlegroups.com>
Subject: Re: Wide issue variable length ISA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 05 May 2021 13:33:10 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Wed, 5 May 2021 13:33 UTC

On Monday, February 1, 2021 at 12:03:59 PM UTC-7, MitchAlsup wrote:

> For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> ago, my subconscious barfed up a good idea on the matter--based on
> sound underlying ideas::
> a) since we have precise execptions--we CAN figure out where the
> constant is in the instruction address space,
> b) Since we have a scorebord/station based instruction scheduling
> mechanism, the instruction can wait for its constant just as easily
> as any other operand,
> c) since we will be executing packets most of the time, the ICache is
> idle, so we can fetch the constants from the ICache ! and route these
> as "use once" operands ! (or "Scalar Operands" when VVM is active)
> d) therefore, we need allocate no space in the instruction slots for
> constants!
> e) all we need is an execution window "deep" enough to absorb this
> additional latency.

At first I couldn't make heads or tails out of this.

But upon reflection, what I *think* you're talking about is this:

Immediate instructions 'pop' constants from a constant stack, and
the instruction stream also contains constants, which push themselves
(i.e. there's an instruction opcode that means 'push constant immediate')
with the possible additional proviso that an immediate instruction can pop
a constant 'on credit'; i.e., wait until the corresponding push constant
instruction happens, which does not need to precede the immediate
instruction in the instruction stream.

I would never have thought of this, or at least seriously considered it, for
the following reason: the 'push constant' opcode would have to be
wastefully long, given that constants are of data types that fit well into
the memory... and instructions also fit well into the _same_ memory,
with the same alignment and chunks.

So I prefer schemes which instead make use of a small number of
overhead bits in the immediate instructions. But, even though it covers
possibly multiple constants, I still also need at least one 16-bit block
header in the schemes I've devised,. which still leaves me unsatisfied.

The only possible alternative I could come up with, that I've currently rejected
as preposterous, is: devise a secondary instruction set where the instructions
are 29 bits long instead of 32 bits, and have the first instruction in each block
come from this set, thus allowing three bits in the block to say how many
instruction slots contain constants.

That minimizes overhead to keep instruction density high... but the price in
complexity is excessive.

Not that _one_ possibility doesn't still remain with me:

Organize the instruction set in such a way that the 29-bit instructions are a
natural subset of the 32-bit instructions, so that this scheme _can_ be used
without defining two instruction sets.

That is a natural extension of a principle I've already tried: have the first bit
of a 32-bit instruction slot indicate if it contains two 16-bit instructions or one
32-bit instruction, but use it to indicate prefixes instead of pairs of 16-bit
instructions in the first slot.

John Savard

Re: Wide issue variable length ISA

<s6uds6$sdh$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16471&group=comp.arch#16471

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-27c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Wide issue variable length ISA
Date: Wed, 5 May 2021 15:33:26 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s6uds6$sdh$1@newsreader4.netcologne.de>
References: <df63ece2-40f8-4954-86ed-2e8a8c72711en@googlegroups.com>
<52b16590-b60a-49f6-9e0e-c3b3c6b6da97n@googlegroups.com>
Injection-Date: Wed, 5 May 2021 15:33:26 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-27c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:27c6:0:7285:c2ff:fe6c:992d";
logging-data="29105"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 5 May 2021 15:33 UTC

Quadibloc <jsavard@ecn.ab.ca> schrieb:

> That is a natural extension of a principle I've already tried: have the first bit
> of a 32-bit instruction slot indicate if it contains two 16-bit instructions or one
> 32-bit instruction, but use it to indicate prefixes instead of pairs of 16-bit
> instructions in the first slot.

I like another idea whose basic frame was developed here a few
weeks ago much better (and which I have extended a bit).

The instructions come in 64-bit bundles. They are independent of
each other, they just happen to share the same word.

Encoding is in the first bits:

Leading 0 means three instructions of 21 bit each. This is for
all the ra = ra op b instructions (which are rather frequent even
in three-operand machines), plus for stuff like load/store with
reasonably large offsets from the stack pointer (or TOC) which
occur frequently. Frequency analysis of existing RISC programs
could give a good idea which instructions should go in there.

Leading 100 and 101 means a 40 bit instruction and a 21 bit
instruction. The 40 bit instructions should be enough for all
"normal" stuff including loading 32-bit constants, rather large
branches depending on bits set in registers, and and maybe a few
four-operand instructions (like FMA or

ra = rb - rc, plus set all comparison flags in rd

or a direct

(ra, rb) = (mulh(rc, rd), mull(rc,rd))

Leading 110 means a 61-bit instruction, which could do vector
stuff with very many registers or other strange things.

Finally, 111 should be reserved for future expansion.

There is a certain danger that some 21-bit slots will not be filled
with reasonable instructions.

Jump targets in a bundle could be either the address of the bundle
plus the number of the instruction, or the byte address rounded
up or down to the next 8 bits. Either way, if there was no legal
instruction there, an exception will be raised.

I think there is a good chance that the code density could be a
bit higher than a 32/64 bit encoding scheme.

Re: Wide issue variable length ISA

<17ff6beb-f9aa-4aaf-8580-8514534d2cd7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16475&group=comp.arch#16475

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:da7:: with SMTP id h7mr32290395qvh.48.1620234676806;
Wed, 05 May 2021 10:11:16 -0700 (PDT)
X-Received: by 2002:a05:6830:108c:: with SMTP id y12mr25183131oto.276.1620234676557;
Wed, 05 May 2021 10:11:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news.muarf.org!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 5 May 2021 10:11:16 -0700 (PDT)
In-Reply-To: <52b16590-b60a-49f6-9e0e-c3b3c6b6da97n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <df63ece2-40f8-4954-86ed-2e8a8c72711en@googlegroups.com> <52b16590-b60a-49f6-9e0e-c3b3c6b6da97n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <17ff6beb-f9aa-4aaf-8580-8514534d2cd7n@googlegroups.com>
Subject: Re: Wide issue variable length ISA
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 05 May 2021 17:11:16 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 5 May 2021 17:11 UTC

On Wednesday, May 5, 2021 at 8:33:11 AM UTC-5, Quadibloc wrote:
> On Monday, February 1, 2021 at 12:03:59 PM UTC-7, MitchAlsup wrote:
>
> > For the My 66000 ISA I wanted a "better" scheme, and a couple of days
> > ago, my subconscious barfed up a good idea on the matter--based on
> > sound underlying ideas::
> > a) since we have precise execptions--we CAN figure out where the
> > constant is in the instruction address space,
> > b) Since we have a scorebord/station based instruction scheduling
> > mechanism, the instruction can wait for its constant just as easily
> > as any other operand,
> > c) since we will be executing packets most of the time, the ICache is
> > idle, so we can fetch the constants from the ICache ! and route these
> > as "use once" operands ! (or "Scalar Operands" when VVM is active)
> > d) therefore, we need allocate no space in the instruction slots for
> > constants!
> > e) all we need is an execution window "deep" enough to absorb this
> > additional latency.
> At first I couldn't make heads or tails out of this.
>
> But upon reflection, what I *think* you're talking about is this:
>
> Immediate instructions 'pop' constants from a constant stack, and
> the instruction stream also contains constants, which push themselves
> (i.e. there's an instruction opcode that means 'push constant immediate')
<
There is no such OpCode. The constants are fetched from execution memory.
Executing instructions from that memory have been placed into what I call a
packet. A packet contains places where 6 (or 8) instructions can be placed.
These instructions are 35-bits long--the register specifiers have expanded
into 6-bits to allow for intra-packet forwarding, and single use encodings.
<
> with the possible additional proviso that an immediate instruction can pop
> a constant 'on credit'; i.e., wait until the corresponding push constant
> instruction happens, which does not need to precede the immediate
> instruction in the instruction stream.
>
> I would never have thought of this, or at least seriously considered it, for
> the following reason: the 'push constant' opcode would have to be
> wastefully long, given that constants are of data types that fit well into
> the memory... and instructions also fit well into the _same_ memory,
> with the same alignment and chunks.

There is no push constant, all we need is something like a 4-bit "address"
because we have the IP of the first instruction and 99% of the time those
6 (or 8) instructions fit in less than 16 words. And note: the common 16-bit
immediates need no such accessing of the ICache.
>
> So I prefer schemes which instead make use of a small number of
> overhead bits in the immediate instructions. But, even though it covers
> possibly multiple constants, I still also need at least one 16-bit block
> header in the schemes I've devised,. which still leaves me unsatisfied.
>
> The only possible alternative I could come up with, that I've currently rejected
> as preposterous, is: devise a secondary instruction set where the instructions
> are 29 bits long instead of 32 bits, and have the first instruction in each block
> come from this set, thus allowing three bits in the block to say how many
> instruction slots contain constants.
<
If you think about what I am trying to do it essentially what you are doing
I am just trying to do it dynamically as I build packets. Doing it dynamically
I can get those 3-bits (5 in my case) without damaging the natural encoding
of the instructions.
>
> That minimizes overhead to keep instruction density high... but the price in
> complexity is excessive.
>
> Not that _one_ possibility doesn't still remain with me:
>
> Organize the instruction set in such a way that the 29-bit instructions are a
> natural subset of the 32-bit instructions, so that this scheme _can_ be used
> without defining two instruction sets.
>
> That is a natural extension of a principle I've already tried: have the first bit
> of a 32-bit instruction slot indicate if it contains two 16-bit instructions or one
> 32-bit instruction, but use it to indicate prefixes instead of pairs of 16-bit
> instructions in the first slot.
>
> John Savard

Re: Wide issue variable length ISA

<CwTkI.37598$DZ5.16281@fx23.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16491&group=comp.arch#16491

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!peer02.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx23.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Wide issue variable length ISA
References: <df63ece2-40f8-4954-86ed-2e8a8c72711en@googlegroups.com> <52b16590-b60a-49f6-9e0e-c3b3c6b6da97n@googlegroups.com> <17ff6beb-f9aa-4aaf-8580-8514534d2cd7n@googlegroups.com>
In-Reply-To: <17ff6beb-f9aa-4aaf-8580-8514534d2cd7n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 95
Message-ID: <CwTkI.37598$DZ5.16281@fx23.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 06 May 2021 15:00:50 UTC
Date: Thu, 06 May 2021 11:00:27 -0400
X-Received-Bytes: 5347

by: EricP - Thu, 6 May 2021 15:00 UTC

MitchAlsup wrote:
> On Wednesday, May 5, 2021 at 8:33:11 AM UTC-5, Quadibloc wrote:
>> On Monday, February 1, 2021 at 12:03:59 PM UTC-7, MitchAlsup wrote:
>>
>>> For the My 66000 ISA I wanted a "better" scheme, and a couple of days
>>> ago, my subconscious barfed up a good idea on the matter--based on
>>> sound underlying ideas::
>>> a) since we have precise execptions--we CAN figure out where the
>>> constant is in the instruction address space,
>>> b) Since we have a scorebord/station based instruction scheduling
>>> mechanism, the instruction can wait for its constant just as easily
>>> as any other operand,
>>> c) since we will be executing packets most of the time, the ICache is
>>> idle, so we can fetch the constants from the ICache ! and route these
>>> as "use once" operands ! (or "Scalar Operands" when VVM is active)
>>> d) therefore, we need allocate no space in the instruction slots for
>>> constants!
>>> e) all we need is an execution window "deep" enough to absorb this
>>> additional latency.
>> At first I couldn't make heads or tails out of this.
>>
>> But upon reflection, what I *think* you're talking about is this:
>>
>> Immediate instructions 'pop' constants from a constant stack, and
>> the instruction stream also contains constants, which push themselves
>> (i.e. there's an instruction opcode that means 'push constant immediate')
> <
> There is no such OpCode. The constants are fetched from execution memory.
> Executing instructions from that memory have been placed into what I call a
> packet. A packet contains places where 6 (or 8) instructions can be placed.
> These instructions are 35-bits long--the register specifiers have expanded
> into 6-bits to allow for intra-packet forwarding, and single use encodings.
> <
>> with the possible additional proviso that an immediate instruction can pop
>> a constant 'on credit'; i.e., wait until the corresponding push constant
>> instruction happens, which does not need to precede the immediate
>> instruction in the instruction stream.
>>
>> I would never have thought of this, or at least seriously considered it, for
>> the following reason: the 'push constant' opcode would have to be
>> wastefully long, given that constants are of data types that fit well into
>> the memory... and instructions also fit well into the _same_ memory,
>> with the same alignment and chunks.
>
> There is no push constant, all we need is something like a 4-bit "address"
> because we have the IP of the first instruction and 99% of the time those
> 6 (or 8) instructions fit in less than 16 words. And note: the common 16-bit
> immediates need no such accessing of the ICache.

Some interesting tidbits...

The AMD patent on the operation cache, 2016
https://patents.google.com/patent/US20200225956A1/

refers to some microarchitecture details, presumably for Zen3,
which, IIUC, deal with IP and variable format instructions,
and how they carry around these multiple 64-bit values internally.

On x64 there can be multiple uOps for each macro-op and therefore each IP.
Each macro-op can optionally have a displacement and/or immediate constant.

If an instruction has both displacement and immediate and decodes to
multiple uOps, those constants might be used by one or more different uOps.
But we don't want all uOps to carry about 64-bit quantities everywhere.

It looks like the uArch has the decoder take the parsed
instruction and IP and split it out into 3 piles:
- the IP goes into an IP circular buffer
- the 1/2/4 byte displacement and/or immediate, if present,
into a constants circular buffer
- one or more uOps go into the uOp buffer with the indexes into
the IP and constants buffers.

Essentially they "second normal form" it.
Then presumably each circular buffer is sized according to usage stats.

IP
v
variable_instruction [displacement] [immediate]
v
decode
v
-------------------------
v v v
IP <--idx---uOp---idx-->const
queue queue queue

IIUC later when they decide to build a uOp cache entry,
they pack that back together in what looks like a heads-n-tails
bundle holding the multiple uOps and optional constants.
The IP can be tossed for uOp cache because we can recreate it
from the the IP used to fetch from the uOp cache.

MAC user's dynamic debugging list evaluator? Never heard of that.

devel / comp.arch / Re: Wide issue variable length ISA

Subject	Author
Re: Wide issue variable length ISA	Quadibloc
Re: Wide issue variable length ISA	Thomas Koenig
Re: Wide issue variable length ISA	MitchAlsup
Re: Wide issue variable length ISA	EricP