Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"Success covers a multitude of blunders." -- George Bernard Shaw


devel / comp.arch / a modest proposal (apologies to J. Swift)

SubjectAuthor
* a modest proposal (apologies to J. Swift)Ivan Godard
+* Re: a modest proposal (apologies to J. Swift)BGB
|`* Re: a modest proposal (apologies to J. Swift)George Neuner
| `- Re: a modest proposal (apologies to J. Swift)winden
+- Re: a modest proposal (apologies to J. Swift)MitchAlsup
+* Re: a modest proposal (apologies to J. Swift)John Levine
|+- Re: a modest proposal (apologies to J. Swift)BGB
|+- Re: a modest proposal (apologies to J. Swift)Bernd Linsel
|+* Re: a modest proposal (apologies to J. Swift)Marcus
||`* Re: a modest proposal (apologies to J. Swift)John Levine
|| `* Re: a modest proposal (apologies to J. Swift)BGB
||  `* Re: a modest proposal (apologies to J. Swift)MitchAlsup
||   `* Re: a modest proposal (apologies to J. Swift)BGB
||    `* Re: a modest proposal (apologies to J. Swift)MitchAlsup
||     `* Re: a modest proposal (apologies to J. Swift)BGB
||      `* Re: a modest proposal (apologies to J. Swift)MitchAlsup
||       `* Re: a modest proposal (apologies to J. Swift)BGB
||        `* Re: a modest proposal (apologies to J. Swift)MitchAlsup
||         `* Re: a modest proposal (apologies to J. Swift)BGB
||          `* Re: a modest proposal (apologies to J. Swift)MitchAlsup
||           `* Re: a modest proposal (apologies to J. Swift)BGB
||            `* Re: a modest proposal (apologies to J. Swift)Michael S
||             +* Re: a modest proposal (apologies to J. Swift)BGB
||             |`* Re: a modest proposal (apologies to J. Swift)Anton Ertl
||             | `* Re: a modest proposal (apologies to J. Swift)BGB
||             |  `* Re: a modest proposal (apologies to J. Swift)Anton Ertl
||             |   `- Re: a modest proposal (apologies to J. Swift)Anton Ertl
||             `* Re: a modest proposal (apologies to J. Swift)JimBrakefield
||              `* Re: a modest proposal (apologies to J. Swift)Thomas Koenig
||               `- Re: a modest proposal (apologies to J. Swift)JimBrakefield
|+* Re: a modest proposal (apologies to J. Swift)JimBrakefield
||`* Re: a modest proposal (apologies to J. Swift)BGB
|| `- Re: a modest proposal (apologies to J. Swift)JimBrakefield
|`- Re: a modest proposal (apologies to J. Swift)MitchAlsup
+- Re: a modest proposal (apologies to J. Swift)Quadibloc
+* Re: a modest proposal (apologies to J. Swift)Anton Ertl
|`- Re: a modest proposal (apologies to J. Swift)Ivan Godard
`- Re: a modest proposal (apologies to J. Swift)Paul A. Clayton

Pages:12
a modest proposal (apologies to J. Swift)

<samgjd$cbd$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17967&group=comp.arch#17967

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: a modest proposal (apologies to J. Swift)
Date: Sat, 19 Jun 2021 21:35:56 -0700
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <samgjd$cbd$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 20 Jun 2021 04:35:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="088ac088747267bfab488c77f2936bd5";
logging-data="12653"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/uHMJdWA2INsvkrv3kVMFq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:1/gIzsSOVcxsZjwZES7z84NMqv0=
Content-Language: en-US
X-Mozilla-News-Host: news://news.eternal-september.org:119
 by: Ivan Godard - Sun, 20 Jun 2021 04:35 UTC

Mixed in with all the wordsize and encoding tumult here, I have been
taken with Mitch's approach to control flow. If I may speak for him, his
basic approach is to choose a dense encoding that can be interpreted
sequentially, but then pre-translate it into a different encoding that
can be executed efficiently, and cache that.

Which led me to wonder: why restrict this approach to control flow?

It is well known that by far the densest encodings are those for
accumulator and stack virtual machines. Their drawback is that they are
inherently serial. Yet it seems to me that it should not be hard to
translate accumulator encoding (for example) to three-address register
encoding using the same dynamic allocation of rename registers as is
used in ordinary OOO.

The translation need not necessarily be linear, either, so long as the
instructions were fixed size or otherwise locatable without linear
parse. The decoder would parse a convenient block of instructions, and
for each assign a rename for the accumulator usage in each and identify
which instructions (like load or a constant) mark a new usage of the
accumulator. The accumulator renames of instruction sequences within the
marks could then be equated, or normal OOO forwarding would connect them.

Exceptions would need to be handled of course, but Mitch's vector scheme
(where he goes back and redoes everything as scalar) would seem to have
a cognate wherein the code is redone as a real accumulator machine. And
you'd want to use a NURA encoding, so stores to a large register set
would deal with values you want to preserve but not put in general
memory, akin to scratchpad; the translator would remove those.

Doing this for a stack encoding strikes me as more complex than an
accumulator encoding, but doable.

I don't follow every discussion here, so this may have been proposed before.

Re: a modest proposal (apologies to J. Swift)

<samq65$nau$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17968&group=comp.arch#17968

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Sun, 20 Jun 2021 02:18:12 -0500
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <samq65$nau$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 20 Jun 2021 07:19:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fb377ebc37d610ffd58f69d2ec1abb78";
logging-data="23902"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19exq7YbucIZfQfd+sZuQrm"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:xe8xq+BOk3IC4o6QLdJYxzj+QAY=
In-Reply-To: <samgjd$cbd$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sun, 20 Jun 2021 07:18 UTC

On 6/19/2021 11:35 PM, Ivan Godard wrote:
> Mixed in with all the wordsize and encoding tumult here, I have been
> taken with Mitch's approach to control flow. If I may speak for him, his
> basic approach is to choose a dense encoding that can be interpreted
> sequentially, but then pre-translate it into a different encoding that
> can be executed efficiently, and cache that.
>
> Which led me to wonder: why restrict this approach to control flow?
>
> It is well known that by far the densest encodings are those for
> accumulator and stack virtual machines. Their drawback is that they are
> inherently serial. Yet it seems to me that it should not be hard to
> translate accumulator encoding (for example) to three-address register
> encoding using the same dynamic allocation of rename registers as is
> used in ordinary OOO.
>
> The translation need not necessarily be linear, either, so long as the
> instructions were fixed size or otherwise locatable without linear
> parse. The decoder would parse a convenient block of instructions, and
> for each assign a rename for the accumulator usage in each and identify
> which instructions (like load or a constant) mark a new usage of the
> accumulator. The accumulator renames of instruction sequences within the
> marks could then be equated, or normal OOO forwarding would connect them.
>
> Exceptions would need to be handled of course, but Mitch's vector scheme
> (where he goes back and redoes everything as scalar) would seem to have
> a cognate wherein the code is redone as a real accumulator machine. And
> you'd want to use a NURA encoding, so stores to a large register set
> would deal with values you want to preserve but not put in general
> memory, akin to scratchpad; the translator would remove those.
>
> Doing this for a stack encoding strikes me as more complex than an
> accumulator encoding, but doable.
>
> I don't follow every discussion here, so this may have been proposed
> before.

Hmm, seems plausible.

Though, admittedly, I would have assumed using a software-based
translation layer, and potentially an RPN based format for distribution
(with JIT or AOT compilation to the machine-level ISA).

Various compilers already do a similar translation, and stack-based
representations have an advantage in that they are easy to generate from
an AST and can be adaptable to a fairly wide range of language constructs.

Though, there are a few cases which RPN can't address well by itself:
Operations with short-circuit behavior;
'switch'/'case';
...

Where these cases tend to need ugly hacks to be able to express effectively.

One may also in some cases want to be able to transform short-circuit
operations into non-short-circuit operations if one can determine that
the operator will not produce side-effects.

Another tradeoff is that stack IR's tend to result in a lot of "noise"
that needs to be eliminated in the translation process (and tend to have
significantly more operations than ideal in the target machine code).
Their win on the code-density front would be more in that they can
usually express many common operations in a single byte.

Typically, operations like function calls exist as high-level abstract
operations.

Such a format might look like, say:
Nothing is left on the stack during branches;
Labels are marked explicitly;
Types are propagated via the stack;
A Harvard execution model is assumed at this level;
Retains an otherwise C like typesystem and memory model;
...

In an abstract sense, this could resemble CLR bytecode, albeit with a
more C like approach to memory management.

Though, I am less sure if this would be viable for a hardware-level
translation stage.

The few times I am aware of where similar was tried, it was typically
abandoned in favor of doing the translation in software.

Re: a modest proposal (apologies to J. Swift)

<dfkucglm1t49vigcpvt4f881l53n0vclhb@4ax.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17973&group=comp.arch#17973

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Sun, 20 Jun 2021 10:46:01 -0400
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <dfkucglm1t49vigcpvt4f881l53n0vclhb@4ax.com>
References: <samgjd$cbd$1@dont-email.me> <samq65$nau$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="05fb0bf1ce6e6653936b92c22ca7c7dc";
logging-data="25700"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19rAi8T4AC4fsv+sOvoj8g2Q7gRJOYJsZo="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:vbTwtwTzTXwKdUBob0JwZcNg6FU=
 by: George Neuner - Sun, 20 Jun 2021 14:46 UTC

On Sun, 20 Jun 2021 02:18:12 -0500, BGB <cr88192@gmail.com> wrote:

>On 6/19/2021 11:35 PM, Ivan Godard wrote:
>> Mixed in with all the wordsize and encoding tumult here, I have been
>> taken with Mitch's approach to control flow. If I may speak for him, his
>> basic approach is to choose a dense encoding that can be interpreted
>> sequentially, but then pre-translate it into a different encoding that
>> can be executed efficiently, and cache that.
>>
>> Which led me to wonder: why restrict this approach to control flow?
>>
>> It is well known that by far the densest encodings are those for
>> accumulator and stack virtual machines. Their drawback is that they are
>> inherently serial. Yet it seems to me that it should not be hard to
>> translate accumulator encoding (for example) to three-address register
>> encoding using the same dynamic allocation of rename registers as is
>> used in ordinary OOO.
>>
>> The translation need not necessarily be linear, either, so long as the
>> instructions were fixed size or otherwise locatable without linear
>> parse. The decoder would parse a convenient block of instructions, and
>> for each assign a rename for the accumulator usage in each and identify
>> which instructions (like load or a constant) mark a new usage of the
>> accumulator. The accumulator renames of instruction sequences within the
>> marks could then be equated, or normal OOO forwarding would connect them.
>>
>> Exceptions would need to be handled of course, but Mitch's vector scheme
>> (where he goes back and redoes everything as scalar) would seem to have
>> a cognate wherein the code is redone as a real accumulator machine. And
>> you'd want to use a NURA encoding, so stores to a large register set
>> would deal with values you want to preserve but not put in general
>> memory, akin to scratchpad; the translator would remove those.
>>
>> Doing this for a stack encoding strikes me as more complex than an
>> accumulator encoding, but doable.
>>
>> I don't follow every discussion here, so this may have been proposed
>> before.

Several years ago John Savard wrote about a theoretical design based
on multiple accumulators. I don't recall much/any discussion of
programming such a beast, though.

Re: a modest proposal (apologies to J. Swift)

<0a7a383a-4803-4667-a740-faa57aa7096cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17977&group=comp.arch#17977

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9244:: with SMTP id u65mr19620390qkd.46.1624211437608;
Sun, 20 Jun 2021 10:50:37 -0700 (PDT)
X-Received: by 2002:a9d:4707:: with SMTP id a7mr3612520otf.178.1624211437350;
Sun, 20 Jun 2021 10:50:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 20 Jun 2021 10:50:37 -0700 (PDT)
In-Reply-To: <samgjd$cbd$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1af:8f14:fd81:ef03;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1af:8f14:fd81:ef03
References: <samgjd$cbd$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0a7a383a-4803-4667-a740-faa57aa7096cn@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 20 Jun 2021 17:50:37 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 20 Jun 2021 17:50 UTC

On Saturday, June 19, 2021 at 11:36:00 PM UTC-5, Ivan Godard wrote:
> Mixed in with all the wordsize and encoding tumult here, I have been
> taken with Mitch's approach to control flow. If I may speak for him, his
> basic approach is to choose a dense encoding that can be interpreted
> sequentially, but then pre-translate it into a different encoding that
> can be executed efficiently, and cache that.
>
> Which led me to wonder: why restrict this approach to control flow?
>
> It is well known that by far the densest encodings are those for
> accumulator and stack virtual machines. Their drawback is that they are
> inherently serial. Yet it seems to me that it should not be hard to
> translate accumulator encoding (for example) to three-address register
> encoding using the same dynamic allocation of rename registers as is
> used in ordinary OOO.
>
> The translation need not necessarily be linear, either, so long as the
> instructions were fixed size or otherwise locatable without linear
> parse.
<
In practice, a logarithmic PARSE also works well (this is better than linear
but worse than BigO( 1 ).)
<
> The decoder would parse a convenient block of instructions, and
> for each assign a rename for the accumulator usage in each and identify
> which instructions (like load or a constant) mark a new usage of the
> accumulator. The accumulator renames of instruction sequences within the
> marks could then be equated, or normal OOO forwarding would connect them.
>
> Exceptions would need to be handled of course, but Mitch's vector scheme
> (where he goes back and redoes everything as scalar) would seem to have
> a cognate wherein the code is redone as a real accumulator machine. And
> you'd want to use a NURA encoding, so stores to a large register set
> would deal with values you want to preserve but not put in general
> memory, akin to scratchpad; the translator would remove those.
<
The search for the exception can be done in whatever microarchitecture finds
reasonable (accumulator or translated into rename space), all the search needs
is that each instruction* is inserted into execution serially so when an exception
is raised a second time, the cause can be uniquely identified.
<
(*) the inserted instruction may have several sub-components (LD-OPs,...) The
important quality is that the exception can identify the instruction raising the
exception (and of course what exception is being raised.)
>
> Doing this for a stack encoding strikes me as more complex than an
> accumulator encoding, but doable.
<
We translated x86-64 into RISC at AMD where a LD-OP was put into a dual-fire
reservation station and LD-OP-ST were put into a triple fire reservation station.
>
> I don't follow every discussion here, so this may have been proposed before.

Re: a modest proposal (apologies to J. Swift)

<7ab33297-f8a7-4fa1-88bc-c2c17fe91c96n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17978&group=comp.arch#17978

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:c447:: with SMTP id h7mr19119478qkm.63.1624211455810;
Sun, 20 Jun 2021 10:50:55 -0700 (PDT)
X-Received: by 2002:a9d:3f0:: with SMTP id f103mr17198034otf.182.1624211455554;
Sun, 20 Jun 2021 10:50:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 20 Jun 2021 10:50:55 -0700 (PDT)
In-Reply-To: <dfkucglm1t49vigcpvt4f881l53n0vclhb@4ax.com>
Injection-Info: google-groups.googlegroups.com; posting-host=83.39.221.67; posting-account=g5fCewkAAAANGmgxC0JS6CCKkT_VjzSI
NNTP-Posting-Host: 83.39.221.67
References: <samgjd$cbd$1@dont-email.me> <samq65$nau$1@dont-email.me> <dfkucglm1t49vigcpvt4f881l53n0vclhb@4ax.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7ab33297-f8a7-4fa1-88bc-c2c17fe91c96n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: winden...@gmail.com (winden)
Injection-Date: Sun, 20 Jun 2021 17:50:55 +0000
Content-Type: text/plain; charset="UTF-8"
 by: winden - Sun, 20 Jun 2021 17:50 UTC

On Sunday, June 20, 2021 at 4:46:06 PM UTC+2, George Neuner wrote:
> On Sun, 20 Jun 2021 02:18:12 -0500, BGB <cr8...@gmail.com> wrote:
>
> >On 6/19/2021 11:35 PM, Ivan Godard wrote:
> >> Mixed in with all the wordsize and encoding tumult here, I have been
> >> taken with Mitch's approach to control flow. If I may speak for him, his
> >> basic approach is to choose a dense encoding that can be interpreted
> >> sequentially, but then pre-translate it into a different encoding that
> >> can be executed efficiently, and cache that.
> >>
> >> Which led me to wonder: why restrict this approach to control flow?
> >>
> >> It is well known that by far the densest encodings are those for
> >> accumulator and stack virtual machines. Their drawback is that they are
> >> inherently serial. Yet it seems to me that it should not be hard to
> >> translate accumulator encoding (for example) to three-address register
> >> encoding using the same dynamic allocation of rename registers as is
> >> used in ordinary OOO.
> >>
> >> The translation need not necessarily be linear, either, so long as the
> >> instructions were fixed size or otherwise locatable without linear
> >> parse. The decoder would parse a convenient block of instructions, and
> >> for each assign a rename for the accumulator usage in each and identify
> >> which instructions (like load or a constant) mark a new usage of the
> >> accumulator. The accumulator renames of instruction sequences within the
> >> marks could then be equated, or normal OOO forwarding would connect them.
> >>
> >> Exceptions would need to be handled of course, but Mitch's vector scheme
> >> (where he goes back and redoes everything as scalar) would seem to have
> >> a cognate wherein the code is redone as a real accumulator machine. And
> >> you'd want to use a NURA encoding, so stores to a large register set
> >> would deal with values you want to preserve but not put in general
> >> memory, akin to scratchpad; the translator would remove those.
> >>
> >> Doing this for a stack encoding strikes me as more complex than an
> >> accumulator encoding, but doable.
> >>
> >> I don't follow every discussion here, so this may have been proposed
> >> before.
> Several years ago John Savard wrote about a theoretical design based
> on multiple accumulators. I don't recall much/any discussion of
> programming such a beast, though.

RE Applying renaming for a stack arch:

The easy part is, during the transconding step, to implement a stack where the contents are the
physical registers numbers instead of the values.

Original C alike code:

a = x + y;

Naive conversion to stack code:

x @
y @
+
a !

Naive conversion by using a queue of temporary registers

; stack: r0 x
ld x --> r0

; stack: r0 x r1 y
ld y --> r1

; stack: r2 t
add r0 r1 --> r2

; stack: r2 t r3 a
move a --> r3

; stack: empty
st r2 --> (r3)

As usual in stack code, each instruction consumes 0 1 or 2 data.

In our naive transcoding, each uop would assume these registers are always single-use and thus mark them free them as they are consumed "from the stack".

In a more intelligent transcoding, I think we'd want at least some way to be able to read from deeper in the stack (eg: "pick 3")... which allows for keeping values in registers to reuse them more than once.

Re: a modest proposal (apologies to J. Swift)

<sao5ar$1cl7$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17989&group=comp.arch#17989

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!adore2!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Sun, 20 Jun 2021 19:35:55 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sao5ar$1cl7$1@gal.iecc.com>
References: <samgjd$cbd$1@dont-email.me>
Injection-Date: Sun, 20 Jun 2021 19:35:55 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="45735"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <samgjd$cbd$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Sun, 20 Jun 2021 19:35 UTC

According to Ivan Godard <ivan@millcomputing.com>:
>Which led me to wonder: why restrict this approach to control flow?
>
>It is well known that by far the densest encodings are those for
>accumulator and stack virtual machines. Their drawback is that they are
>inherently serial. Yet it seems to me that it should not be hard to
>translate accumulator encoding (for example) to three-address register
>encoding using the same dynamic allocation of rename registers as is
>used in ordinary OOO. ...

A major complaint about stack instructions is that it is hard to refer
to operands that aren't at the top of the stack so it's hard to reuse
common subexpressions. You could use an index into the stack, but in
that case why not use those bits as a register number? Even though
transistors are cheap, trying to rediscover loop invariants on the fly
from stack code seems like a stretch.

I'd go in the other direction, with lots of registers and a
compression scheme to make the code smaller. zSeries does gzip deflate
mostly in hardware, perhaps something like that.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: a modest proposal (apologies to J. Swift)

<saoqqd$9lt$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17994&group=comp.arch#17994

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Sun, 20 Jun 2021 20:40:15 -0500
Organization: A noiseless patient Spider
Lines: 103
Message-ID: <saoqqd$9lt$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Jun 2021 01:42:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4abe65464924d7cb69d3c5d1800c738e";
logging-data="9917"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18o/USxboMFiR/rqkHSB7Ll"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:/Coo5f/1b/AQ/Zien/crsX2llWI=
In-Reply-To: <sao5ar$1cl7$1@gal.iecc.com>
Content-Language: en-US
 by: BGB - Mon, 21 Jun 2021 01:40 UTC

On 6/20/2021 2:35 PM, John Levine wrote:
> According to Ivan Godard <ivan@millcomputing.com>:
>> Which led me to wonder: why restrict this approach to control flow?
>>
>> It is well known that by far the densest encodings are those for
>> accumulator and stack virtual machines. Their drawback is that they are
>> inherently serial. Yet it seems to me that it should not be hard to
>> translate accumulator encoding (for example) to three-address register
>> encoding using the same dynamic allocation of rename registers as is
>> used in ordinary OOO. ...
>
> A major complaint about stack instructions is that it is hard to refer
> to operands that aren't at the top of the stack so it's hard to reuse
> common subexpressions. You could use an index into the stack, but in
> that case why not use those bits as a register number? Even though
> transistors are cheap, trying to rediscover loop invariants on the fly
> from stack code seems like a stretch.
>

Granted, this is one drawback of stack machines...

In theory, in an offline compiler one could do the sub-expression
elimination in the backend, but this asks a bit much for a hardware
implementation.

Another option is to create temporary variables in the frontend, and
then fold the common sub-expressions into these.

In both cases, it would be necessary to verify that any variables which
the expressions depend on did not change between the prior version and
the current version.

One other drawback of stack machines is, as noted elsewhere, the
tendency to need more operations to accomplish similar logic (and the
typical workarounds turn it half-way into a register machine).

Generally makes more sense to use the stack more as an abstraction for
moving around variable ID's, as opposed to using it to move around
actual values.

Though, it seems partly a tradeoff of where one wants to put the complexity:
Stack machine:
Simpler frontend, easier to design as bytecode;
Most of the logic needs to go in the backend;
Favors a "high level" approach.
Register machine:
More complex frontend, less simple as bytecode;
Less work is needed in the backend;
More optimization work can be handled by the frontend;
Favors a lower-level approach.

My compiler and VM efforts had gone back and forth over this issue over
the years.

In any case, I would probably lean against using a stack machine as a
hardware-level ISA though.

> I'd go in the other direction, with lots of registers and a
> compression scheme to make the code smaller. zSeries does gzip deflate
> mostly in hardware, perhaps something like that.
>

This is also a valid route.

For a few cases where I had considered low-overhead VM possibilities, I
had mostly looked at register machines, generally where one would have
256 or so "registers" which may hold arguments, locals, or temporary
variables.

In my projects, given the SDcard interface tend to be fairly slow, I end
up frequently compressing things either with LZ4 or my own custom RP2
format.

LZ4 tends to do better for machine code in my ISA, typically getting
around a 40-60% size reduction. RP2 tends to do better for data compression.

In either case, decompression is significantly faster than data transfer
speeds from the SDcard (running an SDcard at 12MHz in SPI mode is ~
1.5MB/s). If the data achieves 60% compression, they can effectively
read data at ~ 3.5 MB/s.

In comparison, Deflate gives only slightly better compression, but has a
significantly higher cost to decode than something like LZ4 or similar.

Entropy coded formats tend to add enough overhead to generally not be
worthwhile in a "try to make reading data from the SDcard faster" sense.

Then there are also like block-based texture compression and
audio-compression, which can be used to reduce memory bandwidth
requirements.

OTOH, keeping executable code in a compressed form in-memory would be a
little less straightforward.

....

Re: a modest proposal (apologies to J. Swift)

<84a09a0f-0dd5-4080-8edd-11880a537cf5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17997&group=comp.arch#17997

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:149:: with SMTP id v9mr22425169qtw.144.1624249924301; Sun, 20 Jun 2021 21:32:04 -0700 (PDT)
X-Received: by 2002:a9d:12a9:: with SMTP id g38mr20665236otg.114.1624249924061; Sun, 20 Jun 2021 21:32:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 20 Jun 2021 21:32:03 -0700 (PDT)
In-Reply-To: <samgjd$cbd$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:353f:a31c:611a:2d7a; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:353f:a31c:611a:2d7a
References: <samgjd$cbd$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <84a09a0f-0dd5-4080-8edd-11880a537cf5n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 21 Jun 2021 04:32:04 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 25
 by: Quadibloc - Mon, 21 Jun 2021 04:32 UTC

On Saturday, June 19, 2021 at 10:36:00 PM UTC-6, Ivan Godard wrote:

> It is well known that by far the densest encodings are those for
> accumulator and stack virtual machines. Their drawback is that they are
> inherently serial. Yet it seems to me that it should not be hard to
> translate accumulator encoding (for example) to three-address register
> encoding using the same dynamic allocation of rename registers as is
> used in ordinary OOO.

> I don't follow every discussion here, so this may have been proposed before.

I do remember that at one point, Mitch Alsup, I believe, pointed out that
OoO execution was applicable to accumulator-memory architectures,
it didn't require a general register architecture.

I found that a bit odd at the time. I had no problems with OoO taking
instructions that referenced a register, and switching them to refer to
a rename register.

But whatever is put in memory by an instruction is part of the _end
product_ of an instruction, and has to be preserved by any implementation
of the architecture. So the benefits of OoO are reduced in an accumulator
architecture, unless there's some way to indicate some locations are
temporaries in the machine language itself.

John Savard

Re: a modest proposal (apologies to J. Swift)

<2021Jun21.100857@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18000&group=comp.arch#18000

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Mon, 21 Jun 2021 08:08:57 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 146
Message-ID: <2021Jun21.100857@mips.complang.tuwien.ac.at>
References: <samgjd$cbd$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="2bd107a9032f477962503dbf2b7ac97b";
logging-data="21805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ACpLEt2e6h7SsSke12T8k"
Cancel-Lock: sha1:1H+92r0SeIYbvUuYX6nLywpPYeI=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Mon, 21 Jun 2021 08:08 UTC

Ivan Godard <ivan@millcomputing.com> writes:
>Mixed in with all the wordsize and encoding tumult here, I have been
>taken with Mitch's approach to control flow. If I may speak for him, his
>basic approach is to choose a dense encoding that can be interpreted
>sequentially, but then pre-translate it into a different encoding that
>can be executed efficiently, and cache that.
>
>Which led me to wonder: why restrict this approach to control flow?
>
>It is well known that by far the densest encodings are those for
>accumulator and stack virtual machines. Their drawback is that they are
>inherently serial. Yet it seems to me that it should not be hard to
>translate accumulator encoding (for example) to three-address register
>encoding using the same dynamic allocation of rename registers as is
>used in ordinary OOO.

Certainly, that's easy, and was proposed for stack machines in:

@Book{debaere&vancampenhout90,
author = "Eddy H. Debaere and Jan M. {Van Campenhout}",
title = "Interpretation and Instruction Path Coprocessing",
publisher = "The MIT Press",
year = 1990,
annote = "Good discussion about interpretation with big
bibliography. They propose instruction path
coprocessing as a means to speed up interpreters. An
instruction path coprocessor is similar to a
microcode sequencer that has the code to be
interpreted as machine code and the machine code of
the main processor as microcode."
}

The other direction (register machine -> accumulator machine) is a
little harder, but also possible and has been proposed in:

@InProceedings{kim&smith02,
author = {Ho-Seop Kim and James E. Smith},
title = {An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing},
crossref = {isca02},
pages = {71--81},
url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
annote = {This paper addresses the problems of wide
superscalars with communication across the chip and
the number of write ports in the register file. The
authors propose an architecture (ILDP) with
general-purpose registers and with accumulators
(with instructions only accessing one accumulator
(read and/or write) and one register (read or
write); for the accumulators their death is
specified explicitly in the instructions. The
microarchitecture builds \emph{strands} from
instructions working on an accumulator; a strand
starts with an instruction writing to an accumulator
without reading from it, continues with instructions
reading from (and possibly writing to) the
accumulator and ends with an instruction that kills
the accumulator. Strands are allocated to one out of
eight processing elements (PEs) dynamically (i.e.,
accumulators are renamed). A PE consists of
mainly one ALU data path (but also a copy of the
GPRs and an L1 cache). They evaluated this
architecture by translating Alpha binaries into it,
and comparing their architecture to a 4-wide or
8-wide Alpha implementation; their architecture has
a lower L1 cache latency, though. The performance of
ILDP in clock cycles is competetive, and one can
expect faster clocks for ILDP. The paper also
presents data for other stuff, e.g. general-purpose
register writes, which have to be promoted between
strands and which are relatively few.}
}

@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

This ILDP idea did not succeed; however, given that we have recently
discussed the delay of a pipeline stage, and whether you could put
several ALU ops in a stage, I wonder:

From discussions here I expect that you can process dependent
instructions faster if you only have one functional unit and bypasses
from the result of that functional unit to the inputs of that
functional unit: shorter wires mean less delay, less load on the
output means less delay, and fewer muxes on the input means less
delay.

Of course, you don't want to have just a single FU, because you want
to process independent instructions in parallel, and you need separate
logic for, e.g., a load/store unit and a multiplier anyway. Still,
there is the question whether it would be beneficial to arrange
dependent (say) ALU operations into a strand that is processed by a
single functional unit faster, but in the context of an OoO core
otherwise. This results in the following questions:

* How do we fit the faster strand processing in the slower rest of the
OoO engine? Have two back-to-back ALUs in one pipeline stage? Have
three ALUs in two pipeline stages (can only be driven every other
cycle, or using wave pipelining)? Run the ALUs at a different clock
(twice the clock to avoid needing to cross clock domains when
talking to the LSU, as in the Willamette)?

* You can avoid to write back the the intermediate results, but there
are heavy requirements:

1) The decoder or register renamer can see that the result is used
only once. This is trivial for an accumulator or stack machine,
but for a register machine it requires that the following
instructions in a strand read from and write to the same register
that the first wrote.

2) Because of precise exceptions and in-order completion, none of
the instructions between the first in a strand and the one before
the final one (even instructions that do not belong to the
strand) must be able to trap. External interrupts can be
arranged to happen between strands, if there is a guarantee that
strands don't overlap forever.

2a) Alternatively, roll back and replay in a non-stranded way on an
exception.

* How frequent are strands for which kinds of functional units?

Given that (AFAIK) recent microarchitectures have not gone there, it
seems that the benefit of such a scheme is not worth the effort.
Earlier cases somewhat in that direction were:

* The arrangement of functional units in the 21264: The 4-wide core
was split in two halves; one half could talk to itself without
delay, but suffered one cycle delay talking to the other half. The
register file was replicated per half to reduce porting
requirements.

* The Willamette (and Northwood) has a double-clocked ALU that allows,
e.g., back-to-back adds.

But this was in 250nm and 180nm, such things don't seem to pay off now.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: a modest proposal (apologies to J. Swift)

<aaa1f498-b7af-45e6-9204-1b710cee3100n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18003&group=comp.arch#18003

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:10ac:: with SMTP id h12mr22043630qkk.370.1624267644282;
Mon, 21 Jun 2021 02:27:24 -0700 (PDT)
X-Received: by 2002:a4a:ab07:: with SMTP id i7mr6405568oon.89.1624267644004;
Mon, 21 Jun 2021 02:27:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Jun 2021 02:27:23 -0700 (PDT)
In-Reply-To: <sao5ar$1cl7$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=88.78.112.110; posting-account=HK1lZgoAAADqgYV3YHRAxdqDzPhsScb7
NNTP-Posting-Host: 88.78.112.110
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <aaa1f498-b7af-45e6-9204-1b710cee3100n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: lile221...@gmail.com (Bernd Linsel)
Injection-Date: Mon, 21 Jun 2021 09:27:24 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Bernd Linsel - Mon, 21 Jun 2021 09:27 UTC

John Levine schrieb am Sonntag, 20. Juni 2021 um 21:35:58 UTC+2:
> According to Ivan Godard <iv...@millcomputing.com>:
> >Which led me to wonder: why restrict this approach to control flow?
> >
> >It is well known that by far the densest encodings are those for
> >accumulator and stack virtual machines. Their drawback is that they are
> >inherently serial. Yet it seems to me that it should not be hard to
> >translate accumulator encoding (for example) to three-address register
> >encoding using the same dynamic allocation of rename registers as is
> >used in ordinary OOO. ...
>
> A major complaint about stack instructions is that it is hard to refer
> to operands that aren't at the top of the stack so it's hard to reuse
> common subexpressions. You could use an index into the stack, but in
> that case why not use those bits as a register number? Even though
> transistors are cheap, trying to rediscover loop invariants on the fly
> from stack code seems like a stretch.
>
> I'd go in the other direction, with lots of registers and a
> compression scheme to make the code smaller. zSeries does gzip deflate
> mostly in hardware, perhaps something like that.
>
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

This paper, although a bit outdated, may be helpful in implementing a modern OoO stack machine:
http://web.archive.org/web/20030710020735if_/http://4th.allthingsgo.biz:80/~sye/BOOST.pdf

--
Regards,
Bernd

Re: a modest proposal (apologies to J. Swift)

<sapo7q$c9v$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18004&group=comp.arch#18004

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Mon, 21 Jun 2021 12:04:41 +0200
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <sapo7q$c9v$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Jun 2021 10:04:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e6cadac4349877b438ef5985099dbaf4";
logging-data="12607"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18MA9iQK5oWiuBZeGyrEk7C1SnX0iJniwQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:6TAlSA8vldSr7vNyzPFyqRTOB2c=
In-Reply-To: <sao5ar$1cl7$1@gal.iecc.com>
Content-Language: en-US
 by: Marcus - Mon, 21 Jun 2021 10:04 UTC

On 2021-06-20, John Levine wrote:
> According to Ivan Godard <ivan@millcomputing.com>:
>> Which led me to wonder: why restrict this approach to control flow?
>>
>> It is well known that by far the densest encodings are those for
>> accumulator and stack virtual machines. Their drawback is that they are
>> inherently serial. Yet it seems to me that it should not be hard to
>> translate accumulator encoding (for example) to three-address register
>> encoding using the same dynamic allocation of rename registers as is
>> used in ordinary OOO. ...
>
> A major complaint about stack instructions is that it is hard to refer
> to operands that aren't at the top of the stack so it's hard to reuse
> common subexpressions. You could use an index into the stack, but in
> that case why not use those bits as a register number? Even though
> transistors are cheap, trying to rediscover loop invariants on the fly
> from stack code seems like a stretch.
>
> I'd go in the other direction, with lots of registers and a
> compression scheme to make the code smaller. zSeries does gzip deflate
> mostly in hardware, perhaps something like that.
>

OTOH I bet that general purpose compression schemes would compress
better if you used stack indices rather than absolute register numbers,
since lower index numbers would likely be more common than higher index
numbers (i.e. there's less entropy than in "random" register numbers).

General purpose compression schemes have lots of other problems, though.
For instance you can't do random access of individual instructions
without first decompressing a compressed block, and you can't guarantee
a certain size/alignment of compressed blocks.

A common method that is used for texture compression in GPU:s is to use
lossy compression where you guarantee a size rather than quality (so
for instance a block of 4x4 pixels is compressed to a fixed size), which
makes random access feasible. It should be obvious that lossy
compression is not suitable for code ;-)

/Marcus

Re: a modest proposal (apologies to J. Swift)

<saq2cb$jb7$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18008&group=comp.arch#18008

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Mon, 21 Jun 2021 05:57:47 -0700
Organization: A noiseless patient Spider
Lines: 147
Message-ID: <saq2cb$jb7$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me>
<2021Jun21.100857@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Jun 2021 12:57:47 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8e44d947007a06818eda4c9c9155f586";
logging-data="19815"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18CEEC+zjI3gTjNlZNQD4/B"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:JrnoBYhGzW7xxNuk8CvBfVHs1CI=
In-Reply-To: <2021Jun21.100857@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Mon, 21 Jun 2021 12:57 UTC

On 6/21/2021 1:08 AM, Anton Ertl wrote:
> Ivan Godard <ivan@millcomputing.com> writes:
>> Mixed in with all the wordsize and encoding tumult here, I have been
>> taken with Mitch's approach to control flow. If I may speak for him, his
>> basic approach is to choose a dense encoding that can be interpreted
>> sequentially, but then pre-translate it into a different encoding that
>> can be executed efficiently, and cache that.
>>
>> Which led me to wonder: why restrict this approach to control flow?
>>
>> It is well known that by far the densest encodings are those for
>> accumulator and stack virtual machines. Their drawback is that they are
>> inherently serial. Yet it seems to me that it should not be hard to
>> translate accumulator encoding (for example) to three-address register
>> encoding using the same dynamic allocation of rename registers as is
>> used in ordinary OOO.
>
> Certainly, that's easy, and was proposed for stack machines in:
>
> @Book{debaere&vancampenhout90,
> author = "Eddy H. Debaere and Jan M. {Van Campenhout}",
> title = "Interpretation and Instruction Path Coprocessing",
> publisher = "The MIT Press",
> year = 1990,
> annote = "Good discussion about interpretation with big
> bibliography. They propose instruction path
> coprocessing as a means to speed up interpreters. An
> instruction path coprocessor is similar to a
> microcode sequencer that has the code to be
> interpreted as machine code and the machine code of
> the main processor as microcode."
> }
>
> The other direction (register machine -> accumulator machine) is a
> little harder, but also possible and has been proposed in:
>
> @InProceedings{kim&smith02,
> author = {Ho-Seop Kim and James E. Smith},
> title = {An Instruction Set and Microarchitecture for
> Instruction Level Distributed Processing},
> crossref = {isca02},
> pages = {71--81},
> url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
> annote = {This paper addresses the problems of wide
> superscalars with communication across the chip and
> the number of write ports in the register file. The
> authors propose an architecture (ILDP) with
> general-purpose registers and with accumulators
> (with instructions only accessing one accumulator
> (read and/or write) and one register (read or
> write); for the accumulators their death is
> specified explicitly in the instructions. The
> microarchitecture builds \emph{strands} from
> instructions working on an accumulator; a strand
> starts with an instruction writing to an accumulator
> without reading from it, continues with instructions
> reading from (and possibly writing to) the
> accumulator and ends with an instruction that kills
> the accumulator. Strands are allocated to one out of
> eight processing elements (PEs) dynamically (i.e.,
> accumulators are renamed). A PE consists of
> mainly one ALU data path (but also a copy of the
> GPRs and an L1 cache). They evaluated this
> architecture by translating Alpha binaries into it,
> and comparing their architecture to a 4-wide or
> 8-wide Alpha implementation; their architecture has
> a lower L1 cache latency, though. The performance of
> ILDP in clock cycles is competetive, and one can
> expect faster clocks for ILDP. The paper also
> presents data for other stuff, e.g. general-purpose
> register writes, which have to be promoted between
> strands and which are relatively few.}
> }
>
> @Proceedings{isca02,
> title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
> booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
> year = "2002",
> key = "ISCA 29",
> }
>
> This ILDP idea did not succeed; however, given that we have recently
> discussed the delay of a pipeline stage, and whether you could put
> several ALU ops in a stage, I wonder:
>
> From discussions here I expect that you can process dependent
> instructions faster if you only have one functional unit and bypasses
> from the result of that functional unit to the inputs of that
> functional unit: shorter wires mean less delay, less load on the
> output means less delay, and fewer muxes on the input means less
> delay.
>
> Of course, you don't want to have just a single FU, because you want
> to process independent instructions in parallel, and you need separate
> logic for, e.g., a load/store unit and a multiplier anyway. Still,
> there is the question whether it would be beneficial to arrange
> dependent (say) ALU operations into a strand that is processed by a
> single functional unit faster, but in the context of an OoO core
> otherwise. This results in the following questions:
>
> * How do we fit the faster strand processing in the slower rest of the
> OoO engine? Have two back-to-back ALUs in one pipeline stage? Have
> three ALUs in two pipeline stages (can only be driven every other
> cycle, or using wave pipelining)? Run the ALUs at a different clock
> (twice the clock to avoid needing to cross clock domains when
> talking to the LSU, as in the Willamette)?
>
> * You can avoid to write back the the intermediate results, but there
> are heavy requirements:
>
> 1) The decoder or register renamer can see that the result is used
> only once. This is trivial for an accumulator or stack machine,
> but for a register machine it requires that the following
> instructions in a strand read from and write to the same register
> that the first wrote.
>
> 2) Because of precise exceptions and in-order completion, none of
> the instructions between the first in a strand and the one before
> the final one (even instructions that do not belong to the
> strand) must be able to trap. External interrupts can be
> arranged to happen between strands, if there is a guarantee that
> strands don't overlap forever.
>
> 2a) Alternatively, roll back and replay in a non-stranded way on an
> exception.
>
> * How frequent are strands for which kinds of functional units?
>
> Given that (AFAIK) recent microarchitectures have not gone there, it
> seems that the benefit of such a scheme is not worth the effort.
> Earlier cases somewhat in that direction were:
>
> * The arrangement of functional units in the 21264: The 4-wide core
> was split in two halves; one half could talk to itself without
> delay, but suffered one cycle delay talking to the other half. The
> register file was replicated per half to reduce porting
> requirements.
>
> * The Willamette (and Northwood) has a double-clocked ALU that allows,
> e.g., back-to-back adds.
>
> But this was in 250nm and 180nm, such things don't seem to pay off now.
>
> - anton
>

An excellent exposition of what I was groping toward; thank you.

Re: a modest proposal (apologies to J. Swift)

<aeac05fe-ae12-4f59-b6be-a3a69e8377den@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18010&group=comp.arch#18010

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:503:: with SMTP id 3mr23906862qkf.417.1624288088071;
Mon, 21 Jun 2021 08:08:08 -0700 (PDT)
X-Received: by 2002:a4a:e9b1:: with SMTP id t17mr21275611ood.0.1624288087836;
Mon, 21 Jun 2021 08:08:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Jun 2021 08:08:07 -0700 (PDT)
In-Reply-To: <samgjd$cbd$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=64.26.99.248; posting-account=6JNn0QoAAAD-Scrkl0ClrfutZTkrOS9S
NNTP-Posting-Host: 64.26.99.248
References: <samgjd$cbd$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <aeac05fe-ae12-4f59-b6be-a3a69e8377den@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: paaroncl...@gmail.com (Paul A. Clayton)
Injection-Date: Mon, 21 Jun 2021 15:08:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 60
 by: Paul A. Clayton - Mon, 21 Jun 2021 15:08 UTC

On Sunday, June 20, 2021 at 12:36:00 AM UTC-4, Ivan Godard wrote:
[snip]
> his
> basic approach is to choose a dense encoding that can be interpreted
> sequentially, but then pre-translate it into a different encoding that
> can be executed efficiently, and cache that.

The benefits of code density would seem to be primarily reducing use of
bandwidth, fetch energy and storage. If a denser format (not necessarily
the densest format) is cached on-chip, then on-chip costs are reduced.

Larger scope of compression provides greater density but can require
more to be examined before the information is available. The product of
DRAM burst length and channel width would apply pressure for cache-
block-sized (memory access chunk sized) units, but when a code
stream is known to usually exceed such length a larger compression
block might be practical at the cost of slightly slower decompression and
some chance of 'wasted prefetch'.

(Since dictionary ROM and decode/expansion logic is more area-
efficient (when used often enough) than generic (writable) storage,
modality/dictionary selection might be useful.)

Strand-oriented encoding might be slightly more dense since names
would be more localized, but such might then require a larger window
to see some performance improving opportunities.

Even when information is recoverable directly from a dense
representation, redundancy may be desirable. There is also generally
persistent information which is not recoverable which may or may
not be worth preserving. (The Mill pushes most such — with perhaps
less significance given to persistence — onto a side band channel, at
least for branches. One could imagine code-correlated metadata for
data prefetching and other execution-oriented optimizations.
Expected resource use might also be useful for 'resource bidding' in
an economic model of resource management.)

The latency and energy cost of going off-chip is so large, that support
for a relatively dense on-chip storage format (for storage-efficient
caching vs. decode-work-efficient caching) seems appropriate. Code
which is executed once for each retrieval from main memory might be
executed in a less efficient format (a little like NVIDIA's Denver that had
two-wide ARM decode for cold [and new] code that had not been
translated to the VLIW format). (Sometimes treating such tasks as
separate threads run on somewhat separate cores might make sense,
particularly if the cold code produces many cache misses for data as
may be likely. A dense compute core would tend to waste resources for
low ILP/DLP code.)

Some optimizations ("decoding") might be based on localized profile
information. (Just as branch prediction metadata can include predictor
preference as well as direction, preserved metadata might accelerate
training toward making better choices and potentially bypassing some
temporary storage use.)

I was hoping to present a more focused examination of the tradeoffs,
but the above at least expresses some of the considerations.

Re: a modest proposal (apologies to J. Swift)

<saqctc$2rso$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18011&group=comp.arch#18011

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Mon, 21 Jun 2021 15:57:32 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <saqctc$2rso$1@gal.iecc.com>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com> <sapo7q$c9v$1@dont-email.me>
Injection-Date: Mon, 21 Jun 2021 15:57:32 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="94104"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com> <sapo7q$c9v$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Mon, 21 Jun 2021 15:57 UTC

According to Marcus <m.delete@this.bitsnbites.eu>:
>> I'd go in the other direction, with lots of registers and a
>> compression scheme to make the code smaller. zSeries does gzip deflate
>> mostly in hardware, perhaps something like that.
>
>OTOH I bet that general purpose compression schemes would compress
>better if you used stack indices rather than absolute register numbers,
>since lower index numbers would likely be more common than higher index
>numbers (i.e. there's less entropy than in "random" register numbers).

I think you're assuming your conclusions. Usually in a register allocator, there
is no reason to prefer one register over another. Add some weights to prefer using
low numbered registers or reusing the same register and we could see how much more
compressible the code was.

>General purpose compression schemes have lots of other problems, though.
>For instance you can't do random access of individual instructions
>without first decompressing a compressed block, and you can't guarantee
>a certain size/alignment of compressed blocks.

True, but look at any variable length instruction set and you have basically the same
issues, albeit perhaps at different scale. I can wave my hands like crazy and imagine
decompressing code blocks into large cache lines.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: a modest proposal (apologies to J. Swift)

<saqj72$mnk$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18012&group=comp.arch#18012

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Mon, 21 Jun 2021 12:42:44 -0500
Organization: A noiseless patient Spider
Lines: 87
Message-ID: <saqj72$mnk$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
<sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Jun 2021 17:45:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4abe65464924d7cb69d3c5d1800c738e";
logging-data="23284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19crpqqGer+RpVxIzHKY7FD"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:UjaaX4SQSGUAQXKk0u5bi1eZpJQ=
In-Reply-To: <saqctc$2rso$1@gal.iecc.com>
Content-Language: en-US
 by: BGB - Mon, 21 Jun 2021 17:42 UTC

On 6/21/2021 10:57 AM, John Levine wrote:
> According to Marcus <m.delete@this.bitsnbites.eu>:
>>> I'd go in the other direction, with lots of registers and a
>>> compression scheme to make the code smaller. zSeries does gzip deflate
>>> mostly in hardware, perhaps something like that.
>>
>> OTOH I bet that general purpose compression schemes would compress
>> better if you used stack indices rather than absolute register numbers,
>> since lower index numbers would likely be more common than higher index
>> numbers (i.e. there's less entropy than in "random" register numbers).
>
> I think you're assuming your conclusions. Usually in a register allocator, there
> is no reason to prefer one register over another. Add some weights to prefer using
> low numbered registers or reusing the same register and we could see how much more
> compressible the code was.
>

In my own tests, generally the dictionary part of an LZ compressor works
better than the entropy part when it comes to machine code.

Though, it is more likely to benefit from restrictive register and
instruction encoding choices, which might be worse for code-density
in-general than a more variable set of choices.

Though, I have noted that the optimal size of the register space (in
terms of code density and performance) for a given function tends to be
variable.

>> General purpose compression schemes have lots of other problems, though.
>> For instance you can't do random access of individual instructions
>> without first decompressing a compressed block, and you can't guarantee
>> a certain size/alignment of compressed blocks.
>
> True, but look at any variable length instruction set and you have basically the same
> issues, albeit perhaps at different scale. I can wave my hands like crazy and imagine
> decompressing code blocks into large cache lines.
>

There are a few possible routes to code compression:
LZ compression, typically decoded during program loading or similar;
This is a bigger factor for simulations than for realtime though.
Branch-based code reuse:
Some common/repetitive constructs can be called via branch ops (1).
Some form of "instruction dictionary":
Not done, doesn't seem worthwhile (2).

1: Things like function prolog/epilog sequences tend to mostly repeat
the same series of instructions, so it may be cost effective to simply
call or branch to a previous instance. The cost of the extra branches
(particularly with branch prediction) may be offset by the reduction in
I$ misses.

2: In this case, the instruction could be reduced partly to a pattern
and a table-lookup.

In this case, part (or all) of the 16-bit encoding space could be
replaced by a lookup table. Say, yynm, where yy fetches an instruction
pattern from the table, and nm are used as register indices or similar
(substituted into the pattern for a longer form instruction, with a few
bits giving the intended pattern substitution type).

The table would likely be built by the compiler and held in some
internal CPU registers at runtime. The drawback of needing to
dynamically reload the table contents to deal with things like DLLs or
syscalls would not be attractive (either manually or with a dedicated
cache).

Similarly, it could also pose a potential latency issue for the
instruction decoder.

It could also be possible to use some sort of "decompress stuff on
demand" features, but this is likely to have enough overheads to where
it only really makes sense if the target machine is RAM-constrained. In
this case, generally indirect calls would be used (via a table), with
non-resident functions generally pointing to a "decompress this" thunk.

A similar strategy can also typically be used for a JIT compiler (the
indirect calls either go to a another JIT compiled trace, or potentially
lead back into the interpreter, which may at its discretion later
replace this pointer with another JIT compiled trace).

....

Re: a modest proposal (apologies to J. Swift)

<7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18013&group=comp.arch#18013

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:693:: with SMTP id 141mr23377566qkg.453.1624299072659;
Mon, 21 Jun 2021 11:11:12 -0700 (PDT)
X-Received: by 2002:a05:6830:3089:: with SMTP id f9mr22248999ots.276.1624299072384;
Mon, 21 Jun 2021 11:11:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Jun 2021 11:11:12 -0700 (PDT)
In-Reply-To: <saqj72$mnk$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:100d:2f2f:88cc:786;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:100d:2f2f:88cc:786
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
<sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com> <saqj72$mnk$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 21 Jun 2021 18:11:12 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 21 Jun 2021 18:11 UTC

On Monday, June 21, 2021 at 12:45:09 PM UTC-5, BGB wrote:
> On 6/21/2021 10:57 AM, John Levine wrote:
> > According to Marcus <m.de...@this.bitsnbites.eu>:
> >>> I'd go in the other direction, with lots of registers and a
> >>> compression scheme to make the code smaller. zSeries does gzip deflate
> >>> mostly in hardware, perhaps something like that.
> >>
> >> OTOH I bet that general purpose compression schemes would compress
> >> better if you used stack indices rather than absolute register numbers,
> >> since lower index numbers would likely be more common than higher index
> >> numbers (i.e. there's less entropy than in "random" register numbers).
> >
> > I think you're assuming your conclusions. Usually in a register allocator, there
> > is no reason to prefer one register over another. Add some weights to prefer using
> > low numbered registers or reusing the same register and we could see how much more
> > compressible the code was.
> >
> In my own tests, generally the dictionary part of an LZ compressor works
> better than the entropy part when it comes to machine code.
>
>
> Though, it is more likely to benefit from restrictive register and
> instruction encoding choices, which might be worse for code-density
> in-general than a more variable set of choices.
>
> Though, I have noted that the optimal size of the register space (in
> terms of code density and performance) for a given function tends to be
> variable.
> >> General purpose compression schemes have lots of other problems, though.
> >> For instance you can't do random access of individual instructions
> >> without first decompressing a compressed block, and you can't guarantee
> >> a certain size/alignment of compressed blocks.
> >
> > True, but look at any variable length instruction set and you have basically the same
> > issues, albeit perhaps at different scale. I can wave my hands like crazy and imagine
> > decompressing code blocks into large cache lines.
> >
> There are a few possible routes to code compression:
> LZ compression, typically decoded during program loading or similar;
> This is a bigger factor for simulations than for realtime though.
> Branch-based code reuse:
> Some common/repetitive constructs can be called via branch ops (1).
> Some form of "instruction dictionary":
> Not done, doesn't seem worthwhile (2).
>
>
> 1: Things like function prolog/epilog sequences tend to mostly repeat
> the same series of instructions, so it may be cost effective to simply
> call or branch to a previous instance. The cost of the extra branches
> (particularly with branch prediction) may be offset by the reduction in
> I$ misses.
<
In the C world, prologue becomes ENTER and epilogue becomes EXIT.
This saves a surprising amount of code space. These instructions perform
all of the saving and restoring of the registers, including stack pointer and
frame pointer updates. And since the stores and loads are to consecutive
memory locations, they can be performed several registers wide per cycle
{Wide = 2,4,8}.
<
In the C++ world, I can agree that arranging the epilogue to call a routine
to perform all destructors and then EXIT makes it significantly easier to
incorporate TRY-THROW-CATCH and even longjump() over destructor
boundaries.
>
>
> 2: In this case, the instruction could be reduced partly to a pattern
> and a table-lookup.
>
> In this case, part (or all) of the 16-bit encoding space could be
> replaced by a lookup table. Say, yynm, where yy fetches an instruction
> pattern from the table, and nm are used as register indices or similar
> (substituted into the pattern for a longer form instruction, with a few
> bits giving the intended pattern substitution type).
>
> The table would likely be built by the compiler and held in some
> internal CPU registers at runtime. The drawback of needing to
> dynamically reload the table contents to deal with things like DLLs or
> syscalls would not be attractive (either manually or with a dedicated
> cache).
>
> Similarly, it could also pose a potential latency issue for the
> instruction decoder.
>
>
> It could also be possible to use some sort of "decompress stuff on
> demand" features, but this is likely to have enough overheads to where
> it only really makes sense if the target machine is RAM-constrained. In
> this case, generally indirect calls would be used (via a table), with
> non-resident functions generally pointing to a "decompress this" thunk.
>
> A similar strategy can also typically be used for a JIT compiler (the
> indirect calls either go to a another JIT compiled trace, or potentially
> lead back into the interpreter, which may at its discretion later
> replace this pointer with another JIT compiled trace).
>
> ...

Re: a modest proposal (apologies to J. Swift)

<saqsq8$r75$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18014&group=comp.arch#18014

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Mon, 21 Jun 2021 15:26:33 -0500
Organization: A noiseless patient Spider
Lines: 186
Message-ID: <saqsq8$r75$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
<sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com>
<saqj72$mnk$1@dont-email.me>
<7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Jun 2021 20:28:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4abe65464924d7cb69d3c5d1800c738e";
logging-data="27877"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18NUgd0ME3LKzdN2XKsgv+/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:ruBzagApnQtcgMuWUTI4kzJX9G8=
In-Reply-To: <7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 21 Jun 2021 20:26 UTC

On 6/21/2021 1:11 PM, MitchAlsup wrote:
> On Monday, June 21, 2021 at 12:45:09 PM UTC-5, BGB wrote:
>> On 6/21/2021 10:57 AM, John Levine wrote:
>>> According to Marcus <m.de...@this.bitsnbites.eu>:
>>>>> I'd go in the other direction, with lots of registers and a
>>>>> compression scheme to make the code smaller. zSeries does gzip deflate
>>>>> mostly in hardware, perhaps something like that.
>>>>
>>>> OTOH I bet that general purpose compression schemes would compress
>>>> better if you used stack indices rather than absolute register numbers,
>>>> since lower index numbers would likely be more common than higher index
>>>> numbers (i.e. there's less entropy than in "random" register numbers).
>>>
>>> I think you're assuming your conclusions. Usually in a register allocator, there
>>> is no reason to prefer one register over another. Add some weights to prefer using
>>> low numbered registers or reusing the same register and we could see how much more
>>> compressible the code was.
>>>
>> In my own tests, generally the dictionary part of an LZ compressor works
>> better than the entropy part when it comes to machine code.
>>
>>
>> Though, it is more likely to benefit from restrictive register and
>> instruction encoding choices, which might be worse for code-density
>> in-general than a more variable set of choices.
>>
>> Though, I have noted that the optimal size of the register space (in
>> terms of code density and performance) for a given function tends to be
>> variable.
>>>> General purpose compression schemes have lots of other problems, though.
>>>> For instance you can't do random access of individual instructions
>>>> without first decompressing a compressed block, and you can't guarantee
>>>> a certain size/alignment of compressed blocks.
>>>
>>> True, but look at any variable length instruction set and you have basically the same
>>> issues, albeit perhaps at different scale. I can wave my hands like crazy and imagine
>>> decompressing code blocks into large cache lines.
>>>
>> There are a few possible routes to code compression:
>> LZ compression, typically decoded during program loading or similar;
>> This is a bigger factor for simulations than for realtime though.
>> Branch-based code reuse:
>> Some common/repetitive constructs can be called via branch ops (1).
>> Some form of "instruction dictionary":
>> Not done, doesn't seem worthwhile (2).
>>
>>
>> 1: Things like function prolog/epilog sequences tend to mostly repeat
>> the same series of instructions, so it may be cost effective to simply
>> call or branch to a previous instance. The cost of the extra branches
>> (particularly with branch prediction) may be offset by the reduction in
>> I$ misses.
> <
> In the C world, prologue becomes ENTER and epilogue becomes EXIT.
> This saves a surprising amount of code space. These instructions perform
> all of the saving and restoring of the registers, including stack pointer and
> frame pointer updates. And since the stores and loads are to consecutive
> memory locations, they can be performed several registers wide per cycle
> {Wide = 2,4,8}.
> <
> In the C++ world, I can agree that arranging the epilogue to call a routine
> to perform all destructors and then EXIT makes it significantly easier to
> incorporate TRY-THROW-CATCH and even longjump() over destructor
> boundaries.

In my case, these tend to turn into a series of 128-bit load/store ops
(register pair).
Since most functions need to save and restore similar sets of registers,
...., then these sequences are fairly reusable.

So, a typical function prolog becomes:
Copy LR to R1 (since BSR stomps LR);
Use 'BSR' to call a previous prolog (to save registers, ...);
Adjust SP (for local data);
Function-specific setup (initializing local storage, *).

And, a typical epilog becomes:
Function-specific teardown;
Adjust SP (up to save area);
Branch to previous epilog.

*: Typically this involves setting up internal pointers for
structures/arrays, saving function arguments to the stack if needed, or
(rarely) calling __alloca or similar for large objects or arrays.

Assuming no cache misses, the 'MOV.X' ops can load or store 128 bits per
clock cycle, and are typically encoded as 16-bits each. Typically, these
are between around 4 and 10 load/store instructions (so, ~ 10-22 bytes,
including a final RTS or RTSU).

The "compressed" sequence is typically 4 or 6 bytes (prolog), or 2 or 4
bytes (epilog).

As a recent addition (compiler level), leaf functions may now skip stack
frame creation, which saves a little more space and also gives a modest
performance increase.

In this case, the prolog is typically 0 bytes (doesn't emit any
instructions), and the epilog is 2 bytes (a lone RTS instruction).
This case requires all of the variables in the function to be able to
fit in scratch registers (they are all statically assigned to a register).

A TODO feature is to expand the compiler's register allocator to be able
to use R32..R63, as there are a lot of leaf functions which may be able
to fit into 28 scratch registers which don't fit into 12.

Another (not yet explored) possibility could be whether full-static
register assignment could be applicable to non-leaf functions.

As noted, most register allocation uses a more naive strategy:
Part of the register space is statically assigned;
The rest uses dynamic assignment.

Where, the dynamic assignment is, essentially:
Demand-load variables into registers (if not already present);
Write any changed registers back to the stack at the end of the basic block;
Flush any non-statically-assigned registers.

The demand loading strategy is, basically:
See if value we want is already in a register;
If so, use it.
Check for an unused register (in this basic block);
Use register if found.
Choose another register to evict (LRU + Round-Robin);
Evict Register:
Write-back contents if dirty;
Set to unused.
Load new value, set to new variable.

Basically, each basic block starts out as a "clean slate" regarding any
non-static registers.

Granted, this isn't a particularly great register allocation strategy,
particularly for a target with a fairly large register space. Could be
better to be able to keep variables alive between basic-blocks without
needing to keep them statically assigned for the entire function.

Static assignment works fairly well, but a certain number of registers
need to be left available to deal effectively with the demand-loaded
case (if too few registers are left available then the register
allocator is prone to "thrashing"). Some amount of this was tuned mostly
via the use of heuristics.

....

>>
>>
>> 2: In this case, the instruction could be reduced partly to a pattern
>> and a table-lookup.
>>
>> In this case, part (or all) of the 16-bit encoding space could be
>> replaced by a lookup table. Say, yynm, where yy fetches an instruction
>> pattern from the table, and nm are used as register indices or similar
>> (substituted into the pattern for a longer form instruction, with a few
>> bits giving the intended pattern substitution type).
>>
>> The table would likely be built by the compiler and held in some
>> internal CPU registers at runtime. The drawback of needing to
>> dynamically reload the table contents to deal with things like DLLs or
>> syscalls would not be attractive (either manually or with a dedicated
>> cache).
>>
>> Similarly, it could also pose a potential latency issue for the
>> instruction decoder.
>>
>>
>> It could also be possible to use some sort of "decompress stuff on
>> demand" features, but this is likely to have enough overheads to where
>> it only really makes sense if the target machine is RAM-constrained. In
>> this case, generally indirect calls would be used (via a table), with
>> non-resident functions generally pointing to a "decompress this" thunk.
>>
>> A similar strategy can also typically be used for a JIT compiler (the
>> indirect calls either go to a another JIT compiled trace, or potentially
>> lead back into the interpreter, which may at its discretion later
>> replace this pointer with another JIT compiled trace).
>>
>> ...

Re: a modest proposal (apologies to J. Swift)

<d705b13b-5b6f-4943-b454-d057a310fb8en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18016&group=comp.arch#18016

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2225:: with SMTP id n5mr630790qkh.38.1624308722248; Mon, 21 Jun 2021 13:52:02 -0700 (PDT)
X-Received: by 2002:a9d:7748:: with SMTP id t8mr4032otl.110.1624308721990; Mon, 21 Jun 2021 13:52:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Jun 2021 13:52:01 -0700 (PDT)
In-Reply-To: <saqsq8$r75$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1df4:f8da:a2d6:40c0; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1df4:f8da:a2d6:40c0
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com> <sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com> <saqj72$mnk$1@dont-email.me> <7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com> <saqsq8$r75$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d705b13b-5b6f-4943-b454-d057a310fb8en@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 21 Jun 2021 20:52:02 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 196
 by: MitchAlsup - Mon, 21 Jun 2021 20:52 UTC

On Monday, June 21, 2021 at 3:28:58 PM UTC-5, BGB wrote:
> On 6/21/2021 1:11 PM, MitchAlsup wrote:
> > On Monday, June 21, 2021 at 12:45:09 PM UTC-5, BGB wrote:
> >> On 6/21/2021 10:57 AM, John Levine wrote:
> >>> According to Marcus <m.de...@this.bitsnbites.eu>:
> >>>>> I'd go in the other direction, with lots of registers and a
> >>>>> compression scheme to make the code smaller. zSeries does gzip deflate
> >>>>> mostly in hardware, perhaps something like that.
> >>>>
> >>>> OTOH I bet that general purpose compression schemes would compress
> >>>> better if you used stack indices rather than absolute register numbers,
> >>>> since lower index numbers would likely be more common than higher index
> >>>> numbers (i.e. there's less entropy than in "random" register numbers).
> >>>
> >>> I think you're assuming your conclusions. Usually in a register allocator, there
> >>> is no reason to prefer one register over another. Add some weights to prefer using
> >>> low numbered registers or reusing the same register and we could see how much more
> >>> compressible the code was.
> >>>
> >> In my own tests, generally the dictionary part of an LZ compressor works
> >> better than the entropy part when it comes to machine code.
> >>
> >>
> >> Though, it is more likely to benefit from restrictive register and
> >> instruction encoding choices, which might be worse for code-density
> >> in-general than a more variable set of choices.
> >>
> >> Though, I have noted that the optimal size of the register space (in
> >> terms of code density and performance) for a given function tends to be
> >> variable.
> >>>> General purpose compression schemes have lots of other problems, though.
> >>>> For instance you can't do random access of individual instructions
> >>>> without first decompressing a compressed block, and you can't guarantee
> >>>> a certain size/alignment of compressed blocks.
> >>>
> >>> True, but look at any variable length instruction set and you have basically the same
> >>> issues, albeit perhaps at different scale. I can wave my hands like crazy and imagine
> >>> decompressing code blocks into large cache lines.
> >>>
> >> There are a few possible routes to code compression:
> >> LZ compression, typically decoded during program loading or similar;
> >> This is a bigger factor for simulations than for realtime though.
> >> Branch-based code reuse:
> >> Some common/repetitive constructs can be called via branch ops (1).
> >> Some form of "instruction dictionary":
> >> Not done, doesn't seem worthwhile (2).
> >>
> >>
> >> 1: Things like function prolog/epilog sequences tend to mostly repeat
> >> the same series of instructions, so it may be cost effective to simply
> >> call or branch to a previous instance. The cost of the extra branches
> >> (particularly with branch prediction) may be offset by the reduction in
> >> I$ misses.
> > <
> > In the C world, prologue becomes ENTER and epilogue becomes EXIT.
> > This saves a surprising amount of code space. These instructions perform
> > all of the saving and restoring of the registers, including stack pointer and
> > frame pointer updates. And since the stores and loads are to consecutive
> > memory locations, they can be performed several registers wide per cycle
> > {Wide = 2,4,8}.
> > <
> > In the C++ world, I can agree that arranging the epilogue to call a routine
> > to perform all destructors and then EXIT makes it significantly easier to
> > incorporate TRY-THROW-CATCH and even longjump() over destructor
> > boundaries.
> In my case, these tend to turn into a series of 128-bit load/store ops
> (register pair).
> Since most functions need to save and restore similar sets of registers,
> ..., then these sequences are fairly reusable.
>
> So, a typical function prolog becomes:
> Copy LR to R1 (since BSR stomps LR);
> Use 'BSR' to call a previous prolog (to save registers, ...);
> Adjust SP (for local data);
> Function-specific setup (initializing local storage, *).
<
So, you consume 3 instructions and 2 control transfers, where I do it in
1 instruction. {No need for the MOV, the STOREs are in ENTER as is
the modification of the SP to allocate local storage.}
>
> And, a typical epilog becomes:
> Function-specific teardown;
> Adjust SP (up to save area);
> Branch to previous epilog.
<
Exception for destructors, mine is 1 instruction.
>
> *: Typically this involves setting up internal pointers for
> structures/arrays, saving function arguments to the stack if needed, or
> (rarely) calling __alloca or similar for large objects or arrays.
<
Their colloquial name is "dope vectors".
>
>
> Assuming no cache misses, the 'MOV.X' ops can load or store 128 bits per
> clock cycle, and are typically encoded as 16-bits each. Typically, these
> are between around 4 and 10 load/store instructions (so, ~ 10-22 bytes,
> including a final RTS or RTSU).
>
> The "compressed" sequence is typically 4 or 6 bytes (prolog), or 2 or 4
> bytes (epilog).
>
>
>
> As a recent addition (compiler level), leaf functions may now skip stack
> frame creation, which saves a little more space and also gives a modest
> performance increase.
<
Yep, subroutine overhead as small as 2 control transfers without additional
overhead. {The CALL to get there, the RET to get back}
>
> In this case, the prolog is typically 0 bytes (doesn't emit any
> instructions), and the epilog is 2 bytes (a lone RTS instruction).
> This case requires all of the variables in the function to be able to
> fit in scratch registers (they are all statically assigned to a register).
>
>
> A TODO feature is to expand the compiler's register allocator to be able
> to use R32..R63, as there are a lot of leaf functions which may be able
> to fit into 28 scratch registers which don't fit into 12.
<
90+% (statically) of leaf functions are adequately managed with 16
available (non-preserved) registers. Doubling this (16-32) will not improve
the situation much.
>
> Another (not yet explored) possibility could be whether full-static
> register assignment could be applicable to non-leaf functions.
>
> As noted, most register allocation uses a more naive strategy:
> Part of the register space is statically assigned;
> The rest uses dynamic assignment.
>
> Where, the dynamic assignment is, essentially:
> Demand-load variables into registers (if not already present);
> Write any changed registers back to the stack at the end of the basic block;
> Flush any non-statically-assigned registers.
>
> The demand loading strategy is, basically:
> See if value we want is already in a register;
> If so, use it.
> Check for an unused register (in this basic block);
> Use register if found.
> Choose another register to evict (LRU + Round-Robin);
> Evict Register:
> Write-back contents if dirty;
> Set to unused.
> Load new value, set to new variable.
>
> Basically, each basic block starts out as a "clean slate" regarding any
> non-static registers.
>
>
> Granted, this isn't a particularly great register allocation strategy,
> particularly for a target with a fairly large register space. Could be
> better to be able to keep variables alive between basic-blocks without
> needing to keep them statically assigned for the entire function.
>
> Static assignment works fairly well, but a certain number of registers
> need to be left available to deal effectively with the demand-loaded
> case (if too few registers are left available then the register
> allocator is prone to "thrashing"). Some amount of this was tuned mostly
> via the use of heuristics.
>
> ...
> >>
> >>
> >> 2: In this case, the instruction could be reduced partly to a pattern
> >> and a table-lookup.
> >>
> >> In this case, part (or all) of the 16-bit encoding space could be
> >> replaced by a lookup table. Say, yynm, where yy fetches an instruction
> >> pattern from the table, and nm are used as register indices or similar
> >> (substituted into the pattern for a longer form instruction, with a few
> >> bits giving the intended pattern substitution type).
> >>
> >> The table would likely be built by the compiler and held in some
> >> internal CPU registers at runtime. The drawback of needing to
> >> dynamically reload the table contents to deal with things like DLLs or
> >> syscalls would not be attractive (either manually or with a dedicated
> >> cache).
> >>
> >> Similarly, it could also pose a potential latency issue for the
> >> instruction decoder.
> >>
> >>
> >> It could also be possible to use some sort of "decompress stuff on
> >> demand" features, but this is likely to have enough overheads to where
> >> it only really makes sense if the target machine is RAM-constrained. In
> >> this case, generally indirect calls would be used (via a table), with
> >> non-resident functions generally pointing to a "decompress this" thunk.
> >>
> >> A similar strategy can also typically be used for a JIT compiler (the
> >> indirect calls either go to a another JIT compiled trace, or potentially
> >> lead back into the interpreter, which may at its discretion later
> >> replace this pointer with another JIT compiled trace).
> >>
> >> ...


Click here to read the complete article
Re: a modest proposal (apologies to J. Swift)

<fc108106-c6e1-468a-8924-e9e01cbcc083n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18020&group=comp.arch#18020

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5bee:: with SMTP id k14mr9322369qvc.11.1624319489408; Mon, 21 Jun 2021 16:51:29 -0700 (PDT)
X-Received: by 2002:a9d:333:: with SMTP id 48mr491131otv.329.1624319489025; Mon, 21 Jun 2021 16:51:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Jun 2021 16:51:28 -0700 (PDT)
In-Reply-To: <sao5ar$1cl7$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:a0d0:9f90:a4d5:4680:2c31:9427; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 2600:1700:a0d0:9f90:a4d5:4680:2c31:9427
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fc108106-c6e1-468a-8924-e9e01cbcc083n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Mon, 21 Jun 2021 23:51:29 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 60
 by: JimBrakefield - Mon, 21 Jun 2021 23:51 UTC

On Sunday, June 20, 2021 at 2:35:58 PM UTC-5, John Levine wrote:
> According to Ivan Godard <iv...@millcomputing.com>:
> >Which led me to wonder: why restrict this approach to control flow?
> >
> >It is well known that by far the densest encodings are those for
> >accumulator and stack virtual machines. Their drawback is that they are
> >inherently serial. Yet it seems to me that it should not be hard to
> >translate accumulator encoding (for example) to three-address register
> >encoding using the same dynamic allocation of rename registers as is
> >used in ordinary OOO. ...
>
> A major complaint about stack instructions is that it is hard to refer
> to operands that aren't at the top of the stack so it's hard to reuse
> common subexpressions. You could use an index into the stack, but in
> that case why not use those bits as a register number? Even though
> transistors are cheap, trying to rediscover loop invariants on the fly
> from stack code seems like a stretch.
>
> I'd go in the other direction, with lots of registers and a
> compression scheme to make the code smaller. zSeries does gzip deflate
> mostly in hardware, perhaps something like that.
>
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

|> compression scheme to make the code smaller. zSeries does gzip deflate
|> mostly in hardware, perhaps something like that.

Stack machines have a simple compression mechanism built-in:
Turn a sequence of frequently used stack operations into a subroutine/word.
A nearby subroutine can be simply the stack operations followed by return.
A nearby call needs a relatively short displacement.
Can also be considered macro expansion. For code density concerns, a sufficiently
long code sequence that is used just twice can result in compaction.

|> A major complaint about stack instructions is that it is hard to refer
|> to operands that aren't at the top of the stack so it's hard to reuse
|> common subexpressions. You could use an index into the stack...

One can add a frame pointer mechanism so that frame values (e.g. temporaries)
are located at a fixed offset from the frame pointer.
On entry to a subroutine using a frame, stack space is reserved and on exit the
stack is shortened. For this to work the top of stack items used in the subroutine
need to be part of the frame and preceded by space for the results.
e.g. data stack before call: ... space for results, subroutine parameters
e.g. data stack during call: ... space for results, parameters, remainder of frame, subroutine local data items...
e.g. data stack after call: ... results

In the above schemes the return stack is only used for return addresses and looping parameters.
The frames are part of the data stack and exist to simplify access to local values
So there are four areas where stack items can be mapped to a register file:
Global values (at one end of the register file)
Frame values
Local data stack values (referenced relative to "top" of stack)
Return address and frame pointer stack

Stack spilling and refilling mechanism needed when register file becomes large.
For deeply embedded applications the stack usage can be predetermined.
For general purpose computing subroutine nesting depth can exceed, say, 60.

Re: a modest proposal (apologies to J. Swift)

<fa656f05-3fa1-4214-982d-5b15df69edc8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18022&group=comp.arch#18022

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b08:: with SMTP id u8mr13328779qvj.52.1624321839902;
Mon, 21 Jun 2021 17:30:39 -0700 (PDT)
X-Received: by 2002:a05:6830:118c:: with SMTP id u12mr660228otq.82.1624321839620;
Mon, 21 Jun 2021 17:30:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Jun 2021 17:30:39 -0700 (PDT)
In-Reply-To: <sao5ar$1cl7$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1df4:f8da:a2d6:40c0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1df4:f8da:a2d6:40c0
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fa656f05-3fa1-4214-982d-5b15df69edc8n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Jun 2021 00:30:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Tue, 22 Jun 2021 00:30 UTC

On Sunday, June 20, 2021 at 2:35:58 PM UTC-5, John Levine wrote:
> According to Ivan Godard <iv...@millcomputing.com>:
> >Which led me to wonder: why restrict this approach to control flow?
> >
> >It is well known that by far the densest encodings are those for
> >accumulator and stack virtual machines. Their drawback is that they are
> >inherently serial. Yet it seems to me that it should not be hard to
> >translate accumulator encoding (for example) to three-address register
> >encoding using the same dynamic allocation of rename registers as is
> >used in ordinary OOO. ...
>
> A major complaint about stack instructions is that it is hard to refer
> to operands that aren't at the top of the stack so it's hard to reuse
> common subexpressions. You could use an index into the stack, but in
> that case why not use those bits as a register number? Even though
> transistors are cheap, trying to rediscover loop invariants on the fly
> from stack code seems like a stretch.
<
It is easy enough to create an instruction that reaches into the stack
and puts that value on the stack so it can be used as a common
sub-expression.
<
Over on the HW side, HW can easily recognize this instruction and know
a priori that this is a common sub-expression and all it needs do is find
the rename register, so this instruction doe not need execution resources
in order to be executed.
>
> I'd go in the other direction, with lots of registers and a
> compression scheme to make the code smaller. zSeries does gzip deflate
> mostly in hardware, perhaps something like that.
>
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

Re: a modest proposal (apologies to J. Swift)

<sat9lt$vvv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18036&group=comp.arch#18036

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Tue, 22 Jun 2021 13:18:17 -0500
Organization: A noiseless patient Spider
Lines: 246
Message-ID: <sat9lt$vvv$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
<sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com>
<saqj72$mnk$1@dont-email.me>
<7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com>
<saqsq8$r75$1@dont-email.me>
<d705b13b-5b6f-4943-b454-d057a310fb8en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Jun 2021 18:20:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5ab69fa683d54fcff8727aeac76e5604";
logging-data="32767"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19hP6w2Y/pMnkxa9laBwUpA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:/Dex4tlWk470XmJdZjlFsovLuvk=
In-Reply-To: <d705b13b-5b6f-4943-b454-d057a310fb8en@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 22 Jun 2021 18:18 UTC

On 6/21/2021 3:52 PM, MitchAlsup wrote:
> On Monday, June 21, 2021 at 3:28:58 PM UTC-5, BGB wrote:
>> On 6/21/2021 1:11 PM, MitchAlsup wrote:
>>> On Monday, June 21, 2021 at 12:45:09 PM UTC-5, BGB wrote:
>>>> On 6/21/2021 10:57 AM, John Levine wrote:
>>>>> According to Marcus <m.de...@this.bitsnbites.eu>:
>>>>>>> I'd go in the other direction, with lots of registers and a
>>>>>>> compression scheme to make the code smaller. zSeries does gzip deflate
>>>>>>> mostly in hardware, perhaps something like that.
>>>>>>
>>>>>> OTOH I bet that general purpose compression schemes would compress
>>>>>> better if you used stack indices rather than absolute register numbers,
>>>>>> since lower index numbers would likely be more common than higher index
>>>>>> numbers (i.e. there's less entropy than in "random" register numbers).
>>>>>
>>>>> I think you're assuming your conclusions. Usually in a register allocator, there
>>>>> is no reason to prefer one register over another. Add some weights to prefer using
>>>>> low numbered registers or reusing the same register and we could see how much more
>>>>> compressible the code was.
>>>>>
>>>> In my own tests, generally the dictionary part of an LZ compressor works
>>>> better than the entropy part when it comes to machine code.
>>>>
>>>>
>>>> Though, it is more likely to benefit from restrictive register and
>>>> instruction encoding choices, which might be worse for code-density
>>>> in-general than a more variable set of choices.
>>>>
>>>> Though, I have noted that the optimal size of the register space (in
>>>> terms of code density and performance) for a given function tends to be
>>>> variable.
>>>>>> General purpose compression schemes have lots of other problems, though.
>>>>>> For instance you can't do random access of individual instructions
>>>>>> without first decompressing a compressed block, and you can't guarantee
>>>>>> a certain size/alignment of compressed blocks.
>>>>>
>>>>> True, but look at any variable length instruction set and you have basically the same
>>>>> issues, albeit perhaps at different scale. I can wave my hands like crazy and imagine
>>>>> decompressing code blocks into large cache lines.
>>>>>
>>>> There are a few possible routes to code compression:
>>>> LZ compression, typically decoded during program loading or similar;
>>>> This is a bigger factor for simulations than for realtime though.
>>>> Branch-based code reuse:
>>>> Some common/repetitive constructs can be called via branch ops (1).
>>>> Some form of "instruction dictionary":
>>>> Not done, doesn't seem worthwhile (2).
>>>>
>>>>
>>>> 1: Things like function prolog/epilog sequences tend to mostly repeat
>>>> the same series of instructions, so it may be cost effective to simply
>>>> call or branch to a previous instance. The cost of the extra branches
>>>> (particularly with branch prediction) may be offset by the reduction in
>>>> I$ misses.
>>> <
>>> In the C world, prologue becomes ENTER and epilogue becomes EXIT.
>>> This saves a surprising amount of code space. These instructions perform
>>> all of the saving and restoring of the registers, including stack pointer and
>>> frame pointer updates. And since the stores and loads are to consecutive
>>> memory locations, they can be performed several registers wide per cycle
>>> {Wide = 2,4,8}.
>>> <
>>> In the C++ world, I can agree that arranging the epilogue to call a routine
>>> to perform all destructors and then EXIT makes it significantly easier to
>>> incorporate TRY-THROW-CATCH and even longjump() over destructor
>>> boundaries.
>> In my case, these tend to turn into a series of 128-bit load/store ops
>> (register pair).
>> Since most functions need to save and restore similar sets of registers,
>> ..., then these sequences are fairly reusable.
>>
>> So, a typical function prolog becomes:
>> Copy LR to R1 (since BSR stomps LR);
>> Use 'BSR' to call a previous prolog (to save registers, ...);
>> Adjust SP (for local data);
>> Function-specific setup (initializing local storage, *).
> <
> So, you consume 3 instructions and 2 control transfers, where I do it in
> 1 instruction. {No need for the MOV, the STOREs are in ENTER as is
> the modification of the SP to allocate local storage.}

Alternatives would likely require either:
An alternate BSR encoding that supports using R1 rather than LR;
The ability to load/store CR's directly (for LR);
Negative displacements for Load/Store ops (to save a stack adjustment);
....

Or:
Some sort of internal state machine or microcode.

>>
>> And, a typical epilog becomes:
>> Function-specific teardown;
>> Adjust SP (up to save area);
>> Branch to previous epilog.
> <
> Exception for destructors, mine is 1 instruction.

Destructors would fall into "function specific teardown", which also
includes teardown for __alloca (calling a function which then frees
anything allocated by __alloca).

Probably 95% of C functions don't need to do anything here.

>>
>> *: Typically this involves setting up internal pointers for
>> structures/arrays, saving function arguments to the stack if needed, or
>> (rarely) calling __alloca or similar for large objects or arrays.
> <
> Their colloquial name is "dope vectors".

OK.

As can be noted, my compiler tends to represent (larger) structs, and
local arrays, as a reference-variable which points to the data area for
the struct or array. Function initialization sets up the pointer to
point to the data within the stack-frame (if it fits), or calls __alloca
(if it is too big).

The alloca logic creates a variable to hold a linked-list of
"stack-frame" allocations (initialized to NULL), which is used so that
they can later be freed (being essentially a thin wrapper over malloc/free).

There may also be some initial prolog-level machinery to deal with
things like "va_start()" and similar as well.

>>
>>
>> Assuming no cache misses, the 'MOV.X' ops can load or store 128 bits per
>> clock cycle, and are typically encoded as 16-bits each. Typically, these
>> are between around 4 and 10 load/store instructions (so, ~ 10-22 bytes,
>> including a final RTS or RTSU).
>>
>> The "compressed" sequence is typically 4 or 6 bytes (prolog), or 2 or 4
>> bytes (epilog).
>>
>>
>>
>> As a recent addition (compiler level), leaf functions may now skip stack
>> frame creation, which saves a little more space and also gives a modest
>> performance increase.
> <
> Yep, subroutine overhead as small as 2 control transfers without additional
> overhead. {The CALL to get there, the RET to get back}

Or, BSR + RTS.

>>
>> In this case, the prolog is typically 0 bytes (doesn't emit any
>> instructions), and the epilog is 2 bytes (a lone RTS instruction).
>> This case requires all of the variables in the function to be able to
>> fit in scratch registers (they are all statically assigned to a register).
>>
>>
>> A TODO feature is to expand the compiler's register allocator to be able
>> to use R32..R63, as there are a lot of leaf functions which may be able
>> to fit into 28 scratch registers which don't fit into 12.
> <
> 90+% (statically) of leaf functions are adequately managed with 16
> available (non-preserved) registers. Doubling this (16-32) will not improve
> the situation much.

From what I could see with the current limit (12 registers, *), the
amount of leaf functions which could fit was a little lower than this.

*: R0, R1, R16, and R17; can't currently be used for variables. However,
R1, R16, and R17 may be used as secondary scratch registers.

>>
>> Another (not yet explored) possibility could be whether full-static
>> register assignment could be applicable to non-leaf functions.
>>
>> As noted, most register allocation uses a more naive strategy:
>> Part of the register space is statically assigned;
>> The rest uses dynamic assignment.
>>
>> Where, the dynamic assignment is, essentially:
>> Demand-load variables into registers (if not already present);
>> Write any changed registers back to the stack at the end of the basic block;
>> Flush any non-statically-assigned registers.
>>
>> The demand loading strategy is, basically:
>> See if value we want is already in a register;
>> If so, use it.
>> Check for an unused register (in this basic block);
>> Use register if found.
>> Choose another register to evict (LRU + Round-Robin);
>> Evict Register:
>> Write-back contents if dirty;
>> Set to unused.
>> Load new value, set to new variable.
>>
>> Basically, each basic block starts out as a "clean slate" regarding any
>> non-static registers.
>>
>>
>> Granted, this isn't a particularly great register allocation strategy,
>> particularly for a target with a fairly large register space. Could be
>> better to be able to keep variables alive between basic-blocks without
>> needing to keep them statically assigned for the entire function.
>>
>> Static assignment works fairly well, but a certain number of registers
>> need to be left available to deal effectively with the demand-loaded
>> case (if too few registers are left available then the register
>> allocator is prone to "thrashing"). Some amount of this was tuned mostly
>> via the use of heuristics.
>>
>> ...
>>>>
>>>>
>>>> 2: In this case, the instruction could be reduced partly to a pattern
>>>> and a table-lookup.
>>>>
>>>> In this case, part (or all) of the 16-bit encoding space could be
>>>> replaced by a lookup table. Say, yynm, where yy fetches an instruction
>>>> pattern from the table, and nm are used as register indices or similar
>>>> (substituted into the pattern for a longer form instruction, with a few
>>>> bits giving the intended pattern substitution type).
>>>>
>>>> The table would likely be built by the compiler and held in some
>>>> internal CPU registers at runtime. The drawback of needing to
>>>> dynamically reload the table contents to deal with things like DLLs or
>>>> syscalls would not be attractive (either manually or with a dedicated
>>>> cache).
>>>>
>>>> Similarly, it could also pose a potential latency issue for the
>>>> instruction decoder.
>>>>
>>>>
>>>> It could also be possible to use some sort of "decompress stuff on
>>>> demand" features, but this is likely to have enough overheads to where
>>>> it only really makes sense if the target machine is RAM-constrained. In
>>>> this case, generally indirect calls would be used (via a table), with
>>>> non-resident functions generally pointing to a "decompress this" thunk.
>>>>
>>>> A similar strategy can also typically be used for a JIT compiler (the
>>>> indirect calls either go to a another JIT compiled trace, or potentially
>>>> lead back into the interpreter, which may at its discretion later
>>>> replace this pointer with another JIT compiled trace).
>>>>
>>>> ...


Click here to read the complete article
Re: a modest proposal (apologies to J. Swift)

<satbfc$sm2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18037&group=comp.arch#18037

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Tue, 22 Jun 2021 13:49:00 -0500
Organization: A noiseless patient Spider
Lines: 109
Message-ID: <satbfc$sm2$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
<fc108106-c6e1-468a-8924-e9e01cbcc083n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Jun 2021 18:51:24 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5ab69fa683d54fcff8727aeac76e5604";
logging-data="29378"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18XvxxcAD2mdd1oC3P8zy1k"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:XScbM5n+JJ9Mslj8B/oWMtrqbMs=
In-Reply-To: <fc108106-c6e1-468a-8924-e9e01cbcc083n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 22 Jun 2021 18:49 UTC

On 6/21/2021 6:51 PM, JimBrakefield wrote:
> On Sunday, June 20, 2021 at 2:35:58 PM UTC-5, John Levine wrote:
>> According to Ivan Godard <iv...@millcomputing.com>:
>>> Which led me to wonder: why restrict this approach to control flow?
>>>
>>> It is well known that by far the densest encodings are those for
>>> accumulator and stack virtual machines. Their drawback is that they are
>>> inherently serial. Yet it seems to me that it should not be hard to
>>> translate accumulator encoding (for example) to three-address register
>>> encoding using the same dynamic allocation of rename registers as is
>>> used in ordinary OOO. ...
>>
>> A major complaint about stack instructions is that it is hard to refer
>> to operands that aren't at the top of the stack so it's hard to reuse
>> common subexpressions. You could use an index into the stack, but in
>> that case why not use those bits as a register number? Even though
>> transistors are cheap, trying to rediscover loop invariants on the fly
>> from stack code seems like a stretch.
>>
>> I'd go in the other direction, with lots of registers and a
>> compression scheme to make the code smaller. zSeries does gzip deflate
>> mostly in hardware, perhaps something like that.
>>
>> --
>> Regards,
>> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
>> Please consider the environment before reading this e-mail. https://jl.ly
>
> |> compression scheme to make the code smaller. zSeries does gzip deflate
> |> mostly in hardware, perhaps something like that.
>
> Stack machines have a simple compression mechanism built-in:
> Turn a sequence of frequently used stack operations into a subroutine/word.
> A nearby subroutine can be simply the stack operations followed by return.
> A nearby call needs a relatively short displacement.
> Can also be considered macro expansion. For code density concerns, a sufficiently
> long code sequence that is used just twice can result in compaction.
>

LZ compression is also fairly effective on stack bytecode, in addition
to Forth style subroutines.

> |> A major complaint about stack instructions is that it is hard to refer
> |> to operands that aren't at the top of the stack so it's hard to reuse
> |> common subexpressions. You could use an index into the stack...
>
> One can add a frame pointer mechanism so that frame values (e.g. temporaries)
> are located at a fixed offset from the frame pointer.
> On entry to a subroutine using a frame, stack space is reserved and on exit the
> stack is shortened. For this to work the top of stack items used in the subroutine
> need to be part of the frame and preceded by space for the results.
> e.g. data stack before call: ... space for results, subroutine parameters
> e.g. data stack during call: ... space for results, parameters, remainder of frame, subroutine local data items...
> e.g. data stack after call: ... results
>

It is possible.

Typically in stack IR formats, there are separate spaces for local
variables and for the value stack. The value stack is generally small
and transient, used mostly for holding the results of intermediate
calculations, or for arguments being passed to a function call, or the
return value from a call or similar.

This is unlike Forth or PostScript, which use a monolithic stack, and
the depth of the stack is allowed to vary along with the control-flow path.

However, a monolithic stack (like in Forth or PostScript) would greatly
limit the utility of the format as an IR.

With some restrictions, Forth-style subroutines could be useful as a "IR
macro" feature though.

> In the above schemes the return stack is only used for return addresses and looping parameters.
> The frames are part of the data stack and exist to simplify access to local values
> So there are four areas where stack items can be mapped to a register file:
> Global values (at one end of the register file)
> Frame values
> Local data stack values (referenced relative to "top" of stack)
> Return address and frame pointer stack
>
> Stack spilling and refilling mechanism needed when register file becomes large.
> For deeply embedded applications the stack usage can be predetermined.
> For general purpose computing subroutine nesting depth can exceed, say, 60.
>

Yeah. This sounds a fair bit more like Forth or similar than it does
like JVM or CLR bytecode.

I guess the question then becomes how much sense something Forth-like
makes as a CPU ISA. I would expect it might make sense to implement the
stack as a sort of specialized cache:
It is addressed in a way more like that of a register file;
Just with relative register addressing.
Its contents are backed to memory.

So, when the stack position moves too far, this cache will essentially
evict and reload cache lines as a sort of modular ring. A cache with
2x16 lines, each 16 bytes, could hold ~ 64 stack items (if each stack
item is 64 bits).

Could be bigger if one wants to support things like local variable
frames, though potentially these could be managed using a separate
general-purpose data cache rather than a stack-cache.

Less sure the specifics of implementing a core for such an architecture.

Re: a modest proposal (apologies to J. Swift)

<ca127074-6978-44ae-9b09-8a304b921026n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18038&group=comp.arch#18038

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:79c6:: with SMTP id u189mr6101478qkc.0.1624389988441; Tue, 22 Jun 2021 12:26:28 -0700 (PDT)
X-Received: by 2002:aca:2102:: with SMTP id 2mr280990oiz.122.1624389988221; Tue, 22 Jun 2021 12:26:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Jun 2021 12:26:28 -0700 (PDT)
In-Reply-To: <sat9lt$vvv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:30a7:85a3:57e9:286d; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:30a7:85a3:57e9:286d
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com> <sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com> <saqj72$mnk$1@dont-email.me> <7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com> <saqsq8$r75$1@dont-email.me> <d705b13b-5b6f-4943-b454-d057a310fb8en@googlegroups.com> <sat9lt$vvv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ca127074-6978-44ae-9b09-8a304b921026n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Jun 2021 19:26:28 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 139
 by: MitchAlsup - Tue, 22 Jun 2021 19:26 UTC

On Tuesday, June 22, 2021 at 1:20:48 PM UTC-5, BGB wrote:
> On 6/21/2021 3:52 PM, MitchAlsup wrote:
> > On Monday, June 21, 2021 at 3:28:58 PM UTC-5, BGB wrote:
> >> On 6/21/2021 1:11 PM, MitchAlsup wrote:

> >> Since most functions need to save and restore similar sets of registers,
> >> ..., then these sequences are fairly reusable.
> >>
> >> So, a typical function prolog becomes:
> >> Copy LR to R1 (since BSR stomps LR);
> >> Use 'BSR' to call a previous prolog (to save registers, ...);
> >> Adjust SP (for local data);
> >> Function-specific setup (initializing local storage, *).
> > <
> > So, you consume 3 instructions and 2 control transfers, where I do it in
> > 1 instruction. {No need for the MOV, the STOREs are in ENTER as is
> > the modification of the SP to allocate local storage.}
<
> Alternatives would likely require either:
> An alternate BSR encoding that supports using R1 rather than LR;
<
If you get rid of the call, you get rid of blowing R1.
<
> The ability to load/store CR's directly (for LR);
> Negative displacements for Load/Store ops (to save a stack adjustment);
<
This is an argument for doing it inside an instruction, you can effectively
do a bunch of @(-SP) pushes and at the end use the immediate that comes
with the instruction to allocate local stack space. I found it profitable to
use 2 lower order bits to indicate if SP gets saved (and restored) and if
FP gets saved and adjusted. I can do this because the stack much always
remain double word aligned.
> ...
>
> Or:
> Some sort of internal state machine or microcode.
<
A sequencer of any ilk that is appropriate.
> >>
> >> And, a typical epilog becomes:
> >> Function-specific teardown;
> >> Adjust SP (up to save area);
> >> Branch to previous epilog.
> > <
> > Exception for destructors, mine is 1 instruction.
<
> Destructors would fall into "function specific teardown", which also
> includes teardown for __alloca (calling a function which then frees
> anything allocated by __alloca).
>
> Probably 95% of C functions don't need to do anything here.
<
It is far closer to 99.95% that do not need it.
<
> >>
> >> *: Typically this involves setting up internal pointers for
> >> structures/arrays, saving function arguments to the stack if needed, or
> >> (rarely) calling __alloca or similar for large objects or arrays.
> > <
> > Their colloquial name is "dope vectors".
> OK.
>
> As can be noted, my compiler tends to represent (larger) structs, and
> local arrays, as a reference-variable which points to the data area for
> the struct or array. Function initialization sets up the pointer to
> point to the data within the stack-frame (if it fits), or calls __alloca
> (if it is too big).
<
If the array size is dynamic.
>
> The alloca logic creates a variable to hold a linked-list of
> "stack-frame" allocations (initialized to NULL), which is used so that
> they can later be freed (being essentially a thin wrapper over malloc/free).
>
> There may also be some initial prolog-level machinery to deal with
> things like "va_start()" and similar as well.
<
This, too, gets wrapped up in the ENTER instruction.
<
ENTER R16,R8,sizeof( LocalDataArea)
<
pushes R16,R17...R29,R0,R1..R8 onto the stack.
<
R1-R8 contained the register bound arguments, the top of stack prior
to enter contained arguments[8..k] so ENTER created a contiguous
array of arguments on the stack, addressed at SP+LocalDataArea+14*8
<
So, va_start() sets a pointer to this storage, and va_arg() gets the next one
and advances the pointer.
> >>
> >>
> >> Assuming no cache misses, the 'MOV.X' ops can load or store 128 bits per
> >> clock cycle, and are typically encoded as 16-bits each. Typically, these
> >> are between around 4 and 10 load/store instructions (so, ~ 10-22 bytes,
> >> including a final RTS or RTSU).
> >>
> >> The "compressed" sequence is typically 4 or 6 bytes (prolog), or 2 or 4
> >> bytes (epilog).
> >>
> >>
> >>
> >> As a recent addition (compiler level), leaf functions may now skip stack
> >> frame creation, which saves a little more space and also gives a modest
> >> performance increase.
> > <
> > Yep, subroutine overhead as small as 2 control transfers without additional
> > overhead. {The CALL to get there, the RET to get back}
> Or, BSR + RTS.
> >>
> >> In this case, the prolog is typically 0 bytes (doesn't emit any
> >> instructions), and the epilog is 2 bytes (a lone RTS instruction).
> >> This case requires all of the variables in the function to be able to
> >> fit in scratch registers (they are all statically assigned to a register).
> >>
> >>
> >> A TODO feature is to expand the compiler's register allocator to be able
> >> to use R32..R63, as there are a lot of leaf functions which may be able
> >> to fit into 28 scratch registers which don't fit into 12.
> > <
> > 90+% (statically) of leaf functions are adequately managed with 16
> > available (non-preserved) registers. Doubling this (16-32) will not improve
> > the situation much.
<
> From what I could see with the current limit (12 registers, *), the
> amount of leaf functions which could fit was a little lower than this.
>
> *: R0, R1, R16, and R17; can't currently be used for variables. However,
> R1, R16, and R17 may be used as secondary scratch registers.
<
I hope this is for better reasons than MIPS used in R2000 and R3000.....
<
R0 is the only register with HW defined use--and that is
a) gets IP return address when a CALL is performed
b) is an alias for IP when used as a base register
c) is an alias for zero when used as an index register
<
R31 (SP) has a bit that changes how ENTER and EXIT work
R30 (FP) has a bit that changes how ENTER and EXIT work.
> >>

Re: a modest proposal (apologies to J. Swift)

<d10ccdb5-fe33-44fd-ac0c-6480d14334a3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18039&group=comp.arch#18039

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:5f0:: with SMTP id z16mr6210574qkg.439.1624395510830; Tue, 22 Jun 2021 13:58:30 -0700 (PDT)
X-Received: by 2002:a4a:bb89:: with SMTP id h9mr4839059oop.69.1624395510468; Tue, 22 Jun 2021 13:58:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Jun 2021 13:58:30 -0700 (PDT)
In-Reply-To: <satbfc$sm2$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:a0d0:9f90:f1b8:246f:69c5:6799; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 2600:1700:a0d0:9f90:f1b8:246f:69c5:6799
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com> <fc108106-c6e1-468a-8924-e9e01cbcc083n@googlegroups.com> <satbfc$sm2$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d10ccdb5-fe33-44fd-ac0c-6480d14334a3n@googlegroups.com>
Subject: Re: a modest proposal (apologies to J. Swift)
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Tue, 22 Jun 2021 20:58:30 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 126
 by: JimBrakefield - Tue, 22 Jun 2021 20:58 UTC

On Tuesday, June 22, 2021 at 1:51:26 PM UTC-5, BGB wrote:
> On 6/21/2021 6:51 PM, JimBrakefield wrote:
> > On Sunday, June 20, 2021 at 2:35:58 PM UTC-5, John Levine wrote:
> >> According to Ivan Godard <iv...@millcomputing.com>:
> >>> Which led me to wonder: why restrict this approach to control flow?
> >>>
> >>> It is well known that by far the densest encodings are those for
> >>> accumulator and stack virtual machines. Their drawback is that they are
> >>> inherently serial. Yet it seems to me that it should not be hard to
> >>> translate accumulator encoding (for example) to three-address register
> >>> encoding using the same dynamic allocation of rename registers as is
> >>> used in ordinary OOO. ...
> >>
> >> A major complaint about stack instructions is that it is hard to refer
> >> to operands that aren't at the top of the stack so it's hard to reuse
> >> common subexpressions. You could use an index into the stack, but in
> >> that case why not use those bits as a register number? Even though
> >> transistors are cheap, trying to rediscover loop invariants on the fly
> >> from stack code seems like a stretch.
> >>
> >> I'd go in the other direction, with lots of registers and a
> >> compression scheme to make the code smaller. zSeries does gzip deflate
> >> mostly in hardware, perhaps something like that.
> >>
> >> --
> >> Regards,
> >> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> >> Please consider the environment before reading this e-mail. https://jl.ly
> >
> > |> compression scheme to make the code smaller. zSeries does gzip deflate
> > |> mostly in hardware, perhaps something like that.
> >
> > Stack machines have a simple compression mechanism built-in:
> > Turn a sequence of frequently used stack operations into a subroutine/word.
> > A nearby subroutine can be simply the stack operations followed by return.
> > A nearby call needs a relatively short displacement.
> > Can also be considered macro expansion. For code density concerns, a sufficiently
> > long code sequence that is used just twice can result in compaction.
> >
> LZ compression is also fairly effective on stack bytecode, in addition
> to Forth style subroutines.
> > |> A major complaint about stack instructions is that it is hard to refer
> > |> to operands that aren't at the top of the stack so it's hard to reuse
> > |> common subexpressions. You could use an index into the stack...
> >
> > One can add a frame pointer mechanism so that frame values (e.g. temporaries)
> > are located at a fixed offset from the frame pointer.
> > On entry to a subroutine using a frame, stack space is reserved and on exit the
> > stack is shortened. For this to work the top of stack items used in the subroutine
> > need to be part of the frame and preceded by space for the results.
> > e.g. data stack before call: ... space for results, subroutine parameters
> > e.g. data stack during call: ... space for results, parameters, remainder of frame, subroutine local data items...
> > e.g. data stack after call: ... results
> >
> It is possible.
>
> Typically in stack IR formats, there are separate spaces for local
> variables and for the value stack. The value stack is generally small
> and transient, used mostly for holding the results of intermediate
> calculations, or for arguments being passed to a function call, or the
> return value from a call or similar.
>
>
> This is unlike Forth or PostScript, which use a monolithic stack, and
> the depth of the stack is allowed to vary along with the control-flow path.
>
> However, a monolithic stack (like in Forth or PostScript) would greatly
> limit the utility of the format as an IR.
>
> With some restrictions, Forth-style subroutines could be useful as a "IR
> macro" feature though.
> > In the above schemes the return stack is only used for return addresses and looping parameters.
> > The frames are part of the data stack and exist to simplify access to local values
> > So there are four areas where stack items can be mapped to a register file:
> > Global values (at one end of the register file)
> > Frame values
> > Local data stack values (referenced relative to "top" of stack)
> > Return address and frame pointer stack
> >
> > Stack spilling and refilling mechanism needed when register file becomes large.
> > For deeply embedded applications the stack usage can be predetermined.
> > For general purpose computing subroutine nesting depth can exceed, say, 60.
> >
> Yeah. This sounds a fair bit more like Forth or similar than it does
> like JVM or CLR bytecode.
>
> I guess the question then becomes how much sense something Forth-like
> makes as a CPU ISA. I would expect it might make sense to implement the
> stack as a sort of specialized cache:
> It is addressed in a way more like that of a register file;
> Just with relative register addressing.
> Its contents are backed to memory.
>
> So, when the stack position moves too far, this cache will essentially
> evict and reload cache lines as a sort of modular ring. A cache with
> 2x16 lines, each 16 bytes, could hold ~ 64 stack items (if each stack
> item is 64 bits).
>
> Could be bigger if one wants to support things like local variable
> frames, though potentially these could be managed using a separate
> general-purpose data cache rather than a stack-cache.
>
> Less sure the specifics of implementing a core for such an architecture.

|> Less sure the specifics of implementing a core for such an architecture.

The current idea for such an architecture is four instruction sizes:
One byte, your typical stack operators (take operands from top of data stack and
return results in their place). Subroutines use the return stack for return addresses
and for looping (loop starting address, loop count etc).
Two bytes: A stack operation and a stack/frame reference for the second operand or an immediate.
Result can go either be pushed down, replace TOS or go to the stack reference.
Three bytes: A stack operation, a stack/frame reference and a predicate.
Four bytes: Stack/frame references for additional operands.

The goal is maximum coding density with more capable instructions so that stack
manipulations (DUP, SWAP, DROP etc) are not needed. Forthers will probably say good
Forth code doesn't need much stack manipulation. I want simply accessed locals.
The thinking is that a small displacement will suffice for data stack and frame references?
Here a five bit code is used for frame and stack references and also provides a short immediate.

The last thought is that if the frame and locals area is dropped (by adjusting the data stack and
frame pointers) on subroutine return, then the instructions can be mapped onto a RISC register
file with possibilities for out of order and/or multiple issue implementation.

The practicality of this scheme rests on not needing more than ~16 frame values and being able to
keep the working environment (via spill and restore) within a ~64 entry register file.

Re: a modest proposal (apologies to J. Swift)

<sb22rk$n03$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18050&group=comp.arch#18050

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: a modest proposal (apologies to J. Swift)
Date: Thu, 24 Jun 2021 08:52:13 -0500
Organization: A noiseless patient Spider
Lines: 217
Message-ID: <sb22rk$n03$1@dont-email.me>
References: <samgjd$cbd$1@dont-email.me> <sao5ar$1cl7$1@gal.iecc.com>
<sapo7q$c9v$1@dont-email.me> <saqctc$2rso$1@gal.iecc.com>
<saqj72$mnk$1@dont-email.me>
<7a9e6be9-e406-467a-93b7-2b14f7d07deen@googlegroups.com>
<saqsq8$r75$1@dont-email.me>
<d705b13b-5b6f-4943-b454-d057a310fb8en@googlegroups.com>
<sat9lt$vvv$1@dont-email.me>
<ca127074-6978-44ae-9b09-8a304b921026n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Jun 2021 13:55:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7f8814ce75f81b6ccf0fe6bbc6aaa1b6";
logging-data="23555"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+gA0LSobzwGC6EbpGcfIWW"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:XgRwsePC8wQxBs4m2uHebU3dIl0=
In-Reply-To: <ca127074-6978-44ae-9b09-8a304b921026n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 24 Jun 2021 13:52 UTC

(Got distracted, posting got delayed...)

On 6/22/2021 2:26 PM, MitchAlsup wrote:
> On Tuesday, June 22, 2021 at 1:20:48 PM UTC-5, BGB wrote:
>> On 6/21/2021 3:52 PM, MitchAlsup wrote:
>>> On Monday, June 21, 2021 at 3:28:58 PM UTC-5, BGB wrote:
>>>> On 6/21/2021 1:11 PM, MitchAlsup wrote:
>
>>>> Since most functions need to save and restore similar sets of registers,
>>>> ..., then these sequences are fairly reusable.
>>>>
>>>> So, a typical function prolog becomes:
>>>> Copy LR to R1 (since BSR stomps LR);
>>>> Use 'BSR' to call a previous prolog (to save registers, ...);
>>>> Adjust SP (for local data);
>>>> Function-specific setup (initializing local storage, *).
>>> <
>>> So, you consume 3 instructions and 2 control transfers, where I do it in
>>> 1 instruction. {No need for the MOV, the STOREs are in ENTER as is
>>> the modification of the SP to allocate local storage.}
> <
>> Alternatives would likely require either:
>> An alternate BSR encoding that supports using R1 rather than LR;
> <
> If you get rid of the call, you get rid of blowing R1.
> <

In this case, a plain branch wont work, since it is necessary to get
back to the function which is invoking the reused prolog.

>> The ability to load/store CR's directly (for LR);
>> Negative displacements for Load/Store ops (to save a stack adjustment);
> <
> This is an argument for doing it inside an instruction, you can effectively
> do a bunch of @(-SP) pushes and at the end use the immediate that comes
> with the instruction to allocate local stack space. I found it profitable to
> use 2 lower order bits to indicate if SP gets saved (and restored) and if
> FP gets saved and adjusted. I can do this because the stack much always
> remain double word aligned.

Possible.

It is possible 32-bit LD/ST encodings could be added which have negative
displacements, but don't exist yet given how rare they are.

....

Hmm, thinking about it, went and switched the 7wnm_ZeoZ Load/Store
encodings from Disp6u to Disp6s. This will allow negative displacements
to be encoded within a 32-bit instruction format (no code exists yet
which will be effected by this change).

As-is, Disp5u encodings are otherwise partially redundant with the
Disp9u encodings.

>> ...
>>
>> Or:
>> Some sort of internal state machine or microcode.
> <
> A sequencer of any ilk that is appropriate.

I guess it can be noted that there isn't anything like this at present.
It could be turned into an interrupt, but doing something like this via
an interrupt would be far too much overhead to be worthwhile.

>>>>
>>>> And, a typical epilog becomes:
>>>> Function-specific teardown;
>>>> Adjust SP (up to save area);
>>>> Branch to previous epilog.
>>> <
>>> Exception for destructors, mine is 1 instruction.
> <
>> Destructors would fall into "function specific teardown", which also
>> includes teardown for __alloca (calling a function which then frees
>> anything allocated by __alloca).
>>
>> Probably 95% of C functions don't need to do anything here.
> <
> It is far closer to 99.95% that do not need it.
> <

Normally, yes...

>>>>
>>>> *: Typically this involves setting up internal pointers for
>>>> structures/arrays, saving function arguments to the stack if needed, or
>>>> (rarely) calling __alloca or similar for large objects or arrays.
>>> <
>>> Their colloquial name is "dope vectors".
>> OK.
>>
>> As can be noted, my compiler tends to represent (larger) structs, and
>> local arrays, as a reference-variable which points to the data area for
>> the struct or array. Function initialization sets up the pointer to
>> point to the data within the stack-frame (if it fits), or calls __alloca
>> (if it is too big).
> <
> If the array size is dynamic.

It may also be done based on size as well, if the array or structure is
too large to be "safe" to put on the stack.

In this case, heap allocations are assumed to be preferable to stack
overflows, but some code is written to assume a somewhat larger default
stack size (eg: 1MB rather than 128K).

The main way to avoid this cost though is not to put large arrays on the
stack and avoid passing large structures by value.

>>
>> The alloca logic creates a variable to hold a linked-list of
>> "stack-frame" allocations (initialized to NULL), which is used so that
>> they can later be freed (being essentially a thin wrapper over malloc/free).
>>
>> There may also be some initial prolog-level machinery to deal with
>> things like "va_start()" and similar as well.
> <
> This, too, gets wrapped up in the ENTER instruction.
> <
> ENTER R16,R8,sizeof( LocalDataArea)
> <
> pushes R16,R17...R29,R0,R1..R8 onto the stack.
> <
> R1-R8 contained the register bound arguments, the top of stack prior
> to enter contained arguments[8..k] so ENTER created a contiguous
> array of arguments on the stack, addressed at SP+LocalDataArea+14*8
> <
> So, va_start() sets a pointer to this storage, and va_arg() gets the next one
> and advances the pointer.

OK.

In this case, it captures the register arguments into an initial state
for va_list.

>>>>
>>>>
>>>> Assuming no cache misses, the 'MOV.X' ops can load or store 128 bits per
>>>> clock cycle, and are typically encoded as 16-bits each. Typically, these
>>>> are between around 4 and 10 load/store instructions (so, ~ 10-22 bytes,
>>>> including a final RTS or RTSU).
>>>>
>>>> The "compressed" sequence is typically 4 or 6 bytes (prolog), or 2 or 4
>>>> bytes (epilog).
>>>>
>>>>
>>>>
>>>> As a recent addition (compiler level), leaf functions may now skip stack
>>>> frame creation, which saves a little more space and also gives a modest
>>>> performance increase.
>>> <
>>> Yep, subroutine overhead as small as 2 control transfers without additional
>>> overhead. {The CALL to get there, the RET to get back}
>> Or, BSR + RTS.
>>>>
>>>> In this case, the prolog is typically 0 bytes (doesn't emit any
>>>> instructions), and the epilog is 2 bytes (a lone RTS instruction).
>>>> This case requires all of the variables in the function to be able to
>>>> fit in scratch registers (they are all statically assigned to a register).
>>>>
>>>>
>>>> A TODO feature is to expand the compiler's register allocator to be able
>>>> to use R32..R63, as there are a lot of leaf functions which may be able
>>>> to fit into 28 scratch registers which don't fit into 12.
>>> <
>>> 90+% (statically) of leaf functions are adequately managed with 16
>>> available (non-preserved) registers. Doubling this (16-32) will not improve
>>> the situation much.
> <
>> From what I could see with the current limit (12 registers, *), the
>> amount of leaf functions which could fit was a little lower than this.
>>
>> *: R0, R1, R16, and R17; can't currently be used for variables. However,
>> R1, R16, and R17 may be used as secondary scratch registers.
> <
> I hope this is for better reasons than MIPS used in R2000 and R3000.....
> <
> R0 is the only register with HW defined use--and that is
> a) gets IP return address when a CALL is performed
> b) is an alias for IP when used as a base register
> c) is an alias for zero when used as an index register
> <
> R31 (SP) has a bit that changes how ENTER and EXIT work
> R30 (FP) has a bit that changes how ENTER and EXIT work.

R0: May be stomped without warning if one tries to encode an operation
with an immediate, and the immediate form can't otherwise be encoded for
whatever reason.

Say, assuming Jumbo isn't allowed:
ADD R4, -9999, R6
Might be encoded as if it were:
MOV -9999, R0
ADD R4, R0, R6

Also R0 and R1 are special cases for Load/Store encodings, ... So, can't
be used as normal registers, since if the operation tries to use them as
a base or index for a load, it invokes magic.

Trying to branch to R0 or R1 may also invokes a few spacial cases:
Branching to R0 is assumed to be an escaped absolute branch;
Branching to R1 behaves as if it were an RTS using R1 as LR.


Click here to read the complete article
Pages:12
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor