Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

You can't cheat the phone company.


devel / comp.arch / Re: Split register files

SubjectAuthor
* Split register filesThomas Koenig
+* Re: Split register filesIvan Godard
|`* Re: Split register filesThomas Koenig
| `* Re: Split register filesBrett
|  `* Re: Split register filesThomas Koenig
|   `* Re: Split register filesBrett
|    `* Re: Split register filesBrett
|     `* Re: Split register filesIvan Godard
|      `* Re: Split register filesBrett
|       +* Re: Split register filesIvan Godard
|       |+* Re: Split register filesStefan Monnier
|       ||`* Re: Split register filesIvan Godard
|       || +- Re: Split register filesStephen Fuld
|       || +- Re: Split register filesStefan Monnier
|       || `* Rescue vs scratchpad (was: Split register files)Stefan Monnier
|       ||  `- Re: Rescue vs scratchpad (was: Split register files)Ivan Godard
|       |`* Re: Split register filesBrett
|       | `* Re: Split register filesIvan Godard
|       |  `* Re: Split register filesBrett
|       |   `* Re: Split register filesIvan Godard
|       |    `* Re: Mill conAsm vs genAsm (was: Split register files)Marcus
|       |     `* Re: Mill conAsm vs genAsm (was: Split register files)Ivan Godard
|       |      `* Re: Mill conAsm vs genAsm (was: Split register files)Quadibloc
|       |       +* Re: Mill conAsm vs genAsm (was: Split register files)Ivan Godard
|       |       |+* Re: Mill conAsm vs genAsm (was: Split register files)MitchAlsup
|       |       ||`* Re: Mill conAsm vs genAsm (was: Split register files)Quadibloc
|       |       || +* Re: Mill conAsm vs genAsm (was: Split register files)MitchAlsup
|       |       || |+* Re: Mill conAsm vs genAsm (was: Split register files)Quadibloc
|       |       || ||`* Re: Mill conAsm vs genAsm (was: Split register files)Marcus
|       |       || || `* Re: Mill conAsm vs genAsm (was: Split register files)Quadibloc
|       |       || ||  `* Re: Mill conAsm vs genAsm (was: Split register files)Marcus
|       |       || ||   `* Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    +* Re: Vector ISA CategorisationStephen Fuld
|       |       || ||    |+- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    |`* Re: Vector ISA CategorisationStefan Monnier
|       |       || ||    | `- Re: Vector ISA CategorisationStephen Fuld
|       |       || ||    +* Re: Vector ISA CategorisationMarcus
|       |       || ||    |+* Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    ||`* Re: Vector ISA Categorisationmbitsnbites
|       |       || ||    || +* Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    || |`- Re: Vector ISA CategorisationMarcus
|       |       || ||    || +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +* Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || |`* Re: Vector ISA CategorisationIvan Godard
|       |       || ||    || | `- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    || +* Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || |`* Re: Vector ISA CategorisationMarcus
|       |       || ||    || | `- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || `- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    |+- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    |+- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    |+- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    |+* Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    ||+- Re: Vector ISA CategorisationThomas Koenig
|       |       || ||    ||`* Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    || +- Re: Vector ISA CategorisationIvan Godard
|       |       || ||    || `- Re: Vector ISA CategorisationThomas Koenig
|       |       || ||    |+* Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    ||`* Re: Vector ISA CategorisationEricP
|       |       || ||    || +* Re: Vector ISA CategorisationStefan Monnier
|       |       || ||    || |`- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +* Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || |`* Re: Vector ISA CategorisationEricP
|       |       || ||    || | `- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || +- Re: Vector ISA CategorisationQuadibloc
|       |       || ||    || +* Re: Vector ISA CategorisationThomas Koenig
|       |       || ||    || |`* Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || | `* Re: Vector ISA CategorisationThomas Koenig
|       |       || ||    || |  `- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    || `- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    |+- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    |+- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    |+- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    |+- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    |`* Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    | `* Re: Vector ISA CategorisationTerje Mathisen
|       |       || ||    |  `- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    +- Re: Vector ISA CategorisationMitchAlsup
|       |       || ||    +- Re: Vector ISA Categorisationluke.l...@gmail.com
|       |       || ||    `- Re: Vector ISA CategorisationMitchAlsup
|       |       || |`* Re: Mill conAsm vs genAsm (was: Split register files)Quadibloc
|       |       || `* Re: Mill conAsm vs genAsm (was: Split register files)luke.l...@gmail.com
|       |       |`* Re: Mill conAsm vs genAsm (was: Split register files)Paul A. Clayton
|       |       `* Re: Mill conAsm vs genAsmStefan Monnier
|       +* Re: Split register filesStefan Monnier
|       `* Re: Split register filesThomas Koenig
+* Re: Split register filesJohn Dallman
+* Re: Split register filesAnton Ertl
+- Re: Split register filesStefan Monnier
`* Re: Split register filesMitchAlsup

Pages:12345678
Re: Split register files

<sc1t3k$nfi$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18421&group=comp.arch#18421

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Tue, 6 Jul 2021 08:33:07 -0700
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <sc1t3k$nfi$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <jwvr1gbvbv2.fsf-monnier+comp.arch@gnu.org>
<sc1oip$mer$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 6 Jul 2021 15:33:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e80503ea528ee6c9b9614aea1ca1130b";
logging-data="24050"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/V/mE/TTwBZ8TzLwHnmDc1dHKXijgqCzk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:iWm/UvUxdqgkDS1fdurtnNjrx70=
In-Reply-To: <sc1oip$mer$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Tue, 6 Jul 2021 15:33 UTC

On 7/6/2021 7:15 AM, Ivan Godard wrote:
> On 7/6/2021 6:06 AM, Stefan Monnier wrote:
>>> There's only one belt at present. Rather than two belts, just configure
>>> a belt twice as big.
>>
>> I think his idea is that with 2 belts (and with control over which
>> results go to which belt) you can arrange to put on belt 1 the
>> results that are only needed very shortly, and on belt 2 the results
>> that are needed in the longer term.  If most results are needed only
>> shortly then belt 1 will move faster and belt 2 will indeed store its
>> result longer.
>>
>> Instead, you use the scratchpad which incurs a higher latency.  IIUC you
>> don't care too much about this latency because it should be reasonably
>> easy to schedule your scratchpad saves&loads in advance, so as long as
>> "save+load" doesn't take more time than the number of cycles results
>> stay on the belt (i.e. the length of the belt measured in cycles), then
>> you can still arrange to have the results "at hand" without delay
>> when you need them.
>>
>>
>>          Stefan
>>
>
> As the belt is just an encoding device, it can be judged on the
> compactness it achieves. Belts don't have a "speed"; things last until
> the more recent set exhausts the capacity. So you could partition the
> belt and decide to put short-life things in one and longer-life things
> in the other, by convention. You'd still need a scratchpad for things
> with still longer lives, but the scratch might be less used (and hence
> smaller) because there would be fewer holes in the slow belt than in a
> double-sized fast belt.
>
> However, this is just a way to selectively preserve longer-life belt
> content. We do preservation now with the rescue() operation, which takes
> a bitmask covering the belt space and renumbers the selected operands to
> the front. It's unclear to me whether a 32-long belt with rescues for
> 33+ drop lives is worse than two 16-long belts with rescues for one of
> them. For space they are the same: choose one of 32 (or 2X16) costs five
> bits. For usage, if the average life is 16-32 drops then the 32-long is
> best because with 2X16 the data falls off the quick belt and overfills
> the slow belt.

I am with Ivan on this, based on general principles. If you have two
classes of things that are similar but different, (in this case, a
"fast" and a "slow" belt) you invariably have cases where you want more
of one and fewer of the other. Similar to the separation of registers
into FP and Int, or data and addresses. The rescue operation is a, IIRC
low cost, fine way of handling that problem.

> Needs some thought.

Sure.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Split register files

<jwvy2ajtpwt.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18423&group=comp.arch#18423

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Tue, 06 Jul 2021 11:42:49 -0400
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <jwvy2ajtpwt.fsf-monnier+comp.arch@gnu.org>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me>
<jwvr1gbvbv2.fsf-monnier+comp.arch@gnu.org>
<sc1oip$mer$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="61f89c2afd5faa4df4ca2092461977cd";
logging-data="22217"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19k8ekLP7xvCN+0agp59wT0"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:0uhXcH6nRNUiktE+m19CGTs1FYQ=
sha1:F7R18l6EauBZJFY89NLuKCfN5ZI=
 by: Stefan Monnier - Tue, 6 Jul 2021 15:42 UTC

> As the belt is just an encoding device, it can be judged on the compactness
> it achieves.

Of course, tho there's also the execution cost of interpreting/executing
the rescue/scratchpad/memory operations when we run out of belt/registers.

> However, this is just a way to selectively preserve longer-life belt content.

At least in theory. In practice if many/most of your values need to be
longer lived, then your "slow belt" will end up moving a fast as (or
faster) than your "fast belt". It's not clear that the concept of
having multiple belts would be a win.

Stefan

Rescue vs scratchpad (was: Split register files)

<jwvsg0rtpms.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18426&group=comp.arch#18426

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Rescue vs scratchpad (was: Split register files)
Date: Tue, 06 Jul 2021 11:46:09 -0400
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <jwvsg0rtpms.fsf-monnier+comp.arch@gnu.org>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me>
<jwvr1gbvbv2.fsf-monnier+comp.arch@gnu.org>
<sc1oip$mer$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="61f89c2afd5faa4df4ca2092461977cd";
logging-data="22217"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+BTT/b9tzY5YdM/y4N+DBr"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:pqCR1kVuoZU2KgYSQuX87wC9CK8=
sha1:qH0vDtNM4+eB71d+YqTWEdIH0/w=
 by: Stefan Monnier - Tue, 6 Jul 2021 15:46 UTC

> However, this is just a way to selectively preserve longer-life belt
> content. We do preservation now with the rescue() operation, which takes
> a bitmask covering the belt space and renumbers the selected operands to the
> front.

Reminds me: how does the compiler choose between using `rescue` to
extend a value's lifetime in the belt vs putting that value in
the scratchpad?

Stefan

Re: Rescue vs scratchpad (was: Split register files)

<sc2i1t$5mn$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18435&group=comp.arch#18435

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Rescue vs scratchpad (was: Split register files)
Date: Tue, 6 Jul 2021 14:30:36 -0700
Organization: A noiseless patient Spider
Lines: 48
Message-ID: <sc2i1t$5mn$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <jwvr1gbvbv2.fsf-monnier+comp.arch@gnu.org>
<sc1oip$mer$1@dont-email.me> <jwvsg0rtpms.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 6 Jul 2021 21:30:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c0df05201c996276f1a63add5d1e516f";
logging-data="5847"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18jJi+ETHqeerz79zvvHcMR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:yC9c2JEKVLI9AkwDi04gx57Ewzg=
In-Reply-To: <jwvsg0rtpms.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Tue, 6 Jul 2021 21:30 UTC

On 7/6/2021 8:46 AM, Stefan Monnier wrote:
>> However, this is just a way to selectively preserve longer-life belt
>> content. We do preservation now with the rescue() operation, which takes
>> a bitmask covering the belt space and renumbers the selected operands to the
>> front.
>
> Reminds me: how does the compiler choose between using `rescue` to
> extend a value's lifetime in the belt vs putting that value in
> the scratchpad?
>
>
> Stefan
>

Some cases are forced. For example, if there is complex control flow
between production and consumption (say produced in a loop prologue and
consumed in the epilogue) then you don't want to be rescuing endlessly
around the loop, and will spill to scratch. Or if the total live exceeds
the fixed belt length then you must spill some, choosing those with the
longest gap before next use.

Otherwise it's just a heuristic: cost of chained rescues, vs. cost of
spill, fill, and scratch space. The specializer looks at the largest gap
(in units of drops, not cycles) between producer and first consumer, or
between pairs of consecutive uses, and if that exceeds a per-member spec
then it spills, otherwise if it's out-of-reach it rescues. But a rescue
rescues everything live at its point of insertion, which alters the gaps
of all of the drops whose life crosses the rescue. Recompute who has the
longest gap, rinse, repeat. It's a natural for a heap-of-ranges data
structure, but it doesn't need to hold gaps that are or become in range
so size tends to be small. I suppose it's nlogn for the out-of-range count.

There's probably a better algorithm for pure rescue insert: given a set
of possibly overlapping ranges, find a set of slices through the set
that reduces all the range fragments to below a fixed length with a
minimal number of slices. Optimal sounds NP to me though - Anton?

Phasing complicates things: "if (A[i+j].p = F(G(a+b, c-d), e^0xf00f00))
{...}" is all one bundle (and one cycle), so you can get intra-bundle
out-of-range or belt overflow too. However, rescue is only possible at
the bundle boundary, while fills are early phase and spills are late
phase, so you may have to bundle bust. (?FILL?, CON, ADD, ADD, SUB,
XOR, CALL, CALL, STORE, BR, ?SPILL?, ?RESCUE?).

Currently the algorithm does not consider scratchpad pressure, though
having to spill to memory instead of scratch is real expensive. Some day
there will be an intern or grad stewey who will have fun with that.

Re: Split register files

<sc4bab$8f2$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18464&group=comp.arch#18464

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-30c8-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 13:47:55 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sc4bab$8f2$1@newsreader4.netcologne.de>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
Injection-Date: Wed, 7 Jul 2021 13:47:55 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-30c8-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:30c8:0:7285:c2ff:fe6c:992d";
logging-data="8674"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 7 Jul 2021 13:47 UTC

Brett <ggtgp@yahoo.com> schrieb:

> I have used architectures with scratchpads or that could configure half the
> cache as scratchpad, yuck.
> And I have the talent to use such, for the average programer the idea is a
> joke.

That should be a question for the compiler writer, not the
application programmer.

Re: Split register files

<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18466&group=comp.arch#18466

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 07 Jul 2021 10:03:24 -0400
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="3b21765f55704cc8ad81c1b07b18a855";
logging-data="30410"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/lAxwuAe+2KaLFMoZjmeXY"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:wtcXmpUdO311H6WD6Hy2mD6nwhw=
sha1:SNiT1keBKNbz4P0IVtH91ENzliA=
 by: Stefan Monnier - Wed, 7 Jul 2021 14:03 UTC

Thomas Koenig [2021-07-07 13:47:55] wrote:
> Brett <ggtgp@yahoo.com> schrieb:
>> I have used architectures with scratchpads or that could configure half the
>> cache as scratchpad, yuck.
>> And I have the talent to use such, for the average programer the idea is a
>> joke.
> That should be a question for the compiler writer, not the
> application programmer.

The kind of scratchpad he's referring to is "out of reach" of most
compilers: they're not up to the task of deciding what to put into it
and what not, so the problem becomes the responsability of the
"application programmer".

The Mill's scratchpad share the same name but are a different concept.

Stefan

Re: Split register files

<8dkFI.8$Pn7.1@fx16.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18469&group=comp.arch#18469

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Split register files
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me> <sc4bab$8f2$1@newsreader4.netcologne.de> <jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
In-Reply-To: <jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 31
Message-ID: <8dkFI.8$Pn7.1@fx16.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 07 Jul 2021 16:01:08 UTC
Date: Wed, 07 Jul 2021 12:00:53 -0400
X-Received-Bytes: 2355
 by: EricP - Wed, 7 Jul 2021 16:00 UTC

Stefan Monnier wrote:
> Thomas Koenig [2021-07-07 13:47:55] wrote:
>> Brett <ggtgp@yahoo.com> schrieb:
>>> I have used architectures with scratchpads or that could configure half the
>>> cache as scratchpad, yuck.
>>> And I have the talent to use such, for the average programer the idea is a
>>> joke.
>> That should be a question for the compiler writer, not the
>> application programmer.
>
> The kind of scratchpad he's referring to is "out of reach" of most
> compilers: they're not up to the task of deciding what to put into it
> and what not, so the problem becomes the responsability of the
> "application programmer".
>
> The Mill's scratchpad share the same name but are a different concept.
>
>
> Stefan

Seems to me that it is similar to the register allocator with an
extra level, in that there is a non-uniform cost between belt rescue,
scratchpad and stack memory. So it should be a cost minimization problem,
also taking into account the lifetime of values (most are single read),
and that temp values across subroutine calls must live in scratchpad
or stack memory (no non-arg belt slots live across calls).

Its reminiscent of VLIW in that belt values are "in flight" forwarded
values, scratchpad is a "large register set" and stack as usual.

Belt "allocation" and Mill calling convention (was: Split register files)

<jwv8s2iqexx.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18471&group=comp.arch#18471

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Belt "allocation" and Mill calling convention (was: Split register files)
Date: Wed, 07 Jul 2021 12:35:03 -0400
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <jwv8s2iqexx.fsf-monnier+comp.arch@gnu.org>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org> <8dkFI.8$Pn7.1@fx16.iad>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="3b21765f55704cc8ad81c1b07b18a855";
logging-data="3063"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18NMV6/3uIJ583tn6A6anzW"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:vZdtGfSOOoKOMNBcIdKPp6356l8=
sha1:8Fib05CiE+Ll1/SAqFZMHUBdl/8=
 by: Stefan Monnier - Wed, 7 Jul 2021 16:35 UTC

EricP [2021-07-07 12:00:53] wrote:
> Stefan Monnier wrote:
>> Thomas Koenig [2021-07-07 13:47:55] wrote:
>>> Brett <ggtgp@yahoo.com> schrieb:
>>>> I have used architectures with scratchpads or that could configure half the
>>>> cache as scratchpad, yuck.
>>>> And I have the talent to use such, for the average programer the idea is a
>>>> joke.
>>> That should be a question for the compiler writer, not the
>>> application programmer.
>> The kind of scratchpad he's referring to is "out of reach" of most
>> compilers: they're not up to the task of deciding what to put into it
>> and what not, so the problem becomes the responsability of the
>> "application programmer".
>> The Mill's scratchpad share the same name but are a different concept.
> Seems to me that it is similar to the register allocator with an
> extra level, in that there is a non-uniform cost between belt rescue,
> scratchpad and stack memory.

Indeed, using `rescue` could be called "belt allocation" in that it's
a similar problem to register allocation, yet it's a bit different, so
it's not immediately obvious what algorithm to use to approximate the
optimal solution.

Ivan's answer suggests that they currently use a bunch of heuristics
that are hoped to work well but without having a clear model of a
global optimum.

> and that temp values across subroutine calls must live in scratchpad
> or stack memory (no non-arg belt slots live across calls).

That's not my understanding. According to the slides of the "belt
talk", the caller's belt is automatically preserved during a function
call, and the function call just "drops" its result(s) onto the caller's
belt exactly like a normal op would do.
So temp values can definitely live on the belt across calls.

Stefan

Re: Split register files

<sc4n1k$h30$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18472&group=comp.arch#18472

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-30c8-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 17:08:04 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sc4n1k$h30$1@newsreader4.netcologne.de>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Wed, 7 Jul 2021 17:08:04 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-30c8-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:30c8:0:7285:c2ff:fe6c:992d";
logging-data="17504"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 7 Jul 2021 17:08 UTC

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
> Thomas Koenig [2021-07-07 13:47:55] wrote:
>> Brett <ggtgp@yahoo.com> schrieb:
>>> I have used architectures with scratchpads or that could configure half the
>>> cache as scratchpad, yuck.
>>> And I have the talent to use such, for the average programer the idea is a
>>> joke.
>> That should be a question for the compiler writer, not the
>> application programmer.
>
> The kind of scratchpad he's referring to is "out of reach" of most
> compilers: they're not up to the task of deciding what to put into it
> and what not, so the problem becomes the responsability of the
> "application programmer".

There are only two options, then: Either compiler technology is
improved so this works, or the concept is dead.

How do we know that compilers are not up to it? Did anybody try to
throw a relevant machine description at one of the major compilers?
I would suspect that putting registers which do nothing, but which
have low-cost register moves to registers which do something,
could already provide something reasonable (iff that scratchpad
is cheaper than using a cache).

Of course, there should be an API to determine when could use what.

> The Mill's scratchpad share the same name but are a different concept.

So I gathered.

Re: Split register files

<jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18473&group=comp.arch#18473

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 07 Jul 2021 13:38:20 -0400
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
<sc4n1k$h30$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="3b21765f55704cc8ad81c1b07b18a855";
logging-data="28236"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/BR6dIXG9nGzVs3B+Usiue"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:lNB3DtMlbE68Sdcw4XLwCARr4yY=
sha1:dY7se8spOcOGCaqCsRIO4YlkVTw=
 by: Stefan Monnier - Wed, 7 Jul 2021 17:38 UTC

Thomas Koenig [2021-07-07 17:08:04] wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
>> Thomas Koenig [2021-07-07 13:47:55] wrote:
>>> Brett <ggtgp@yahoo.com> schrieb:
>>>> I have used architectures with scratchpads or that could configure half the
>>>> cache as scratchpad, yuck.
>>>> And I have the talent to use such, for the average programer the idea is a
>>>> joke.
>>> That should be a question for the compiler writer, not the
>>> application programmer.
>> The kind of scratchpad he's referring to is "out of reach" of most
>> compilers: they're not up to the task of deciding what to put into it
>> and what not, so the problem becomes the responsability of the
>> "application programmer".
> There are only two options, then: Either compiler technology is
> improved so this works, or the concept is dead.

And yet, the concept has existed and keeps existing.
It's typically used in two circumstances:
- Boot time, before the external memory can be used because it hasn't
yet been configured.
- Embedded systems, where the "application programmer" does the job of
optimizing placement, and in return gets more predictable memory
access times.

> I would suspect that putting registers which do nothing, but which
> have low-cost register moves to registers which do something,
> could already provide something reasonable (iff that scratchpad
> is cheaper than using a cache).

AFAIK it usually has the same latency when used as a scratchpad as when
used as a cache.

Stefan

Re: Split register files

<sc4sne$l6p$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18474&group=comp.arch#18474

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-30c8-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 18:45:02 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sc4sne$l6p$1@newsreader4.netcologne.de>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
<sc4n1k$h30$1@newsreader4.netcologne.de>
<jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Wed, 7 Jul 2021 18:45:02 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-30c8-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:30c8:0:7285:c2ff:fe6c:992d";
logging-data="21721"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 7 Jul 2021 18:45 UTC

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
> Thomas Koenig [2021-07-07 17:08:04] wrote:

>> I would suspect that putting registers which do nothing, but which
>> have low-cost register moves to registers which do something,
>> could already provide something reasonable (iff that scratchpad
>> is cheaper than using a cache).
>
> AFAIK it usually has the same latency when used as a scratchpad as when
> used as a cache.

Then it is not interesting. Values spilled to the stack will end
up in the L1 cache anyway, and if there is no latency advantage,
there is no point. You just lose adressing modes and flexibility.

I could understand some architecture putting in a larger number of
registers that can be accessed at register speeds (single-cycle
latency), but that is apparently not what is being discussed.

Re: Split register files

<sc4t71$vrf$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18475&group=comp.arch#18475

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 11:53:19 -0700
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <sc4t71$vrf$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
<sc4n1k$h30$1@newsreader4.netcologne.de>
<jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>
<sc4sne$l6p$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 7 Jul 2021 18:53:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9203ba4c790133137110bf012d207374";
logging-data="32623"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18zo6hrBpYE51gST0DXOmFLRSCG9793L0o="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:xfteKQiHEZjgHh7iKYoeuRLnq6w=
In-Reply-To: <sc4sne$l6p$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Stephen Fuld - Wed, 7 Jul 2021 18:53 UTC

On 7/7/2021 11:45 AM, Thomas Koenig wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
>> Thomas Koenig [2021-07-07 17:08:04] wrote:
>
>>> I would suspect that putting registers which do nothing, but which
>>> have low-cost register moves to registers which do something,
>>> could already provide something reasonable (iff that scratchpad
>>> is cheaper than using a cache).
>>
>> AFAIK it usually has the same latency when used as a scratchpad as when
>> used as a cache.
>
> Then it is not interesting. Values spilled to the stack will end
> up in the L1 cache anyway, and if there is no latency advantage,
> there is no point. You just lose adressing modes and flexibility.

There are embedded applications that have a required response time to
some event. That is, when that event occurs, you must respond within a
specified amount of time. By having a fixed amount of storage that you
know will respond within that time, you can guarantee meeting the
timing. If you put it in a cache, you risk getting a cache miss on that
data, and thus missing timing.

> I could understand some architecture putting in a larger number of
> registers that can be accessed at register speeds (single-cycle
> latency), but that is apparently not what is being discussed.

Right.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Split register files

<sc5cg5$a3p$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18482&group=comp.arch#18482

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 23:14:13 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <sc5cg5$a3p$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me>
<sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 7 Jul 2021 23:14:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1abc7aa5fa33377c55f7b13b2e49edb9";
logging-data="10361"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Qz6zoObswjFHeAaDc2FVr"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:WMHxfHGrrnWUSY/IXsSa30fuYUo=
sha1:E4z+WG6Lz/sRGn7Rgp9qy8GzR/c=
 by: Brett - Wed, 7 Jul 2021 23:14 UTC

Ivan Godard <ivan@millcomputing.com> wrote:
> On 7/6/2021 1:04 AM, Brett wrote:
>> Ivan Godard <ivan@millcomputing.com> wrote:
>>> On 7/5/2021 12:15 AM, Brett wrote:
>>>> Brett <ggtgp@yahoo.com> wrote:
>>>>> Thomas Koenig <tkoenig@netcologne.de> wrote:
>>>>>> Brett <ggtgp@yahoo.com> schrieb:
>>>>>>> Thomas Koenig <tkoenig@netcologne.de> wrote:
>>>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>>>> On 6/26/2021 2:32 AM, Thomas Koenig wrote:
>>>>>>>>
>>>>>>>>> See the Texas Instruments C64
>>>>>>>>
>>>>>>>> That's rather interesting, thanks!
>>>>>>>>
>>>>>>>> Looks rather similar to what I had in mind, except that they allow
>>>>>>>> at least some cross-operation between the different register files,
>>>>>>>> and they had 16 registers (later 32) per register file.
>>>>>>>>
>>>>>>>
>>>>>>> I like it, however I would burn two instruction bits for the two banks.
>>>>>>> So you can say bank 0, bank 1, both banks, sequential load to both, etc.
>>>>>>
>>>>>> What would you appy them to? Assuming a three-register instruction,
>>>>>> (and assuming that r0-15 are in the first bank and r16-r31 in the
>>>>>> second) would you then be able to write
>>>>>>
>>>>>> add r2,r17,r22
>>>>>
>>>>> Add.12 r2,r17,r22 ; both banks run same instruction.
>>>>>
>>>>> A loop unroll of one is just tagging the opcode to run on both banks.
>>>>> Which gives good code density.
>>>>>
>>>>> Add.1s r2,r17,r22 ; add bank 1 but splat result to both banks.
>>>>>
>>>>> Also would do load pair splat, so banks get sequential items for loop
>>>>> unroll use, which saves an address register in the second bank from having
>>>>> to be a clone plus 1 and other grief. The saved register could then be your
>>>>> loop index, etc.
>>>>>
>>>>> You get loop unrolling without doubling your rename width or ports, both of
>>>>> which are critical limits to scaling.
>>>>>
>>>>>> (so the bank 1 ALU presumably does the calculation and sends over
>>>>>> the result to bank 0)?
>>>>>>
>>>>>> Also, this would use one bit only if you have two banks. I was
>>>>>> envisioning more than two, then the advantage in encoding bits
>>>>>> starts to disappear.
>>>>>
>>>>> You can use 2 bank bits and 1 splat or other bit, cheaper than three 6 bit
>>>>> register specifiers. 2+1+(3*4) = 15, verses 3*6 = 18.
>>>>>
>>>>> With four banks the four common choices are; bank 1, all, alternate primary
>>>>> bank, and all but primary bank. Which is better than my 2 bank mask bits in
>>>>> the same opcode space.
>>>>>
>>>>> You can expand this to 3 bits, or the full 4 bit mask each bank which is
>>>>> wasteful. There is also the issue of splat variants if any, which may
>>>>> better be handled by other opcodes. There will be a copy bank register
>>>>> instruction that handles all the tough cases.
>>>>
>>>> I have thought about this some more and how it overlaps vector registers
>>>> but with more flexibility and have decided that 5 banks of 16 registers is
>>>> best so that one can do away with vector registers since you have that
>>>> functionality.
>>>>
>>>> All addressing is done in the first two banks, and the other banks are for
>>>> unrolling and long calculations that need more than 16 registers. Only the
>>>> last four banks support floating point and vector type operations. All
>>>> registers are 64 bits.
>>>>
>>>> Basically a sort of super upgraded RISC 68000, without the crap instruction
>>>> set arch and limits. You can’t do addressing in vector registers, but can
>>>> with this arch.
>>>>
>>>> You could also do a Mill variant of this arch, and with 5 belts you can
>>>> dump the evil kluge visible scratchpad resulting in a nicer cleaner
>>>> architecture.
>>>
>>> Code can have unbounded numbers of concurrently live operands that have
>>> to be put somewhere. If you don't have a scratchpad then you have to be
>>> prepared to stash to memory. Even if you do have a scratchpad you have
>>> to be prepared fo it to overflow and need stashing to memory too.
>>> Scratchpad, like registers, is just an optimization of memory for
>>> certain common special cases.
>>
>> I have used architectures with scratchpads or that could configure half the
>> cache as scratchpad, yuck.
>> And I have the talent to use such, for the average programer the idea is a
>> joke.
>>
>> The age of two way caches that suffer aliasing issues is over, use the
>> cache as God intended.
>>
>> Of course in your case the compiler is using the scratchpad, and it may
>> give you more bandwidth, and that bandwidth costs less power than the
>> cache.
>>
>> But I still think a scratchpad is an over complicated kludge and thus you
>> are living in the past. More importantly it will scare off mediocre
>> programers. The belt is scary enough as is.
>>
>> On an OS call the scratchpad can leak information unless cleared, a small
>> performance liability.
>
> Calls, to the OS or otherwise, get a whole new scratchpad (or so it
> appears), courtesy the hardware spiller.
>
>> The real problem is that you only have one virtual belt, as that makes
>> opcodes smaller and seemingly makes things simpler. But not all problems
>> fit in one belt, fat code blows apart a single belt. Thus a scratchpad to
>> shoehorn more code with good performance. Fat code purely from cache would
>> perform badly.
>
> Yes; hence the scratchpad.
>
>> You also have compiler issues, which makes handling more belts difficult.
>> It is easy to say that one bank should be used for globals and thus rarely
>> rotate, another for addressing and others for compute, but convincing a
>> compiler to do something intelligent is really hard.
>
> There's only one belt at present. Rather than two belts, just configure
> a belt twice as big.
>
>> A loop has a limited number of dependent compute chains, it should be
>> possible to just randomly assign opcode chains to belts. Filling belts
>> roughly evenly over time.
>
> Too much fanout in open code, although FP codes are better; you would
> wind up doing a lot of inter-belt transfers because the dataflow is
> tree-like rather than linear.

Integer trees are so short they are forks. No arch can help such junk.
Addressing is linear and makes up a third of the ops.

>> You have some secrete sauce still I bet to get to 32 ops a cycle, how you
>> handle vector support spreading across units? Etc. Maybe you have lots of
>> belts hiding under the virtual belt, and thus are doing what I am
>> suggesting.
>
> Nope. Remember, the belt is a naming device, not a shift register.

It can be hard to wrap ones brain around doing 32 ops a cycle when you have
16 belt registers.

An 8 way unroll with lots of ALU’s can pull this off.

Which means that after your virtual belt get translated to operations you
have lots of little independent chains which are little belts. Most call
this OoO, but you can order the opcodes sequentially, but doing so is not
belt friendly as each data step is 8 opcodes away on a 8 way unroll.

You need a form of OoO somewhere when the ops hit the hardware, and this is
what is confusing me.

>> It’s been years since I read your docs, I apologize for any
>> mischaracterizations I have made. My mental model of your architecture is
>> undoubtedly wrong.
>>
>> I thank you for your posts, they are enlightening.

Re: Split register files

<sc5db6$ed1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18483&group=comp.arch#18483

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 16:28:36 -0700
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <sc5db6$ed1$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org> <8dkFI.8$Pn7.1@fx16.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 7 Jul 2021 23:28:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7fa8f77aec8a753dfe786d0cb1966a9f";
logging-data="14753"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/xHmRfIZ6SlKiGEShmnOCi"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:/XVbrosOmGvbWMjkbvhRJnhnsD8=
In-Reply-To: <8dkFI.8$Pn7.1@fx16.iad>
Content-Language: en-US
 by: Ivan Godard - Wed, 7 Jul 2021 23:28 UTC

On 7/7/2021 9:00 AM, EricP wrote:
> Stefan Monnier wrote:
>> Thomas Koenig [2021-07-07 13:47:55] wrote:
>>> Brett <ggtgp@yahoo.com> schrieb:
>>>> I have used architectures with scratchpads or that could configure
>>>> half the
>>>> cache as scratchpad, yuck.
>>>> And I have the talent to use such, for the average programer the
>>>> idea is a
>>>> joke.
>>> That should be a question for the compiler writer, not the
>>> application programmer.
>>
>> The kind of scratchpad he's referring to is "out of reach" of most
>> compilers: they're not up to the task of deciding what to put into it
>> and what not, so the problem becomes the responsability of the
>> "application programmer".
>>
>> The Mill's scratchpad share the same name but are a different concept.
>>
>>
>>         Stefan
>
> Seems to me that it is similar to the register allocator with an
> extra  level, in that there is a non-uniform cost between belt rescue,
> scratchpad and stack memory. So it should be a cost minimization problem,
> also taking into account the lifetime of values (most are single read),
> and that temp values across subroutine calls must live in scratchpad
> or stack memory (no non-arg belt slots live across calls).
>
> Its reminiscent of VLIW in that belt values are "in flight" forwarded
> values, scratchpad is a "large register set" and stack as usual.

Close but no cigar :-)

Belt values *do* persist across calls, they just don't persist *into*
calls except for explicit arguments (which are reordered to signature
order). Scratch is the same, except no called arguments are passed in
scratch.

Yes, the task is very like a register allocator for a non-uniform
register set. For example, do you put a value in one register or a
different one, and decide based on whether the one register will be need
for an upcoming instruction that needs a register pair including that one?

And yes, it's a cost minimization problem.

Re: Belt "allocation" and Mill calling convention (was: Split register files)

<sc5dlk$g5i$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18484&group=comp.arch#18484

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Belt "allocation" and Mill calling convention (was: Split
register files)
Date: Wed, 7 Jul 2021 16:34:11 -0700
Organization: A noiseless patient Spider
Lines: 48
Message-ID: <sc5dlk$g5i$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org> <8dkFI.8$Pn7.1@fx16.iad>
<jwv8s2iqexx.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 7 Jul 2021 23:34:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7fa8f77aec8a753dfe786d0cb1966a9f";
logging-data="16562"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/qSlDlUkg9gaEvddY92alw"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:QQ7zs/ftI64oQAWlWlxctu2bWIo=
In-Reply-To: <jwv8s2iqexx.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Wed, 7 Jul 2021 23:34 UTC

On 7/7/2021 9:35 AM, Stefan Monnier wrote:
> EricP [2021-07-07 12:00:53] wrote:
>> Stefan Monnier wrote:
>>> Thomas Koenig [2021-07-07 13:47:55] wrote:
>>>> Brett <ggtgp@yahoo.com> schrieb:
>>>>> I have used architectures with scratchpads or that could configure half the
>>>>> cache as scratchpad, yuck.
>>>>> And I have the talent to use such, for the average programer the idea is a
>>>>> joke.
>>>> That should be a question for the compiler writer, not the
>>>> application programmer.
>>> The kind of scratchpad he's referring to is "out of reach" of most
>>> compilers: they're not up to the task of deciding what to put into it
>>> and what not, so the problem becomes the responsability of the
>>> "application programmer".
>>> The Mill's scratchpad share the same name but are a different concept.
>> Seems to me that it is similar to the register allocator with an
>> extra level, in that there is a non-uniform cost between belt rescue,
>> scratchpad and stack memory.
>
> Indeed, using `rescue` could be called "belt allocation" in that it's
> a similar problem to register allocation, yet it's a bit different, so
> it's not immediately obvious what algorithm to use to approximate the
> optimal solution.
>
> Ivan's answer suggests that they currently use a bunch of heuristics
> that are hoped to work well but without having a clear model of a
> global optimum.

It's pretty clear that optimum is bin-packing and hence NP. The
specializer is intended to be used as a very fast install-time tool, and
can't afford NP, no more than a JIT (with a similar place in the
process) can afford it. Hence, heuristics for both.

We do not have a compile-time batch version of the specializer intended
for when NP algorithms are desirable to get optimum. If market demands,
we will.

>> and that temp values across subroutine calls must live in scratchpad
>> or stack memory (no non-arg belt slots live across calls).
>
> That's not my understanding. According to the slides of the "belt
> talk", the caller's belt is automatically preserved during a function
> call, and the function call just "drops" its result(s) onto the caller's
> belt exactly like a normal op would do.
> So temp values can definitely live on the belt across calls.

Yes

Re: Split register files

<sc5fh8$p7q$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18485&group=comp.arch#18485

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Wed, 7 Jul 2021 17:05:59 -0700
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <sc5fh8$p7q$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 8 Jul 2021 00:06:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7fa8f77aec8a753dfe786d0cb1966a9f";
logging-data="25850"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19oRaCuXy60b4zq6WdgKx/0"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:6rTPaYLkuRZuGGsSMbYi1EF3G70=
In-Reply-To: <sc5cg5$a3p$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Thu, 8 Jul 2021 00:05 UTC

On 7/7/2021 4:14 PM, Brett wrote:
> Ivan Godard <ivan@millcomputing.com> wrote:
>> On 7/6/2021 1:04 AM, Brett wrote:

<snip>

>>> You also have compiler issues, which makes handling more belts difficult.
>>> It is easy to say that one bank should be used for globals and thus rarely
>>> rotate, another for addressing and others for compute, but convincing a
>>> compiler to do something intelligent is really hard.
>>
>> There's only one belt at present. Rather than two belts, just configure
>> a belt twice as big.
>>
>>> A loop has a limited number of dependent compute chains, it should be
>>> possible to just randomly assign opcode chains to belts. Filling belts
>>> roughly evenly over time.
>>
>> Too much fanout in open code, although FP codes are better; you would
>> wind up doing a lot of inter-belt transfers because the dataflow is
>> tree-like rather than linear.
>
> Integer trees are so short they are forks. No arch can help such junk.
> Addressing is linear and makes up a third of the ops.

To simplify an example:
source: a+b+c+d
breadth-first: (a+b)+(c+d) 2 cycles, max belt live load 2
depth-first: (a+(b+(c+d))) 3 cycles, max belt live load 1

constraints:
belt live load cannot exceed belt length;
belt delta (distance in drops between producer and consumer) cannot
exceed belt length without rescue or spill/fill
no NP

BTW, a graph-coloring register allocator is also NP.

>>> You have some secrete sauce still I bet to get to 32 ops a cycle, how you
>>> handle vector support spreading across units? Etc. Maybe you have lots of
>>> belts hiding under the virtual belt, and thus are doing what I am
>>> suggesting.
>>
>> Nope. Remember, the belt is a naming device, not a shift register.
>
> It can be hard to wrap ones brain around doing 32 ops a cycle when you have
> 16 belt registers.

Not possible, unless the op arguments were shared from 16 available and
all had 16 or fewer drops total (neglecting phasing). As a rule of
thumb, a Mill member is configured such that <slot count> == <belt
length>. The larger members (Gold has 32 belt length) rarely need to
rescue, although they may have occasional obligatory spills, while
smaller ones (Copper has 8 belt length) are replete with rescues on the
same code.

> An 8 way unroll with lots of ALU’s can pull this off.

Nope.

> Which means that after your virtual belt get translated to operations you
> have lots of little independent chains which are little belts. Most call
> this OoO, but you can order the opcodes sequentially, but doing so is not
> belt friendly as each data step is 8 opcodes away on a 8 way unroll.

Software pipelining rather than unroll, but yes: the belt must be long
enough to hold the un-piped placement times the piping.In practical
configs the constraint to piping limiting factor is the number of FUs
you have of whatever is the most constraining kind. If your loop body
has a multiply and you have two multipliers you can't pipe it into a
single 3-pipe bundle.

Re: Scratchpads (was: Split register files)

<sc65fd$197$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18488&group=comp.arch#18488

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Scratchpads (was: Split register files)
Date: Thu, 8 Jul 2021 08:20:28 +0200
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <sc65fd$197$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<jwvwnq3vc90.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Jul 2021 06:20:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="aee15b0c6682d2c7fc3bd87d6c28d548";
logging-data="1319"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wWAzRilJ5BhhpDoK9TmoVzjK2Cf+JVfs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:FgMBFm6WH2Ejjq69vOXmW3MY6CA=
In-Reply-To: <jwvwnq3vc90.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Marcus - Thu, 8 Jul 2021 06:20 UTC

On 2021-07-06, Stefan Monnier wrote:
>> I have used architectures with scratchpads or that could configure half the
>> cache as scratchpad, yuck.
>
> I don't think these scratchpads are very much like the Mill's scratchpad.
>
> I suspect the "yuck" above refers to the problems you end up having of
> administering this scratchpad, sharing it between unrelated functions.
>
> Mill's scratchpad is more like a CPU-supported notion of frame
> activation record. Every time you enter a function you get a fresh new
> scratchpad and when you return from a function, the scratchpad is thrown
> away and the caller recovers instead the scratchpad it had before the
> call (scrachpads get "pushed on/popped off the stack" behind the scene).
>
> So it's easy for programmers and compilers to use it as a kind of "slow
> register file with register-windows".

I've been toying with the idea of having a small scratchpad at a fixed
memory location in the MRISC32 for the sake of doing things like vector
register permutations (a la POWER vperm): Since the machine has
scatter/gather it is trivial to do permutations via memory in just two
instructions (a linear store + a gather load), and since the idea is
that the vector registers should be long enough to hide any pipeline
latencies the permutation operation basically has the same cost as two
back-to-back ALU operations, /provided/ that there are no cache misses.

The size of a vector register is 64+ bytes (implementation dependent),
so the memory requirement for a single register is typically more than
one cache line.

So, what do you people think? Would a HW scratchpad have a performance
advantage over, say, using the local stack as scratchpad area for the
purpose of doing vector permutations and similar?

One obvious advantage would be that the DCache is not polluted.

/Marcus

Re: Scratchpads (was: Split register files)

<sc677g$aeh$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18489&group=comp.arch#18489

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Scratchpads (was: Split register files)
Date: Wed, 7 Jul 2021 23:50:23 -0700
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <sc677g$aeh$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<jwvwnq3vc90.fsf-monnier+comp.arch@gnu.org> <sc65fd$197$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 8 Jul 2021 06:50:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e86c943a1bed7cd0d127ab445dcf1243";
logging-data="10705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+URVww6wS9+YIs0EQnkUV15meoRzQEpr8="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:lCmgaToTS2k0vqP+Bx7E0Fb3oLk=
In-Reply-To: <sc65fd$197$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Thu, 8 Jul 2021 06:50 UTC

On 7/7/2021 11:20 PM, Marcus wrote:
> On 2021-07-06, Stefan Monnier wrote:
>>> I have used architectures with scratchpads or that could configure
>>> half the
>>> cache as scratchpad, yuck.
>>
>> I don't think these scratchpads are very much like the Mill's scratchpad.
>>
>> I suspect the "yuck" above refers to the problems you end up having of
>> administering this scratchpad, sharing it between unrelated functions.
>>
>> Mill's scratchpad is more like a CPU-supported notion of frame
>> activation record.  Every time you enter a function you get a fresh new
>> scratchpad and when you return from a function, the scratchpad is thrown
>> away and the caller recovers instead the scratchpad it had before the
>> call (scrachpads get "pushed on/popped off the stack" behind the scene).
>>
>> So it's easy for programmers and compilers to use it as a kind of "slow
>> register file with register-windows".
>
> I've been toying with the idea of having a small scratchpad at a fixed
> memory location in the MRISC32 for the sake of doing things like vector
> register permutations (a la POWER vperm): Since the machine has
> scatter/gather it is trivial to do permutations via memory in just two
> instructions (a linear store + a gather load), and since the idea is
> that the vector registers should be long enough to hide any pipeline
> latencies the permutation operation basically has the same cost as two
> back-to-back ALU operations, /provided/ that there are no cache misses.
>
> The size of a vector register is 64+ bytes (implementation dependent),
> so the memory requirement for a single register is typically more than
> one cache line.
>
> So, what do you people think?

What happens if you take an interrupt and want to do a context switch
between the store and the gather load?

> Would a HW scratchpad have a performance
> advantage over, say, using the local stack as scratchpad area for the
> purpose of doing vector permutations and similar?

Probably, but I suspect there are better ways of accomplishing this
functionality.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Split register files

<2021Jul8.091204@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18491&group=comp.arch#18491

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Thu, 08 Jul 2021 07:12:04 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 16
Message-ID: <2021Jul8.091204@mips.complang.tuwien.ac.at>
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me> <sc4bab$8f2$1@newsreader4.netcologne.de> <jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org> <sc4n1k$h30$1@newsreader4.netcologne.de> <jwvr1gaowte.fsf-monnier+comp.arch@gnu.org> <sc4sne$l6p$1@newsreader4.netcologne.de> <sc4t71$vrf$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="4046eb382dfb208213332e8b4156df24";
logging-data="29361"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KR5IbDhtvVMXTsGbUfe4j"
Cancel-Lock: sha1:UGMSQPrBjXahXyMDJiBXuIpseSM=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 8 Jul 2021 07:12 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>There are embedded applications that have a required response time to
>some event. That is, when that event occurs, you must respond within a
>specified amount of time. By having a fixed amount of storage that you
>know will respond within that time, you can guarantee meeting the
>timing. If you put it in a cache, you risk getting a cache miss on that
>data, and thus missing timing.

There exists hardware where you can lock cache lines (they cannot be
evicted) for that purpose. AFAIK some PowerPC CPUs have this, and I
expect that ARM Cortex-R CPUs have this, too.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Scratchpads (was: Split register files)

<sc6pcq$1rf$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18496&group=comp.arch#18496

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Scratchpads (was: Split register files)
Date: Thu, 8 Jul 2021 14:00:25 +0200
Organization: A noiseless patient Spider
Lines: 72
Message-ID: <sc6pcq$1rf$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<jwvwnq3vc90.fsf-monnier+comp.arch@gnu.org> <sc65fd$197$1@dont-email.me>
<sc677g$aeh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 8 Jul 2021 12:00:26 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="aee15b0c6682d2c7fc3bd87d6c28d548";
logging-data="1903"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ravKPYVJ4eoZEkltk/5ei8CGl06j6nRM="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:pATS3cb0cqZYdXoxZenOP7G5bj8=
In-Reply-To: <sc677g$aeh$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Thu, 8 Jul 2021 12:00 UTC

On 2021-07-08, Stephen Fuld wrote:
> On 7/7/2021 11:20 PM, Marcus wrote:
>> On 2021-07-06, Stefan Monnier wrote:
>>>> I have used architectures with scratchpads or that could configure
>>>> half the
>>>> cache as scratchpad, yuck.
>>>
>>> I don't think these scratchpads are very much like the Mill's
>>> scratchpad.
>>>
>>> I suspect the "yuck" above refers to the problems you end up having of
>>> administering this scratchpad, sharing it between unrelated functions.
>>>
>>> Mill's scratchpad is more like a CPU-supported notion of frame
>>> activation record.  Every time you enter a function you get a fresh new
>>> scratchpad and when you return from a function, the scratchpad is thrown
>>> away and the caller recovers instead the scratchpad it had before the
>>> call (scrachpads get "pushed on/popped off the stack" behind the scene).
>>>
>>> So it's easy for programmers and compilers to use it as a kind of "slow
>>> register file with register-windows".
>>
>> I've been toying with the idea of having a small scratchpad at a fixed
>> memory location in the MRISC32 for the sake of doing things like vector
>> register permutations (a la POWER vperm): Since the machine has
>> scatter/gather it is trivial to do permutations via memory in just two
>> instructions (a linear store + a gather load), and since the idea is
>> that the vector registers should be long enough to hide any pipeline
>> latencies the permutation operation basically has the same cost as two
>> back-to-back ALU operations, /provided/ that there are no cache misses.
>>
>> The size of a vector register is 64+ bytes (implementation dependent),
>> so the memory requirement for a single register is typically more than
>> one cache line.
>>
>> So, what do you people think?
>
> What happens if you take an interrupt and want to do a context switch
> between the store and the gather load?
>

Interrupts and exceptions are generally problematic for vector loads and
stores and need sufficient support for handling things like a mid-vector
page fault. I have not implemented any solutions for that yet (but there
*are* solutions).

Specifically for a scratchpad you'd have to define the semantics, of
course. I have not come that far yet but you would typically want it to
behave like a register file that is managed during context switches (one
option is to have many small areas - one for each thread, until you run
out of hard areas and need to do software switching).

>> Would a HW scratchpad have a performance
>> advantage over, say, using the local stack as scratchpad area for the
>> purpose of doing vector permutations and similar?
>
> Probably, but I suspect there are better ways of accomplishing this
> functionality.

The best (as in most performant) way is to have a dedicated vector
permutation unit, so that you do not have to do the cache/scratchpad
roundtrip, but such a thing is usually quite costly (and it adds
requirements on how the VRF must work).

Hmm... I guess you could code up a "macro instruction" that acts like
a combined store + load but against an internal memory that is hidden
from the programmer. If the instruction is guaranteed to be atomic (i.e.
no interrupts are allowed mid instruction), there would be no need to
ever care about the contents of the memory (e.g. during context
switches). I do not have any similar instructions ATM, though.

/Marcus

Re: Split register files

<sc76oa$6o4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18505&group=comp.arch#18505

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Thu, 8 Jul 2021 08:48:24 -0700
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <sc76oa$6o4$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
<sc4n1k$h30$1@newsreader4.netcologne.de>
<jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>
<sc4sne$l6p$1@newsreader4.netcologne.de> <sc4t71$vrf$1@dont-email.me>
<2021Jul8.091204@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Jul 2021 15:48:26 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e86c943a1bed7cd0d127ab445dcf1243";
logging-data="6916"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KhruCB0Im4iqO5HktPyBnihhGIgeFBXc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:VZulSgreTchp9dKD0QYOvq9Hn2k=
In-Reply-To: <2021Jul8.091204@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Thu, 8 Jul 2021 15:48 UTC

On 7/8/2021 12:12 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> There are embedded applications that have a required response time to
>> some event. That is, when that event occurs, you must respond within a
>> specified amount of time. By having a fixed amount of storage that you
>> know will respond within that time, you can guarantee meeting the
>> timing. If you put it in a cache, you risk getting a cache miss on that
>> data, and thus missing timing.
>
> There exists hardware where you can lock cache lines (they cannot be
> evicted) for that purpose. AFAIK some PowerPC CPUs have this, and I
> expect that ARM Cortex-R CPUs have this, too.

Yes, another solution to the problem. It has the advantage of not
requiring the HW to have a specific hardware defined amount of dedicated
memory. It's disadvantage is that you are "wasting" some amount of HW
(the tags, etc. for the locked lines), which could be significant if you
need a relatively large amount of locked space.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Split register files

<2021Jul8.192030@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18514&group=comp.arch#18514

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Thu, 08 Jul 2021 17:20:30 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 18
Message-ID: <2021Jul8.192030@mips.complang.tuwien.ac.at>
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me> <sc4bab$8f2$1@newsreader4.netcologne.de> <jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org> <sc4n1k$h30$1@newsreader4.netcologne.de> <jwvr1gaowte.fsf-monnier+comp.arch@gnu.org> <sc4sne$l6p$1@newsreader4.netcologne.de> <sc4t71$vrf$1@dont-email.me> <2021Jul8.091204@mips.complang.tuwien.ac.at> <sc76oa$6o4$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="4046eb382dfb208213332e8b4156df24";
logging-data="7314"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ABf8C4kcJi05Pu5YjhpVD"
Cancel-Lock: sha1:ZpnCRLa9WTO9EmU9ihEHXJoXhWo=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 8 Jul 2021 17:20 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 7/8/2021 12:12 AM, Anton Ertl wrote:
>> There exists hardware where you can lock cache lines (they cannot be
>> evicted) for that purpose. AFAIK some PowerPC CPUs have this, and I
>> expect that ARM Cortex-R CPUs have this, too.
>
>Yes, another solution to the problem. It has the advantage of not
>requiring the HW to have a specific hardware defined amount of dedicated
>memory. It's disadvantage is that you are "wasting" some amount of HW
>(the tags, etc. for the locked lines)

If you want to lock lines that are not in a hardware-defined subset of
the address space, the tags are not wasted.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Split register files

<sc7elp$tu1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18518&group=comp.arch#18518

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Split register files
Date: Thu, 8 Jul 2021 11:03:35 -0700
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <sc7elp$tu1$1@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de>
<jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
<sc4n1k$h30$1@newsreader4.netcologne.de>
<jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>
<sc4sne$l6p$1@newsreader4.netcologne.de> <sc4t71$vrf$1@dont-email.me>
<2021Jul8.091204@mips.complang.tuwien.ac.at> <sc76oa$6o4$1@dont-email.me>
<2021Jul8.192030@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Jul 2021 18:03:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e86c943a1bed7cd0d127ab445dcf1243";
logging-data="30657"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Y/L0FH1No5WjgJ84p8QURYVO18TYoaGI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:zgPEXEZUTIM9+hUgZVgeaidtQD4=
In-Reply-To: <2021Jul8.192030@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Thu, 8 Jul 2021 18:03 UTC

On 7/8/2021 10:20 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> On 7/8/2021 12:12 AM, Anton Ertl wrote:
>>> There exists hardware where you can lock cache lines (they cannot be
>>> evicted) for that purpose. AFAIK some PowerPC CPUs have this, and I
>>> expect that ARM Cortex-R CPUs have this, too.
>>
>> Yes, another solution to the problem. It has the advantage of not
>> requiring the HW to have a specific hardware defined amount of dedicated
>> memory. It's disadvantage is that you are "wasting" some amount of HW
>> (the tags, etc. for the locked lines)
>
> If you want to lock lines that are not in a hardware-defined subset of
> the address space, the tags are not wasted.

I probably wasn't clear. I know that you need the tags for the locked
lines, but if you used dedicated memory instead of a cache, then you
don't need tags, etc. for the dedicated memory.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Split register files

<62545488-12cd-4706-a2ef-8819681206b2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18528&group=comp.arch#18528

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:564b:: with SMTP id 11mr6330776qtt.60.1625781444644;
Thu, 08 Jul 2021 14:57:24 -0700 (PDT)
X-Received: by 2002:aca:dac5:: with SMTP id r188mr25056476oig.78.1625781444403;
Thu, 08 Jul 2021 14:57:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 8 Jul 2021 14:57:24 -0700 (PDT)
In-Reply-To: <sc7elp$tu1$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:f5e2:724e:7e91:6c05;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:f5e2:724e:7e91:6c05
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc4bab$8f2$1@newsreader4.netcologne.de> <jwv1r8arzpx.fsf-monnier+comp.arch@gnu.org>
<sc4n1k$h30$1@newsreader4.netcologne.de> <jwvr1gaowte.fsf-monnier+comp.arch@gnu.org>
<sc4sne$l6p$1@newsreader4.netcologne.de> <sc4t71$vrf$1@dont-email.me>
<2021Jul8.091204@mips.complang.tuwien.ac.at> <sc76oa$6o4$1@dont-email.me>
<2021Jul8.192030@mips.complang.tuwien.ac.at> <sc7elp$tu1$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <62545488-12cd-4706-a2ef-8819681206b2n@googlegroups.com>
Subject: Re: Split register files
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 08 Jul 2021 21:57:24 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Thu, 8 Jul 2021 21:57 UTC

On Thursday, July 8, 2021 at 12:03:39 PM UTC-6, Stephen Fuld wrote:
> On 7/8/2021 10:20 AM, Anton Ertl wrote:

> > If you want to lock lines that are not in a hardware-defined subset of
> > the address space, the tags are not wasted.

> I probably wasn't clear. I know that you need the tags for the locked
> lines, but if you used dedicated memory instead of a cache, then you
> don't need tags, etc. for the dedicated memory.

You still don't need to waste the tags.

If it happens that a cache line is 64 bits, since a tag is a 64-bit address,
you could simply use the tags too and thus have to lock only half as many
cache entries to get your buffer.

Of course, a cache line is more likely to be something like 512 bits.

John Savard

Re: Scratchpads (was: Split register files)

<sc84pd$4tn$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18535&group=comp.arch#18535

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Scratchpads (was: Split register files)
Date: Thu, 8 Jul 2021 17:21:01 -0700
Organization: A noiseless patient Spider
Lines: 85
Message-ID: <sc84pd$4tn$2@dont-email.me>
References: <sb6s70$dip$1@newsreader4.netcologne.de>
<sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de>
<sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me>
<sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<jwvwnq3vc90.fsf-monnier+comp.arch@gnu.org> <sc65fd$197$1@dont-email.me>
<sc677g$aeh$1@dont-email.me> <sc6pcq$1rf$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 9 Jul 2021 00:21:02 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5afffdc2df28b90ccc9a573cf00e64fa";
logging-data="5047"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19+icRXYcVFcZwdPTY4r7XtMYcbGhvL5KI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:5M5MNAPhUvc/KAoO075Dcsyy/SQ=
In-Reply-To: <sc6pcq$1rf$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Fri, 9 Jul 2021 00:21 UTC

On 7/8/2021 5:00 AM, Marcus wrote:
> On 2021-07-08, Stephen Fuld wrote:
>> On 7/7/2021 11:20 PM, Marcus wrote:
>>> On 2021-07-06, Stefan Monnier wrote:
>>>>> I have used architectures with scratchpads or that could configure
>>>>> half the
>>>>> cache as scratchpad, yuck.
>>>>
>>>> I don't think these scratchpads are very much like the Mill's
>>>> scratchpad.
>>>>
>>>> I suspect the "yuck" above refers to the problems you end up having of
>>>> administering this scratchpad, sharing it between unrelated functions.
>>>>
>>>> Mill's scratchpad is more like a CPU-supported notion of frame
>>>> activation record.  Every time you enter a function you get a fresh new
>>>> scratchpad and when you return from a function, the scratchpad is
>>>> thrown
>>>> away and the caller recovers instead the scratchpad it had before the
>>>> call (scrachpads get "pushed on/popped off the stack" behind the
>>>> scene).
>>>>
>>>> So it's easy for programmers and compilers to use it as a kind of "slow
>>>> register file with register-windows".
>>>
>>> I've been toying with the idea of having a small scratchpad at a fixed
>>> memory location in the MRISC32 for the sake of doing things like vector
>>> register permutations (a la POWER vperm): Since the machine has
>>> scatter/gather it is trivial to do permutations via memory in just two
>>> instructions (a linear store + a gather load), and since the idea is
>>> that the vector registers should be long enough to hide any pipeline
>>> latencies the permutation operation basically has the same cost as two
>>> back-to-back ALU operations, /provided/ that there are no cache misses.
>>>
>>> The size of a vector register is 64+ bytes (implementation dependent),
>>> so the memory requirement for a single register is typically more than
>>> one cache line.
>>>
>>> So, what do you people think?
>>
>> What happens if you take an interrupt and want to do a context switch
>> between the store and the gather load?
>>
>
> Interrupts and exceptions are generally problematic for vector loads and
> stores and need sufficient support for handling things like a mid-vector
> page fault. I have not implemented any solutions for that yet (but there
> *are* solutions).
>
> Specifically for a scratchpad you'd have to define the semantics, of
> course. I have not come that far yet but you would typically want it to
> behave like a register file that is managed during context switches (one
> option is to have many small areas - one for each thread, until you run
> out of hard areas and need to do software switching).

Ugh!

>>> Would a HW scratchpad have a performance
>>> advantage over, say, using the local stack as scratchpad area for the
>>> purpose of doing vector permutations and similar?
>>
>> Probably, but I suspect there are better ways of accomplishing this
>> functionality.
>
> The best (as in most performant) way is to have a dedicated vector
> permutation unit, so that you do not have to do the cache/scratchpad
> roundtrip, but such a thing is usually quite costly (and it adds
> requirements on how the VRF must work).
>
> Hmm... I guess you could code up a "macro instruction" that acts like
> a combined store + load but against an internal memory that is hidden
> from the programmer. If the instruction is guaranteed to be atomic (i.e.
> no interrupts are allowed mid instruction), there would be no need to
> ever care about the contents of the memory (e.g. during context
> switches). I do not have any similar instructions ATM, though.

That was along the lines of what I was thinking.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Pages:12345678
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor