Message-ID:

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
If you find that it is broken please let me know here rocksolid.nodes.help

devel / comp.arch / Re: Introducing ForwardCom: An open ISA with variable-length vector registers

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<trib30$1acn8$1@dont-email.me>

https://www.novabbs.com/devel/article-flat.php?id=30698&group=comp.arch#30698

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Fri, 3 Feb 2023 07:54:23 +0100
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <trib30$1acn8$1@dont-email.me>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 3 Feb 2023 06:54:24 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="a161405709ba2f4693c0a068654c88dd";
logging-data="1389288"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+yFNp3j9x24sKxnI0Dv6uTqdj5a3QRrrbi2Iigfr/w7w=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.15
Cancel-Lock: sha1:eT/m3kosC4ghRo+7mqvvGI7EcHE=
In-Reply-To: <trflaf$32et$1@newsreader4.netcologne.de>

by: Terje Mathisen - Fri, 3 Feb 2023 06:54 UTC

Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>> Thomas Koenig wrote:
>>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>>> On Wednesday, February 1, 2023 at 8:03:45 AM UTC-6, Thomas Koenig wrote:
>>>>> Agner Fog <ag...@dtu.dk> schrieb:
>>>>>> The ForwardCom assembly language is simple and immediately
>>>>>> intelligible. Adding two integers is as simple as: int r1 =
>>>>>> r2 + r3. Branches and loops are written just like in C or Java,
>>>>>> for example:
>>>>>
>>>>>> for (int r0 = 0; r0 < 100; r0++) { }
>>>>> Quite interesting, thanks a lot for sharing!
>>>>>
>>>>> One question: What would be the best way to handle loop-carried
>>>>> dependencies (let's say a memmove, where operands can overlap,
>>>>> or C's
>>>> <
>>>> MM Rto,Rfrom,Rcnt
>>>>>
>>>>> void add (const int *a, int *b, int *c, int n)
>>>>> {
>>>>> for (int i=0; i<n; i++)
>>>>> a[i] = b[i] + c[i];
>>>>> }
>>>> <
>>>> I fail to see a loop carried dependence. Nothing loaded or
>>>> calculated in the loop is used in the subsequent loop. Do I
>>>> have a bad interpretation of "loop carried" ?
>>>
>>> It's not obvious, but (see my reply to Terje) the pointers can
>>> point to overlapping parts of an array. In this case, a non-
>>> obvious loop carried depenence can be introduced (note the
>>> lack of restrict in the argument list), apart from the
>>> const which is wrong.
>>
>> IMHO, this is exactly like the IBM RLL-decoding using intentionally
>> overlapping moves (MVC?): You really, really should not do it, and if
>> you do, then you deserve whatever ills befall you.
>
> I concur. The question is who is "you" in your sentence above...
>
> Compiler writers have to get it right, according to the language
> specification, so wrong code is not an option. If the language
> specification results in pessimized code to cater for a case that
> hardly anybody uses intentionally, well... what is a compiler
> writer to do?

Generate correct code, so for max optimization give a warning, but
produce code that tests and branches to alternate implementations
depending upon the parameters, i.e. in you example both b[] and c[] must
be checked for overlap with a[]. The fallback can be fast, as in
memmove(), if zero or one input overlap, then fall back on scalar for
really bad cases?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30701&group=comp.arch#30701

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:6209:b0:71f:fa55:dc3d with SMTP id ou9-20020a05620a620900b0071ffa55dc3dmr846255qkn.258.1675413941071;
Fri, 03 Feb 2023 00:45:41 -0800 (PST)
X-Received: by 2002:a05:6870:b003:b0:163:32c8:bb97 with SMTP id
y3-20020a056870b00300b0016332c8bb97mr666980oae.61.1675413940806; Fri, 03 Feb
2023 00:45:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 3 Feb 2023 00:45:40 -0800 (PST)
In-Reply-To: <trib30$1acn8$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Fri, 03 Feb 2023 08:45:41 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2080

by: robf...@gmail.com - Fri, 3 Feb 2023 08:45 UTC

Re: memory management,

I skimmed through the manual and was somewhat surprised by the memory
management section. It sounds almost somewhat like a segmented system.
Is it possible to get a little more detail? If I have got this right, there is a table(s)
of base block addresses and block lengths and a number of comparators are
used to map a virtual address to a physical one. Can an object span across
multiple blocks? Is the block table swapped on a context switch, or are there
entries for multiple contexts?

While many systems contain block address translations, they also typically
include TLB's and paged memory. That style of memory management seems
to have won-out.

Agner Fog <agfo@dtu.dk> writes:
>Anton Ertl wrote:
>> In recent time we have had the problem that CPU manufacturers combine=20
>> different cores on the same CPU, and want to migrate threads between=20
>> these cores. So they use the same SIMD lengths on all cores, even, in=20
>> the case of Intel, disabling AVX-512 on CPUs that do not have the=20
>> smaller cores, i.e., where all cores could do AVX-512 if it was not=20
>> disabled.=20
>Yes, this was a big mistake in my opinion. The first version of Intel's Ald=
>er Lake had access to different cores with and without AVX512 support.

What I read is that you could have a BIOS setting where you could
enable AVX-512 but you could only do that if you disabled the E-cores.
Intel soon provided BIOS updates that eliminated this option.

>Your=
> program will crash if the software detects that it can use AVX512 and late=
>r jumps to another core without AVX512. So they had to disable the AVX512 c=
>ompletely, including the new half-precision floating point instructions. Di=
>d Intel expect the OS to move a thread back to a capable core in this case?=
> You cannot expect an OS to give special treatment to every new processor o=
>n the market with new quirks.

One way to deal with this would be to not report the availability of
AVX-512 (and maybe even disable them) unless the OS tells the CPU in
through a model-specific register that it is prepared for
heterogeneous AVX-512 support.

An OS that is unaware of the issue would just work, without AVX-512,
as is currently the case with Alder Lake; but the option I outlined
would open the possibility for other approaches:

One way an OS could handle the issue is to only tell the CPU that it
should report AVX-512 for processes that are limited to P-cores (with
thread-affinity or somesuch).

Another way would be to always tell the CPU that, and, as you wrote,
to migrate the process back to a P-Core if it executes an AVX-512
instruction.

Intel was not very coordinated when designing Alder Lake. They had
two teams designing cores with different instruction sets, then did
not provide OS workarounds for that, then dealt with the resulting
fallout by first restricting AVX-512 severely, but for whatever reason
even that was not enough for them, and finally they disabled AVX-512
completely.

>Unfortunately, ARM does the same in their big.LITTLE architecture.

A little more coordinated, it seems: All their cores are limited to
128-bit SIMD in SVE, which however eliminates the USP of SVE:
scalability to wider SIMD units.

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary.

I think it's something the programming language design should take
care of. If the language has vectors as a value type, this eliminates
the problem. The Fortran array sub-language is not quite there, but
still pretty good.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<a2074c58-8c59-4e33-8222-4661611dd92fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30703&group=comp.arch#30703

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4311:b0:71d:bbfe:6af0 with SMTP id u17-20020a05620a431100b0071dbbfe6af0mr807722qko.327.1675428117906;
Fri, 03 Feb 2023 04:41:57 -0800 (PST)
X-Received: by 2002:a05:6870:b008:b0:163:36d5:35db with SMTP id
y8-20020a056870b00800b0016336d535dbmr625573oae.113.1675428117612; Fri, 03 Feb
2023 04:41:57 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 3 Feb 2023 04:41:57 -0800 (PST)
In-Reply-To: <2023Feb2.234358@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:d512:a614:a30b:924c;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:d512:a614:a30b:924c
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com> <2023Feb2.234358@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a2074c58-8c59-4e33-8222-4661611dd92fn@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 03 Feb 2023 12:41:57 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4513

by: Michael S - Fri, 3 Feb 2023 12:41 UTC

On Friday, February 3, 2023 at 1:03:22 AM UTC+2, Anton Ertl wrote:
> Agner Fog <ag...@dtu.dk> writes:
> >ForwardCom can vectorize array loops in a new way that automatically adjust=
> >s to the maximum vector length supported by the CPU. It works in the follow=
> >ing way:
> >
> >A register used as loop counter is initialized to the array length. This co=
> >unter is decremented by the vector register length and repeats as long as i=
> >t is positive. The counter register also specifies the vector register leng=
> >th. The vector registers used in the loop will have the maximum length as l=
> >ong as the counter exceeds this value. There is no loop tail because the ve=
> >ctor length is automatically adjusted in the last iteration of the loop to =
> >fit the remaining number of array elements.
> ...
> >There is no global vector length register. You can have different vectors w=
> >ith different lengths at the same time. The length is stored in the vector =
> >register itself. When you save and restore a vector register, it will only =
> >save the part of the register that is actually used.=20
>
> In recent time we have had the problem that CPU manufacturers combine
> different cores on the same CPU, and want to migrate threads between
> these cores. So they use the same SIMD lengths on all cores, even, in
> the case of Intel, disabling AVX-512 on CPUs that do not have the
> smaller cores, i.e., where all cores could do AVX-512 if it was not
> disabled.
>
> It seems to me that you don't have an answer for that. And I guess
> that as long as there are architectural (programmer-visible) SIMD
> registers, the problem will persist. The SIMD registers must not be
> an architectural feature, as in VVM (or maybe Helium?), to make
> migration from cores with longer to cores with shorter vectors
> possible.

SIMD registers are an architectural feature of Helium/MVE.
They overlap with scalar FP registers in a manner, different from
what is common in x86 world, but consistent with a way in which
ARMv7-M FPv5 double precision register overlap with single-precision.

If anything, actual execution width (1, 2 or beats per Architecture tick)
is more exposed to SW in Helium than in most other SIMD environments.

"The SIMD registers must not be an architectural feature" appears
to be a fashion of comp.arch newsgroup, but it's not a popular
view in wider world.

>
> The other option would be for Intel to implement AVX-512 in the small
> cores without 512-bit units, the same way that AMD has been using time
> and again: XMM=2x64bits on K8, YMM=2x128 bits on later heavy machinery
> and Zen1, and ZMM is 2x256 bits on Zen4.
>

Intel did it many times in the past. All SSEx implementations before Core had
64-bit floating-point execution units. They did it later, too.
Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
Alder Lake was/is just a mess, as you correctly state in other post.
Although somewhat less messy than its predecessor Lakefield.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<ca496fc9-9001-44ba-b7f6-89aea76c07e2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30704&group=comp.arch#30704

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:f50f:0:b0:71c:ecec:9dca with SMTP id l15-20020a37f50f000000b0071cecec9dcamr749127qkk.55.1675428592595;
Fri, 03 Feb 2023 04:49:52 -0800 (PST)
X-Received: by 2002:a05:6871:8a3:b0:163:aadd:a457 with SMTP id
r35-20020a05687108a300b00163aadda457mr777108oaq.201.1675428592320; Fri, 03
Feb 2023 04:49:52 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 3 Feb 2023 04:49:52 -0800 (PST)
In-Reply-To: <2023Feb3.120201@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:d512:a614:a30b:924c;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:d512:a614:a30b:924c
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<2023Feb2.234358@mips.complang.tuwien.ac.at> <f68efae2-ac92-4f83-9221-191a89438510n@googlegroups.com>
<2023Feb3.120201@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ca496fc9-9001-44ba-b7f6-89aea76c07e2n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 03 Feb 2023 12:49:52 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5023

by: Michael S - Fri, 3 Feb 2023 12:49 UTC

On Friday, February 3, 2023 at 1:35:44 PM UTC+2, Anton Ertl wrote:
> Agner Fog <ag...@dtu.dk> writes:
> >Anton Ertl wrote:
> >> In recent time we have had the problem that CPU manufacturers combine=20
> >> different cores on the same CPU, and want to migrate threads between=20
> >> these cores. So they use the same SIMD lengths on all cores, even, in=20
> >> the case of Intel, disabling AVX-512 on CPUs that do not have the=20
> >> smaller cores, i.e., where all cores could do AVX-512 if it was not=20
> >> disabled.=20
> >Yes, this was a big mistake in my opinion. The first version of Intel's Ald=
> >er Lake had access to different cores with and without AVX512 support.
> What I read is that you could have a BIOS setting where you could
> enable AVX-512 but you could only do that if you disabled the E-cores.
> Intel soon provided BIOS updates that eliminated this option.
>
> >Your=
> > program will crash if the software detects that it can use AVX512 and late=
> >r jumps to another core without AVX512. So they had to disable the AVX512 c=
> >ompletely, including the new half-precision floating point instructions. Di=
> >d Intel expect the OS to move a thread back to a capable core in this case?=
> > You cannot expect an OS to give special treatment to every new processor o=
> >n the market with new quirks.
> One way to deal with this would be to not report the availability of
> AVX-512 (and maybe even disable them) unless the OS tells the CPU in
> through a model-specific register that it is prepared for
> heterogeneous AVX-512 support.
>
> An OS that is unaware of the issue would just work, without AVX-512,
> as is currently the case with Alder Lake; but the option I outlined
> would open the possibility for other approaches:
>
> One way an OS could handle the issue is to only tell the CPU that it
> should report AVX-512 for processes that are limited to P-cores (with
> thread-affinity or somesuch).
>
> Another way would be to always tell the CPU that, and, as you wrote,
> to migrate the process back to a P-Core if it executes an AVX-512
> instruction.
>
> Intel was not very coordinated when designing Alder Lake. They had
> two teams designing cores with different instruction sets, then did
> not provide OS workarounds for that, then dealt with the resulting
> fallout by first restricting AVX-512 severely, but for whatever reason
> even that was not enough for them, and finally they disabled AVX-512
> completely.
> >Unfortunately, ARM does the same in their big.LITTLE architecture.
> A little more coordinated, it seems: All their cores are limited to
> 128-bit SIMD in SVE, which however eliminates the USP of SVE:
> scalability to wider SIMD units.
>

Yes, Arm Inc. manages their big.LITTLE combos well.
It was Samsung that did bad things like combining Cortex-A53
as LITTLE with their own Mongoose cores as big, despite the
mismatch in cache lines width, which is certainly visible to
application software.
Mongoose is dead now, so happiness returned.

> >Regarding pointer aliasing:
> >I think it is the responsibility of the programmer to consider pointer alia=
> >sing problems and do a loop backwards if necessary.
> I think it's something the programming language design should take
> care of. If the language has vectors as a value type, this eliminates
> the problem. The Fortran array sub-language is not quite there, but
> still pretty good.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Agner Fog <agfo@dtu.dk> writes:
>Anton Ertl wrote:

>Regarding pointer aliasing:
>I think it is the responsibility of the programmer to consider pointer alia=
>sing problems and do a loop backwards if necessary. If you want the hardwar=
>e to fix all kinds of bad programming, you will soon find yourself down a v=
>ery deep rabbit hole.

Which is why memmove(3) was standardized decades ago.

Michael S <already5chosen@yahoo.com> writes:
>"The SIMD registers must not be an architectural feature" appears
>to be a fashion of comp.arch newsgroup, but it's not a popular
>view in wider world.

Yet? Certainly there have been complaints about SIMD width as
instruction set feature (as in SSE, AVX, AVX-512), and they have
resulted in SVE, and AFAIK the RISC-V vector extension. But with that
the SIMD width is still architectural, which prevents migration to
implementations with a different SIMD width; that's very clear in
Alder Lake, but is also a problem when migrating VMs to different
computers, e.g., migrating an SVE program from a Fujitsu supercomputer
to some ARMv9 server.

Keeping things that you want to be able to change as
microarchitectural feature rather than turning them into architectural
features has been used successfully in the past. If we could use it
for SIMD/vectors, that would be great. I am not sure how realistic
VVM is, but if it can be realized, the fact that the SIMD size is a
microarchitectural rather than architectural feature is a major
benefit as far as I am concerned (I am not a believer in
auto-vectorization, even if it happens in the hardware and avoids the
aliasing problems.

>Intel did it many times in the past. All SSEx implementations before Core had
>64-bit floating-point execution units. They did it later, too.
>Even Silvermont, a direct ancestor of Alder Lake E-cores, had 64-bit FP multiplier.
>Alder Lake was/is just a mess, as you correctly state in other post.

So the fact that Gracemont (and therefore Alder Lake) does not support
AVX-512 is probably a result of Intel trying to perform market
segmentation with instruction set extensions. If so, apparently Intel
marketing does not understand the market very well.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Thomas Koenig <tkoenig@netcologne.de> writes:
>For Fortran, there is a related problem. While arguments cannot
>overlap, the language proscribes that the right-hand side of
>an assignment is evaluated before the actual assignment.
>This leads to generation of temporary arrays for code like
>
> a(n:m) = a(n-1:m-1) + a(n+1:m+1)
>
>or when overlap / non-overlap cannot be detected at compile time.

And what is the problem? How would you define it instead?

I think that Fortran defines this well. I think that it's even better
to have vectors as value types not just for intermediate results of
expressions, but also in variables, but what Fortran does here is a
good intermediate step.

Having the result in different memory than the source is a fine
solution (and, if all of a is overwritten, and a had a value type, it
could even be done without extra copying). But in the present case,
there is another solution (which might be found by your compiler):

if (n<=m) {
double aim1 = a[n-1] + a[n+1];
for (i=n+1; i<=m; i++) {
double ai = a[i-1] + a[i+1];
a[i-1] = aim1;
aim1 = ai;
}
a[i-1] = aim1;
}

You can unroll the loop by a factor of 2 and perform some renaming to
eliminate "aim1 = ai;". You can use SIMD instead of double, with the
same logic otherwise to make use of SIMD.

Your example looks similar to a convolution filter, but those usually
have more than one dimension, which makes the whole optimization
harder (e.g., for a 2D convolution filter you would have to store more
than one line in intermediate storage before copying it back to the
original location. Just having the result in a new location looks
much more attractive then.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<trjker$1i2jr$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30708&group=comp.arch#30708

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Fri, 3 Feb 2023 12:40:22 -0600
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <trjker$1i2jr$1@dont-email.me>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<2023Feb2.234358@mips.complang.tuwien.ac.at>
<f68efae2-ac92-4f83-9221-191a89438510n@googlegroups.com>
<2023Feb3.120201@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 3 Feb 2023 18:40:27 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="831e33877a214f0e42e78acaf6027a54";
logging-data="1641083"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+F91DLkLimOJPqSKrsX5sA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:TLQxRIjweLaOkOQlevtxWnJlVcs=
Content-Language: en-US
In-Reply-To: <2023Feb3.120201@mips.complang.tuwien.ac.at>

by: BGB - Fri, 3 Feb 2023 18:40 UTC

On 2/3/2023 5:02 AM, Anton Ertl wrote:
> Agner Fog <agfo@dtu.dk> writes:
>> Anton Ertl wrote:
>>> In recent time we have had the problem that CPU manufacturers combine=20
>>> different cores on the same CPU, and want to migrate threads between=20
>>> these cores. So they use the same SIMD lengths on all cores, even, in=20
>>> the case of Intel, disabling AVX-512 on CPUs that do not have the=20
>>> smaller cores, i.e., where all cores could do AVX-512 if it was not=20
>>> disabled.=20
>> Yes, this was a big mistake in my opinion. The first version of Intel's Ald=
>> er Lake had access to different cores with and without AVX512 support.
>
> What I read is that you could have a BIOS setting where you could
> enable AVX-512 but you could only do that if you disabled the E-cores.
> Intel soon provided BIOS updates that eliminated this option.
>
>> Your=
>> program will crash if the software detects that it can use AVX512 and late=
>> r jumps to another core without AVX512. So they had to disable the AVX512 c=
>> ompletely, including the new half-precision floating point instructions. Di=
>> d Intel expect the OS to move a thread back to a capable core in this case?=
>> You cannot expect an OS to give special treatment to every new processor o=
>> n the market with new quirks.
>
> One way to deal with this would be to not report the availability of
> AVX-512 (and maybe even disable them) unless the OS tells the CPU in
> through a model-specific register that it is prepared for
> heterogeneous AVX-512 support.
>
> An OS that is unaware of the issue would just work, without AVX-512,
> as is currently the case with Alder Lake; but the option I outlined
> would open the possibility for other approaches:
>
> One way an OS could handle the issue is to only tell the CPU that it
> should report AVX-512 for processes that are limited to P-cores (with
> thread-affinity or somesuch).
>
> Another way would be to always tell the CPU that, and, as you wrote,
> to migrate the process back to a P-Core if it executes an AVX-512
> instruction.
>
> Intel was not very coordinated when designing Alder Lake. They had
> two teams designing cores with different instruction sets, then did
> not provide OS workarounds for that, then dealt with the resulting
> fallout by first restricting AVX-512 severely, but for whatever reason
> even that was not enough for them, and finally they disabled AVX-512
> completely.
>
>> Unfortunately, ARM does the same in their big.LITTLE architecture.
>
> A little more coordinated, it seems: All their cores are limited to
> 128-bit SIMD in SVE, which however eliminates the USP of SVE:
> scalability to wider SIMD units.
>

IMO:
This isn't a huge loss, as IME the general utility of SIMD drops off
quickly as one goes wider.

So, 4-element vectors, 128-bit vectors, ... can almost be considered as
magic numbers.

Likewise, if one does go to 8-element vectors, it is often "more useful"
to handle them more as "2x 4-elements" than as a homogeneous 8.

Main use-case for 256-bit would be 4x Binary64, but not as much else.
But, if one has a CPU where this has no advantage over 128-bit (because
the CPU is just faking it), well then, 128-bit remains on top.

And, ironically, the main alternate use case for SIMD being for memory
copies, which are then "however wide the CPU can natively handle"
(multiplies by a scale of pipeline latency).

So, for BJX2 in my current cores, this leads to copying 512-bits at a
time being the "most efficient" case for general memcpy. Though larger
blocks can still see an advantage by reducing loop overhead.

Copying 2048 or 4096 bits is enough to basically "hide" looping related
overheads, but is only really "useful" for bulk copying (say, one needs
to copy a buffer of 16K or more before it makes much difference).

For something like an LZ77 decoder, is may make sense to ignore
bulk-copy cases (the majority of matches being relatively short).

In this case, it makes more sense to try to be "most efficient" for
copying in the 4-16 byte range. Copies much over 256 bytes being
relatively uncommon.

Otherwise, I have skepticism about compiler autovectorization being able
to make effective use of wider SIMD types, when (as a programmer) it is
also difficult to make effective use of them.

>> Regarding pointer aliasing:
>> I think it is the responsibility of the programmer to consider pointer alia=
>> sing problems and do a loop backwards if necessary.
>
> I think it's something the programming language design should take
> care of. If the language has vectors as a value type, this eliminates
> the problem. The Fortran array sub-language is not quite there, but
> still pretty good.
>

Would be useful if C gave a little more here.

Ideally, I would like both keywords for aliasing (and unaligned
load/store, etc); along with maybe a way to specify vector types and
vector ops that worked across compilers and platforms, and whether or
not the target actually supports them natively.

Though, GCC's vectors can give at least one of these points.
* Works across targets.
But:
* Not really portable outside of GCC (and Clang)
* Still seems to depend on native SIMD support AFAIK.

For BGBCC, I had defined my own system, with partial compatibility with
GCC's notation. Operations are defined relative to the vector size/type,
rather than whether or not it maps to a native HW vector (though, with
the caveat that trying to use vectors when no SIMD is enabled, will
generate runtime calls...).

....

Quadibloc <jsavard@ecn.ab.ca> schrieb:
> On Monday, January 30, 2023 at 4:03:36 AM UTC-7, Agner Fog wrote:
>
>> ForwardCom can vectorize array loops in a new way that automatically
>> adjusts to the maximum vector length supported by the CPU. It works
>> in the following way:
>
>> A register used as loop counter is initialized to the array length. This
>> counter is decremented by the vector register length and repeats as
>> long as it is positive. The counter register also specifies the vector
>> register length. The vector registers used in the loop will have the
>> maximum length as long as the counter exceeds this value. There
>> is no loop tail because the vector length is automatically adjusted
>> in the last iteration of the loop to fit the remaining number of array
>> elements.
>
> This isn't completely new. The IBM System/370 at one point added a
> vector feature which was largely modelled on that of the Cray I. But
> unlike the Cray I, it was designed so that one model of the 370 might
> have vector registers with 64 elements, and another one might have
> vector registers with 256 elements.

A colleague used it on an IBM 3090. Compared to the Fujitsu VP
they also had at the computer center, it was as slow as molasses.

[...]

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30712&group=comp.arch#30712

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:4e45:0:b0:3ba:18b0:ab46 with SMTP id e5-20020ac84e45000000b003ba18b0ab46mr27558qtw.176.1675496076268;
Fri, 03 Feb 2023 23:34:36 -0800 (PST)
X-Received: by 2002:a05:6870:f144:b0:163:aa5f:4530 with SMTP id
l4-20020a056870f14400b00163aa5f4530mr892515oac.167.1675496075851; Fri, 03 Feb
2023 23:34:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 3 Feb 2023 23:34:35 -0800 (PST)
In-Reply-To: <36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=128.76.247.189; posting-account=tYjOgQoAAACRs74arwcusKjVVQt_fFMX
NNTP-Posting-Host: 128.76.247.189
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me> <36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: agf...@dtu.dk (Agner Fog)
Injection-Date: Sat, 04 Feb 2023 07:34:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4829

by: Agner Fog - Sat, 4 Feb 2023 07:34 UTC

Re: memory management
The main purpose of the memory map in ForwardCom is not virtual address translation, but memory protection. Virtual address translation is not needed in simple well-behaved cases.

> Can an object span across multiple blocks?
Yes, but there would be no point in doing so because blocks can have arbitrary sizes. For example, if a data object spans two blocks of 3 kB and 4 kB, you can join them into one block of 7 kB. You only need to subdivide a data structure if different parts have different access permissions. A typical application has initialized static date, uninitialized static data, stack, and heap. These all have the same access permissions (read and write) so they can be joined into a single block large enough to serve all the needs of the running process for read/write data. A typical application program has three memory blocks with different access permissions: 1. execute only, 2.. read only, 3. read and write. If there is a child thread with thread-private memory, then this thread will have a memory map with one extra entry for private data that is not accessible to parent and sibling threads. Thread-private memory is a ForwardCom feature.

ForwardCom has features to predict the maximum size of stack and heap and reserve the calculated size of memory when a program is loaded. If this prediction fails, then the memory becomes fragmented. Then you will need an extra entry in the memory map to access a vacant physical memory block that is not contiguous with the rest, and possibly use virtual address translation to make it appear to be contiguous.

ForwardCom is using various features to avoid memory fragmentation, including an innovative system for function libraries. This should keep the number of entries in the memory map of each process or thread small enough to fit on the CPU chip without the need for TLB and multi-level page tables. A small memory map with few variable-size entries is implemented most efficiently with comparators. A large memory map for a highly fragmented memory cannot use this method, but must use fixed-size pages and large page tables. Intel has up to 5 levels of page tables.

The memory map needs to be saved and restored on context switches. It will typically be smaller than the register file so that it can be saved efficiently. The CPU may or may not have multiple memory maps on-chip to switch between.

3. februar 2023 kl. 09.45.42 UTC+1 wrote robf...gmail.com:
> Re: memory management,
>
> I skimmed through the manual and was somewhat surprised by the memory
> management section. It sounds almost somewhat like a segmented system.
> Is it possible to get a little more detail? If I have got this right, there is a table(s)
> of base block addresses and block lengths and a number of comparators are
> used to map a virtual address to a physical one. Can an object span across
> multiple blocks? Is the block table swapped on a context switch, or are there
> entries for multiple contexts?
>
> While many systems contain block address translations, they also typically
> include TLB's and paged memory. That style of memory management seems
> to have won-out.

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<trle1r$6nm0$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30715&group=comp.arch#30715

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Sat, 4 Feb 2023 11:03:23 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <trle1r$6nm0$1@newsreader4.netcologne.de>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>
<883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
Injection-Date: Sat, 4 Feb 2023 11:03:23 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:22da:0:7285:c2ff:fe6c:992d";
logging-data="220864"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 4 Feb 2023 11:03 UTC

Agner Fog <agfo@dtu.dk> schrieb:

> ForwardCom has features to predict the maximum size of stack
> and heap and reserve the calculated size of memory when a program
> is loaded.

That sounds like an extremely hard thing to do. What do these
heuristics loook like, and how would they estimate (for example)
heap size of the program

program main
real, dimension(:,:), allocatable :: a, b c
read (*,*) n
allocate (a(n,n), b(n,n))
read (*,*) a, b
c = matmul(a,b)
print *,c
end

? Do they rely on statistics gathered in earlier runs (sort
of profile-guided feedback, which is not very reliable), or are
they functions of the program itself? What about self-compiled
programs?

> If this prediction fails, then the memory becomes
> fragmented. Then you will need an extra entry in the memory map
> to access a vacant physical memory block that is not contiguous
> with the rest, and possibly use virtual address translation to
> make it appear to be contiguous.

> ForwardCom is using various features to avoid memory
> fragmentation, including an innovative system for function
> libraries. This should keep the number of entries in the memory
> map of each process or thread small enough to fit on the CPU chip
> without the need for TLB and multi-level page tables. A small memory
> map with few variable-size entries is implemented most efficiently
> with comparators. A large memory map for a highly fragmented memory
> cannot use this method, but must use fixed-size pages and large
> page tables. Intel has up to 5 levels of page tables.

It is also a question of the size of the page tables...

Do you envision having page tables as a backup, or will your method
converge to something equivalent to page tables when the application
manages to beat all of your heuristics?

And is the current performance of page tables expected to be better
or worse than what you propose, when your heuristics don't work well?

Really, how well your heuristics work is a central question. I guess
this is not possible without extensive profiling of typical workloads,
where the question of what "typical" is is always problematic.

Very probably, profilign existing applications on other processors
can help a lot there.

> The memory map needs to be saved and restored on context
> switches. It will typically be smaller than the register file
> so that it can be saved efficiently. The CPU may or may not have
> multiple memory maps on-chip to switch between.

So, if the memory map exceeds the (probably) fixed number of
entries that are on the chip, what happens?

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<trlidn$6qgc$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30716&group=comp.arch#30716

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Sat, 4 Feb 2023 12:17:59 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <trlidn$6qgc$1@newsreader4.netcologne.de>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de>
<5c9cef11-d404-44cb-86f1-70174340d082n@googlegroups.com>
<trenlr$2g2p$1@newsreader4.netcologne.de>
<bc5ec264-5711-4793-9228-b58b5eb2db11n@googlegroups.com>
<trfm67$32et$2@newsreader4.netcologne.de>
<trgrn9$3qv5$1@newsreader4.netcologne.de> <trgth4$104kd$1@dont-email.me>
<f1073f9d-2853-4e41-84ba-543869aaad02n@googlegroups.com>
Injection-Date: Sat, 4 Feb 2023 12:17:59 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:22da:0:7285:c2ff:fe6c:992d";
logging-data="223756"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 4 Feb 2023 12:17 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> Also note: The C (or Fortran) source has manually unrolled this loop 4
> times. This just multiplies the number of instructions in the loop by
> 4 while leaving a single LOOP instruction for iteration control.

Is such unrolling actually (likely to be) helpful in a VVM
implementation for speed?

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<12cb8be5-3f45-472d-af7d-6a55898beae6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30717&group=comp.arch#30717

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:eed0:0:b0:53a:e4de:3b3 with SMTP id h16-20020a0ceed0000000b0053ae4de03b3mr829689qvs.76.1675513240142;
Sat, 04 Feb 2023 04:20:40 -0800 (PST)
X-Received: by 2002:a05:6871:8a3:b0:163:aadd:a457 with SMTP id
r35-20020a05687108a300b00163aadda457mr1170130oaq.201.1675513239925; Sat, 04
Feb 2023 04:20:39 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 04:20:39 -0800 (PST)
In-Reply-To: <trle1r$6nm0$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=128.76.247.189; posting-account=tYjOgQoAAACRs74arwcusKjVVQt_fFMX
NNTP-Posting-Host: 128.76.247.189
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com> <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
<trle1r$6nm0$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <12cb8be5-3f45-472d-af7d-6a55898beae6n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: agf...@dtu.dk (Agner Fog)
Injection-Date: Sat, 04 Feb 2023 12:20:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6945

by: Agner Fog - Sat, 4 Feb 2023 12:20 UTC

Thomas Koenig wrote:
> how would they estimate (for example) heap size of the program?
>Do they rely on statistics gathered in earlier runs

The maximum stack size is calculated by the compiler and the linker. In case of deeply recursive functions, the programmer may specify an estimated maximum recursion level.

And yes, the maximum heap size relies on statistics gathered in earlier runs. The programmer may add a suggested value.

If a heap grows beyond the estimated maximum, then the OS may add a large extra block of memory, perhaps the double of the previous size. Assume, for example, that a program is using 4 memory map entries when it is loaded, and the hardware has space for 256 entries, then the heap can grow to 2^252 times the initial size before we run out of memory map entries. This will never happen even if we use disk space as virtual memory.

There can be other causes of filling the memory map, though. The most likely is demand-paged memory. The use of demand-paged memory is strongly discouraged in ForwardCom. Another possibility is starting and ending so many processes that the entire memory space gradually becomes totally fragmented.

> So, if the memory map exceeds the (probably) fixed number of entries that are on the chip, what happens?
The best solution will probably be to start a garbage collector that can suspend running processes for a short time while rearranging the fragmented memory it is using. A fallback to fixed-size page tables is possible if the hardware supports it, but that is the expensive solution we are trying to avoid.

The performance problem of garbage collection exists in current systems as well. The programmer of a performance-critical memory-hungry program will take care to avoid garbage collection. This can be done by allocating a large memory block when starting, and by recycling allocated memory. Another solution is to start a new thread when a large block of memory is needed, for example in a video editing program. The main thread will not need an extra memory map entry if the new memory block is private to the new thread.

A memory map with space for, for example, 256 entries will require at most 512 comparators. This will be much more efficient than a traditional TLB with millions of pages in multiple levels, and we will have no TLB misses.

lørdag den 4. februar 2023 kl. 12.03.26 UTC+1 wrote Thomas Koenig:
> Agner Fog schrieb:
> > ForwardCom has features to predict the maximum size of stack
> > and heap and reserve the calculated size of memory when a program
> > is loaded.
> That sounds like an extremely hard thing to do. What do these
> heuristics loook like, and how would they estimate (for example)
> heap size of the program
>
> program main
> real, dimension(:,:), allocatable :: a, b c
> read (*,*) n
> allocate (a(n,n), b(n,n))
> read (*,*) a, b
> c = matmul(a,b)
> print *,c
> end
>
> ? Do they rely on statistics gathered in earlier runs (sort
> of profile-guided feedback, which is not very reliable), or are
> they functions of the program itself? What about self-compiled
> programs?
> > If this prediction fails, then the memory becomes
> > fragmented. Then you will need an extra entry in the memory map
> > to access a vacant physical memory block that is not contiguous
> > with the rest, and possibly use virtual address translation to
> > make it appear to be contiguous.
>
> > ForwardCom is using various features to avoid memory
> > fragmentation, including an innovative system for function
> > libraries. This should keep the number of entries in the memory
> > map of each process or thread small enough to fit on the CPU chip
> > without the need for TLB and multi-level page tables. A small memory
> > map with few variable-size entries is implemented most efficiently
> > with comparators. A large memory map for a highly fragmented memory
> > cannot use this method, but must use fixed-size pages and large
> > page tables. Intel has up to 5 levels of page tables.
> It is also a question of the size of the page tables...
>
> Do you envision having page tables as a backup, or will your method
> converge to something equivalent to page tables when the application
> manages to beat all of your heuristics?
>
> And is the current performance of page tables expected to be better
> or worse than what you propose, when your heuristics don't work well?
>
> Really, how well your heuristics work is a central question. I guess
> this is not possible without extensive profiling of typical workloads,
> where the question of what "typical" is is always problematic.
>
> Very probably, profilign existing applications on other processors
> can help a lot there.
> > The memory map needs to be saved and restored on context
> > switches. It will typically be smaller than the register file
> > so that it can be saved efficiently. The CPU may or may not have
> > multiple memory maps on-chip to switch between.
> So, if the memory map exceeds the (probably) fixed number of
> entries that are on the chip, what happens?

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<d1bd5b73-a6e6-4884-8e2d-282e826784f2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30718&group=comp.arch#30718

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:57c3:0:b0:3b8:6b3c:a31 with SMTP id w3-20020ac857c3000000b003b86b3c0a31mr1030451qta.279.1675515835653;
Sat, 04 Feb 2023 05:03:55 -0800 (PST)
X-Received: by 2002:a05:6871:8a3:b0:163:aadd:a457 with SMTP id
r35-20020a05687108a300b00163aadda457mr1181173oaq.201.1675515835472; Sat, 04
Feb 2023 05:03:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 05:03:55 -0800 (PST)
In-Reply-To: <trlidn$6qgc$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=162.157.97.93; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 162.157.97.93
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <5c9cef11-d404-44cb-86f1-70174340d082n@googlegroups.com>
<trenlr$2g2p$1@newsreader4.netcologne.de> <bc5ec264-5711-4793-9228-b58b5eb2db11n@googlegroups.com>
<trfm67$32et$2@newsreader4.netcologne.de> <trgrn9$3qv5$1@newsreader4.netcologne.de>
<trgth4$104kd$1@dont-email.me> <f1073f9d-2853-4e41-84ba-543869aaad02n@googlegroups.com>
<trlidn$6qgc$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d1bd5b73-a6e6-4884-8e2d-282e826784f2n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 04 Feb 2023 13:03:55 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3307

by: Quadibloc - Sat, 4 Feb 2023 13:03 UTC

On Saturday, February 4, 2023 at 5:18:02 AM UTC-7, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Also note: The C (or Fortran) source has manually unrolled this loop 4
> > times. This just multiplies the number of instructions in the loop by
> > 4 while leaving a single LOOP instruction for iteration control.

> Is such unrolling actually (likely to be) helpful in a VVM
> implementation for speed?

It will be interesting to hear Mitch's answer to this.

I would be (initially) inclined to say yes. Why?

Mitch's VVM instructions _look_ like ordinary loop instructions. But
the processor treats them as vector instructions instead.

To me, that implies they *take the place* of Cray-like vector
instructions, removing complexity from the instruction set.
That would mean that they have _less_ overhead than a regular
loop, but not that they have _zero_ overhead. And zero
overhead is what an unrolled loop has, right?

And that's where I see why I'm wrong. Compared to an ideal
vector instruction, an unrolled loop certainly does have
"overhead" - all those additional times you fetch and decode
the instructions for the later copies of the loop. (The execute,
of course, you have to do anyways.)

But as long as even a loop with VVM has some nonzero
amount of overhead, it's certainly possible that unrolling it
a very _small_ number of times might still be an optimization.
However, that wouldn't be true if the small number was less
than 2, and given Mitch's description of how he intends
VVM to be implemented, it does seem as though
this would be quite possible.

John Savard

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<aAvDL.621866$vBI8.420635@fx15.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30719&group=comp.arch#30719

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com> <trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com> <tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me> <trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me> <36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com> <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
In-Reply-To: <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 41
Message-ID: <aAvDL.621866$vBI8.420635@fx15.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 04 Feb 2023 16:19:18 UTC
Date: Sat, 04 Feb 2023 11:18:03 -0500
X-Received-Bytes: 3719

by: EricP - Sat, 4 Feb 2023 16:18 UTC

Agner Fog wrote:
> Re: memory management
>
> ForwardCom is using various features to avoid memory fragmentation, including an innovative system for function libraries. This should keep the number of entries in the memory map of each process or thread small enough to fit on the CPU chip without the need for TLB and multi-level page tables. A small memory map with few variable-size entries is implemented most efficiently with comparators. A large memory map for a highly fragmented memory cannot use this method, but must use fixed-size pages and large page tables. Intel has up to 5 levels of page tables.
>
> The memory map needs to be saved and restored on context switches. It will typically be smaller than the register file so that it can be saved efficiently. The CPU may or may not have multiple memory maps on-chip to switch between.

A hardware table of comparators is more than double the cost per-entry
as a TLB binary CAM. A binary CAM compare == entry is a bunch of XOR's
and an AND gate. A binary range comparator (low <= addr <= upr) entry is
two bunches of XOR's plus a priority selector (find-last-set),
so double the wire loading on the compare value bus, and more than
double the power consumption per entry, plus longer propagation delay.

Real TLBs for radix page tables can have multiple CAMs to cache
table interior nodes, allowing a reverse page table walk to speed
TLB miss resolution, which adds to the whole TLB cost.

Also each EXE or DLL has its own set of virtual memory section
requirements, plus all the DLLs referenced by other DLLs,
and any of these virtual sections may be relocated when created.
A single instance of a running program can have dozens to hundreds
of memory sections, each with its range of addresses and attributes.

A HW range table would have a limited number of entries so only a
subset of a programs memory section entries can be loaded at once.
How would a range table miss work?

How does virtual to physical relocation work for this range table?
A radix page table TLB does relocation by substitution so there is
zero cost beyond the TLB lookup for a table hit.
A range table can't use substitution and must use arithmetic relocation,
which costs an N-bit addition propagation delay above the table lookup.
For a 64-bit address this could add a clock to every memory access.

How would paging work for the range table? Hardware needs to track which
pages in each memory section are present, or trigger a page fault if not.
Where are these page Present bits stored for each memory section,
or is this going to use whole-segment faulting/swapping?

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<f1c9c185-9779-4006-9736-b418edd97539n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30720&group=comp.arch#30720

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:14b:b0:3b8:6cb0:8d26 with SMTP id v11-20020a05622a014b00b003b86cb08d26mr1492895qtw.375.1675528755037;
Sat, 04 Feb 2023 08:39:15 -0800 (PST)
X-Received: by 2002:a05:6870:4413:b0:169:d2b7:df22 with SMTP id
u19-20020a056870441300b00169d2b7df22mr1072787oah.245.1675528754786; Sat, 04
Feb 2023 08:39:14 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 08:39:14 -0800 (PST)
In-Reply-To: <aAvDL.621866$vBI8.420635@fx15.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=128.76.247.189; posting-account=tYjOgQoAAACRs74arwcusKjVVQt_fFMX
NNTP-Posting-Host: 128.76.247.189
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com> <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
<aAvDL.621866$vBI8.420635@fx15.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f1c9c185-9779-4006-9736-b418edd97539n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: agf...@dtu.dk (Agner Fog)
Injection-Date: Sat, 04 Feb 2023 16:39:15 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2994

by: Agner Fog - Sat, 4 Feb 2023 16:39 UTC

4. februar 2023 kl. 17.19.22 UTC+1 EricP wrote:
> Agner Fog wrote:
> > Re: memory management
> A hardware table of comparators is more than double the cost per-entry
> as a TLB binary CAM.

Yes, but there are fewer entries

> Real TLBs for radix page tables can have multiple CAMs to cache
> table interior nodes, allowing a reverse page table walk to speed
> TLB miss resolution, which adds to the whole TLB cost.

No page tables = no table walk

> Also each EXE or DLL has its own set of virtual memory section
> requirements, plus all the DLLs referenced by other DLLs,
> and any of these virtual sections may be relocated when created.
> A single instance of a running program can have dozens to hundreds
> of memory sections, each with its range of addresses and attributes.

Yes, exactly. ForwardCom has no DLLs, but a new library system with a relink feature. This removes the biggest source of memory fragmentation and avoids a lot of waste of partially used pages. See https://www.forwardcom.info/forwardcom_libraries.php

> How would paging work for the range table? Hardware needs to track which
> pages in each memory section are present, or trigger a page fault if not.

Usually, all memory used by a process will be loaded when the process is loaded. If a memory block is too big for loading it all at once, it needs one entry in the memory map for the present part and one entry for the absent part.

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<2245a823-2b91-4008-a38f-f708ead44832n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30722&group=comp.arch#30722

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:e0c2:0:b0:534:8a1b:cc57 with SMTP id x2-20020a0ce0c2000000b005348a1bcc57mr854878qvk.63.1675533187956;
Sat, 04 Feb 2023 09:53:07 -0800 (PST)
X-Received: by 2002:a54:4407:0:b0:368:ca97:3a2a with SMTP id
k7-20020a544407000000b00368ca973a2amr745312oiw.261.1675533187726; Sat, 04 Feb
2023 09:53:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 09:53:07 -0800 (PST)
In-Reply-To: <trlidn$6qgc$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:44ad:d89d:84bf:db;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:44ad:d89d:84bf:db
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <5c9cef11-d404-44cb-86f1-70174340d082n@googlegroups.com>
<trenlr$2g2p$1@newsreader4.netcologne.de> <bc5ec264-5711-4793-9228-b58b5eb2db11n@googlegroups.com>
<trfm67$32et$2@newsreader4.netcologne.de> <trgrn9$3qv5$1@newsreader4.netcologne.de>
<trgth4$104kd$1@dont-email.me> <f1073f9d-2853-4e41-84ba-543869aaad02n@googlegroups.com>
<trlidn$6qgc$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2245a823-2b91-4008-a38f-f708ead44832n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 04 Feb 2023 17:53:07 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3002

by: MitchAlsup - Sat, 4 Feb 2023 17:53 UTC

On Saturday, February 4, 2023 at 6:18:02 AM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
<
> > Also note: The C (or Fortran) source has manually unrolled this loop 4
> > times. This just multiplies the number of instructions in the loop by
> > 4 while leaving a single LOOP instruction for iteration control.
<
> Is such unrolling actually (likely to be) helpful in a VVM
> implementation for speed?
<
I guess I failed to actually address the point I was trying to make::
<
Many of the BLAS subroutines were hand unrolled (typically 4 times).
<
Is this something the programmer should be doing in this day and age,
or is this something that is better left to the compiler ???
<
Many of the BLASS matrix subroutines are partitioned into 4 loops,
one for each kind of transpose, to optimize memory performance.
<
Is this something the programmer should be doing in this day and age,
or is this something that is better left to the compiler ???
<
-------------------------------------------------------------------------------------------------------------
<
As far as loop unrolling and VVM goes, unrolled loops provide no
impediment to VVM as long as the unrolled loop still fits in the
VVM buffering (which is implementation defined.)

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<trm9m8$23989$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30725&group=comp.arch#30725

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Sat, 4 Feb 2023 10:55:02 -0800
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <trm9m8$23989$1@dont-email.me>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de>
<5c9cef11-d404-44cb-86f1-70174340d082n@googlegroups.com>
<trenlr$2g2p$1@newsreader4.netcologne.de>
<bc5ec264-5711-4793-9228-b58b5eb2db11n@googlegroups.com>
<trfm67$32et$2@newsreader4.netcologne.de>
<trgrn9$3qv5$1@newsreader4.netcologne.de> <trgth4$104kd$1@dont-email.me>
<f1073f9d-2853-4e41-84ba-543869aaad02n@googlegroups.com>
<trlidn$6qgc$1@newsreader4.netcologne.de>
<2245a823-2b91-4008-a38f-f708ead44832n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 4 Feb 2023 18:55:04 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="afb4cf5616bec706dcbd3a539fde7fa4";
logging-data="2204937"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18cnziIK7lqZ2VKas25BZhUV6Cic3zEI0Y="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.7.0
Cancel-Lock: sha1:6LlHdX2Jx+ui2eTPoN3J4H3px4E=
In-Reply-To: <2245a823-2b91-4008-a38f-f708ead44832n@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Sat, 4 Feb 2023 18:55 UTC

On 2/4/2023 9:53 AM, MitchAlsup wrote:
> On Saturday, February 4, 2023 at 6:18:02 AM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
> <
>>> Also note: The C (or Fortran) source has manually unrolled this loop 4
>>> times. This just multiplies the number of instructions in the loop by
>>> 4 while leaving a single LOOP instruction for iteration control.
> <
>> Is such unrolling actually (likely to be) helpful in a VVM
>> implementation for speed?
> <
> I guess I failed to actually address the point I was trying to make::
> <
> Many of the BLAS subroutines were hand unrolled (typically 4 times).
> <
> Is this something the programmer should be doing in this day and age,
> or is this something that is better left to the compiler ???

I guess probably the compiler.

> Many of the BLASS matrix subroutines are partitioned into 4 loops,
> one for each kind of transpose, to optimize memory performance.

> <
> Is this something the programmer should be doing in this day and age,
> or is this something that is better left to the compiler ???
> <

I am less sure about this. Loops are pretty general, so the question of
loop unrolling has wide applicability. Matrix stuff is less general, so
the effort in the compiler would have less of a payback.

-------------------------------------------------------------------------------------------------------------
> <
> As far as loop unrolling and VVM goes, unrolled loops provide no
> impediment to VVM as long as the unrolled loop still fits in the
> VVM buffering (which is implementation defined.)

While I agree with this, I think that VVM reduces the benefits of loop
unrolling. i.e. the loop overhead is only one instruction, and there is
no cost for the non-sequential control transfer.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<824271124.697231964.049764.acolvin-efunct.com@news.eternal-september.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30726&group=comp.arch#30726

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: acol...@efunct.com (mac)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length
vector registers
Date: Sat, 4 Feb 2023 19:45:47 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 6
Message-ID: <824271124.697231964.049764.acolvin-efunct.com@news.eternal-september.org>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de>
<5c9cef11-d404-44cb-86f1-70174340d082n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 4 Feb 2023 19:45:47 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8f210580665bb38335c2e8336e786ae5";
logging-data="2222129"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fA1hh91FYuiWOwoR20qCp"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:lWY+S4dm1i1UNdHKtcuINYPuJgo=
sha1:PqxtUArGg4hODKzxBzFenVj0jFo=

by: mac - Sat, 4 Feb 2023 19:45 UTC

> But if your point was to illustrate that C has bad semantics
> wrt array aliasing, I have to agree. The Fortran version has
> no loop carried dependence.

Strictly speaking, the Fortran version *ignores* any loop-carried
dependence

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<8f67991a-2398-4e64-8d60-a417e32fb7e9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30727&group=comp.arch#30727

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:580a:0:b0:3b8:6aaf:6ad with SMTP id g10-20020ac8580a000000b003b86aaf06admr1610015qtg.400.1675540113045;
Sat, 04 Feb 2023 11:48:33 -0800 (PST)
X-Received: by 2002:a05:6870:58aa:b0:163:b0c5:f852 with SMTP id
be42-20020a05687058aa00b00163b0c5f852mr916511oab.9.1675540112640; Sat, 04 Feb
2023 11:48:32 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 11:48:32 -0800 (PST)
In-Reply-To: <824271124.697231964.049764.acolvin-efunct.com@news.eternal-september.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:44ad:d89d:84bf:db;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:44ad:d89d:84bf:db
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <5c9cef11-d404-44cb-86f1-70174340d082n@googlegroups.com>
<824271124.697231964.049764.acolvin-efunct.com@news.eternal-september.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8f67991a-2398-4e64-8d60-a417e32fb7e9n@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 04 Feb 2023 19:48:33 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2047

by: MitchAlsup - Sat, 4 Feb 2023 19:48 UTC

On Saturday, February 4, 2023 at 1:45:50 PM UTC-6, mac wrote:
> > But if your point was to illustrate that C has bad semantics
> > wrt array aliasing, I have to agree. The Fortran version has
> > no loop carried dependence.
>
> Strictly speaking, the Fortran version *ignores* any loop-carried
> dependence
<
Only because the language explicitly puts memory aliasing on the
shoulders of the programmer. The compiler has been freed of the
burden of detecting memory aliasing explicitly by the language in
certain circumstances.

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<trmhuc$7hnk$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30728&group=comp.arch#30728

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Sat, 4 Feb 2023 21:15:56 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <trmhuc$7hnk$1@newsreader4.netcologne.de>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>
<883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
<trle1r$6nm0$1@newsreader4.netcologne.de>
<12cb8be5-3f45-472d-af7d-6a55898beae6n@googlegroups.com>
Injection-Date: Sat, 4 Feb 2023 21:15:56 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:22da:0:7285:c2ff:fe6c:992d";
logging-data="247540"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 4 Feb 2023 21:15 UTC

Agner Fog <agfo@dtu.dk> schrieb:
> Thomas Koenig wrote:
>> how would they estimate (for example) heap size of the program?
>>Do they rely on statistics gathered in earlier runs
>
> The maximum stack size is calculated by the compiler and the
> linker.

Hmm... "calculated" is a strong word. Is it then necessary to make
a complete call graph, and to estimate the recursion depth?
This sounds very hard to do reliably, as compiler estimates
of levels of recursion are likely to be as wrong as programmer's
estimates :-)

>In case of deeply recursive functions, the programmer may
>specify an estimated maximum recursion level.

Ah, the days of the REGION parameter of JCL... (but slurm isn't
better, it is just the syntax that is different).

There is actually another reason to use stack instead of dynamic
memory: Speed. Allocating arrays on the stack insted of using
dynamic memory has several advantages: No overhead for malloc/free(),
and unused values on the stack are much more likely to be overwritten
by new values, so they no longer take up stack space.

This is why gfortran, for example, also puts arrays on the stack
up to a certain limit at -Ofast.

Your scheme is actually better than a hard error on an unreasonably
low value as a stack size, which some operating systems impose.
MacOS is particularly egregious in this respect, especially
regarding pthreads...

> And yes, the maximum heap size relies on statistics gathered in
> earlier runs. The programmer may add a suggested value.
> If a heap grows beyond the estimated maximum, then the OS may add
> a large extra block of memory, perhaps the double of the previous
> size.

OK...

>Assume, for example, that a program is using 4 memory map
> entries when it is loaded, and the hardware has space for 256
> entries,

256 entries for each process, each consisting of a 64-bit pointer
and a 64-bit offset, would be quite large (4 kB), also compared to
a reasonable register set. However, you obviously don't need that
(I assume that you use 64-bit addresses).

> then the heap can grow to 2^252 times the initial size
> before we run out of memory map entries. This will never happen
> even if we use disk space as virtual memory.

.... and if your architecture can even address that much memory.

>
> There can be other causes of filling the memory map, though. The
> most likely is demand-paged memory. The use of demand-paged memory
> is strongly discouraged in ForwardCom.

What is then your strategy for when virtual memory exceeds physical
memory? Swapping out whole processes? That has been replaced by
paging for interactive systems for good reasons.

What is the strategy if the size of a single process exceeds total physical
memory, but (as is often the case) only a small part of that memory is in use?
(Having recently done a Debug build of clang, which took 20 GB on my
16GB home machine, I had reason to observe that the behavior, while not
great, was not too bad for responsiveness).

What is your strategy for memory-mapped I/O?

> Another possibility is
> starting and ending so many processes that the entire memory space
> gradually becomes totally fragmented.

Would you run into trouble over time if the processes finish again,
or would that set things right?

>> So, if the memory map exceeds the (probably) fixed number of entries that are on the chip, what happens?
> The best solution will probably be to start a garbage collector
> that can suspend running processes for a short time while
> rearranging the fragmented memory it is using. A fallback to
> fixed-size page tables is possible if the hardware supports it,
> but that is the expensive solution we are trying to avoid.

Sounds like a thing better to avoid, indeed...

> The performance problem of garbage collection exists in current
> systems as well. The programmer of a performance-critical
> memory-hungry program will take care to avoid garbage
> collection. This can be done by allocating a large memory block
> when starting, and by recycling allocated memory. Another solution
> is to start a new thread when a large block of memory is needed,
> for example in a video editing program. The main thread will not
> need an extra memory map entry if the new memory block is private
t> o the new thread.

> A memory map with space for, for example,
> 256 entries will require at most 512 comparators. This will be much
> more efficient than a traditional TLB with millions of pages in
> multiple levels, and we will have no TLB misses.

Sounds like an interesting concept, but I have till have doubts that
it will actually work. How far are you on the way of demonstrating
this? Given enough resources (probably a few PhDs would not be
enough) it might be possible to put such a system in a current
operating system (Linux kernel?) and try out how it works. Or,
maybe more reasonably, it might be possible to modify the kernel
to gather enough data so that a simulator of your system can be fed.

How far along are you to verifying that your system works?

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<5923848a-2504-4487-975c-40f1137beb2cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30729&group=comp.arch#30729

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1c2:b0:3b9:e0b5:1f8e with SMTP id t2-20020a05622a01c200b003b9e0b51f8emr839924qtw.399.1675546323119;
Sat, 04 Feb 2023 13:32:03 -0800 (PST)
X-Received: by 2002:a05:6870:3050:b0:163:3ab5:b3f with SMTP id
u16-20020a056870305000b001633ab50b3fmr1013522oau.218.1675546322682; Sat, 04
Feb 2023 13:32:02 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 13:32:02 -0800 (PST)
In-Reply-To: <f1c9c185-9779-4006-9736-b418edd97539n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:44ad:d89d:84bf:db;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:44ad:d89d:84bf:db
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com> <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
<aAvDL.621866$vBI8.420635@fx15.iad> <f1c9c185-9779-4006-9736-b418edd97539n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5923848a-2504-4487-975c-40f1137beb2cn@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 04 Feb 2023 21:32:03 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3956

by: MitchAlsup - Sat, 4 Feb 2023 21:32 UTC

On Saturday, February 4, 2023 at 10:39:16 AM UTC-6, Agner Fog wrote:
> 4. februar 2023 kl. 17.19.22 UTC+1 EricP wrote:
> > Agner Fog wrote:
> > > Re: memory management
> > A hardware table of comparators is more than double the cost per-entry
> > as a TLB binary CAM.
> Yes, but there are fewer entries
> > Real TLBs for radix page tables can have multiple CAMs to cache
> > table interior nodes, allowing a reverse page table walk to speed
> > TLB miss resolution, which adds to the whole TLB cost.
> No page tables = no table walk
> > Also each EXE or DLL has its own set of virtual memory section
> > requirements, plus all the DLLs referenced by other DLLs,
> > and any of these virtual sections may be relocated when created.
> > A single instance of a running program can have dozens to hundreds
> > of memory sections, each with its range of addresses and attributes.
> Yes, exactly. ForwardCom has no DLLs, but a new library system with a relink feature. This removes the biggest source of memory fragmentation and avoids a lot of waste of partially used pages. See https://www.forwardcom.info/forwardcom_libraries.php
> > How would paging work for the range table? Hardware needs to track which
> > pages in each memory section are present, or trigger a page fault if not.
> Usually, all memory used by a process will be loaded when the process is loaded. If a memory block is too big for loading it all at once, it needs one entry in the memory map for the present part and one entry for the absent part.
<
Consider a web browser which is intended to be initialized and then run for years without every being turned off. Are you handling pages on the browser as unique processes, or as additions to the current process ??
<
Then consider a database engine with 10K users and supporting a 100GB database::
Each user having to be totally isolated from every other user:: Are 256 mapping entries anywhere close to what is needed for the <common> DB engine
<
Consider the context switching overhead of migrating 256 quadword entries from process to process. Are you still capable of hard real time performance ?
<
Have you considered that each mapping entry might want its own ASID so that shared memory is accessed through a common ASID ??

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<3abd6fd9-1fc7-49a3-8f0f-92030ffa3a5cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30730&group=comp.arch#30730

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5e0b:0:b0:3ba:11a1:e88c with SMTP id h11-20020ac85e0b000000b003ba11a1e88cmr656281qtx.147.1675583750681;
Sat, 04 Feb 2023 23:55:50 -0800 (PST)
X-Received: by 2002:a05:6870:b003:b0:163:32c8:bb97 with SMTP id
y3-20020a056870b00300b0016332c8bb97mr1359259oae.61.1675583750372; Sat, 04 Feb
2023 23:55:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Feb 2023 23:55:50 -0800 (PST)
In-Reply-To: <5923848a-2504-4487-975c-40f1137beb2cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=128.76.247.189; posting-account=tYjOgQoAAACRs74arwcusKjVVQt_fFMX
NNTP-Posting-Host: 128.76.247.189
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de> <967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com> <883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
<aAvDL.621866$vBI8.420635@fx15.iad> <f1c9c185-9779-4006-9736-b418edd97539n@googlegroups.com>
<5923848a-2504-4487-975c-40f1137beb2cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3abd6fd9-1fc7-49a3-8f0f-92030ffa3a5cn@googlegroups.com>
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector registers
From: agf...@dtu.dk (Agner Fog)
Injection-Date: Sun, 05 Feb 2023 07:55:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6690

by: Agner Fog - Sun, 5 Feb 2023 07:55 UTC

Thomas Koenig wrote:
>> The maximum stack size is calculated by the compiler and the linker.

>Hmm... "calculated" is a strong word. Is it then necessary to make
>a complete call graph, and to estimate the recursion depth?

If library function A calls functions B and C, then the object file for A includes sufficient information to calculate the difference between the stack pointer at B and A, and between C and A. This includes, of course, fixed-size arrays on the stack. The linker is summing this information and finds the branch with the deepest stack use.

alloca in C makes variable-size arrays on the stack. This is rarely used. The programmer may specify an estimated maximum size when using alloca, or the system can use the same strategy as for the heap.

>This is why gfortran, for example, also puts arrays on the stack up to a certain limit at -Ofast.

Most programming languages put fixed-size arrays on the stack.

>256 entries for each process, each consisting of a 64-bit pointer
>and a 64-bit offset, would be quite large (4 kB), also compared to
>a reasonable register set. However, you obviously don't need that

Most processes would use only 3 entries. The rest is reserve for worst case.. You don't need to save the unused entries.

>What is the strategy if the size of a single process exceeds total physical
>memory, but (as is often the case) only a small part of that memory is in use?

We would have to divide the process's memory into a few blocks, some of which are swapped out.

>What is your strategy for memory-mapped I/O?

The user process is calling a system function. The system call instruction has a feature for sharing a block of memory. The system function does not have access to everything, as current systems have, but only to the memory block shared by the application. The system function may use DMA or whatever the system designer sees fit.

>> Another possibility is starting and ending so many processes that the entire memory space
>> gradually becomes totally fragmented.

>Would you run into trouble over time if the processes finish again,
>or would that set things right?

The OS will collect as much garbage as it can without disturbing running processes.

>Sounds like an interesting concept, but I have till have doubts that it will actually work.

You probably cannot avoid TLB and fixed-size pages in large multi-user servers, but you can in small systems for dedicated purposes. There is so much to gain by avoiding TLB and multilevel page tables that this is something worth pursuing. ForwardCom does not ban TLB, but many niche applications can certainly work without it. I don't see ForwardCom replacing x86 or ARM in any near future because users need backward compatibility. The first uses will probably be research, embedded FPGA systems, single-purpose niche products, and large vector processors. Some wasteful programming habits have to go if you want top performance. At least you should avoid memory mapped files.

>How far along are you to verifying that your system works?

The library system and relinking mechanism works already. This removes the biggest source of memory fragmentation. The rest is playground for whoever needs an interesting university project.

MitchAlsup wrote:
>Consider a web browser which is intended to be initialized and then run for years without every
>being turned off. Are you handling pages on the browser as unique processes,
>or as additions to the current process ??

Each thread has its own memory map. You can put each browser page in a separate thread.

>Then consider a database engine with 10K users and supporting a 100GB database::
>Each user having to be totally isolated from every other user::
>Are 256 mapping entries anywhere close to what is needed for the <common> DB engine

You probably wouldn't use this for a many-user server, as I explained above..
But even if you have multiple users, you would still service each user in a separate process or a separate thread with its own small memory map.

>Consider the context switching overhead of migrating 256 quadword entries from process to process.
>Are you still capable of hard real time performance ?

99% of processes will have 2-5 memory map entries. You don't save unused entries. The hardware may still have space for many entries to deal with rare worst-case events.

>Have you considered that each mapping entry might want its own ASID so that shared memory is
>accessed through a common ASID ??

Each process and each thread has its own little memory map. The map of which process has access to what is not stored inside the CPU. Transfer of large data from one process to another or between system and a user process goes through a memory block shared between the two.

Re: Introducing ForwardCom: An open ISA with variable-length vector registers

<tro4tu$8fvr$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30734&group=comp.arch#30734

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Introducing ForwardCom: An open ISA with variable-length vector
registers
Date: Sun, 5 Feb 2023 11:46:06 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <tro4tu$8fvr$1@newsreader4.netcologne.de>
References: <12155892-ec89-4bee-85bb-d1491a5dd20dn@googlegroups.com>
<trdrft$1rac$1@newsreader4.netcologne.de>
<967f0435-af12-452d-9aa9-ddf9f1580760n@googlegroups.com>
<tree26$28u3$1@newsreader4.netcologne.de> <treorr$hb6f$1@dont-email.me>
<trflaf$32et$1@newsreader4.netcologne.de> <trib30$1acn8$1@dont-email.me>
<36156f23-56e8-4624-9e06-58334c48ccd6n@googlegroups.com>
<883abab8-4670-4b73-a17a-c1c58b71d110n@googlegroups.com>
<aAvDL.621866$vBI8.420635@fx15.iad>
<f1c9c185-9779-4006-9736-b418edd97539n@googlegroups.com>
<5923848a-2504-4487-975c-40f1137beb2cn@googlegroups.com>
<3abd6fd9-1fc7-49a3-8f0f-92030ffa3a5cn@googlegroups.com>
Injection-Date: Sun, 5 Feb 2023 11:46:06 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-22da-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:22da:0:7285:c2ff:fe6c:992d";
logging-data="278523"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 5 Feb 2023 11:46 UTC

Agner Fog <agfo@dtu.dk> schrieb:
> Thomas Koenig wrote:
>>> The maximum stack size is calculated by the compiler and the linker.
>
>>Hmm... "calculated" is a strong word. Is it then necessary to make
>>a complete call graph, and to estimate the recursion depth?
>
> If library function A calls functions B and C, then the object file for A includes sufficient information to calculate the difference between the stack pointer at B and A, and between C and A. This includes, of course, fixed-size arrays on the stack. The linker is summing this information and finds the branch with the deepest stack use.
>
> alloca in C makes variable-size arrays on the stack. This is
> rarely used.

Well, alloca is not part of the standardized C language, it is an
extension.

>The programmer may specify an estimated maximum size
> when using alloca, or the system can use the same strategy as for
> the heap.

alloca is not part of the standardized C language proper (and systems
have IMHO unreasonable restrictions on stack size). Plus, a stack
overflow is usually not handled gracefully.

>>This is why gfortran, for example, also puts arrays on the stack up to a certain limit at -Ofast.
>
> Most programming languages put fixed-size arrays on the stack.

I was not talking about fixed-size arrays only.

In modern Fortran (since 1991) you can use

subroutine foo(a,n)
real, dimension(n), intent(inout) :: a
real, dimension(n) :: b

which declare a variable-sized array, which the compiler can put
on the heap or the stack. Since Fortran 2008, there is also

n = ...
block
real, dimension(n) :: a
end block

C99 later had something like this as variable-length arrays, but
they were declared optional in subsequent revisions of the C standard,
and if I remember correctly, they are not supported in Microsfot C.

>>256 entries for each process, each consisting of a 64-bit pointer
>>and a 64-bit offset, would be quite large (4 kB), also compared to
>>a reasonable register set. However, you obviously don't need that
>
> Most processes would use only 3 entries. The rest is reserve
> for worst case. You don't need to save the unused entries.

OK.

>>What is the strategy if the size of a single process exceeds total physical
>>memory, but (as is often the case) only a small part of that memory is in use?
>
> We would have to divide the process's memory into a few blocks,
> some of which are swapped out.

So, this would then somewhat look like a system with huge pages.

Hm, then your system would be a bit like having huge tables (1 GB
or something like that) with an additional size limitation?

>>What is your strategy for memory-mapped I/O?
>
> The user process is calling a system function. The system call
> instruction has a feature for sharing a block of memory. The system
> function does not have access to everything, as current systems
> have, but only to the memory block shared by the application. The
> system function may use DMA or whatever the system designer

So, each shared memory block would then have its own entry in the
table for each of the processes, correct?

Do I see it correctly that an application which, for whatever reason,
opened 1000 files via memory-mapped I/O at the same time would run
out of descriptors?

Same thing could apply for shared memory between processes.

>>> Another possibility is starting and ending so many processes that the entire memory space
>>> gradually becomes totally fragmented.
>
>>Would you run into trouble over time if the processes finish again,
>>or would that set things right?
>
> The OS will collect as much garbage as it can without disturbing running processes.

OK.

>
>>Sounds like an interesting concept, but I have till have doubts that it will actually work.
>
> You probably cannot avoid TLB and fixed-size pages in large
> multi-user servers, but you can in small systems for dedicated
> purposes. There is so much to gain by avoiding TLB and multilevel
> page tables that this is something worth pursuing. ForwardCom does
> not ban TLB, but many niche applications can certainly work without
> it. I don't see ForwardCom replacing x86 or ARM in any near future
> because users need backward compatibility.

Source code compatibility would be the first step; for this you
need a compiler. I read that there is currently none available;
a port to either gcc or LLVM would be a logical first step.
Is work ongoing on that?

> The first uses will
> probably be research, embedded FPGA systems, single-purpose niche
> products, and large vector processors. Some wasteful programming
> habits have to go if you want top performance. At least you should
> avoid memory mapped files.

Memory mapped files: They would be wasteful on your architecture,
but I'm not aware that they are wasteful on what is in used today
(or are they?).

Research, sure. FPGA/embedded systems: Possible, you also don't
necessarily need an operating system for that. Large vector
processors: For a cluster, you would need access to a pretty
full operating system including networked file access, shells etc.
Job submission and cross-compilation could then be done on special
servers, and people using HPC are used to specifying maximum memory
requirements via slurm (or whatever system they are using) anyway.

So, what about a port of a reasonably fully-featured OS? Linux
or one of the BSD variants comes to mind.

Regarding special applications: I think today's situation of binary
compatibility between largely different classes of computers,
from laptops to supercomputers, has shown that a general-purpose
architecture has the greatest potential.

>>How far along are you to verifying that your system works?
>
> The library system and relinking mechanism works already. This
> removes the biggest source of memory fragmentation. The rest is
> playground for whoever needs an interesting university project.

In other words, you have removed obstacles on the way to memory
fragmentation, but there is no (semi-)hard data on the effectiveness
of your solutions for real-world applications. This will make it
hard to convince sceptics that your model would actually work,
I'm afraid.

Subject	Author
Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector	Terje Mathisen
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Tim Rentsch
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Anton Ertl
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector	Stephen Fuld
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Stephen Fuld
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Tim Rentsch
Re: Introducing ForwardCom: An open ISA with variable-length vector	williamfindlay
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Tim Rentsch
Re: Grammar peeving	moi
Re: Grammar peeving	Tim Rentsch
Re: Grammar peeving	moi
Re: Grammar peeving	Tim Rentsch
Re: Grammar peeving	Thomas Koenig
Re: Grammar peeving	MitchAlsup
Re: Grammar peeving	Tim Rentsch
Re: Grammar peeving	Thomas Koenig
Re: Grammar peeving	Tim Rentsch
Re: Grammar peeving	Michael S
Re: Grammar peeving	MitchAlsup
Re: extreme Grammar peeving	John Levine
Re: extreme Grammar peeving	moi
Re: not even wrong, extreme Grammar peeving	John Levine
Re: not even wrong, extreme Grammar peeving	moi
Re: not even wrong, extreme Grammar peeving	John Levine
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Tim Rentsch
Re: Introducing ForwardCom: An open ISA with variable-length	mac
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Terje Mathisen
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector	Terje Mathisen
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	robf...@gmail.com
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Anton Ertl
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Josh Vanderhoof
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	JohnG
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Agner Fog
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector	Brian G. Lucas
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector	Terje Mathisen
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector	BGB
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector	BGB
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	EricP
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector	Thomas Koenig
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Scott Lurndal
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	MitchAlsup
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Quadibloc
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	Anton Ertl
Re: Introducing ForwardCom: An open ISA with variable-length vector registers	luke.l...@gmail.com

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards. If you find that it is broken please let me know here rocksolid.nodes.help

devel / comp.arch / Re: Introducing ForwardCom: An open ISA with variable-length vector registers

devel / comp.arch / Re: Introducing ForwardCom: An open ISA with variable-length vector registers

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
If you find that it is broken please let me know here rocksolid.nodes.help