Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!

Re: VVM versus single instruction

Subject	Author
VVM versus single instruction	MitchAlsup
Re: VVM versus single instruction	Quadibloc
Re: VVM versus single instruction	MitchAlsup
Re: VVM versus single instruction	BGB
Re: VVM versus single instruction	MitchAlsup
Re: VVM versus single instruction	BGB
Re: VVM versus single instruction	Stefan Monnier
Re: VVM versus single instruction	BGB
Re: VVM versus single instruction	Stefan Monnier
Re: VVM versus single instruction	BGB
Re: VVM versus single instruction	EricP
Re: VVM versus single instruction	BGB
Re: VVM versus single instruction	Stefan Monnier
Re: VVM versus single instruction	MitchAlsup
Re: VVM versus single instruction	Stefan Monnier
Re: VVM versus single instruction	Stephen Fuld
Re: VVM versus single instruction	MitchAlsup
Re: VVM versus single instruction	Stephen Fuld
Re: VVM versus single instruction	BGB

VVM versus single instruction

<777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23295&group=comp.arch#23295

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1136:: with SMTP id p22mr698716qkk.685.1644260772742;
Mon, 07 Feb 2022 11:06:12 -0800 (PST)
X-Received: by 2002:a05:6808:8d:: with SMTP id s13mr164447oic.227.1644260772194;
Mon, 07 Feb 2022 11:06:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Feb 2022 11:06:12 -0800 (PST)
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:902d:1b92:ecee:b701;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:902d:1b92:ecee:b701
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
Subject: VVM versus single instruction
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Feb 2022 19:06:12 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 66

by: MitchAlsup - Mon, 7 Feb 2022 19:06 UTC

A while back we had a conversation about vectorized memory
to memory copy activities; with good arguments on both sides.

While refining some of the the more esoteric portions of
My 66000, I stumbled onto a new realization for including
certain nstruction rather than performing the function as
a vectorized loop.

The primary reason is unCacheable memory!
The secondary reason it cache coherence traffic!

PCIe has a maximum payload length of 4096 bytes. My 66000
interconnect transport supports messages of such length!
Why should the instruction set not have access to that
functionality?

It seems to me that instructions like: LDM, STM, and MM
should be able to access that functionality "At Least"
when the access is to unCacheable memory (that is: no
coherence traffic is associated with the interior parts
of the message to screw up cache coherence.) My 66000
can use this capability to perform message passing since
all of the data is transferred in a single payload and
will appear ATOMIC to all interested 3rd-parties.

LDM and STM can only access 4-cache lines of data (the
size of the register file). Still, this would allow a
device driver to send an entire PCIe command to a device
without needing to grab a big OS LOCK! There must be
dozens of usages that would eliminate a lot of LOCKing
found in lots of OSs simply because the entire message
arrives atomically !

MM can access essentially unbounded length, but the PCIe
already chops the unbounded DMA into chunks. And then
ships as many chunks as are required. Thus, the logic is
already in the chip, why not instantiate annother instance!

{This may have to be restricted to Guest OS or HyperVisor
"privilege" levels.}

In annother realizatino, imagine that the ROM (typically
BOOT ROM) contains a page filled with zeros. ROM is in
unCacheable space, typically. So, after allocating a new
page Guest OS could fill it with zeros in a single
instruction! Which happens to traverse the interconnect
in a single message saving lots of message headers on the
interconnect and ALL of the coherence traffic!

Since we KNOW that ROM will never change its stored values,
we can use the knowledge of this to eliminate coherence
traffic on ANY reads to ROM! {there are no writes!} even
if ROM address is translated through a Cacheable access
PTE, too. {in practice Flash ROM can change its value,
but this happens so infrequently that nobody will be
harmed making this subSpace coherence traffic free.
Generally when Flash ROM is updated one goes through a
ReBoot sequence shortly thereafter.}

Now that ROM is access efficient, entire libraries of
shared code can be placed in ROM. All of this code can
be Trusted !

So one gets streaming stores.
Low overhead Load.
Lower overhead shared libraries.
simply by detecing "a couple of things"

Re: VVM versus single instruction

<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23296&group=comp.arch#23296

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:22d4:: with SMTP id o20mr833564qki.90.1644262487776;
Mon, 07 Feb 2022 11:34:47 -0800 (PST)
X-Received: by 2002:a05:6830:233d:: with SMTP id q29mr347400otg.331.1644262487505;
Mon, 07 Feb 2022 11:34:47 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Feb 2022 11:34:47 -0800 (PST)
In-Reply-To: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:cc97:2170:c420:2f16;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:cc97:2170:c420:2f16
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
Subject: Re: VVM versus single instruction
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Feb 2022 19:34:47 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 17

by: Quadibloc - Mon, 7 Feb 2022 19:34 UTC

On Monday, February 7, 2022 at 12:06:14 PM UTC-7, MitchAlsup wrote:

> Since we KNOW that ROM will never change its stored values,
> we can use the knowledge of this to eliminate coherence
> traffic on ANY reads to ROM!

What, existing architectures, which allow one to mark
memory used for memory-mapped I/O as not cacheable,
don't also allow indicating the other common thing
usually found in a computer's memory address space
already?

Or, perhaps, did they figure that if nobody ever attempts to
write to ROM, there won't _be_ any cache coherence traffic,
even if it's cached normally and the computer doesn't know
it isn't RAM?

John Savard

Re: VVM versus single instruction

<a506d819-7863-403b-9f09-df30f2483351n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23297&group=comp.arch#23297

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:15cc:: with SMTP id o12mr1085315qkm.152.1644268770273;
Mon, 07 Feb 2022 13:19:30 -0800 (PST)
X-Received: by 2002:a05:6830:1493:: with SMTP id s19mr683127otq.85.1644268770030;
Mon, 07 Feb 2022 13:19:30 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Feb 2022 13:19:29 -0800 (PST)
In-Reply-To: <5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:902d:1b92:ecee:b701;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:902d:1b92:ecee:b701
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com> <5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a506d819-7863-403b-9f09-df30f2483351n@googlegroups.com>
Subject: Re: VVM versus single instruction
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Feb 2022 21:19:30 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 34

by: MitchAlsup - Mon, 7 Feb 2022 21:19 UTC

On Monday, February 7, 2022 at 1:34:49 PM UTC-6, Quadibloc wrote:
> On Monday, February 7, 2022 at 12:06:14 PM UTC-7, MitchAlsup wrote:
>
> > Since we KNOW that ROM will never change its stored values,
> > we can use the knowledge of this to eliminate coherence
> > traffic on ANY reads to ROM!
<
> What, existing architectures, which allow one to mark
> memory used for memory-mapped I/O as not cacheable,
> don't also allow indicating the other common thing
> usually found in a computer's memory address space
> already?
<
Well, for example no payloads in HyperTransport are longer than
a cache line. Here I am allowing instruction access to 4096 byte
payloads.
<
unCacheable memory space uses the incoherent command set
of HT. And there are some complications due to MTRRs to
cover streaming stores and a few other things.
>
> Or, perhaps, did they figure that if nobody ever attempts to
> write to ROM, there won't _be_ any cache coherence traffic,
> even if it's cached normally and the computer doesn't know
> it isn't RAM?
<
In HT is one marked the ROM Cacheable, Snoop requests were
not suppressed. (99% confidence level) The coherent command
set of HT did not have qualifications; however, nothing prevents
whatever controller at ROM to simply not generate the coherent
requests appropriate to Reading ROM. My guess is that ROM
access is so infrequent in operation, it never made the front page
of designer work list.
>
> John Savard

Re: VVM versus single instruction

<sts6dr$3jh$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23298&group=comp.arch#23298

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 7 Feb 2022 16:28:08 -0600
Organization: A noiseless patient Spider
Lines: 172
Message-ID: <sts6dr$3jh$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 7 Feb 2022 22:28:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="0cf1e5531404321b57c8451ad1db7702";
logging-data="3697"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/d90I5oh9sIl/Xm9Pt0dVS"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:wv7Vki/XITU14P/AiKWUsdQvyRE=
In-Reply-To: <5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 7 Feb 2022 22:28 UTC

On 2/7/2022 1:34 PM, Quadibloc wrote:
> On Monday, February 7, 2022 at 12:06:14 PM UTC-7, MitchAlsup wrote:
>
>> Since we KNOW that ROM will never change its stored values,
>> we can use the knowledge of this to eliminate coherence
>> traffic on ANY reads to ROM!
>
> What, existing architectures, which allow one to mark
> memory used for memory-mapped I/O as not cacheable,
> don't also allow indicating the other common thing
> usually found in a computer's memory address space
> already?
>
> Or, perhaps, did they figure that if nobody ever attempts to
> write to ROM, there won't _be_ any cache coherence traffic,
> even if it's cached normally and the computer doesn't know
> it isn't RAM?
>

Yeah, if you don't store to ROM, the cache lines don't get marked dirty,
so no reason to write them back.

Probably depends a lot on how one implements the cache consistency
mechanism though.

In my case, it is pretty loose:
Each cache merely copies the cache-line from the outer-level caches, or
from DRAM (NINE style, so a copy of a cache line in L1 may or may not
have another copy in L2);
Parties are responsible for writing back dirty cache lines in a timely
manner;
There is at present no (general) mechanism to invalidate cache lines in
other caches.

Consistency generally requires explicit intervention:
Manual cache flushes;
Use of "volatile" memory access (cache lines are auto self flushing);
Encoded either via a special address range (physical) or TLB bits;
Volatile lines auto-flush after a certain number of clock cycles;
...

The volatile cache lines can maintain some semblance of consistency
between cores, however them being self-flushing isn't ideal for performance.

Another possibility could be a way to flag memory as "exclusive" which
causes only a single party to be allowed a copy of a given cache line
(or, only a single copy if any party has a "dirty" copy, *). This would
require more complicated signaling between the caches though.

*: So, say, if an L1 cache tries to write to an "Exclusive" line, it
first has to signal to the bus that it is doing so, and all other caches
will flush this cache line, and then the L2 cache puts the cache line
into an "Acquired Exclusive" state, and the L1 can proceed once it gets
back a response from the L2.

If an L2 miss happens which effects this cache line, the L2 sends out a
signal that whatever L1 cache holds this cache line should write it back
(in which cache it turns back into a normal "dirty" cache line).

Potentially, the L1's would only be allowed to have a fairly limited
number of dirty cache lines in this mode, with similar "auto-evict after
N cycles" behavior to the volatile lines (albeit maybe a larger
timeout), ...

Though, for "general purpose" read/write shared memory, this would have
similar drawbacks to using "volatile". It could be better for
read-dominant or read-only memory, but then has little advantage over
the current "default" strategy.

Though, one could argue that a drawback of my current strategy would be
that it hinders traditional multi-threading by not allowing multiple
threads to operate at the same time on the same memory without incurring
memory consistency issues (or using volatile memory and incurring a
performance hit).

....

Oh well, in my case, getting the new (128-bit) C ABI working is being a
bit more of a challenge on the debugging front than I had originally
imagined.

Gradually making progress, but it has turned into a pretty massive bug
storm (partly by shifting a bulk of the code generation in this case
over to "mostly untested" and "only partially implemented" code paths).

Types of frequent problem cases:
Code not dealing correctly with 128 bit pointers;
Problem cases involving R32..R63 not being handled, ...;
A few bugs were resulting in pointers / etc becoming misaligned;
Bugs in the register allocator involving 128-bit types;
...

So, summary of ABI differences from normal 64-bit ABI:
Uses all 64 GPRs by default;
Having XGPR and XMOV is required for this ABI;
Also assumes ALUX (128-bit ALU ops), ...
Switches to using 16 registers for argument passing;
This allows up to 8 pointer (or int128/vector) arguments;
R4..R7, R20..R23, R36..R39, R52..R55
Pointers and variant types are now 128 bits;
All pointers exist as register pairs;
Pointers are generally also now assumed to be tag-refs by default.
Various register assignments have changed;
The 'this' argument has been moved from R3 to R19:R18;
Struct Return: R2 to R3:R2;
...
Alignment for pointers, etc, is now 16 bytes.

Limitations and non-changes:
Keeps prior organization of scratch and callee preserve registers;
Assumes all image sections, ..., to exist within a single quadrant;
Assumes all branches to also be within the same quadrant;
The ABI's are neither link nor data-sharing compatible;
Many existing ASM blobs are not compatible with the ABI;
Pointer arithmetic will be slower than with the 64-bit ABI;
Normal pointers now require a strict 16B alignment;
...

Changing the above would have required more significant changes, or in
some cases handling call/return from function pointers via runtime calls
or thunks or additional ISA-level support.

Simpler option for now is to assume that one can put the program image
and all of its associated DLLs within a single 48-bit quadrant (however,
even as such, function pointers are also still widened to 128 bits...).

ISA additions needed for this:
XLEA.B, Does LEA on a 128-bit pointers, ...
Adjusts bounds-checks for bounds-checked pointers;
Currently requires manual scaling for indexing (*).
MOVTT, Modifies the type-tag value on a register.
Sub-variants exit for the top 4 and top 16 bits.
XMOVTT, Loads an Imm32 of type-tags into a 128-bit pointer.
Needed mostly to avoid steep penalties with bounded pointers.
...

*: At present, XLEA.B behaves more like an ADD with bounds adjustment
than a LEA. Currently only does arithmetic on the low 48 bits.

No real change in operating mode or ISA level behavior (in general), for
the most part nearly everything is happening at the compiler and ABI level.

....

Not sure if this will be all that useful in a larger sense, but it does
at least (potentially) give more of a chance to debug some areas of my
compiler...

It is possible though that parts of the newer ABI could be useful as a
higher-performance mode with 64-bit pointers though.

Say, it enables the use of more registers for call/return, ..., but
without the relative penalty of making all the pointers twice the size,
or all the extra register pressure. Register pressure was the main
reason the ABI was expanded to use all 64 registers; the compiler seems
to have "register pressure" issues when using 128b pointers with 32
GPRs, which in this case behaving as 14 128-bit registers, with only 7
being usable for holding local variables. When using 64 GPRs, it is
possible to hold 14 128-bit values in callee-save registers (so,
register pressure is less of an issue in this case).

....

Re: VVM versus single instruction

<jwvzgn2uvtx.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23299&group=comp.arch#23299

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 07 Feb 2022 18:06:24 -0500
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <jwvzgn2uvtx.fsf-monnier+comp.arch@gnu.org>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="c5af4fff4d67e6d27e2becb666adee8e";
logging-data="1715"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/nvd3ftu8pWysGJ9UlN17h"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:rgJ7doGzKGUH016tRwyhPT9TQe8=
sha1:OVdDUcxJU1rMoa98kn8jYJ/cX44=

by: Stefan Monnier - Mon, 7 Feb 2022 23:06 UTC

I thought boot ROMs (as well as boot Flash) were typically rather slow
memories, so I wouldn't want to use it as a source for zeroing pages.

> Since we KNOW that ROM will never change its stored values, we can
> use the knowledge of this to eliminate coherence traffic on ANY reads
> to ROM! {there are no writes!} even if ROM address is translated
> through a Cacheable access PTE, too.

It might make sense to be able to mark some pages as "not coherent", in
case coherence traffic is a source of performance issues. Such memory
region would then be cached in a naive way and it'd be up to software to
flush the caches if/when necessary to maintain coherence.

I could imagine some HPC codes using it when processes get pinned to
a specific CPU: they could mark all of the memory that is 100% private
as "not coherent".

Stefan

Re: VVM versus single instruction

<stsdc5$gbh$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23300&group=comp.arch#23300

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 7 Feb 2022 16:26:42 -0800
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <stsdc5$gbh$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 8 Feb 2022 00:26:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e59659670c90a923675983372e5901d1";
logging-data="16753"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CvHd6XmjnHYCMdG3q1YnQdFjMgRc4QG4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:v8AXndMKYbhGy/6yqSoYpP1PTt0=
In-Reply-To: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Tue, 8 Feb 2022 00:26 UTC

On 2/7/2022 11:06 AM, MitchAlsup wrote:
snip

> In annother realizatino, imagine that the ROM (typically
> BOOT ROM) contains a page filled with zeros. ROM is in
> unCacheable space, typically. So, after allocating a new
> page Guest OS could fill it with zeros in a single
> instruction! Which happens to traverse the interconnect
> in a single message saving lots of message headers on the
> interconnect and ALL of the coherence traffic!

If filling a page with zeros is important, why would you waste time and
power "loading" 4K bytes of zeros from ROM, when you could just load the
value once and write it as many times as required, albeit using a single
instruction?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM versus single instruction

<3b494441-cbc7-48eb-b406-bafadfd9a933n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23301&group=comp.arch#23301

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:509:: with SMTP id u9mr1537688qtg.530.1644280818459;
Mon, 07 Feb 2022 16:40:18 -0800 (PST)
X-Received: by 2002:a05:6830:233d:: with SMTP id q29mr741291otg.331.1644280818107;
Mon, 07 Feb 2022 16:40:18 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Feb 2022 16:40:17 -0800 (PST)
In-Reply-To: <jwvzgn2uvtx.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:902d:1b92:ecee:b701;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:902d:1b92:ecee:b701
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com> <jwvzgn2uvtx.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3b494441-cbc7-48eb-b406-bafadfd9a933n@googlegroups.com>
Subject: Re: VVM versus single instruction
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 08 Feb 2022 00:40:18 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 43

by: MitchAlsup - Tue, 8 Feb 2022 00:40 UTC

On Monday, February 7, 2022 at 5:06:27 PM UTC-6, Stefan Monnier wrote:
> > In annother realizatino, imagine that the ROM (typically
> > BOOT ROM) contains a page filled with zeros. ROM is in
> > unCacheable space, typically. So, after allocating a new
> > page Guest OS could fill it with zeros in a single
> > instruction!
<
> I thought boot ROMs (as well as boot Flash) were typically rather slow
> memories, so I wouldn't want to use it as a source for zeroing pages.
<
Imagine a little piece of logic near <whatever> ROM, that when accessed
spits out enough zero-bytes to satisfy the request. This does not have to
be as slow as ROM, nor does the zero-area have to exist in a unit of store.
This can be faster than ANY form of storage.
<
As to ROM being slow--I might be a little out of touch w/ HW implementations.
<
> > Since we KNOW that ROM will never change its stored values, we can
> > use the knowledge of this to eliminate coherence traffic on ANY reads
> > to ROM! {there are no writes!} even if ROM address is translated
> > through a Cacheable access PTE, too.
<
> It might make sense to be able to mark some pages as "not coherent", in
> case coherence traffic is a source of performance issues. Such memory
> region would then be cached in a naive way and it'd be up to software to
> flush the caches if/when necessary to maintain coherence.
<
General PTE based control of this has "run out of encoding bits".
<
I was also going to use Cacheable but inCoherent for my safe Stack, as
each Thread would have its own safe stack, you want high performance
and you know that nobody else can see it. So your use of this memory
is inCoherent, but something like the debugger or swapper would snoop
the cache for the up-to-date data.
>
> I could imagine some HPC codes using it when processes get pinned to
> a specific CPU: they could mark all of the memory that is 100% private
> as "not coherent".
<
Basically any memory known to be used by exactly 1 thread can use the
mechanism.
>
>
> Stefan

Re: VVM versus single instruction

<c5f51205-49e3-4f37-b699-acb1bdbe1385n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23302&group=comp.arch#23302

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5f50:: with SMTP id y16mr1480279qta.307.1644281116751;
Mon, 07 Feb 2022 16:45:16 -0800 (PST)
X-Received: by 2002:a9d:628e:: with SMTP id x14mr962010otk.38.1644281116448;
Mon, 07 Feb 2022 16:45:16 -0800 (PST)
Path: i2pn2.org!rocksolid2!news.neodome.net!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Feb 2022 16:45:16 -0800 (PST)
In-Reply-To: <stsdc5$gbh$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:902d:1b92:ecee:b701;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:902d:1b92:ecee:b701
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com> <stsdc5$gbh$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c5f51205-49e3-4f37-b699-acb1bdbe1385n@googlegroups.com>
Subject: Re: VVM versus single instruction
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 08 Feb 2022 00:45:16 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 34

by: MitchAlsup - Tue, 8 Feb 2022 00:45 UTC

On Monday, February 7, 2022 at 6:26:48 PM UTC-6, Stephen Fuld wrote:
> On 2/7/2022 11:06 AM, MitchAlsup wrote:
> snip
> > In annother realizatino, imagine that the ROM (typically
> > BOOT ROM) contains a page filled with zeros. ROM is in
> > unCacheable space, typically. So, after allocating a new
> > page Guest OS could fill it with zeros in a single
> > instruction! Which happens to traverse the interconnect
> > in a single message saving lots of message headers on the
> > interconnect and ALL of the coherence traffic!
<
> If filling a page with zeros is important,
<
Anytime an OS allocates a page for a process it should zero
out the page for protection. Back at CMU, one night about 4AM
I was running a program on the IBM 360/67 which allocated a page,
printed its contents, deallocated it, ad infinitum.
<
after a couple dozen pages of garbage, I ran into BELL Telephone
Billing records. Apparently CMU has sold time on the /67 to BT.
<
< why would you waste time and
> power "loading" 4K bytes of zeros from ROM, when you could just load the
> value once and write it as many times as required, albeit using a single
> instruction?
<
Because the MM activity is sent to the location of ROM and the
data passes directly from ROM to memory via something that
smells like DMA (without passing through the processor.)
<
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: VVM versus single instruction

<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23303&group=comp.arch#23303

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:a855:: with SMTP id r82mr1386880qke.645.1644281360620;
Mon, 07 Feb 2022 16:49:20 -0800 (PST)
X-Received: by 2002:a05:6808:2003:: with SMTP id q3mr702564oiw.133.1644281360370;
Mon, 07 Feb 2022 16:49:20 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Feb 2022 16:49:20 -0800 (PST)
In-Reply-To: <sts6dr$3jh$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:902d:1b92:ecee:b701;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:902d:1b92:ecee:b701
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com> <sts6dr$3jh$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
Subject: Re: VVM versus single instruction
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 08 Feb 2022 00:49:20 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 179

by: MitchAlsup - Tue, 8 Feb 2022 00:49 UTC

On Monday, February 7, 2022 at 4:28:14 PM UTC-6, BGB wrote:
> On 2/7/2022 1:34 PM, Quadibloc wrote:
> > On Monday, February 7, 2022 at 12:06:14 PM UTC-7, MitchAlsup wrote:
> >
> >> Since we KNOW that ROM will never change its stored values,
> >> we can use the knowledge of this to eliminate coherence
> >> traffic on ANY reads to ROM!
> >
> > What, existing architectures, which allow one to mark
> > memory used for memory-mapped I/O as not cacheable,
> > don't also allow indicating the other common thing
> > usually found in a computer's memory address space
> > already?
> >
> > Or, perhaps, did they figure that if nobody ever attempts to
> > write to ROM, there won't _be_ any cache coherence traffic,
> > even if it's cached normally and the computer doesn't know
> > it isn't RAM?
> >
> Yeah, if you don't store to ROM, the cache lines don't get marked dirty,
> so no reason to write them back.
<
You cannot do this on RWE = 101 pages because you don't know
somebody else in the system does not have RWE = 11x permission.
>
> Probably depends a lot on how one implements the cache consistency
> mechanism though.
>
> In my case, it is pretty loose:
> Each cache merely copies the cache-line from the outer-level caches, or
> from DRAM (NINE style, so a copy of a cache line in L1 may or may not
> have another copy in L2);
> Parties are responsible for writing back dirty cache lines in a timely
> manner;
> There is at present no (general) mechanism to invalidate cache lines in
> other caches.
>
> Consistency generally requires explicit intervention:
> Manual cache flushes;
> Use of "volatile" memory access (cache lines are auto self flushing);
> Encoded either via a special address range (physical) or TLB bits;
> Volatile lines auto-flush after a certain number of clock cycles;
> ...
>
>
> The volatile cache lines can maintain some semblance of consistency
> between cores, however them being self-flushing isn't ideal for performance.
>
> Another possibility could be a way to flag memory as "exclusive" which
> causes only a single party to be allowed a copy of a given cache line
> (or, only a single copy if any party has a "dirty" copy, *). This would
> require more complicated signaling between the caches though.
<
Effectively, one has to be able to determine system exclusivity. PTEs
contain too little information do make this determination. However,
when you know some properties of the address space, you may be
able to derive such information.
>
> *: So, say, if an L1 cache tries to write to an "Exclusive" line, it
> first has to signal to the bus that it is doing so, and all other caches
> will flush this cache line, and then the L2 cache puts the cache line
> into an "Acquired Exclusive" state, and the L1 can proceed once it gets
> back a response from the L2.
>
> If an L2 miss happens which effects this cache line, the L2 sends out a
> signal that whatever L1 cache holds this cache line should write it back
> (in which cache it turns back into a normal "dirty" cache line).
>
> Potentially, the L1's would only be allowed to have a fairly limited
> number of dirty cache lines in this mode, with similar "auto-evict after
> N cycles" behavior to the volatile lines (albeit maybe a larger
> timeout), ...
>
>
> Though, for "general purpose" read/write shared memory, this would have
> similar drawbacks to using "volatile". It could be better for
> read-dominant or read-only memory, but then has little advantage over
> the current "default" strategy.
>
> Though, one could argue that a drawback of my current strategy would be
> that it hinders traditional multi-threading by not allowing multiple
> threads to operate at the same time on the same memory without incurring
> memory consistency issues (or using volatile memory and incurring a
> performance hit).
>
> ...
>
>
>
> Oh well, in my case, getting the new (128-bit) C ABI working is being a
> bit more of a challenge on the debugging front than I had originally
> imagined.
>
> Gradually making progress, but it has turned into a pretty massive bug
> storm (partly by shifting a bulk of the code generation in this case
> over to "mostly untested" and "only partially implemented" code paths).
>
> Types of frequent problem cases:
> Code not dealing correctly with 128 bit pointers;
> Problem cases involving R32..R63 not being handled, ...;
> A few bugs were resulting in pointers / etc becoming misaligned;
> Bugs in the register allocator involving 128-bit types;
> ...
>
>
> So, summary of ABI differences from normal 64-bit ABI:
> Uses all 64 GPRs by default;
> Having XGPR and XMOV is required for this ABI;
> Also assumes ALUX (128-bit ALU ops), ...
> Switches to using 16 registers for argument passing;
> This allows up to 8 pointer (or int128/vector) arguments;
> R4..R7, R20..R23, R36..R39, R52..R55
> Pointers and variant types are now 128 bits;
> All pointers exist as register pairs;
> Pointers are generally also now assumed to be tag-refs by default.
> Various register assignments have changed;
> The 'this' argument has been moved from R3 to R19:R18;
> Struct Return: R2 to R3:R2;
> ...
> Alignment for pointers, etc, is now 16 bytes.
>
>
> Limitations and non-changes:
> Keeps prior organization of scratch and callee preserve registers;
> Assumes all image sections, ..., to exist within a single quadrant;
> Assumes all branches to also be within the same quadrant;
> The ABI's are neither link nor data-sharing compatible;
> Many existing ASM blobs are not compatible with the ABI;
> Pointer arithmetic will be slower than with the 64-bit ABI;
> Normal pointers now require a strict 16B alignment;
> ...
>
> Changing the above would have required more significant changes, or in
> some cases handling call/return from function pointers via runtime calls
> or thunks or additional ISA-level support.
>
> Simpler option for now is to assume that one can put the program image
> and all of its associated DLLs within a single 48-bit quadrant (however,
> even as such, function pointers are also still widened to 128 bits...).
>
>
> ISA additions needed for this:
> XLEA.B, Does LEA on a 128-bit pointers, ...
> Adjusts bounds-checks for bounds-checked pointers;
> Currently requires manual scaling for indexing (*).
> MOVTT, Modifies the type-tag value on a register.
> Sub-variants exit for the top 4 and top 16 bits.
> XMOVTT, Loads an Imm32 of type-tags into a 128-bit pointer.
> Needed mostly to avoid steep penalties with bounded pointers.
> ...
>
> *: At present, XLEA.B behaves more like an ADD with bounds adjustment
> than a LEA. Currently only does arithmetic on the low 48 bits.
>
>
> No real change in operating mode or ISA level behavior (in general), for
> the most part nearly everything is happening at the compiler and ABI level.
>
> ...
>
>
>
> Not sure if this will be all that useful in a larger sense, but it does
> at least (potentially) give more of a chance to debug some areas of my
> compiler...
>
> It is possible though that parts of the newer ABI could be useful as a
> higher-performance mode with 64-bit pointers though.
>
> Say, it enables the use of more registers for call/return, ..., but
> without the relative penalty of making all the pointers twice the size,
> or all the extra register pressure. Register pressure was the main
> reason the ABI was expanded to use all 64 registers; the compiler seems
> to have "register pressure" issues when using 128b pointers with 32
> GPRs, which in this case behaving as 14 128-bit registers, with only 7
> being usable for holding local variables. When using 64 GPRs, it is
> possible to hold 14 128-bit values in callee-save registers (so,
> register pressure is less of an issue in this case).
>
> ...

Re: VVM versus single instruction

<stsgsn$2ek$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23304&group=comp.arch#23304

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 7 Feb 2022 17:26:45 -0800
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <stsgsn$2ek$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<stsdc5$gbh$1@dont-email.me>
<c5f51205-49e3-4f37-b699-acb1bdbe1385n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 8 Feb 2022 01:26:47 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e59659670c90a923675983372e5901d1";
logging-data="2516"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+DpywgY1npyJQ6EJloynMPDUt1GAxFX6M="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:ssTd35H/Qtj0nt5sl49tw+q53Js=
In-Reply-To: <c5f51205-49e3-4f37-b699-acb1bdbe1385n@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Tue, 8 Feb 2022 01:26 UTC

On 2/7/2022 4:45 PM, MitchAlsup wrote:
> On Monday, February 7, 2022 at 6:26:48 PM UTC-6, Stephen Fuld wrote:
>> On 2/7/2022 11:06 AM, MitchAlsup wrote:
>> snip
>>> In annother realizatino, imagine that the ROM (typically
>>> BOOT ROM) contains a page filled with zeros. ROM is in
>>> unCacheable space, typically. So, after allocating a new
>>> page Guest OS could fill it with zeros in a single
>>> instruction! Which happens to traverse the interconnect
>>> in a single message saving lots of message headers on the
>>> interconnect and ALL of the coherence traffic!
> <
>> If filling a page with zeros is important,
> <
> Anytime an OS allocates a page for a process it should zero
> out the page for protection. Back at CMU, one night about 4AM
> I was running a program on the IBM 360/67 which allocated a page,
> printed its contents, deallocated it, ad infinitum.
> <
> after a couple dozen pages of garbage, I ran into BELL Telephone
> Billing records. Apparently CMU has sold time on the /67 to BT.

You Bad boy!!!!!

One of the things I liked about Exec 8 is that by default, it wrote
zeros to all memory before loading the program, or expanding its memory
footprint. Although it did do this with a loop of double store
instructions. :-(. But this was in the early 1970s.

> <
> < why would you waste time and
>> power "loading" 4K bytes of zeros from ROM, when you could just load the
>> value once and write it as many times as required, albeit using a single
>> instruction?
> <
> Because the MM activity is sent to the location of ROM and the
> data passes directly from ROM to memory via something that
> smells like DMA (without passing through the processor.)

But then why not add an optional bit to the DMA that says essentially
"Don't increment the source address and use the same value you loaded
first?" Still uses less resources than loading redundant zeros.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM versus single instruction

<stsgt9$2g3$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23305&group=comp.arch#23305

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 7 Feb 2022 19:27:02 -0600
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <stsgt9$2g3$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<stsdc5$gbh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 8 Feb 2022 01:27:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="092b2592cbc62eb6de5ddde8185c49c1";
logging-data="2563"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX195aONQVioI7caP2hxnbnEu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:QUEkBO/MwNwzoDVny5p/d5yw1fY=
In-Reply-To: <stsdc5$gbh$1@dont-email.me>
Content-Language: en-US

by: BGB - Tue, 8 Feb 2022 01:27 UTC

On 2/7/2022 6:26 PM, Stephen Fuld wrote:
> On 2/7/2022 11:06 AM, MitchAlsup wrote:
> snip
>
>> In annother realizatino, imagine that the ROM (typically
>> BOOT ROM) contains a page filled with zeros. ROM is in
>> unCacheable space, typically. So, after allocating a new
>> page Guest OS could fill it with zeros in a single
>> instruction! Which happens to traverse the interconnect
>> in a single message saving lots of message headers on the
>> interconnect and ALL of the coherence traffic!
>
> If filling a page with zeros is important, why would you waste time and
> power "loading" 4K bytes of zeros from ROM, when you could just load the
> value once and write it as many times as required, albeit using a single
> instruction?
>

In my case, I have a page full of zeroes built into the ring-bus
interface for the Boot ROM.

There are also pages of NOP, RTS instructions, and BREAK instructions.
Each of these pages is 64K. These cases have different uses.

I could potentially also add pages for the RISC-V equivalents of these
instructions, if RISC-V mode sees more use (but, apart from being more
popular, there is little "good" reason to use RISC-V mode, and I am left
to suspect a core specialized for running RV32IM or similar might be a
better fit for "general use" than one crudely running RV64I on top of
the BJX2 core, *1).

Unlike actual pages, or storing them in ROM, these don't really take up
any storage space. So it is more like "well, a request fell into the
address range for the page full of zeroes.". Which sub-range it falls
into effectively selects the fill pattern to use.

Accessing these pages should actually be faster than accessing DRAM or
similar, as they will have roughly similar latency to accessing the L2
cache, and will never experience an L2 miss.

*1: Say, a core running RV32IM or similar with a 2-wide superscalar
pipeline, probably with a 4R+2W register file. Could then consider a few
(modest) ISA extensions to allow for 64-bit register-pair operations to
allow the core to interact with MMIO devices on the ringbus and similar.

Though, a full 'G' profile still looks like a stretch (supporting a
RISC-V 'G' profile, even for a 32-bit core, would not be likely to be
particularly cheap in terms of LUT budget).

....

Re: VVM versus single instruction

<jwvpmnyyu01.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23306&group=comp.arch#23306

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 07 Feb 2022 21:26:18 -0500
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <jwvpmnyyu01.fsf-monnier+comp.arch@gnu.org>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<jwvzgn2uvtx.fsf-monnier+comp.arch@gnu.org>
<3b494441-cbc7-48eb-b406-bafadfd9a933n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="827d9f2ae16f3600cb3c85aec17aeb74";
logging-data="10178"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+K00j9yJcQYJX2hBX2E3OP"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:N1CGtcyKnv+qkEvAmKm0BOqeAdg=
sha1:HnsODzbxLzEsi2BlBOHVMbq6QIc=

by: Stefan Monnier - Tue, 8 Feb 2022 02:26 UTC

>> I could imagine some HPC codes using it when processes get pinned to
>> a specific CPU: they could mark all of the memory that is 100% private
>> as "not coherent".
> Basically any memory known to be used by exactly 1 thread can use the
> mechanism.

At the cost of explicit extra flushes when the thread is moved from one
CPU to another, tho.

Stefan

Re: VVM versus single instruction

<stsn0d$2hf$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23307&group=comp.arch#23307

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 7 Feb 2022 21:11:06 -0600
Organization: A noiseless patient Spider
Lines: 236
Message-ID: <stsn0d$2hf$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
<sts6dr$3jh$1@dont-email.me>
<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 8 Feb 2022 03:11:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="092b2592cbc62eb6de5ddde8185c49c1";
logging-data="2607"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+m9AHwvM3oIHES/9CIakni"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:QXxMNMFQqjHFNjNhy4cpQODCb1Q=
In-Reply-To: <e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 8 Feb 2022 03:11 UTC

On 2/7/2022 6:49 PM, MitchAlsup wrote:
> On Monday, February 7, 2022 at 4:28:14 PM UTC-6, BGB wrote:
>> On 2/7/2022 1:34 PM, Quadibloc wrote:
>>> On Monday, February 7, 2022 at 12:06:14 PM UTC-7, MitchAlsup wrote:
>>>
>>>> Since we KNOW that ROM will never change its stored values,
>>>> we can use the knowledge of this to eliminate coherence
>>>> traffic on ANY reads to ROM!
>>>
>>> What, existing architectures, which allow one to mark
>>> memory used for memory-mapped I/O as not cacheable,
>>> don't also allow indicating the other common thing
>>> usually found in a computer's memory address space
>>> already?
>>>
>>> Or, perhaps, did they figure that if nobody ever attempts to
>>> write to ROM, there won't _be_ any cache coherence traffic,
>>> even if it's cached normally and the computer doesn't know
>>> it isn't RAM?
>>>
>> Yeah, if you don't store to ROM, the cache lines don't get marked dirty,
>> so no reason to write them back.
> <
> You cannot do this on RWE = 101 pages because you don't know
> somebody else in the system does not have RWE = 11x permission.

Only if one assumes that writes to executable memory will be seen by the
instruction cache without requiring a manual cache flush.

Both BJX2 and ARM fall into the "unless you manually flush the caches,
you may end up executing stale data" camp...

Granted, this is unlike x86, where one can modify an instruction just
before it is executed and it uses the up-to-date value (I suspect it is
likely that x86 might also need to invoke a pipeline flush or similar to
be able to deal with stuff like this).

>>
>> Probably depends a lot on how one implements the cache consistency
>> mechanism though.
>>
>> In my case, it is pretty loose:
>> Each cache merely copies the cache-line from the outer-level caches, or
>> from DRAM (NINE style, so a copy of a cache line in L1 may or may not
>> have another copy in L2);
>> Parties are responsible for writing back dirty cache lines in a timely
>> manner;
>> There is at present no (general) mechanism to invalidate cache lines in
>> other caches.
>>
>> Consistency generally requires explicit intervention:
>> Manual cache flushes;
>> Use of "volatile" memory access (cache lines are auto self flushing);
>> Encoded either via a special address range (physical) or TLB bits;
>> Volatile lines auto-flush after a certain number of clock cycles;
>> ...
>>
>>
>> The volatile cache lines can maintain some semblance of consistency
>> between cores, however them being self-flushing isn't ideal for performance.
>>
>> Another possibility could be a way to flag memory as "exclusive" which
>> causes only a single party to be allowed a copy of a given cache line
>> (or, only a single copy if any party has a "dirty" copy, *). This would
>> require more complicated signaling between the caches though.
> <
> Effectively, one has to be able to determine system exclusivity. PTEs
> contain too little information do make this determination. However,
> when you know some properties of the address space, you may be
> able to derive such information.

The exclusive flag would not guarantee exclusivity, it would tell the
memory subsystem whether or not to "actually bother" with trying to
maintain exclusivity.

This would probably only be for data caches though, since self-modifying
code is infrequent enough that it is effective to require that any code
which may cause self-modifying behaviors to also be responsible for
initiating a cache flush.

So, it allows selecting between a "go fast and don't bother with memory
consistency" option, and a "go a little slower but keep memory
consistent via exclusivity checks" option (contrast with volatile as a
"go slowly but keep memory consistent via immediate auto-flush").

As-is, there is an "epoch" mechanism, but it is too slow to be useful
for traditional multithreading (one wants stores to propagate
immediately, not "probably sometime within the next 10k-20k clock cycles
or so").

But, alas, "better" would require some sort of "actual" cache coherency
protocol...

The current model does sort of work though, with a few restrictions:
Multi-core is mostly for multiple processes;
Each logical thread in the same parent process executes in the same core
(cooperative or preemptive threading without SMP or "true" multithreading);
One approaches multi-threading in a more Erlang-like style (threads do
not directly share mutable memory, but instead work by passing messages
or data objects via copying them via a "mailbox" or similar, with only
immutable data being directly sharable);
....

In the latter scenario, one would use the memory in volatile mode mostly
to implement the mailboxes.

>>
>> *: So, say, if an L1 cache tries to write to an "Exclusive" line, it
>> first has to signal to the bus that it is doing so, and all other caches
>> will flush this cache line, and then the L2 cache puts the cache line
>> into an "Acquired Exclusive" state, and the L1 can proceed once it gets
>> back a response from the L2.
>>
>> If an L2 miss happens which effects this cache line, the L2 sends out a
>> signal that whatever L1 cache holds this cache line should write it back
>> (in which cache it turns back into a normal "dirty" cache line).
>>
>> Potentially, the L1's would only be allowed to have a fairly limited
>> number of dirty cache lines in this mode, with similar "auto-evict after
>> N cycles" behavior to the volatile lines (albeit maybe a larger
>> timeout), ...
>>
>>
>> Though, for "general purpose" read/write shared memory, this would have
>> similar drawbacks to using "volatile". It could be better for
>> read-dominant or read-only memory, but then has little advantage over
>> the current "default" strategy.
>>
>> Though, one could argue that a drawback of my current strategy would be
>> that it hinders traditional multi-threading by not allowing multiple
>> threads to operate at the same time on the same memory without incurring
>> memory consistency issues (or using volatile memory and incurring a
>> performance hit).
>>
>> ...
>>
>>
>>
>> Oh well, in my case, getting the new (128-bit) C ABI working is being a
>> bit more of a challenge on the debugging front than I had originally
>> imagined.
>>
>> Gradually making progress, but it has turned into a pretty massive bug
>> storm (partly by shifting a bulk of the code generation in this case
>> over to "mostly untested" and "only partially implemented" code paths).
>>
>> Types of frequent problem cases:
>> Code not dealing correctly with 128 bit pointers;
>> Problem cases involving R32..R63 not being handled, ...;
>> A few bugs were resulting in pointers / etc becoming misaligned;
>> Bugs in the register allocator involving 128-bit types;
>> ...
>>
>>
>> So, summary of ABI differences from normal 64-bit ABI:
>> Uses all 64 GPRs by default;
>> Having XGPR and XMOV is required for this ABI;
>> Also assumes ALUX (128-bit ALU ops), ...
>> Switches to using 16 registers for argument passing;
>> This allows up to 8 pointer (or int128/vector) arguments;
>> R4..R7, R20..R23, R36..R39, R52..R55
>> Pointers and variant types are now 128 bits;
>> All pointers exist as register pairs;
>> Pointers are generally also now assumed to be tag-refs by default.
>> Various register assignments have changed;
>> The 'this' argument has been moved from R3 to R19:R18;
>> Struct Return: R2 to R3:R2;
>> ...
>> Alignment for pointers, etc, is now 16 bytes.
>>
>>
>> Limitations and non-changes:
>> Keeps prior organization of scratch and callee preserve registers;
>> Assumes all image sections, ..., to exist within a single quadrant;
>> Assumes all branches to also be within the same quadrant;
>> The ABI's are neither link nor data-sharing compatible;
>> Many existing ASM blobs are not compatible with the ABI;
>> Pointer arithmetic will be slower than with the 64-bit ABI;
>> Normal pointers now require a strict 16B alignment;
>> ...
>>
>> Changing the above would have required more significant changes, or in
>> some cases handling call/return from function pointers via runtime calls
>> or thunks or additional ISA-level support.
>>
>> Simpler option for now is to assume that one can put the program image
>> and all of its associated DLLs within a single 48-bit quadrant (however,
>> even as such, function pointers are also still widened to 128 bits...).
>>
>>
>> ISA additions needed for this:
>> XLEA.B, Does LEA on a 128-bit pointers, ...
>> Adjusts bounds-checks for bounds-checked pointers;
>> Currently requires manual scaling for indexing (*).
>> MOVTT, Modifies the type-tag value on a register.
>> Sub-variants exit for the top 4 and top 16 bits.
>> XMOVTT, Loads an Imm32 of type-tags into a 128-bit pointer.
>> Needed mostly to avoid steep penalties with bounded pointers.
>> ...
>>
>> *: At present, XLEA.B behaves more like an ADD with bounds adjustment
>> than a LEA. Currently only does arithmetic on the low 48 bits.
>>
>>
>> No real change in operating mode or ISA level behavior (in general), for
>> the most part nearly everything is happening at the compiler and ABI level.
>>
>> ...
>>
>>
>>
>> Not sure if this will be all that useful in a larger sense, but it does
>> at least (potentially) give more of a chance to debug some areas of my
>> compiler...
>>
>> It is possible though that parts of the newer ABI could be useful as a
>> higher-performance mode with 64-bit pointers though.
>>
>> Say, it enables the use of more registers for call/return, ..., but
>> without the relative penalty of making all the pointers twice the size,
>> or all the extra register pressure. Register pressure was the main
>> reason the ABI was expanded to use all 64 registers; the compiler seems
>> to have "register pressure" issues when using 128b pointers with 32
>> GPRs, which in this case behaving as 14 128-bit registers, with only 7
>> being usable for holding local variables. When using 64 GPRs, it is
>> possible to hold 14 128-bit values in callee-save registers (so,
>> register pressure is less of an issue in this case).
>>
>> ...

Click here to read the complete article

Re: VVM versus single instruction

<jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23308&group=comp.arch#23308

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 07 Feb 2022 22:20:46 -0500
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
<sts6dr$3jh$1@dont-email.me>
<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
<stsn0d$2hf$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="827d9f2ae16f3600cb3c85aec17aeb74";
logging-data="31084"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/zAX9n+FbutK4gMw1UdHwq"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:QozVsRn7SwACZrsPtQiCuuiaiMo=
sha1:otyBXmnEI09icW579zB/zKIi30g=

by: Stefan Monnier - Tue, 8 Feb 2022 03:20 UTC

> One approaches multi-threading in a more Erlang-like style (threads do not
> directly share mutable memory, but instead work by passing messages or data
> objects via copying them via a "mailbox" or similar, with only immutable
> data being directly sharable);

Note that Erlang's data is immutable at the abstraction level of
the language. The GC will recover unused memory and recycle it for
other uses, resulting in that "immutable" memory being modified.
So unless your GC is careful to flush that "immutable" data from other
CPU's caches, it still needs cache coherence.

Stefan

Re: VVM versus single instruction

<stsvph$9mg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23309&group=comp.arch#23309

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Mon, 7 Feb 2022 23:41:03 -0600
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <stsvph$9mg$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
<sts6dr$3jh$1@dont-email.me>
<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
<stsn0d$2hf$1@dont-email.me> <jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 8 Feb 2022 05:41:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="092b2592cbc62eb6de5ddde8185c49c1";
logging-data="9936"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Sg1izCHGRaoFvXkbqupmx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:WElSUwXF3Z+7Pc/H0IxXOomvMT0=
In-Reply-To: <jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: BGB - Tue, 8 Feb 2022 05:41 UTC

On 2/7/2022 9:20 PM, Stefan Monnier wrote:
>> One approaches multi-threading in a more Erlang-like style (threads do not
>> directly share mutable memory, but instead work by passing messages or data
>> objects via copying them via a "mailbox" or similar, with only immutable
>> data being directly sharable);
>
> Note that Erlang's data is immutable at the abstraction level of
> the language. The GC will recover unused memory and recycle it for
> other uses, resulting in that "immutable" memory being modified.
> So unless your GC is careful to flush that "immutable" data from other
> CPU's caches, it still needs cache coherence.
>

This is not as much of an issue if the program and GC run on the same
core, and every core/thread also has its own heap.

If most of the shared mutable state can be eliminated (including things
like shared memory heaps) then the need for cache coherency can be
greatly reduced.

FWIW, similar issues can come up when dealing with distributed shared
memory systems (typically running over a network), where the latency is
high enough that the cost of maintaining strict coherence would be
prohibitive, and an "eventual consistency" model tends to be used instead.

People can write code that works in such a model, but it does require
some care so that it does not blow up in their face.

....

Re: VVM versus single instruction

<jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23310&group=comp.arch#23310

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Tue, 08 Feb 2022 08:21:05 -0500
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
<sts6dr$3jh$1@dont-email.me>
<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
<stsn0d$2hf$1@dont-email.me>
<jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>
<stsvph$9mg$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="827d9f2ae16f3600cb3c85aec17aeb74";
logging-data="8936"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/IG1TDD1JMk2SJAlk7PAf7"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:V6MOywgrH+wvpxnL/hiLC4f7Evo=
sha1:xVamcdOnamRZRxJzvd6MtMvdOPg=

by: Stefan Monnier - Tue, 8 Feb 2022 13:21 UTC

BGB [2022-02-07 23:41:03] wrote:
> On 2/7/2022 9:20 PM, Stefan Monnier wrote:
>>> One approaches multi-threading in a more Erlang-like style (threads do not
>>> directly share mutable memory, but instead work by passing messages or data
>>> objects via copying them via a "mailbox" or similar, with only immutable
>>> data being directly sharable);
>> Note that Erlang's data is immutable at the abstraction level of
>> the language. The GC will recover unused memory and recycle it for
>> other uses, resulting in that "immutable" memory being modified.
>> So unless your GC is careful to flush that "immutable" data from other
>> CPU's caches, it still needs cache coherence.
> This is not as much of an issue if the program and GC run on the same core,
> and every core/thread also has its own heap.

I was thinking of the situation where thread A on CPU A sends
a reference to an immutable object to thread B on CPU B.
Then later on that same thread A sends "another" reference to some new
immutable object to same thread B still on CPU B.
Both of those objects may actually reside at the same place in memory
(because the first object got GC'd in the mean time), so in order to
make sure CPU B doesn't use outdated info from its cache, you need some
coherence work.

If the messages actually send the data (as opposed to sending
just references), then the heaps can be 100% private, but at that point
you don't need the objects to be immutable either.

Stefan

Re: VVM versus single instruction

<stucuq$gpg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23311&group=comp.arch#23311

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Tue, 8 Feb 2022 12:31:51 -0600
Organization: A noiseless patient Spider
Lines: 189
Message-ID: <stucuq$gpg$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
<sts6dr$3jh$1@dont-email.me>
<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
<stsn0d$2hf$1@dont-email.me> <jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>
<stsvph$9mg$1@dont-email.me> <jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 8 Feb 2022 18:31:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="092b2592cbc62eb6de5ddde8185c49c1";
logging-data="17200"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18M5FRf0QOTnuGwXwNERQ7B"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:oqYRjPpUowDsUHJJTRjnqI5o8n8=
In-Reply-To: <jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: BGB - Tue, 8 Feb 2022 18:31 UTC

On 2/8/2022 7:21 AM, Stefan Monnier wrote:
> BGB [2022-02-07 23:41:03] wrote:
>> On 2/7/2022 9:20 PM, Stefan Monnier wrote:
>>>> One approaches multi-threading in a more Erlang-like style (threads do not
>>>> directly share mutable memory, but instead work by passing messages or data
>>>> objects via copying them via a "mailbox" or similar, with only immutable
>>>> data being directly sharable);
>>> Note that Erlang's data is immutable at the abstraction level of
>>> the language. The GC will recover unused memory and recycle it for
>>> other uses, resulting in that "immutable" memory being modified.
>>> So unless your GC is careful to flush that "immutable" data from other
>>> CPU's caches, it still needs cache coherence.
>> This is not as much of an issue if the program and GC run on the same core,
>> and every core/thread also has its own heap.
>
> I was thinking of the situation where thread A on CPU A sends
> a reference to an immutable object to thread B on CPU B.
> Then later on that same thread A sends "another" reference to some new
> immutable object to same thread B still on CPU B.
> Both of those objects may actually reside at the same place in memory
> (because the first object got GC'd in the mean time), so in order to
> make sure CPU B doesn't use outdated info from its cache, you need some
> coherence work.
>
> If the messages actually send the data (as opposed to sending
> just references), then the heaps can be 100% private, but at that point
> you don't need the objects to be immutable either.
>

Yeah, I was thinking of a "pass data by copy" system, with a disjoint
heap in each thread, as opposed to "pass by reference" with a heap
shared across multiple threads running on multiple processors.

For pass by copy, only the mailbox memory would need to be kept
consistent. Messages would likely be passed in the form of binary
serialized blobs.

For example, for a 3D engine I had been working on recently (before
getting distracted with the 128-bit ABI thing), I had been using this:
https://github.com/cr88192/bgbtech_bt3mini/blob/master/docs/2021-12-28_ABXE1.txt

Or, basically a binary serialized XML format I am calling ABXE.

Though, not for inter-thread passing, but for storing things like the
entity graph, where entities are converted to XML and then the XML is
stored in a serialized form (a lot of the code for the XML nodes was
copy/pasted from BGBCC, effectively it is using a repurposed version of
the compiler's AST nodes).

Granted, arguably, it could be more efficient to skip the XML step and
serialize the entities using a format more like that used for the
network protocol in the Quake games (tag bytes, combined with
bitmap-bytes and fields encoded according to the bitmap bits), eg:
TAG_ENTITY //encoding an entity
TAG_ENT_BASE_ATTRIB //base entity attributes
0x03 //Say, ORIGIN|YAW
U24 org_x; //16.8 fixed
U24 org_y; //16.8 fixed
U16 org_z; //8.8 fixed
U8 yaw; //yaw angle
TAG_ENT_INVEN //inventory slot
...
...
TAG_END //end of entity

However, this sort of format tends to be more brittle, since if one adds
new features it tends to break any existing decoders.

Something like binary serialized XML is a little more flexible in that
it allows new/unrecognized features to be ignored.

Granted, a properly designed TLV format could also allow for extension
without breaking stuff, but this is a tradeoff in terms of storage space
(using FOURCC tags for everything would bulk everything up pretty bad;
and formats like ASN.1 BER tend to be barely much better than the Quake
style approach in this respect).

One could also use a textual serialization, such as ASCII-serialized
S-Expressions, but, due to things like the computational cost of parsing
tokens and converting strings to numbers, etc, this would likely be less
efficient than using something like my ABXE format (where the nodes can
be readily recycled for new messages and events; or potentially a more
specialized decoder could be implemented which skips over the use of an
intermediate node graph).

Granted, one can also serialize things using a stack-machine bytecode
format rather than a flattened tree structure, but this is its own set
of tradeoffs.

Though, not implemented yet, I had imagined a network protocol for this
3D engine which would likely be a mix of TLV for the outer layers, with
a combination of ABXE and more specialized binary serializations for
other things (such as chunk updates).

On the Windows port, it is multi-threaded, albeit with a more
traditional "shared memory and mutexes" style, and considerable
debugging annoyance because directly sharing memory across multiple
threads results in this case results in a whole lot of areas where
corruption can occur (I spent a fair bit of time hunting down places
where thread related issues were mangling the terrain).

Usually, when I divide stuff up into threads, I tend to prefer to divide
things up in terms of functional area or units of work. But, in this
case, multiple parties have ended up interacting with the terrain
system, which was a source of issues.

Main reason for using threading in this case was mostly because when I
switched it over to rendering via the GPU, using raycasts to build a
list of visible blocks and similar became the primary bottleneck in the
engine (so, I made an initial crude attempt to split up the player
rendering and raycast into separate threads).

Ideally, raycast and game tick could also be split up, but doing so
would effectively involve similar changes to what would be needed for
networking (and likely also result in effectively having two copies of
the terrain in memory; one copy on the server, and one copy on the
client; though with the client-side copy being effectively volatile).

....

This engine was started as a fork of the "BtMini" engine I had written
as an attempt to run a simplistic Minecraft style engine on BJX2. In
this case, the use of raycast can save a lot of memory, and works well
at short distances, but with the drawback that its computational cost
goes up very sharply as draw distance increases (4/3*PI*r^2).

There is a memory component to draw distance as well, but it is mostly
in terms of having an array big enough to hold the list of visible
blocks, and vertex arrays big enough to hold all of the block faces, ...

Arguably makes less sense on a desktop PC where it ends up being more
CPU bound than memory bound.

Unclear whether working on this 3D engine, or trying to get the new
128-bit C ABI working in BJX2, is a better use of my time...

Also in testing realized a new annoyance in the new ABI:
All the pointers are now tag-refs;
I am using MSB tagging and bounds-checked pointers;
MSB bounds-checking breaks expected behavior of relative compare.

In effect, the bounds checking can break the use of using bare int128
compares on pointers.

Pointer comparisons then end up needing to be something like:
XMOV ptr1, R4
XMOV ptr2, R6
XMOVTT 0, R4 //Only has "Imm32, Rn" encoding
XMOVTT 0, R6
CMPXGT R6, R4

This kinda sucks pretty bad...

Though, could consider adding an "XMOVZT" instruction:
XMOVZT Xm, Xn
Which copies the 128-bit value from Xm to Xn while also zeroing all the
tag bits, which can at least shave a few instructions off the pointer
compare, and 4 DWORDs off the encoding. Still not great, but cheaper on
the FPGA side of things than adding compare instructions which ignore
the high 16 bits.

Internally, the XMOVZT encoding would likely be decoded as:
MOVTT Rm+1, ZZR, Rn+1 | MOVTT Rm+0, ZZR, Rn+0
So, would reuse an existing mechanism at least.

Though, relatedly, may as well also add a MOVZT, which MOV a 64-bit
value while zeroing the high 16 bits (this task also comes up
semi-often, and also uses the same basic mechanism as the prior
instructions).

Granted, arguably, I wouldn't need new instructions for these if the ISA
had an exposed zero register, but alas...

....

Re: VVM versus single instruction

<ltSMJ.10203$jwf9.320@fx24.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23321&group=comp.arch#23321

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx24.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com> <5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com> <sts6dr$3jh$1@dont-email.me> <e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com> <stsn0d$2hf$1@dont-email.me> <jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org> <stsvph$9mg$1@dont-email.me> <jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>
In-Reply-To: <jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 60
Message-ID: <ltSMJ.10203$jwf9.320@fx24.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 09 Feb 2022 17:03:13 UTC
Date: Wed, 09 Feb 2022 12:03:06 -0500
X-Received-Bytes: 3949

by: EricP - Wed, 9 Feb 2022 17:03 UTC

Stefan Monnier wrote:
> BGB [2022-02-07 23:41:03] wrote:
>> On 2/7/2022 9:20 PM, Stefan Monnier wrote:
>>>> One approaches multi-threading in a more Erlang-like style (threads do not
>>>> directly share mutable memory, but instead work by passing messages or data
>>>> objects via copying them via a "mailbox" or similar, with only immutable
>>>> data being directly sharable);
>>> Note that Erlang's data is immutable at the abstraction level of
>>> the language. The GC will recover unused memory and recycle it for
>>> other uses, resulting in that "immutable" memory being modified.
>>> So unless your GC is careful to flush that "immutable" data from other
>>> CPU's caches, it still needs cache coherence.
>> This is not as much of an issue if the program and GC run on the same core,
>> and every core/thread also has its own heap.
>
> I was thinking of the situation where thread A on CPU A sends
> a reference to an immutable object to thread B on CPU B.
> Then later on that same thread A sends "another" reference to some new
> immutable object to same thread B still on CPU B.
> Both of those objects may actually reside at the same place in memory
> (because the first object got GC'd in the mean time), so in order to
> make sure CPU B doesn't use outdated info from its cache, you need some
> coherence work.

Yes, this is a coherence protocol, just implemented in software not HW.
This one approximates a write invalidate protocol but implemented in
software over a network stack.

To ensure that there is one and only one valid object value at a time,
i.e. to ensure that the protocol is multi-copy atomic and that
updates are seen to propagate to all cores at the same time,
it needs to interlock on invalidate messages waiting for reply ACK's,
while processing others invalidate requests so it does not deadlock,
with appropriate flushing of inbound and outbound message queues.

Without that, cores could have different values for the same object X
at the same time.

Also this would need an index to tell it which core has the current
copy of each object so it knows who to ask for the current value.
And updates to that index need to be coordinated.

It has race conditions on updates to multiple objects on multiple cores.
To update multiple objects you would need to implement a mutex
using this coherence protocol so the the messages are processes
through the same network queues and are processed in order.

Personally I'd prefer to have cache HW take care of all this.

> If the messages actually send the data (as opposed to sending
> just references), then the heaps can be 100% private, but at that point
> you don't need the objects to be immutable either.

This is a weak ordered write invalidate protocol.
It also has race conditions on updates to multiple objects on
multiple cores and will requires mutexes and barriers.

Re: VVM versus single instruction

<su143g$pif$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23325&group=comp.arch#23325

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: VVM versus single instruction
Date: Wed, 9 Feb 2022 13:19:08 -0600
Organization: A noiseless patient Spider
Lines: 179
Message-ID: <su143g$pif$1@dont-email.me>
References: <777d6a6b-a296-4e82-93ff-365136421e62n@googlegroups.com>
<5124c4f8-4f6a-4852-bbcc-bbcb113e6e22n@googlegroups.com>
<sts6dr$3jh$1@dont-email.me>
<e0cd0c30-e42f-41d3-8f59-6fe02803a700n@googlegroups.com>
<stsn0d$2hf$1@dont-email.me> <jwvk0e6yrjg.fsf-monnier+comp.arch@gnu.org>
<stsvph$9mg$1@dont-email.me> <jwv4k59zef5.fsf-monnier+comp.arch@gnu.org>
<ltSMJ.10203$jwf9.320@fx24.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 9 Feb 2022 19:19:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2394e7d82d878e21280076653c8a4a16";
logging-data="26191"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18epvIp7+1Jab3/zMCIDze0"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:ffVHCbuaRtnD24hj3zzYxc3leSg=
In-Reply-To: <ltSMJ.10203$jwf9.320@fx24.iad>
Content-Language: en-US

by: BGB - Wed, 9 Feb 2022 19:19 UTC

On 2/9/2022 11:03 AM, EricP wrote:
> Stefan Monnier wrote:
>> BGB [2022-02-07 23:41:03] wrote:
>>> On 2/7/2022 9:20 PM, Stefan Monnier wrote:
>>>>> One approaches multi-threading in a more Erlang-like style (threads
>>>>> do not
>>>>> directly share mutable memory, but instead work by passing messages
>>>>> or data
>>>>> objects via copying them via a "mailbox" or similar, with only
>>>>> immutable
>>>>> data being directly sharable);
>>>> Note that Erlang's data is immutable at the abstraction level of
>>>> the language. The GC will recover unused memory and recycle it for
>>>> other uses, resulting in that "immutable" memory being modified.
>>>> So unless your GC is careful to flush that "immutable" data from other
>>>> CPU's caches, it still needs cache coherence.
>>> This is not as much of an issue if the program and GC run on the same
>>> core,
>>> and every core/thread also has its own heap.
>>
>> I was thinking of the situation where thread A on CPU A sends
>> a reference to an immutable object to thread B on CPU B.
>> Then later on that same thread A sends "another" reference to some new
>> immutable object to same thread B still on CPU B.
>> Both of those objects may actually reside at the same place in memory
>> (because the first object got GC'd in the mean time), so in order to
>> make sure CPU B doesn't use outdated info from its cache, you need some
>> coherence work.
>
> Yes, this is a coherence protocol, just implemented in software not HW.
> This one approximates a write invalidate protocol but implemented in
> software over a network stack.
>
> To ensure that there is one and only one valid object value at a time,
> i.e. to ensure that the protocol is multi-copy atomic and that
> updates are seen to propagate to all cores at the same time,
> it needs to interlock on invalidate messages waiting for reply ACK's,
> while processing others invalidate requests so it does not deadlock,
> with appropriate flushing of inbound and outbound message queues.
>
> Without that, cores could have different values for the same object X
> at the same time.
>
> Also this would need an index to tell it which core has the current
> copy of each object so it knows who to ask for the current value.
> And updates to that index need to be coordinated.
>
> It has race conditions on updates to multiple objects on multiple cores.
> To update multiple objects you would need to implement a mutex
> using this coherence protocol so the the messages are processes
> through the same network queues and are processed in order.
>
> Personally I'd prefer to have cache HW take care of all this.
>

In my case, it is possible to get the hardware to "sorta" handle it,
mostly by using "volatile" memory (auto-flush cache lines within a few
clock cycles after every access).

Big drawback is performance...

So, say, with normal caching, one can get ~ 250-320 MB/s for accesses
within the L1, but with 'volatile' memory it drops to ~ 15-30 MB/s (in
which case it would likely be faster to use manual cache flushing if
doing larger copies, but can work well enough for small data).

Though, there is another mechanism I refer to as "cache sniping", where
one can exploit the use of the core using direct-mapped caches to knock
things out of the L1 cache by accessing another address which clashes
with the same spot in the L1 cache as the data one wants to evict.

There are a few special instructions for this:
If doing this will work, it gives an address that will evict the line;
If it will not, it gives NULL.

On a core with a set-associative L1, it would probably always return
NULL (in which case, an actual cache flush is needed).

Side note:
I have experimented with set-associative L1's, but in my tests the
relative gains were (at best) fairly modest, and it was harder to
justify this against the higher LUT cost.

Also, 2-way L1 I$ seemed to be more effective at reducing L1 miss rate
than 2-way L1 D$, but kinda rendered moot as L1 I$ in-general has a
somewhat lower miss rate (and the direct-mapped L1 I$ still does pretty
decent).

The effectiveness of set-associativity for an L1 D$ seems to be
positively correlated to the L1's size, which was sort of a
disincentive. It can't make a 2kB or 4kB L1 "not suck", and while it
works "better" on a 16kB or 32kB L1, the miss rate has already dropped
off enough due to the larger size by this point, that it has become
largely moot (however, it comes with the drawback of a fairly obvious
jump in LUT cost; but maybe gets that 96% avg L1 hit-rate to 97.5%).

So, ATM, generally using:
16K L1 I$ (direct-mapped), 32K L1 D$ (direct-mapped).
256K L2 (2-way set-associative, *)
(*: This part uses up nearly all the BRAM in the FPGA I am using. )

For the smaller cores, it may make more sense to use 2K L1's (generally
held entirely in LUTRAM).

I am currently using a 2-way associative L2 cache though (though,
combined with the use of 64B cache lines in the L2, the LUT cost of the
L2 cache is pretty massive).

Though, I guess one could argue that by the time one is willing to pay
the costs for a set-associative L1 they can probably also justify paying
the cost of a proper hardware coherency protocol.

....

>> If the messages actually send the data (as opposed to sending
>> just references), then the heaps can be 100% private, but at that point
>> you don't need the objects to be immutable either.
>
> This is a weak ordered write invalidate protocol.
> It also has race conditions on updates to multiple objects on
> multiple cores and will requires mutexes and barriers.
>

Yeah, if one assumes shared mutable objects.

If one assumes shared immutable data, with only locally mutable objects,
this becomes less of an issue.

If one also abandons the notion of "object as a discrete entity with a
deterministic state" is also goes away. One might have to deal with
objects which have different state depending on the observer.

One great consequence though is that one can no longer use dynamically
modified linked lists or similar, since multiple parties interacting
with a linked list at the same time (or in an inconsistent state) is
prone to either result in cyclic loops, lose entities, or otherwise
break things in all sorts of ways.

But, things like arrays could work, if one could avoid threads stomping
on each other (such as by partitioning the array into non-overlapping
sections and having each thread only operate within its own section).

Things like cons-lists can also work, given they are external to the
object, and (in an abstract sense) are created once, used immutably, and
then discarded.

However, one would need to have a way to ensure that a cons list is in a
consistent state when passing to another party. However, in this case,
one can (typically) serialize the cons-list on one side, pass it as a
data-blob, and rebuild it on the other.

In this concept, things like "objects" can be less of a shared entity
with shared state, and more of an abstract entity identified with an ID
number of similar (messages related to the same ID correspond to the
same logical object).

So, both parties have their own (loosely correlated) notion of "object
1234", but this object may exist in different places in memory, and in
different internal states, within each thread.

One could possibly also choose to see the object as an abstract ID
within a set of associated tables or arrays, rather than as a discrete
entity existing somewhere in memory, but this is a separate issue.

....

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor