Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"The chain which can be yanked is not the eternal chain." -- G. Fitch


devel / comp.arch / Hardware assisted message passing

SubjectAuthor
* Hardware assisted message passingStephen Fuld
+* Re: Hardware assisted message passingTimothy McCaffrey
|+* Re: Hardware assisted message passingrobf...@gmail.com
||+* Re: Hardware assisted message passingStephen Fuld
|||`* Re: Hardware assisted message passingMitchAlsup
||| `- Re: Hardware assisted message passingStephen Fuld
||`- Re: Hardware assisted message passingMitchAlsup
|`- Re: Hardware assisted message passingStephen Fuld
+* Re: Hardware assisted message passingTheo Markettos
|+- Re: Hardware assisted message passingStephen Fuld
|+* Re: Hardware assisted message passingMitchAlsup
||`* Re: Hardware assisted message passingEricP
|| `* Re: Hardware assisted message passingMitchAlsup
||  `- Re: Hardware assisted message passingBGB
|`* Re: Hardware assisted message passingEricP
| `* Re: Hardware assisted message passingGeorge Neuner
|  `* Re: Hardware assisted message passingEricP
|   +- Re: Hardware assisted message passingEricP
|   `* Re: Hardware assisted message passingGeorge Neuner
|    `- Re: Hardware assisted message passingEricP
`* Re: Hardware assisted message passingMitchAlsup
 +* Re: Hardware assisted message passingThomas Koenig
 |`- Re: Hardware assisted message passingMitchAlsup
 +* Re: Hardware assisted message passingStephen Fuld
 |+- Re: Hardware assisted message passingMitchAlsup
 |`* Re: Hardware assisted message passingMitchAlsup
 | `- Re: Hardware assisted message passingStephen Fuld
 +* Re: Hardware assisted message passingEricP
 |+* Re: Hardware assisted message passingMitchAlsup
 ||+* Re: Hardware assisted message passingIvan Godard
 |||`- Re: Hardware assisted message passingMitchAlsup
 ||+* Re: Hardware assisted message passingStephen Fuld
 |||+* Re: Hardware assisted message passingMitchAlsup
 ||||`- Re: Hardware assisted message passingMitchAlsup
 |||`- Re: Hardware assisted message passingEricP
 ||`- Re: Hardware assisted message passingEricP
 |`* Re: Hardware assisted message passingNiklas Holsti
 | +- Re: Hardware assisted message passingMitchAlsup
 | `* Re: Hardware assisted message passingEricP
 |  `* Re: Hardware assisted message passingNiklas Holsti
 |   `* Re: Hardware assisted message passingEricP
 |    +* Re: Hardware assisted message passingNiklas Holsti
 |    |`* Re: Hardware assisted message passingEricP
 |    | +* Re: Hardware assisted message passingStefan Monnier
 |    | |`- Re: Hardware assisted message passingEricP
 |    | `* Re: Hardware assisted message passingNiklas Holsti
 |    |  +* Re: Hardware assisted message passingMitchAlsup
 |    |  |`* Re: Hardware assisted message passingNiklas Holsti
 |    |  | `* Re: Hardware assisted message passingMitchAlsup
 |    |  |  `* Re: Hardware assisted message passingNiklas Holsti
 |    |  |   +* Re: Hardware assisted message passingMitchAlsup
 |    |  |   |`- Re: Hardware assisted message passingNiklas Holsti
 |    |  |   `* Re: Hardware assisted message passingStephen Fuld
 |    |  |    `* Re: Hardware assisted message passingNiklas Holsti
 |    |  |     `* Re: Hardware assisted message passingStephen Fuld
 |    |  |      +* Re: Hardware assisted message passingMitchAlsup
 |    |  |      |`- Re: Hardware assisted message passingStephen Fuld
 |    |  |      `* Re: Hardware assisted message passingNiklas Holsti
 |    |  |       +* Re: Hardware assisted message passingMitchAlsup
 |    |  |       |`* Re: Hardware assisted message passingNiklas Holsti
 |    |  |       | `* Re: Hardware assisted message passingMitchAlsup
 |    |  |       |  `* Re: Hardware assisted message passingNiklas Holsti
 |    |  |       |   +* Re: Hardware assisted message passingIvan Godard
 |    |  |       |   |`* Re: Hardware assisted message passingNiklas Holsti
 |    |  |       |   | +- Re: Hardware assisted message passingIvan Godard
 |    |  |       |   | `* Re: Hardware assisted message passingStephen Fuld
 |    |  |       |   |  `* Re: Hardware assisted message passingMitchAlsup
 |    |  |       |   |   `* Re: Hardware assisted short message passingMitchAlsup
 |    |  |       |   |    `* Re: Hardware assisted short message passingStephen Fuld
 |    |  |       |   |     `* Re: Hardware assisted short message passingMitchAlsup
 |    |  |       |   |      `* Re: Hardware assisted short message passingStephen Fuld
 |    |  |       |   |       +- Re: Hardware assisted short message passingMitchAlsup
 |    |  |       |   |       +- Re: Hardware assisted short message passingrobf...@gmail.com
 |    |  |       |   |       `- Re: Hardware assisted short message passingMitchAlsup
 |    |  |       |   `- Re: Hardware assisted message passingMitchAlsup
 |    |  |       `- Re: Hardware assisted message passingIvan Godard
 |    |  `* Re: Hardware assisted message passingEricP
 |    |   `* Re: Hardware assisted message passingNiklas Holsti
 |    |    `* Re: Hardware assisted message passingEricP
 |    |     `- Re: Hardware assisted message passingNiklas Holsti
 |    `* Re: Hardware assisted message passingMitchAlsup
 |     `* Re: Hardware assisted message passingEricP
 |      +* Re: Hardware assisted message passingNiklas Holsti
 |      |`- Re: Hardware assisted message passingEricP
 |      `- Re: Hardware assisted message passingNiklas Holsti
 `* Re: Hardware assisted message passingBakul Shah
  `* Re: Hardware assisted message passingMitchAlsup
   +* Re: Hardware assisted message passingBakul Shah
   |+- Re: Hardware assisted message passingMitchAlsup
   |`* Re: Hardware assisted message passingIvan Godard
   | +- Re: Hardware assisted message passingMitchAlsup
   | `* Re: Hardware assisted message passingBakul Shah
   |  `* Re: Hardware assisted message passingIvan Godard
   |   `* Re: Hardware assisted message passingBakul Shah
   |    `- Re: Hardware assisted message passingMitchAlsup
   `* Re: Hardware assisted message passingBakul Shah
    `- Re: Hardware assisted message passingMitchAlsup

Pages:1234
Hardware assisted message passing

<so2u87$tbg$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22174&group=comp.arch#22174

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Hardware assisted message passing
Date: Mon, 29 Nov 2021 08:13:57 -0800
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <so2u87$tbg$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 29 Nov 2021 16:13:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f536148f6266cf7f6efc88bc504bf437";
logging-data="30064"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/IZrFMYN0steHbDv8KUvtVrMl5/faZrM8="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:9Z4fAPRubGAh7cGWthykChANQKs=
Content-Language: en-US
 by: Stephen Fuld - Mon, 29 Nov 2021 16:13 UTC

When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
the “unifying” power of hardware (well, in their case microcode)
assisted message passing. By unifying power, I mean the idea of
combining several otherwise diverse functions into a single, efficient
mechanism. This allows elimination of several other “specialized”
mechanisms. Putting the basics into hardware is what allows the
efficiency to make this all reasonable.

Seeing Mitch Alsup’s posts in several threads here about what appears to
be at least a similar mechanism he is putting into his design reawakened
my interest in this, so I thought it might be interesting to see how far
is is reasonable to “push” on this to expand its utility into more than
its basic use. I will list some possibilities, and hope others will
comment on “reasonableness” and add other potential uses.

So, the basic use is to send a “message” of typically up to several
hundred bytes (though often much less) from one process to another
another process that is willing to accept it. An extension is to allow
the hardware itself to send or receiver a message. So on to other uses.

1. The first, and pretty obvious, mechanism to subsume into this
paradigm is that once you have it, you no longer need a separate
instruction to request service from the OS. This is replaced, of
course, with a message from the user program to a process within the OS.

2. Elxsi also used the message passing mechanism to process faults such
as divide by zero. The hardware sends a message to the offending
process, which can choose to process it or not. Thus another mechanism
can be subsumed into message passing.

3. Similarly, with suitable hardware, you no longer need a separate
mechanism for inter-processor interrupts.

4. Elxsi could send messages transparently to processes on other CPUs in
a shared memory multiprocessor system. As Mitch has pointed out, this
can be extended to multiple processors, even without shared memory, as
long as there is some interconnect. This allows arbitrary distribution
of computations across an arbitrary set of systems, and hopefully allows
changing that distribution without requiring source code changes or even
better without recompilation, or best yet, without even taking down the
application. (I think Erlang supports something like this.) In fact,
once you allow that, at least in theory, the sender and receiver could
even be across the internet.

5. Mitch talked about what (at least to me) seems like a very nice
mechanism to use hardware sourced message passing to eliminate the
“first level I/O interrupt handling” in software. This saves some code,
an interrupt and a lock. I think this could possibly be extended to
save some more on the I/O initiation side. The idea is to have the
device handler format a message appropriately, then send it to the
hardware/device. This could eliminate any queuing code in the device
handler and the need for a device lock. It would also improive I/O
throughput as it would reduce the idle time between finishing one I/O
and starting the next (for those devices that don’t have internal
queuing) as the start of the new command would be done in the hardware
without requiring any software intervention.

6. It is not a requirement that the sending and receiving entities be
using the same ISA, as long as they both “understand” the same message
format. As an example, you could regard a host sending a message with a
SATA command to a disk drive, as a program on the host sending a message
to the program running on the CPU on the disk drive, which may very well
have a different ISA from the host. A result of this it that you could
use the message passing mechanism to use co-processors, such as perhaps
an encryption or compression engine. A nice advantage of this is that
for a lower end system, without the hardware co-processor, a software
process can be substituted for the hardware with no source code changes
to the user program and without the overhead of an illegal instruction
interrupt.

7. Last, MPI seems to be the defacto standard for high performance
distributed scientific programs. While I am not advocating directly
implementing all of MPI in hardware, it would probably be useful to see
what it is reasonable to do to include some capabilities to improve the
efficiency of MPI.

Note that there is a related discussion about what features the message
passing hardware should support. I won’t pretend to understand all the
trade offs here, but welcome other’s contributions.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22181&group=comp.arch#22181

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4446:: with SMTP id w6mr37445944qkp.631.1638281750920;
Tue, 30 Nov 2021 06:15:50 -0800 (PST)
X-Received: by 2002:a9d:4d8b:: with SMTP id u11mr49905708otk.144.1638281750474;
Tue, 30 Nov 2021 06:15:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 30 Nov 2021 06:15:50 -0800 (PST)
In-Reply-To: <so2u87$tbg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=73.188.126.34; posting-account=ujX_IwoAAACu0_cef9hMHeR8g0ZYDNHh
NNTP-Posting-Host: 73.188.126.34
References: <so2u87$tbg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: timcaff...@aol.com (Timothy McCaffrey)
Injection-Date: Tue, 30 Nov 2021 14:15:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3404
 by: Timothy McCaffrey - Tue, 30 Nov 2021 14:15 UTC

On Monday, November 29, 2021 at 11:14:02 AM UTC-5, Stephen Fuld wrote:
( A bunch of stuff )
> Note that there is a related discussion about what features the message
> passing hardware should support. I won’t pretend to understand all the
> trade offs here, but welcome other’s contributions.
>
And where might that thread be hiding?

Anyway, I would like to mention a message passing OS: CTOS (Convergent Technologies OS).

It used the concept of an "Exchange". An exchange could have messages or processes queued to it
(or neither, but not both). Exchanges could exist on the current CPU (in those days there were no
multiple CPU systems), or a remote CPU. The remote CPU could be on a local bus or accessed via a LAN,
or even over TCP/IP.

Typically, services registered an exchange, and processes could send messages to an exchange. Such
services could include things like printers, file services or even network services. You might see that
things like file servers, print servers, and such just kind of "fall out" of the design, you don't do anything
different at the application (or even the OS level) to support a remote printer vs. a local printer.

You could even have a dedicated box that did your WAN (e.g. TCP/IP stack), and just provided that as
a service to other systems.

You also didn't need to have things like semaphores or locks at the user level. If you needed a lock you
just used a dummy message on an exchange, if the message was queued then you had the "lock", if not
you got queued up waiting for it, when you released the lock, you sent the message back to the exchange.

The message format was designed so that if you had additional data as part of the message, and it was being
sent to a "remote" exchange, the OS knew how to append the data (so there was no back-and-forth from the remote CPU).

It was incredibly well designed, although it had a few warts (like static global assignment of service IDs, IIRC).

This all worked in the early 80s on 8086 CPUs, with a 1 megabit LAN. Made MS-DOS seem like a toy.
Unfortunately, Convergent tried to use the OS to sell overpriced hardware instead of selling it separately.

- Tim

Re: Hardware assisted message passing

<eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22182&group=comp.arch#22182

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4495:: with SMTP id x21mr37443531qkp.604.1638285980146;
Tue, 30 Nov 2021 07:26:20 -0800 (PST)
X-Received: by 2002:a9d:4f0b:: with SMTP id d11mr50684629otl.227.1638285979909;
Tue, 30 Nov 2021 07:26:19 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 30 Nov 2021 07:26:19 -0800 (PST)
In-Reply-To: <3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:1891:c3e2:63f:f703;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:1891:c3e2:63f:f703
References: <so2u87$tbg$1@dont-email.me> <3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 30 Nov 2021 15:26:20 +0000
Content-Type: text/plain; charset="UTF-8"
 by: robf...@gmail.com - Tue, 30 Nov 2021 15:26 UTC

I have toyed a little bit with hardware message passing. Rather than passing hundreds of bytes around
I planned on using much smaller messages, for example 128 bits. Some OSes like Windows and
MMURTL get by with passing only two words for messages. Larger messages are passed using
pointers or references. Uses smaller messages may allow the entire message to be transferred in a
single clock cycle. A small message may be atomic.
I think most of the use can be handled with small messages. The hardware would be simpler for small
messages.

Re: Hardware assisted message passing

<so5lug$lmu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22184&group=comp.arch#22184

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Tue, 30 Nov 2021 09:10:39 -0800
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <so5lug$lmu$1@dont-email.me>
References: <so2u87$tbg$1@dont-email.me>
<3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 30 Nov 2021 17:10:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="79b4b5152152d06bd366e9e711ccdca2";
logging-data="22238"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/BdguvMtw55fFOcRYSGgNpKJNrqJZUx20="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:O2kmatQvTfZk5iosjaGCbNrqQlI=
In-Reply-To: <3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 30 Nov 2021 17:10 UTC

On 11/30/2021 6:15 AM, Timothy McCaffrey wrote:
> On Monday, November 29, 2021 at 11:14:02 AM UTC-5, Stephen Fuld wrote:
> ( A bunch of stuff )
>> Note that there is a related discussion about what features the message
>> passing hardware should support. I won’t pretend to understand all the
>> trade offs here, but welcome other’s contributions.
>>
> And where might that thread be hiding?

Sorry for my clumsy wording. I know of no existing thread like this,
but just wanted to leave open the possibility of having that discussion.
Happily, you took up that possibility! :-)

>
> Anyway, I would like to mention a message passing OS: CTOS (Convergent Technologies OS).

I had forgotten about CTOS. :-( I never used it, but came into
contact with others who did, once Convergent was bought out by
Burroughs, and CTOS was renamed to BTOS. The people who used it really
liked it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<so5m98$obp$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22185&group=comp.arch#22185

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Tue, 30 Nov 2021 09:16:22 -0800
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <so5m98$obp$1@dont-email.me>
References: <so2u87$tbg$1@dont-email.me>
<3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
<eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 30 Nov 2021 17:16:24 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="79b4b5152152d06bd366e9e711ccdca2";
logging-data="24953"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+MKZ/QDzHy6JDId9QM/8eythj8qX0K8uk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:nYGVPVkIUdcATnpBYc/8z33O58o=
In-Reply-To: <eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 30 Nov 2021 17:16 UTC

On 11/30/2021 7:26 AM, robf...@gmail.com wrote:
>
> I have toyed a little bit with hardware message passing. Rather than passing hundreds of bytes around
> I planned on using much smaller messages, for example 128 bits. Some OSes like Windows and
> MMURTL get by with passing only two words for messages. Larger messages are passed using
> pointers or references. Uses smaller messages may allow the entire message to be transferred in a
> single clock cycle. A small message may be atomic.
> I think most of the use can be handled with small messages. The hardware would be simpler for small
> messages.

Elxsi had both "small" and "large" messages. While allowing only small
messages does make the hardware simpler, it really hurts when you want
to transfer more data than the size of a small message. Even passing a
pointer or such means the hardware has to do some additional work to
transfer the bulk of the data, which gets icky unless you restrict the
system to a single shared memory.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<cPf*upDAy@news.chiark.greenend.org.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22186&group=comp.arch#22186

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!.POSTED!not-for-mail
From: theom+n...@chiark.greenend.org.uk (Theo Markettos)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: 01 Dec 2021 10:32:50 +0000 (GMT)
Organization: University of Cambridge, England
Lines: 200
Message-ID: <cPf*upDAy@news.chiark.greenend.org.uk>
References: <so2u87$tbg$1@dont-email.me>
NNTP-Posting-Host: chiark.greenend.org.uk
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: chiark.greenend.org.uk 1638354772 9540 212.13.197.229 (1 Dec 2021 10:32:52 GMT)
X-Complaints-To: abuse@chiark.greenend.org.uk
NNTP-Posting-Date: Wed, 1 Dec 2021 10:32:52 +0000 (UTC)
User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (Linux/3.16.0-11-amd64 (x86_64))
Originator: theom@chiark.greenend.org.uk ([212.13.197.229])
 by: Theo Markettos - Wed, 1 Dec 2021 10:32 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
> When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
> the “unifying” power of hardware (well, in their case microcode)
> assisted message passing. By unifying power, I mean the idea of
> combining several otherwise diverse functions into a single, efficient
> mechanism. This allows elimination of several other “specialized”
> mechanisms. Putting the basics into hardware is what allows the
> efficiency to make this all reasonable.

We've done a bit of work in this area:
https://www.repository.cam.ac.uk/handle/1810/317181
and related publications at:
https://poets-project.org/publications/

Basically that makes passing messages from one thread to another a
first-class communication primitive, and one that doesn't care if the thread
is running on the other side of a cluster. Messages are kept small to
simplify the networking and keep latency very low (minimal buffering, simple
congestion control etc).

Viewed as inter-thread-communication (ITC), you're right it can
subsume a number of use cases. In particular when you aren't operating on a
shared-memory machine, 'everything is a message' makes a certain degree of
sense.

I think a key point is a software/hardware distinction. Hardware message
formats to me are the easy bit, the hard part is dealing with them at the
other end. This is the reason things use (R)DMA, because memory
reads/writes are simple API that can be implemented directly in hardware
(what the memory actually means is left to higher level software). If the
hardware messages you're sending are rich enough, you need to wake up
software at the other end to handle them. By the time you've woken up
software and are parsing packet headers you've burnt a lot of performance.

There are several aspects to think about though. Is this synchronous or
asynchronous? For example a trap instruction is synchronous - the thread
ends up in kernel mode. If you're using messages you need to block the
thread and wait for the kernel thread to pick up the reply. If you're
scheduling that synchronously, a message looks pretty similar to a trap
instruction, just with extra context switching overhead (because it's
a generic message to be handled rather than a trap with specific hardware
support for setting up registers appropriately). If you're doing it
asynchronously, the latency before the kernel thread picks it up can be
unbounded.

> Seeing Mitch Alsup’s posts in several threads here about what appears to
> be at least a similar mechanism he is putting into his design reawakened
> my interest in this, so I thought it might be interesting to see how far
> is is reasonable to “push” on this to expand its utility into more than
> its basic use. I will list some possibilities, and hope others will
> comment on “reasonableness” and add other potential uses.
>
> So, the basic use is to send a “message” of typically up to several
> hundred bytes (though often much less) from one process to another
> another process that is willing to accept it. An extension is to allow
> the hardware itself to send or receiver a message. So on to other uses.
>
> 1. The first, and pretty obvious, mechanism to subsume into this
> paradigm is that once you have it, you no longer need a separate
> instruction to request service from the OS. This is replaced, of
> course, with a message from the user program to a process within the OS.

Isn't a syscall already leaning in this direction? If the service you want
is anything complicated, you write your request into a data structure, put
the pointer to that data structure into a register, and call the trap
instruction. That context switches into the OS which then dispatches the
request to the right handler. If the service you want is something simple
you don't need to bother with the data structure and pass parameters in
registers so it's faster. By forcing a different paradigm, aren't you just
adding more overhead?

> 2. Elxsi also used the message passing mechanism to process faults
> such as divide by zero. The hardware sends a message to the offending
> process, which can choose to process it or not. Thus another mechanism
> can be subsumed into message passing.

Faults are just variations of traps, so the above also applies.

> 3. Similarly, with suitable hardware, you no longer need a separate
> mechanism for inter-processor interrupts.

PCIe offers message-signalled interrupts, so things are already in that
direction. Again the mechanism for how you signal interrupts isn't that
important, it's how you handle them that's the tricky part. The above about
trap handlers applies, although a nice thing about MSIs is that you could in
theory associate a handler with a specific source (ie MSI 0x12345678 has a
specific function pointer to call, rather than going into a dispatcher to
work out what to do) which reduces latency. I suppose you could argue that
might apply to other messages like syscalls (ie not just a 'trap'
instruction, but a 'trap 0x12345678' instruction that has a specific handler
associated with it, rather than a generic dispatcher handling all traps)

> 4. Elxsi could send messages transparently to processes on other CPUs
> in a shared memory multiprocessor system. As Mitch has pointed out, this
> can be extended to multiple processors, even without shared memory, as
> long as there is some interconnect. This allows arbitrary distribution of
> computations across an arbitrary set of systems, and hopefully allows
> changing that distribution without requiring source code changes or even
> better without recompilation, or best yet, without even taking down the
> application. (I think Erlang supports something like this.) In fact, once
> you allow that, at least in theory, the sender and receiver could even be
> across the internet.

This is what we do in POETS. The wish to keep tight bounds on latency
imposes certain requirements on buffering and reliability, but if you
weren't so fussed by latency there's no reason why you couldn't go over a
wider area. Another of our papers explores the paradigms that helps to
bridge the messaging into software (ie you can signal to other threads in
what seems like a natural way, rather than something that requires explicit
management, eg marshalling RDMA transactions):
https://eprints.ncl.ac.uk/file_store/production/267875/5454FB9A-0A47-4C08-AD85-3D464D286E93.pdf

> 5. Mitch talked about what (at least to me) seems like a very nice
> mechanism to use hardware sourced message passing to eliminate the
> “first level I/O interrupt handling” in software. This saves some code,
> an interrupt and a lock. I think this could possibly be extended to
> save some more on the I/O initiation side. The idea is to have the
> device handler format a message appropriately, then send it to the
> hardware/device. This could eliminate any queuing code in the device
> handler and the need for a device lock. It would also improive I/O
> throughput as it would reduce the idle time between finishing one I/O
> and starting the next (for those devices that don’t have internal
> queuing) as the start of the new command would be done in the hardware
> without requiring any software intervention.

I think it depends what the content of the message is. I like to compare
SATA and NVMe here. SATA is a message-based protocol, there are commands
and responses, with various data structures packed into the commands as to
what you want to do. Essentially what's happening is your OS is chatting to
the firmware on your HDD/SSD via these messages - it's essentially ITC but
the threads are running across heterogeneous hardware. But actually you
can't do native ITC, so what happens is you interpose a host controller, eg
AHCI on PCIe, whose host-OS driver is actually putting messages in memory
buffers and then orchestrating DMA to the AHCI controller, who pulls the
message out again and pushes it down the SATA wire. So you have this
illusion of ITC but actually there's lots of DMA wrangling happening behind
the scenes to make it happen.

Meanwhile NVMe is a DMA protocol. It's simple memory reads and writes made
from the device. The device driver sets up the transfer (organises where
they're going to go from and to, opens up IOMMU windows to allow it, etc)
but the device can then just do it directly. There's no faux conversation
between the host OS and the firmware going on, it's just PCIe reads and
writes.

(Of course the firmware does have side channels for things that aren't reads
and writes, like asking about flash health, and there are messages for
those. But they're off the critical path)

You get much better performance by simply binning the command/response model
completely, if that's something you are able to do.

> 6. It is not a requirement that the sending and receiving entities be
> using the same ISA, as long as they both “understand” the same message
> format. As an example, you could regard a host sending a message with a
> SATA command to a disk drive, as a program on the host sending a message
> to the program running on the CPU on the disk drive, which may very well
> have a different ISA from the host. A result of this it that you could
> use the message passing mechanism to use co-processors, such as perhaps
> an encryption or compression engine. A nice advantage of this is that
> for a lower end system, without the hardware co-processor, a software
> process can be substituted for the hardware with no source code changes
> to the user program and without the overhead of an illegal instruction
> interrupt.


Click here to read the complete article
Re: Hardware assisted message passing

<so8v7l$m16$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22187&group=comp.arch#22187

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Wed, 1 Dec 2021 15:07:31 -0800
Organization: A noiseless patient Spider
Lines: 277
Message-ID: <so8v7l$m16$1@dont-email.me>
References: <so2u87$tbg$1@dont-email.me>
<cPf*upDAy@news.chiark.greenend.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 1 Dec 2021 23:07:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4c65a9630efb5628ba3b5ddae4e40715";
logging-data="22566"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bBWqpN9E1xJMHbjuNEe/xM+r/p9Cma30="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:uQy1m5H/7cKuNQujdCem9StnT9k=
In-Reply-To: <cPf*upDAy@news.chiark.greenend.org.uk>
Content-Language: en-US
 by: Stephen Fuld - Wed, 1 Dec 2021 23:07 UTC

On 12/1/2021 2:32 AM, Theo Markettos wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
>> When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
>> the “unifying” power of hardware (well, in their case microcode)
>> assisted message passing. By unifying power, I mean the idea of
>> combining several otherwise diverse functions into a single, efficient
>> mechanism. This allows elimination of several other “specialized”
>> mechanisms. Putting the basics into hardware is what allows the
>> efficiency to make this all reasonable.
>
> We've done a bit of work in this area:
> https://www.repository.cam.ac.uk/handle/1810/317181
> and related publications at:
> https://poets-project.org/publications/

I was unaware of your work. Thank you for pointing it out. I need to
spend more time reading and thinking.

>
> Basically that makes passing messages from one thread to another a
> first-class communication primitive, and one that doesn't care if the thread
> is running on the other side of a cluster.

Yes.

> Messages are kept small to
> simplify the networking and keep latency very low (minimal buffering, simple
> congestion control etc).

I understand the motivation, but it can be very limiting. If you allow
messages up to N bytes, there is always some valid use for a message of
N+1 bytes. :-(

I sort of like Elxsi's idea of short and long messages. That way, you
gain the advantages you said for the short messages without sacrificing
too much for the longer ones. But it does complicate things.

> Viewed as inter-thread-communication (ITC), you're right it can
> subsume a number of use cases. In particular when you aren't operating on a
> shared-memory machine, 'everything is a message' makes a certain degree of
> sense.

Agreed.

>
> I think a key point is a software/hardware distinction. Hardware message
> formats to me are the easy bit, the hard part is dealing with them at the
> other end. This is the reason things use (R)DMA, because memory
> reads/writes are simple API that can be implemented directly in hardware
> (what the memory actually means is left to higher level software).

Yes. In fact for my idea of using message passing primitives to send
say an ATA packet to a disk drive, I sort of assumed that the actual
data transfer for the read or write would use RDMA.

> If the
> hardware messages you're sending are rich enough, you need to wake up
> software at the other end to handle them. By the time you've woken up
> software and are parsing packet headers you've burnt a lot of performance.

Yes.

> There are several aspects to think about though. Is this synchronous or
> asynchronous? For example a trap instruction is synchronous - the thread
> ends up in kernel mode. If you're using messages you need to block the
> thread and wait for the kernel thread to pick up the reply. If you're
> scheduling that synchronously, a message looks pretty similar to a trap
> instruction, just with extra context switching overhead (because it's
> a generic message to be handled rather than a trap with specific hardware
> support for setting up registers appropriately). If you're doing it
> asynchronously, the latency before the kernel thread picks it up can be
> unbounded.

Agreed. My thought was doing what a trap instruction would do, but
doing it with the message passing instructions (perhaps with a specific,
hardware defined) destination address. At least you save an op-code.

>> Seeing Mitch Alsup’s posts in several threads here about what appears to
>> be at least a similar mechanism he is putting into his design reawakened
>> my interest in this, so I thought it might be interesting to see how far
>> is is reasonable to “push” on this to expand its utility into more than
>> its basic use. I will list some possibilities, and hope others will
>> comment on “reasonableness” and add other potential uses.
>>
>> So, the basic use is to send a “message” of typically up to several
>> hundred bytes (though often much less) from one process to another
>> another process that is willing to accept it. An extension is to allow
>> the hardware itself to send or receiver a message. So on to other uses.
>>
>> 1. The first, and pretty obvious, mechanism to subsume into this
>> paradigm is that once you have it, you no longer need a separate
>> instruction to request service from the OS. This is replaced, of
>> course, with a message from the user program to a process within the OS.
>
> Isn't a syscall already leaning in this direction?

Yes, but again, you save an op-code.

> If the service you want
> is anything complicated, you write your request into a data structure, put
> the pointer to that data structure into a register, and call the trap
> instruction. That context switches into the OS which then dispatches the
> request to the right handler. If the service you want is something simple
> you don't need to bother with the data structure and pass parameters in
> registers so it's faster. By forcing a different paradigm, aren't you just
> adding more overhead?

I hope not, but obviously there are a lot of details to be worked out.
Note that Elxsi did this with a modified Unix-like OS.

>> 2. Elxsi also used the message passing mechanism to process faults
>> such as divide by zero. The hardware sends a message to the offending
>> process, which can choose to process it or not. Thus another mechanism
>> can be subsumed into message passing.
>
> Faults are just variations of traps, so the above also applies.
>
>> 3. Similarly, with suitable hardware, you no longer need a separate
>> mechanism for inter-processor interrupts.
>
> PCIe offers message-signalled interrupts, so things are already in that
> direction. Again the mechanism for how you signal interrupts isn't that
> important, it's how you handle them that's the tricky part. The above about
> trap handlers applies, although a nice thing about MSIs is that you could in
> theory associate a handler with a specific source (ie MSI 0x12345678 has a
> specific function pointer to call, rather than going into a dispatcher to
> work out what to do) which reduces latency.

Again, Elxsi has a send message to hardware function, which perhaps
could do this already. (I never got into that level of detail of their
design)

> I suppose you could argue that
> might apply to other messages like syscalls (ie not just a 'trap'
> instruction, but a 'trap 0x12345678' instruction that has a specific handler
> associated with it, rather than a generic dispatcher handling all traps)
>
>> 4. Elxsi could send messages transparently to processes on other CPUs
>> in a shared memory multiprocessor system. As Mitch has pointed out, this
>> can be extended to multiple processors, even without shared memory, as
>> long as there is some interconnect. This allows arbitrary distribution of
>> computations across an arbitrary set of systems, and hopefully allows
>> changing that distribution without requiring source code changes or even
>> better without recompilation, or best yet, without even taking down the
>> application. (I think Erlang supports something like this.) In fact, once
>> you allow that, at least in theory, the sender and receiver could even be
>> across the internet.
>
> This is what we do in POETS. The wish to keep tight bounds on latency
> imposes certain requirements on buffering and reliability, but if you
> weren't so fussed by latency there's no reason why you couldn't go over a
> wider area. Another of our papers explores the paradigms that helps to
> bridge the messaging into software (ie you can signal to other threads in
> what seems like a natural way, rather than something that requires explicit
> management, eg marshalling RDMA transactions):
> https://eprints.ncl.ac.uk/file_store/production/267875/5454FB9A-0A47-4C08-AD85-3D464D286E93.pdf
>
>> 5. Mitch talked about what (at least to me) seems like a very nice
>> mechanism to use hardware sourced message passing to eliminate the
>> “first level I/O interrupt handling” in software. This saves some code,
>> an interrupt and a lock. I think this could possibly be extended to
>> save some more on the I/O initiation side. The idea is to have the
>> device handler format a message appropriately, then send it to the
>> hardware/device. This could eliminate any queuing code in the device
>> handler and the need for a device lock. It would also improive I/O
>> throughput as it would reduce the idle time between finishing one I/O
>> and starting the next (for those devices that don’t have internal
>> queuing) as the start of the new command would be done in the hardware
>> without requiring any software intervention.
>
> I think it depends what the content of the message is. I like to compare
> SATA and NVMe here. SATA is a message-based protocol, there are commands
> and responses, with various data structures packed into the commands as to
> what you want to do. Essentially what's happening is your OS is chatting to
> the firmware on your HDD/SSD via these messages - it's essentially ITC but
> the threads are running across heterogeneous hardware. But actually you
> can't do native ITC, so what happens is you interpose a host controller, eg
> AHCI on PCIe, whose host-OS driver is actually putting messages in memory
> buffers and then orchestrating DMA to the AHCI controller, who pulls the
> message out again and pushes it down the SATA wire. So you have this
> illusion of ITC but actually there's lots of DMA wrangling happening behind
> the scenes to make it happen.


Click here to read the complete article
Re: Hardware assisted message passing

<e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22195&group=comp.arch#22195

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1112:: with SMTP id e18mr35278943qty.226.1638738598994;
Sun, 05 Dec 2021 13:09:58 -0800 (PST)
X-Received: by 2002:a05:6808:1141:: with SMTP id u1mr20018776oiu.30.1638738598783;
Sun, 05 Dec 2021 13:09:58 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Dec 2021 13:09:58 -0800 (PST)
In-Reply-To: <so2u87$tbg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:40fa:f6f2:2e75:c4dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:40fa:f6f2:2e75:c4dc
References: <so2u87$tbg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 05 Dec 2021 21:09:58 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 236
 by: MitchAlsup - Sun, 5 Dec 2021 21:09 UTC

First a note stating that my computer went tits up last Thursday and I
have not be able too access comp.arch via my wife's computer since
google (in all its wisdom) thinks her computer could NEVER let her
husband on Google.Groups under a name not hers...................
<
Bad Google, Bad Google........
<
On Monday, November 29, 2021 at 8:14:02 AM UTC-8, Stephen Fuld wrote:
> When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
> the “unifying” power of hardware (well, in their case microcode)
> assisted message passing. By unifying power, I mean the idea of
> combining several otherwise diverse functions into a single, efficient
> mechanism. This allows elimination of several other “specialized”
> mechanisms. Putting the basics into hardware is what allows the
> efficiency to make this all reasonable.
>
> Seeing Mitch Alsup’s posts in several threads here about what appears to
> be at least a similar mechanism he is putting into his design reawakened
> my interest in this, so I thought it might be interesting to see how far
> is is reasonable to “push” on this to expand its utility into more than
> its basic use. I will list some possibilities, and hope others will
> comment on “reasonableness” and add other potential uses.
<
Thanks for the referral.
>
> So, the basic use is to send a “message” of typically up to several
> hundred bytes (though often much less) from one process to another
> another process that is willing to accept it. An extension is to allow
> the hardware itself to send or receiver a message. So on to other uses.
<
My intended use was to explicitly make interrupts a first class citizen
of the architecture. An interrupt causes a waiting thread to receive control
at a specified priority and within a specified set of CPUs (affinity).
<
My southbridge (PCIe) uses/converts wire interrupts into message based
interrupts.
<
CPUs can send interrupts to other threads (and by SW convention other
SPUs.)
<
An interrupt is associated with a thread, that thread has a priority,
Thread header (PSW, Root Pointer, State) and a register file.
<
The receipt of an interrupt either gets queued onto that thread (if it is
already running) or activates that thread if is is not running. The queueing
guarantees that the thread is never active more than once, and that no
interrupts are ever lost.
<
Should the newly active thread have a higher priority on the CPU of its
affinity, a message is sent to the lowest priority thread currently running
in the affinity set of the interrupt handler. This message contains all
data necessary to perform a context switch into the interrupt handling
thread. The arrival of such a message at the CPU causes its current
thread state to be returned to the front of its run queue.
<
The CPU does NOT choose when to context switch, or how. This is
all decided elsewhere.
<
In this model, the interrupt message is one (1) 64-bit container,
And the queueing system is designed to efficiently manage smallish
messages (1-8 Doublewords), and performs this task without locking
since it is being performed in a place where all requests have already
been serialized, and performs then one at a time. Note: This place
is not a CPU !! { Critically important } nor is it SW that runs on a CPU.
>
> 1. The first, and pretty obvious, mechanism to subsume into this
> paradigm is that once you have it, you no longer need a separate
> instruction to request service from the OS. This is replaced, of
> course, with a message from the user program to a process within the OS.
<
I worked in ADA rendezvous into the system, too. They have almost all of
the properties of the std interrupts, but add in the address space join/release,
and the reactivation of caller at the end of rendezvous.
>
> 2. Elxsi also used the message passing mechanism to process faults such
> as divide by zero. The hardware sends a message to the offending
> process, which can choose to process it or not. Thus another mechanism
> can be subsumed into message passing.
<
Exceptions are not interrupts. Exception handlers are bound to the thread
creating the exception and are synchronous, Interrupts are bound to a
particular handler asynchronously and for a long duration (the SATA disk
handler is associated with multiple SATA drive interrupts and for the entire
time the system has power turned on.
>
> 3. Similarly, with suitable hardware, you no longer need a separate
> mechanism for inter-processor interrupts.
<
Exactly--and a useful byproduct. But this unification needs to be expressible
at the instruction level. That is the SATA interrupt handler needs a way to
schedule the user-I/O-completion handler efficiently (1 instruction) and that
thread needs an efficient way to take the waiting thread off the wait queue
and drop it back on the run queue (also 1-instruction). There is no particular
reason that any CPU context switches are processed by running instructions!
<
Indeed, having the entire thread state arrive as the interrupt message
gets rid of lots of OS issues (cue EricP)
<
You may also want/need the ability to use IPIs between different virtual
machines, or perform virtual interrupt remapping so what HW uses and
what SW uses are unified but mapped just like memory pages and access
rights.
>
> 4. Elxsi could send messages transparently to processes on other CPUs in
> a shared memory multiprocessor system. As Mitch has pointed out, this
> can be extended to multiple processors, even without shared memory, as
> long as there is some interconnect. This allows arbitrary distribution
> of computations across an arbitrary set of systems, and hopefully allows
> changing that distribution without requiring source code changes or even
> better without recompilation, or best yet, without even taking down the
> application. (I think Erlang supports something like this.) In fact,
> once you allow that, at least in theory, the sender and receiver could
> even be across the internet.
>
> 5. Mitch talked about what (at least to me) seems like a very nice
> mechanism to use hardware sourced message passing to eliminate the
> “first level I/O interrupt handling” in software. This saves some code,
> an interrupt and a lock.
<
The savings are bigger than that!
<
The CPU which will ultimately "take" a context switch performs none
of the work of context switching--so it continues on with what it was
working on until the context of the (ahem) context switch) is actually
arriving. The Context switch is an exchange--new state is loaded while
old state is unloaded-bundled into a message--and passed back to the
queueing system. (Very CDC 6600 Exchange-Jump like) The CPU goes
from thread-computing to thread-computing without having to run ANY
operating system code !!
<
The queueing system can consume an arriving interrupt every cycle,
And after a short pipeline delay, it can spit out a new context-switch
message every 5-ish cycles (imaging there are 16-64 cores on the chip.)
All without any locking !!
<
> I think this could possibly be extended to
> save some more on the I/O initiation side. The idea is to have the
> device handler format a message appropriately, then send it to the
> hardware/device. This could eliminate any queuing code in the device
> handler and the need for a device lock. It would also improive I/O
> throughput as it would reduce the idle time between finishing one I/O
> and starting the next
<
Most of the devices for which this is useful already have input command
queues and many of the devices rearrange requests to optimize the
device's performance. So, I am simply expecting lower overall overhead
allocated to all parts of I/O, while also minimizing the total cycle counts
between I/O-events.
<
> (for those devices that don’t have internal
> queuing) as the start of the new command would be done in the hardware
> without requiring any software intervention.
>
> 6. It is not a requirement that the sending and receiving entities be
> using the same ISA, as long as they both “understand” the same message
> format. As an example, you could regard a host sending a message with a
> SATA command to a disk drive, as a program on the host sending a message
> to the program running on the CPU on the disk drive, which may very well
> have a different ISA from the host.
<
This should (SHOULD) be the SATA device command queue rather than
the CPU running the device. The layer of abstraction is very powerful
here.
<
> A result of this it that you could
> use the message passing mechanism to use co-processors, such as perhaps
> an encryption or compression engine.
<
Just route the data to the decrypter (HW or SW) and set him up to deliver the
decrypted data to the original requestor.
<
> A nice advantage of this is that
> for a lower end system, without the hardware co-processor, a software
> process can be substituted for the hardware with no source code changes
> to the user program and without the overhead of an illegal instruction
> interrupt.
>
> 7. Last, MPI seems to be the defacto standard for high performance
> distributed scientific programs. While I am not advocating directly
> implementing all of MPI in hardware, it would probably be useful to see
> what it is reasonable to do to include some capabilities to improve the
> efficiency of MPI.
<
My plan is to build around MPI into HW such that SW is in charge of setting
stuff up while HW is in charge of transporting messages from producer
to consumer; possibly with virtual->physical translation along the path.
<
Interrupt wires are mapped to virtual MPIs. virtual MPIs are then mapped
to physical MPIs and routed to various systems.
>
> Note that there is a related discussion about what features the message
> passing hardware should support. I won’t pretend to understand all the
> trade offs here, but welcome other’s contributions.
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)


Click here to read the complete article
Re: Hardware assisted message passing

<sojb26$tqu$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22196&group=comp.arch#22196

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-282-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Sun, 5 Dec 2021 21:30:46 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sojb26$tqu$1@newsreader4.netcologne.de>
References: <so2u87$tbg$1@dont-email.me>
<e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
Injection-Date: Sun, 5 Dec 2021 21:30:46 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-282-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:282:0:7285:c2ff:fe6c:992d";
logging-data="30558"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 5 Dec 2021 21:30 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> First a note stating that my computer went tits up last Thursday and I
> have not be able too access comp.arch via my wife's computer since
> google (in all its wisdom) thinks her computer could NEVER let her
> husband on Google.Groups under a name not hers...................

Don't you have access to an NNTP server, and a decent newsreader?

Google Groups is probably the worst thing that ever happened
to Usenet.

Re: Hardware assisted message passing

<89dad950-6ca9-4836-9d0d-a5d0d09fcc72n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22197&group=comp.arch#22197

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:250d:: with SMTP id gf13mr33870868qvb.39.1638740532185;
Sun, 05 Dec 2021 13:42:12 -0800 (PST)
X-Received: by 2002:a9d:4c90:: with SMTP id m16mr26515627otf.129.1638740531936;
Sun, 05 Dec 2021 13:42:11 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Dec 2021 13:42:11 -0800 (PST)
In-Reply-To: <cPf*upDAy@news.chiark.greenend.org.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:40fa:f6f2:2e75:c4dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:40fa:f6f2:2e75:c4dc
References: <so2u87$tbg$1@dont-email.me> <cPf*upDAy@news.chiark.greenend.org.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <89dad950-6ca9-4836-9d0d-a5d0d09fcc72n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 05 Dec 2021 21:42:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 408
 by: MitchAlsup - Sun, 5 Dec 2021 21:42 UTC

On Wednesday, December 1, 2021 at 2:32:54 AM UTC-8, Theo Markettos wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:
> > When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
> > the “unifying” power of hardware (well, in their case microcode)
> > assisted message passing. By unifying power, I mean the idea of
> > combining several otherwise diverse functions into a single, efficient
> > mechanism. This allows elimination of several other “specialized”
> > mechanisms. Putting the basics into hardware is what allows the
> > efficiency to make this all reasonable.
> We've done a bit of work in this area:
> https://www.repository.cam.ac.uk/handle/1810/317181
> and related publications at:
> https://poets-project.org/publications/
<
Thank you for jumping into this topic !
>
> Basically that makes passing messages from one thread to another a
> first-class communication primitive, and one that doesn't care if the thread
> is running on the other side of a cluster. Messages are kept small to
> simplify the networking and keep latency very low (minimal buffering, simple
> congestion control etc).
<
I went so far as to put them into ISA--there is (in effect) a send-interrupt
instruction.
>
> Viewed as inter-thread-communication (ITC), you're right it can
> subsume a number of use cases. In particular when you aren't operating on a
> shared-memory machine, 'everything is a message' makes a certain degree of
> sense.
<
I did not go to the "everything is a message". Instead I concentrated on the
virtualization of interrupts (device->specific CPU) and that there are 2 things
that need to be (more) efficient delivering interrupts to interrupt handlers,
delivering IPIs between CPUs, and making context switching especially efficient
without blowing register file space inside the CPUs.
>
> I think a key point is a software/hardware distinction. Hardware message
> formats to me are the easy bit, the hard part is dealing with them at the
> other end. This is the reason things use (R)DMA, because memory
> reads/writes are simple API that can be implemented directly in hardware
> (what the memory actually means is left to higher level software).
<
I (a HW guy) sees it a bit differently:: You can send small messages (cache
line) efficiently. Within a cache line you can send multiple addresses and
other data. From here, it is fairly easy for SW to requests pages be mapped
from the arrived message. This is how non-local accesses are performed::
page based data transfer over the network.
<
> If the
> hardware messages you're sending are rich enough, you need to wake up
> software at the other end to handle them. By the time you've woken up
> software and are parsing packet headers you've burnt a lot of performance..
<
Yep--but if that is the model you want, why should HW get in the way ??
>
> There are several aspects to think about though. Is this synchronous or
> asynchronous? For example a trap instruction is synchronous - the thread
> ends up in kernel mode.
<
Traps and exceptions are synchronous. (you caused them)
<
Interrupts and messages are asynchronous (they come from elsewhere)
From the point where you know you want to send a message, until the
point where the message is being received by recipient, there should
be no locking necessary ! This stuff is serially-reusable and if you can
provide serial reusability (for free !) you don't need locking !!
<
> If you're using messages you need to block the
> thread and wait for the kernel thread to pick up the reply.
<
After a kernel has setup a message channel from thread A to thread B
thread A should be able to send a message to thread B without any
kernel activity needed !!
<
> If you're
> scheduling that synchronously, a message looks pretty similar to a trap
> instruction, just with extra context switching overhead (because it's
> a generic message to be handled rather than a trap with specific hardware
> support for setting up registers appropriately). If you're doing it
> asynchronously, the latency before the kernel thread picks it up can be
> unbounded.
<
You are still under the assumption of 2 things that may not be what you
think they are::
1) you are assuming the CPU morphs from context to context.
2) you are assuming the queueing is performed by a CPU
<
Given these constraints, what you say is correct.
<
But there are ways to configure the system where those assumptions are
incorrect.
<
> > Seeing Mitch Alsup’s posts in several threads here about what appears to
> > be at least a similar mechanism he is putting into his design reawakened
> > my interest in this, so I thought it might be interesting to see how far
> > is is reasonable to “push” on this to expand its utility into more than
> > its basic use. I will list some possibilities, and hope others will
> > comment on “reasonableness” and add other potential uses.
> >
> > So, the basic use is to send a “message” of typically up to several
> > hundred bytes (though often much less) from one process to another
> > another process that is willing to accept it. An extension is to allow
> > the hardware itself to send or receiver a message. So on to other uses.
> >
> > 1. The first, and pretty obvious, mechanism to subsume into this
> > paradigm is that once you have it, you no longer need a separate
> > instruction to request service from the OS. This is replaced, of
> > course, with a message from the user program to a process within the OS..
<
> Isn't a syscall already leaning in this direction? If the service you want
> is anything complicated, you write your request into a data structure, put
> the pointer to that data structure into a register, and call the trap
> instruction.
<
Sure
<
> That context switches into the OS which then dispatches the
> request to the right handler.
<
No need to get the OS involved !!! (99% of the time) Just arrive at the right
handler with his register file already setup properly (which may be on a
different CPU or on a different chip.)
<
> If the service you want is something simple
> you don't need to bother with the data structure and pass parameters in
> registers so it's faster. By forcing a different paradigm, aren't you just
> adding more overhead?
<
Agreed.
<
> > 2. Elxsi also used the message passing mechanism to process faults
> > such as divide by zero. The hardware sends a message to the offending
> > process, which can choose to process it or not. Thus another mechanism
> > can be subsumed into message passing.
<
> Faults are just variations of traps, so the above also applies.
<
Pedantic mode=on
<
No Faults are not traps! A trap is something requested by an instruction
(the TAP instruction, for example) while a fault is an unexpected event
during the performance of an instruction.
<
Yes you can "map" traps and faults into the same "vector" space, but
you should be very careful when doing this.
<
> > 3. Similarly, with suitable hardware, you no longer need a separate
> > mechanism for inter-processor interrupts.
<
> PCIe offers message-signalled interrupts, so things are already in that
> direction. Again the mechanism for how you signal interrupts isn't that
> important, it's how you handle them that's the tricky part.
<
No, it is WHO handles them that is the tricky part !! And for the most
part, all current implementations are sorely lacking here.
<
In the model I am professing, there is no SW between the sending of an
interrupt (wire or MPI) and the arrival of a context switch message at
a CPU in the affinity class of the interrupt handler (typically 1 CPU).
Everything in between is handled by HW which was previously setup
by SW.
<
Number of CPU instructions performed between sending of interrupt
and the interrupt handler gaining complete context switch over AN
appropriate CPU = 0-cycles !
<
> The above about
> trap handlers applies, although a nice thing about MSIs is that you could in
> theory associate a handler with a specific source (ie MSI 0x12345678 has a
> specific function pointer to call,
<
To a good first order, this is how I am doing ADA rendezvous.
<
> rather than going into a dispatcher to
> work out what to do) which reduces latency. I suppose you could argue that
> might apply to other messages like syscalls (ie not just a 'trap'
> instruction, but a 'trap 0x12345678' instruction that has a specific handler
> associated with it, rather than a generic dispatcher handling all traps)
<
> > 4. Elxsi could send messages transparently to processes on other CPUs
> > in a shared memory multiprocessor system. As Mitch has pointed out, this
> > can be extended to multiple processors, even without shared memory, as
> > long as there is some interconnect. This allows arbitrary distribution of
> > computations across an arbitrary set of systems, and hopefully allows
> > changing that distribution without requiring source code changes or even
> > better without recompilation, or best yet, without even taking down the
> > application. (I think Erlang supports something like this.) In fact, once
> > you allow that, at least in theory, the sender and receiver could even be
> > across the internet.
> This is what we do in POETS. The wish to keep tight bounds on latency
> imposes certain requirements on buffering and reliability, but if you
> weren't so fussed by latency there's no reason why you couldn't go over a
> wider area. Another of our papers explores the paradigms that helps to
> bridge the messaging into software (ie you can signal to other threads in
> what seems like a natural way, rather than something that requires explicit
> management, eg marshalling RDMA transactions):
> https://eprints.ncl.ac.uk/file_store/production/267875/5454FB9A-0A47-4C08-AD85-3D464D286E93.pdf
> > 5. Mitch talked about what (at least to me) seems like a very nice
> > mechanism to use hardware sourced message passing to eliminate the
> > “first level I/O interrupt handling” in software. This saves some code,
> > an interrupt and a lock. I think this could possibly be extended to
> > save some more on the I/O initiation side. The idea is to have the
> > device handler format a message appropriately, then send it to the
> > hardware/device. This could eliminate any queuing code in the device
> > handler and the need for a device lock. It would also improive I/O
> > throughput as it would reduce the idle time between finishing one I/O
> > and starting the next (for those devices that don’t have internal
> > queuing) as the start of the new command would be done in the hardware
> > without requiring any software intervention.
> I think it depends what the content of the message is. I like to compare
> SATA and NVMe here. SATA is a message-based protocol, there are commands
> and responses, with various data structures packed into the commands as to
> what you want to do. Essentially what's happening is your OS is chatting to
> the firmware on your HDD/SSD via these messages -
<
With complete I/O virtualization, one can allow a non-OS thread to perform
said "chatting".
<
> it's essentially ITC but
> the threads are running across heterogeneous hardware. But actually you
> can't do native ITC, so what happens is you interpose a host controller, eg
> AHCI on PCIe, whose host-OS driver is actually putting messages in memory
> buffers and then orchestrating DMA to the AHCI controller, who pulls the
> message out again and pushes it down the SATA wire. So you have this
> illusion of ITC but actually there's lots of DMA wrangling happening behind
> the scenes to make it happen.
>
> Meanwhile NVMe is a DMA protocol. It's simple memory reads and writes made
> from the device. The device driver sets up the transfer (organises where
> they're going to go from and to, opens up IOMMU windows to allow it, etc)
> but the device can then just do it directly. There's no faux conversation
> between the host OS and the firmware going on, it's just PCIe reads and
> writes.
<
One could create a SATA extension that treats the whole disk as directly
addressable memory. {With no disrespect to SATA protocols}
>
> (Of course the firmware does have side channels for things that aren't reads
> and writes, like asking about flash health, and there are messages for
> those. But they're off the critical path)
>
> You get much better performance by simply binning the command/response model
> completely, if that's something you are able to do.


Click here to read the complete article
Re: Hardware assisted message passing

<1b7bf651-969e-4a89-92d3-490675606bc0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22198&group=comp.arch#22198

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:2b0:: with SMTP id m16mr33420023qvv.116.1638740591331;
Sun, 05 Dec 2021 13:43:11 -0800 (PST)
X-Received: by 2002:a05:6830:2b20:: with SMTP id l32mr8862258otv.333.1638740591121;
Sun, 05 Dec 2021 13:43:11 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Dec 2021 13:43:10 -0800 (PST)
In-Reply-To: <sojb26$tqu$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:40fa:f6f2:2e75:c4dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:40fa:f6f2:2e75:c4dc
References: <so2u87$tbg$1@dont-email.me> <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
<sojb26$tqu$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1b7bf651-969e-4a89-92d3-490675606bc0n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 05 Dec 2021 21:43:11 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 12
 by: MitchAlsup - Sun, 5 Dec 2021 21:43 UTC

On Sunday, December 5, 2021 at 1:30:48 PM UTC-8, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > First a note stating that my computer went tits up last Thursday and I
> > have not be able too access comp.arch via my wife's computer since
> > google (in all its wisdom) thinks her computer could NEVER let her
> > husband on Google.Groups under a name not hers...................
> Don't you have access to an NNTP server, and a decent newsreader?
>
> Google Groups is probably the worst thing that ever happened
> to Usenet.
<
Google.Groups is better than usenet simply ceasing to exist !
But not much.

Re: Hardware assisted message passing

<NNqrJ.50503$zF3.50320@fx03.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22201&group=comp.arch#22201

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
References: <so2u87$tbg$1@dont-email.me> <cPf*upDAy@news.chiark.greenend.org.uk> <89dad950-6ca9-4836-9d0d-a5d0d09fcc72n@googlegroups.com>
In-Reply-To: <89dad950-6ca9-4836-9d0d-a5d0d09fcc72n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 24
Message-ID: <NNqrJ.50503$zF3.50320@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 06 Dec 2021 16:23:09 UTC
Date: Mon, 06 Dec 2021 11:13:21 -0500
X-Received-Bytes: 1711
 by: EricP - Mon, 6 Dec 2021 16:13 UTC

MitchAlsup wrote:
> On Wednesday, December 1, 2021 at 2:32:54 AM UTC-8, Theo Markettos wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:
>>> 2. Elxsi also used the message passing mechanism to process faults
>>> such as divide by zero. The hardware sends a message to the offending
>>> process, which can choose to process it or not. Thus another mechanism
>>> can be subsumed into message passing.
> <
>> Faults are just variations of traps, so the above also applies.
> <
> Pedantic mode=on
> <
> No Faults are not traps! A trap is something requested by an instruction
> (the TAP instruction, for example) while a fault is an unexpected event
> during the performance of an instruction.
> <
> Yes you can "map" traps and faults into the same "vector" space, but
> you should be very careful when doing this.
> <

BTW you left Pedantic mode on.

Re: Hardware assisted message passing

<ONqrJ.50504$zF3.12794@fx03.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22202&group=comp.arch#22202

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
References: <so2u87$tbg$1@dont-email.me> <cPf*upDAy@news.chiark.greenend.org.uk>
In-Reply-To: <cPf*upDAy@news.chiark.greenend.org.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 93
Message-ID: <ONqrJ.50504$zF3.12794@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 06 Dec 2021 16:23:10 UTC
Date: Mon, 06 Dec 2021 11:23:00 -0500
X-Received-Bytes: 5471
 by: EricP - Mon, 6 Dec 2021 16:23 UTC

Theo Markettos wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
>> When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
>> the “unifying” power of hardware (well, in their case microcode)
>> assisted message passing. By unifying power, I mean the idea of
>> combining several otherwise diverse functions into a single, efficient
>> mechanism. This allows elimination of several other “specialized”
>> mechanisms. Putting the basics into hardware is what allows the
>> efficiency to make this all reasonable.
>
> We've done a bit of work in this area:
> https://www.repository.cam.ac.uk/handle/1810/317181
> and related publications at:
> https://poets-project.org/publications/
>
> Basically that makes passing messages from one thread to another a
> first-class communication primitive, and one that doesn't care if the thread
> is running on the other side of a cluster. Messages are kept small to
> simplify the networking and keep latency very low (minimal buffering, simple
> congestion control etc).
>
> ...
>
>> 4. Elxsi could send messages transparently to processes on other CPUs
>> in a shared memory multiprocessor system. As Mitch has pointed out, this
>> can be extended to multiple processors, even without shared memory, as
>> long as there is some interconnect. This allows arbitrary distribution of
>> computations across an arbitrary set of systems, and hopefully allows
>> changing that distribution without requiring source code changes or even
>> better without recompilation, or best yet, without even taking down the
>> application. (I think Erlang supports something like this.) In fact, once
>> you allow that, at least in theory, the sender and receiver could even be
>> across the internet.
>
> This is what we do in POETS. The wish to keep tight bounds on latency
> imposes certain requirements on buffering and reliability, but if you
> weren't so fussed by latency there's no reason why you couldn't go over a
> wider area. Another of our papers explores the paradigms that helps to
> bridge the messaging into software (ie you can signal to other threads in
> what seems like a natural way, rather than something that requires explicit
> management, eg marshalling RDMA transactions):
> https://eprints.ncl.ac.uk/file_store/production/267875/5454FB9A-0A47-4C08-AD85-3D464D286E93.pdf

I've been looking at your paper (and some that it cited)
Termination detection for fine-grained message-passing architectures, 2020

and thinking about Safra’s Algorithm for distributed termination
detection and I has some ideas on ways to do this more efficiently,
in particular for neural nets but possibly also for other situations.

IIUC Safra is basically serial and requires two sweeps across all
nodes to detect "termination", otherwise known as a "quorum",
aka arrival of all nodes at a rendezvous point.

In the example scenario I'm thinking of there are, say, 64k processor
nodes in a 256*256 2-D torus network, each node with a 16-bit address.
Each node executes some number of neurons, say 16,
so each neuron has a 20-bit address.
Each neuron can have up to 128 synapses, arbitrarily connected,
Any neuron can send a "fired" message addressed from one of its
axons (outputs) to any different neuron dendrite (input).

So there are, say, 1M neurons and if every neuron fired and was
maximally connected then there would be 128M messages _per iteration_.

Since neurons can be arbitrarily connected, messages can travel
in any direction through the torus to get to their destination,
with variable routings and numbers of hops,
and therefore different delivery latencies.
Two messages can leave in the order A B and arrive in the order B A.

To ensure experiments are repeatable the system must be globally
synchronous. That is, for each clock tick a set of neurons fire
and complete their operation. IOW it is NOT a free-running collection
of asynchronous nodes firing messages at each other ASAP.

For efficiency nodes may temporarily be locally asynchronous.

For efficiency we do not want to send messages for neurons that
don't fire in a clock cycle.

The problem is, how do we know when the neurons should fire if
we can't just count message arrivals since some messages may not
be sent in a particular cycle,
and when has a cycle has finished so we can start the next iteration?

Does that basically summarize the problem space, for neural nets at least?
Because I had some ideas on how to do this which might be more efficient
than Safra but I didn't want to launch into a great long explanation if
I had misunderstood the problem.

Re: Hardware assisted message passing

<soldi4$d49$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22203&group=comp.arch#22203

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Mon, 6 Dec 2021 08:25:38 -0800
Organization: A noiseless patient Spider
Lines: 220
Message-ID: <soldi4$d49$1@dont-email.me>
References: <so2u87$tbg$1@dont-email.me>
<e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 6 Dec 2021 16:25:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c50df2fb3a4b4c607fb50b0776f2d5a2";
logging-data="13449"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+kZ6sllSRig5/3hANBL272jgXJhtQJcqI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:EJsKcACk0NG9rehHIfYUgxvS1lg=
In-Reply-To: <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Mon, 6 Dec 2021 16:25 UTC

On 12/5/2021 1:09 PM, MitchAlsup wrote:
> First a note stating that my computer went tits up last Thursday and I
> have not be able too access comp.arch via my wife's computer since
> google (in all its wisdom) thinks her computer could NEVER let her
> husband on Google.Groups under a name not hers...................
> <
> Bad Google, Bad Google........

I was wondering why you hadn't replied in this thread. Welcome back!

> <
> On Monday, November 29, 2021 at 8:14:02 AM UTC-8, Stephen Fuld wrote:
>> When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
>> the “unifying” power of hardware (well, in their case microcode)
>> assisted message passing. By unifying power, I mean the idea of
>> combining several otherwise diverse functions into a single, efficient
>> mechanism. This allows elimination of several other “specialized”
>> mechanisms. Putting the basics into hardware is what allows the
>> efficiency to make this all reasonable.
>>
>> Seeing Mitch Alsup’s posts in several threads here about what appears to
>> be at least a similar mechanism he is putting into his design reawakened
>> my interest in this, so I thought it might be interesting to see how far
>> is is reasonable to “push” on this to expand its utility into more than
>> its basic use. I will list some possibilities, and hope others will
>> comment on “reasonableness” and add other potential uses.
> <
> Thanks for the referral.

You are most welcome. And thank you for making me think about this again.

>> So, the basic use is to send a “message” of typically up to several
>> hundred bytes (though often much less) from one process to another
>> another process that is willing to accept it. An extension is to allow
>> the hardware itself to send or receiver a message. So on to other uses.
> <
> My intended use was to explicitly make interrupts a first class citizen
> of the architecture. An interrupt causes a waiting thread to receive control
> at a specified priority and within a specified set of CPUs (affinity).
> <
> My southbridge (PCIe) uses/converts wire interrupts into message based
> interrupts.

An interesting way to look at it.

> CPUs can send interrupts to other threads (and by SW convention other
> SPUs.)
> <
> An interrupt is associated with a thread, that thread has a priority,
> Thread header (PSW, Root Pointer, State) and a register file.
> <
> The receipt of an interrupt either gets queued onto that thread (if it is
> already running) or activates that thread if is is not running. The queueing
> guarantees that the thread is never active more than once, and that no
> interrupts are ever lost.

I understand.

> Should the newly active thread have a higher priority on the CPU of its
> affinity, a message is sent to the lowest priority thread currently running
> in the affinity set of the interrupt handler. This message contains all
> data necessary to perform a context switch into the interrupt handling
> thread. The arrival of such a message at the CPU causes its current
> thread state to be returned to the front of its run queue.
> <
> The CPU does NOT choose when to context switch, or how. This is
> all decided elsewhere.

Got it. Nice. You move more of what is traditionally OS code
(examining the dispatch queue and activating the highest priority
thread) into hardware.

> In this model, the interrupt message is one (1) 64-bit container,
> And the queueing system is designed to efficiently manage smallish
> messages (1-8 Doublewords), and performs this task without locking
> since it is being performed in a place where all requests have already
> been serialized, and performs then one at a time. Note: This place
> is not a CPU !! { Critically important } nor is it SW that runs on a CPU.
>>
>> 1. The first, and pretty obvious, mechanism to subsume into this
>> paradigm is that once you have it, you no longer need a separate
>> instruction to request service from the OS. This is replaced, of
>> course, with a message from the user program to a process within the OS.
> <
> I worked in ADA rendezvous into the system, too. They have almost all of
> the properties of the std interrupts, but add in the address space join/release,
> and the reactivation of caller at the end of rendezvous.

I had to research Ada rendezvous to learn what it was. I agree with
your analysis.

>>
>> 2. Elxsi also used the message passing mechanism to process faults such
>> as divide by zero. The hardware sends a message to the offending
>> process, which can choose to process it or not. Thus another mechanism
>> can be subsumed into message passing.
> <
> Exceptions are not interrupts. Exception handlers are bound to the thread
> creating the exception and are synchronous, Interrupts are bound to a
> particular handler asynchronously and for a long duration (the SATA disk
> handler is associated with multiple SATA drive interrupts and for the entire
> time the system has power turned on.

Yes, but I think not relevant to the way Elxsi defined its message
system. I don't remember all the details, but I was struck by the fact
that it treated exceptions the way they did. BTW, at least one of the
Elxsi manuals are available on bitsavers, and it gives some details. It
seems to have a lot of options. I don't remember, and haven't taken the
time to understand all the details, but I was struck by its subsuming
exceptions like divide by zero into the message passing system, so I
mentioned it in my description as a further "unification".

>> 3. Similarly, with suitable hardware, you no longer need a separate
>> mechanism for inter-processor interrupts.
> <
> Exactly--and a useful byproduct. But this unification needs to be expressible
> at the instruction level. That is the SATA interrupt handler needs a way to
> schedule the user-I/O-completion handler efficiently (1 instruction) and that
> thread needs an efficient way to take the waiting thread off the wait queue
> and drop it back on the run queue (also 1-instruction). There is no particular
> reason that any CPU context switches are processed by running instructions!
> <
> Indeed, having the entire thread state arrive as the interrupt message
> gets rid of lots of OS issues (cue EricP)

I'm not sure, but I think that goes beyond what Elxsi did.

> You may also want/need the ability to use IPIs between different virtual
> machines, or perform virtual interrupt remapping so what HW uses and
> what SW uses are unified but mapped just like memory pages and access
> rights.

Good point.

>>
>> 4. Elxsi could send messages transparently to processes on other CPUs in
>> a shared memory multiprocessor system. As Mitch has pointed out, this
>> can be extended to multiple processors, even without shared memory, as
>> long as there is some interconnect. This allows arbitrary distribution
>> of computations across an arbitrary set of systems, and hopefully allows
>> changing that distribution without requiring source code changes or even
>> better without recompilation, or best yet, without even taking down the
>> application. (I think Erlang supports something like this.) In fact,
>> once you allow that, at least in theory, the sender and receiver could
>> even be across the internet.
>>
>> 5. Mitch talked about what (at least to me) seems like a very nice
>> mechanism to use hardware sourced message passing to eliminate the
>> “first level I/O interrupt handling” in software. This saves some code,
>> an interrupt and a lock.
> <
> The savings are bigger than that!
> <
> The CPU which will ultimately "take" a context switch performs none
> of the work of context switching--so it continues on with what it was
> working on until the context of the (ahem) context switch) is actually
> arriving. The Context switch is an exchange--new state is loaded while
> old state is unloaded-bundled into a message--and passed back to the
> queueing system. (Very CDC 6600 Exchange-Jump like) The CPU goes
> from thread-computing to thread-computing without having to run ANY
> operating system code !!
> <
> The queueing system can consume an arriving interrupt every cycle,
> And after a short pipeline delay, it can spit out a new context-switch
> message every 5-ish cycles (imaging there are 16-64 cores on the chip.)
> All without any locking !!
> <
>> I think this could possibly be extended to
>> save some more on the I/O initiation side. The idea is to have the
>> device handler format a message appropriately, then send it to the
>> hardware/device. This could eliminate any queuing code in the device
>> handler and the need for a device lock. It would also improive I/O
>> throughput as it would reduce the idle time between finishing one I/O
>> and starting the next
> <
> Most of the devices for which this is useful already have input command
> queues and many of the devices rearrange requests to optimize the
> device's performance.


Click here to read the complete article
Re: Hardware assisted message passing

<070dfc1f-599d-4fb2-ba97-d5c3a3d942edn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22204&group=comp.arch#22204

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:ed52:: with SMTP id v18mr37065134qvq.61.1638809036329; Mon, 06 Dec 2021 08:43:56 -0800 (PST)
X-Received: by 2002:a9d:4d8b:: with SMTP id u11mr30032908otk.144.1638809036184; Mon, 06 Dec 2021 08:43:56 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 08:43:55 -0800 (PST)
In-Reply-To: <soldi4$d49$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com> <soldi4$d49$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <070dfc1f-599d-4fb2-ba97-d5c3a3d942edn@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Dec 2021 16:43:56 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 30
 by: MitchAlsup - Mon, 6 Dec 2021 16:43 UTC

On Monday, December 6, 2021 at 10:25:43 AM UTC-6, Stephen Fuld wrote:
> On 12/5/2021 1:09 PM, MitchAlsup wrote:

> >> 6. It is not a requirement that the sending and receiving entities be
> >> using the same ISA, as long as they both “understand” the same message
> >> format. As an example, you could regard a host sending a message with a
> >> SATA command to a disk drive, as a program on the host sending a message
> >> to the program running on the CPU on the disk drive, which may very well
> >> have a different ISA from the host.
> > <
> > This should (SHOULD) be the SATA device command queue rather than
> > the CPU running the device. The layer of abstraction is very powerful
> > here.
<
> I'm not sure what you are saying here. The CPU on the SATA drive is
> where the command queue is maintained.
<
This command queue is adequately served by speaking SATA commands.
No need to understand the ISA of the device controller.
<
SATA command set is the abstraction one wants.
<
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<7a30e960-70a4-45d4-bf96-83c7da5c3db7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22205&group=comp.arch#22205

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4155:: with SMTP id e21mr42664231qtm.312.1638820541538;
Mon, 06 Dec 2021 11:55:41 -0800 (PST)
X-Received: by 2002:aca:ac8a:: with SMTP id v132mr790667oie.44.1638820541327;
Mon, 06 Dec 2021 11:55:41 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 11:55:41 -0800 (PST)
In-Reply-To: <eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
<eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7a30e960-70a4-45d4-bf96-83c7da5c3db7n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Dec 2021 19:55:41 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 20
 by: MitchAlsup - Mon, 6 Dec 2021 19:55 UTC

On Tuesday, November 30, 2021 at 9:26:20 AM UTC-6, robf...@gmail.com wrote:
> I have toyed a little bit with hardware message passing. Rather than passing hundreds of bytes around
> I planned on using much smaller messages, for example 128 bits. Some OSes like Windows and
> MMURTL get by with passing only two words for messages. Larger messages are passed using
> pointers or references. Uses smaller messages may allow the entire message to be transferred in a
> single clock cycle. A small message may be atomic.
<
I came to the "general" conclusion that any HW implementation should be optimized around
messages the size of a cache line; after all of the busses, coherence, and protocols are based
on cache lines.
<
This means messages of 8-doublewords (or smaller) are optimal, can be ATOMIC, and have
other "good" properties.
<
I suspect that 8-Doublewords covers way more than the 90%-ile messages--but, in practice
I can do ADA rendezvous in messages of length 2-Doubleword.
<
> I think most of the use can be handled with small messages. The hardware would be simpler for small
> messages.
<
HW has already been optimized for cache lines--just use these as your principal message size.

Re: Hardware assisted message passing

<39fa6e83-c813-4ec7-80a4-a246fde4cd39n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22206&group=comp.arch#22206

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5fc1:: with SMTP id k1mr41762784qta.303.1638820660275;
Mon, 06 Dec 2021 11:57:40 -0800 (PST)
X-Received: by 2002:a4a:3451:: with SMTP id n17mr23863616oof.28.1638820660066;
Mon, 06 Dec 2021 11:57:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 11:57:39 -0800 (PST)
In-Reply-To: <so5m98$obp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
<eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com> <so5m98$obp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <39fa6e83-c813-4ec7-80a4-a246fde4cd39n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Dec 2021 19:57:40 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 23
 by: MitchAlsup - Mon, 6 Dec 2021 19:57 UTC

On Tuesday, November 30, 2021 at 11:16:26 AM UTC-6, Stephen Fuld wrote:
> On 11/30/2021 7:26 AM, robf...@gmail.com wrote:
> >
> > I have toyed a little bit with hardware message passing. Rather than passing hundreds of bytes around
> > I planned on using much smaller messages, for example 128 bits. Some OSes like Windows and
> > MMURTL get by with passing only two words for messages. Larger messages are passed using
> > pointers or references. Uses smaller messages may allow the entire message to be transferred in a
> > single clock cycle. A small message may be atomic.
> > I think most of the use can be handled with small messages. The hardware would be simpler for small
> > messages.
> Elxsi had both "small" and "large" messages. While allowing only small
> messages does make the hardware simpler, it really hurts when you want
> to transfer more data than the size of a small message. Even passing a
> pointer or such means the hardware has to do some additional work to
> transfer the bulk of the data, which gets icky unless you restrict the
> system to a single shared memory.
<
Pass a pointer (base) and a limit (bounds) and have the other end of transport
perform RDMA to access the message data. You can regain the illusion of ATOMIC
doing this.
<
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<27b35f1c-83dc-402f-9f5d-db0c90ca2a45n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22207&group=comp.arch#22207

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:b8b:: with SMTP id k11mr35407904qkh.746.1638820743984;
Mon, 06 Dec 2021 11:59:03 -0800 (PST)
X-Received: by 2002:a9d:1b0f:: with SMTP id l15mr30521050otl.38.1638820743779;
Mon, 06 Dec 2021 11:59:03 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 11:59:03 -0800 (PST)
In-Reply-To: <NNqrJ.50503$zF3.50320@fx03.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <cPf*upDAy@news.chiark.greenend.org.uk>
<89dad950-6ca9-4836-9d0d-a5d0d09fcc72n@googlegroups.com> <NNqrJ.50503$zF3.50320@fx03.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <27b35f1c-83dc-402f-9f5d-db0c90ca2a45n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Dec 2021 19:59:03 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 22
 by: MitchAlsup - Mon, 6 Dec 2021 19:59 UTC

On Monday, December 6, 2021 at 10:23:12 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, December 1, 2021 at 2:32:54 AM UTC-8, Theo Markettos wrote:
> >> Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:
> >>> 2. Elxsi also used the message passing mechanism to process faults
> >>> such as divide by zero. The hardware sends a message to the offending
> >>> process, which can choose to process it or not. Thus another mechanism
> >>> can be subsumed into message passing.
> > <
> >> Faults are just variations of traps, so the above also applies.
> > <
> > Pedantic mode=on
> > <
> > No Faults are not traps! A trap is something requested by an instruction
> > (the TAP instruction, for example) while a fault is an unexpected event
> > during the performance of an instruction.
> > <
> > Yes you can "map" traps and faults into the same "vector" space, but
> > you should be very careful when doing this.
> > <
> BTW you left Pedantic mode on.
<
I can't decide if I should let this one go (or not) !!

Re: Hardware assisted message passing

<solr78$hme$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22208&group=comp.arch#22208

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Mon, 6 Dec 2021 12:18:47 -0800
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <solr78$hme$1@dont-email.me>
References: <so2u87$tbg$1@dont-email.me>
<3116bac9-ecc5-4917-a341-74849309bb22n@googlegroups.com>
<eb21218a-7383-4e20-9e50-3e3b40791ad5n@googlegroups.com>
<so5m98$obp$1@dont-email.me>
<39fa6e83-c813-4ec7-80a4-a246fde4cd39n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 6 Dec 2021 20:18:48 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c50df2fb3a4b4c607fb50b0776f2d5a2";
logging-data="18126"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Br7e3WOvSmgvMwqxZPN4hDlvSvGeJF2s="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:gpvXApqDlU/ENIfDGV2HKTgIjd4=
In-Reply-To: <39fa6e83-c813-4ec7-80a4-a246fde4cd39n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Mon, 6 Dec 2021 20:18 UTC

On 12/6/2021 11:57 AM, MitchAlsup wrote:
> On Tuesday, November 30, 2021 at 11:16:26 AM UTC-6, Stephen Fuld wrote:
>> On 11/30/2021 7:26 AM, robf...@gmail.com wrote:
>>>
>>> I have toyed a little bit with hardware message passing. Rather than passing hundreds of bytes around
>>> I planned on using much smaller messages, for example 128 bits. Some OSes like Windows and
>>> MMURTL get by with passing only two words for messages. Larger messages are passed using
>>> pointers or references. Uses smaller messages may allow the entire message to be transferred in a
>>> single clock cycle. A small message may be atomic.
>>> I think most of the use can be handled with small messages. The hardware would be simpler for small
>>> messages.
>> Elxsi had both "small" and "large" messages. While allowing only small
>> messages does make the hardware simpler, it really hurts when you want
>> to transfer more data than the size of a small message. Even passing a
>> pointer or such means the hardware has to do some additional work to
>> transfer the bulk of the data, which gets icky unless you restrict the
>> system to a single shared memory.
> <
> Pass a pointer (base) and a limit (bounds) and have the other end of transport
> perform RDMA to access the message data. You can regain the illusion of ATOMIC
> doing this.

Good provided there is a bit in the "send" instruction, or in a known
place in the message, that tells the hardware at the receiving end to do
the RDMA without any software intervention (and, as I said above, if the
receiver is across the internet, it gets "icky"). And I guess to
maintain atomicity, the sending hardware has to wait until it is told by
the receiving hardware that the RDMA is complete, so the sending program
knows it can modify the locations containing the long part of the message.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<7c83ecbc-ad31-4455-87b7-4dcb1ca3c228n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22209&group=comp.arch#22209

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:28d0:: with SMTP id l16mr35569374qkp.500.1638822658646;
Mon, 06 Dec 2021 12:30:58 -0800 (PST)
X-Received: by 2002:aca:eb0b:: with SMTP id j11mr929482oih.51.1638822658520;
Mon, 06 Dec 2021 12:30:58 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 12:30:58 -0800 (PST)
In-Reply-To: <soldi4$d49$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
<soldi4$d49$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7c83ecbc-ad31-4455-87b7-4dcb1ca3c228n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Dec 2021 20:30:58 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 143
 by: MitchAlsup - Mon, 6 Dec 2021 20:30 UTC

On Monday, December 6, 2021 at 10:25:43 AM UTC-6, Stephen Fuld wrote:
> On 12/5/2021 1:09 PM, MitchAlsup wrote:

> > My intended use was to explicitly make interrupts a first class citizen
> > of the architecture. An interrupt causes a waiting thread to receive control
> > at a specified priority and within a specified set of CPUs (affinity).
> > <
> > My southbridge (PCIe) uses/converts wire interrupts into message based
> > interrupts.
> An interesting way to look at it.
> > CPUs can send interrupts to other threads (and by SW convention other
> > SPUs.)
> > <
> > An interrupt is associated with a thread, that thread has a priority,
> > Thread header (PSW, Root Pointer, State) and a register file.
> > <
> > The receipt of an interrupt either gets queued onto that thread (if it is
> > already running) or activates that thread if is is not running. The queueing
> > guarantees that the thread is never active more than once, and that no
> > interrupts are ever lost.
> I understand.
> > Should the newly active thread have a higher priority on the CPU of its
> > affinity, a message is sent to the lowest priority thread currently running
> > in the affinity set of the interrupt handler. This message contains all
> > data necessary to perform a context switch into the interrupt handling
> > thread. The arrival of such a message at the CPU causes its current
> > thread state to be returned to the front of its run queue.
> > <
> > The CPU does NOT choose when to context switch, or how. This is
> > all decided elsewhere.
> Got it. Nice. You move more of what is traditionally OS code
> (examining the dispatch queue and activating the highest priority
> thread) into hardware.
<
There are LOTS of advantages to performing the OS queueing outside of
the CPUs.
1) one can enqueue an entire message in 1 cycle (pipelined)
2) one can determine if the new enqueue needs to cause a context switch
.....also 1 cycle (not pipelined)
3) one can gather up the data required for a context switch and send it
.....out over the link in 5-cycles (pipelined)
4) none of this requires locks (or any sort of ATOMIC activity)
5) std interrupts and MPIs have been unified
6) no need for ISR to need to lower its priority after saving state.....
7) scheduling a thread on a different core in a different chip is no
different than scheduling a thread on this very core.
8) the scheduler operates after the OS/HV have chosen CPU affinities.
<
Given that a "chip" today will have multiple cores (8-64) a builtin
northbridge (memory controller and DRAM controller) and a builtin
southbridge (PCIe host) and an interconnect fabric::
<
I have chosen that the proper place for said scheduler is over at
the Memory Controller--it is like a function unit that is kicked into action
by interrupts.
<
Basically, the normal definition of an interrupt is to kick a particular thread
into running at a specified priority and in a specified environment (thread
state). We can extend this notion to moving between wait states and run
states (bidirectionally) based on arrival of interrupt.
<
{Warning handwaving accuracy}
So, something like the time-slicer, gets a timer interrupt and then looks
at who is currently running, and sends an interrupt to tell thread to goto
the "end" of the run queue of current priority (about 5 instructions in CPU)
2-cycles in scheduler, and a 5-cycle message to context switch thread.
<
> > In this model, the interrupt message is one (1) 64-bit container,
> > And the queueing system is designed to efficiently manage smallish
> > messages (1-8 Doublewords), and performs this task without locking
> > since it is being performed in a place where all requests have already
> > been serialized, and performs then one at a time. Note: This place
> > is not a CPU !! { Critically important } nor is it SW that runs on a CPU.
> >>
> >> 1. The first, and pretty obvious, mechanism to subsume into this
> >> paradigm is that once you have it, you no longer need a separate
> >> instruction to request service from the OS. This is replaced, of
> >> course, with a message from the user program to a process within the OS.
> > <
> > I worked in ADA rendezvous into the system, too. They have almost all of
> > the properties of the std interrupts, but add in the address space join/release,
> > and the reactivation of caller at the end of rendezvous.
<
> I had to research Ada rendezvous to learn what it was. I agree with
> your analysis.
> >>
> >> 2. Elxsi also used the message passing mechanism to process faults such
> >> as divide by zero. The hardware sends a message to the offending
> >> process, which can choose to process it or not. Thus another mechanism
> >> can be subsumed into message passing.
> > <
> > Exceptions are not interrupts. Exception handlers are bound to the thread
> > creating the exception and are synchronous, Interrupts are bound to a
> > particular handler asynchronously and for a long duration (the SATA disk
> > handler is associated with multiple SATA drive interrupts and for the entire
> > time the system has power turned on.
<
> Yes, but I think not relevant to the way Elxsi defined its message
> system. I don't remember all the details, but I was struck by the fact
> that it treated exceptions the way they did. BTW, at least one of the
> Elxsi manuals are available on bitsavers, and it gives some details. It
> seems to have a lot of options. I don't remember, and haven't taken the
> time to understand all the details, but I was struck by its subsuming
> exceptions like divide by zero into the message passing system, so I
> mentioned it in my description as a further "unification".
<
It is currently unclear if such unification retains value. Certainly, they all
fall under the general concept of an unintentional subroutine call, but
unifying the mechanics of traps with that of exceptions or context-switch
seems dubious at best. At worst is shows that the though pattern remains
CPU centric (CPUs do everything--everything else is simply a slave.)
<
> >> 3. Similarly, with suitable hardware, you no longer need a separate
> >> mechanism for inter-processor interrupts.
> > <
> > Exactly--and a useful byproduct. But this unification needs to be expressible
> > at the instruction level. That is the SATA interrupt handler needs a way to
> > schedule the user-I/O-completion handler efficiently (1 instruction) and that
> > thread needs an efficient way to take the waiting thread off the wait queue
> > and drop it back on the run queue (also 1-instruction). There is no particular
> > reason that any CPU context switches are processed by running instructions!
> > <
> > Indeed, having the entire thread state arrive as the interrupt message
> > gets rid of lots of OS issues (cue EricP)
<
> I'm not sure, but I think that goes beyond what Elxsi did.
<
If you remember your history, CDC 6600 had peripheral processors, and these
PPs RAN the OS, and only when the PPs wanted the CPU to do something else
did the PPs send the Exchange Jump instruction. I basically like this model
but wanted to bring it up to date with modern system design. The OS scheduler
resides in the logic block associated with the Memory Controller. There are several
reasons for this:: 1) events arrive in a defined order (memory order), 2) queues
are HW managed so that queque insert, extract, rotate (front->rear) can be
performed in 1 cycle, 3) if an event changes the run priorities such that a core
is running a thread of lower priority than that of a run queue, a context switch
message is build and shipped to lower priority core causing the higher priority
thread to displace the lower priority thread. 4) the core receiving the context-switch
sends back a message containing the previously running thread's state. 5) the arrival
of said state, marks the thread as not-running,......
<snip>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Hardware assisted message passing

<u1xrJ.95050$np6.11511@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22210&group=comp.arch#22210

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
References: <so2u87$tbg$1@dont-email.me> <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
In-Reply-To: <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 141
Message-ID: <u1xrJ.95050$np6.11511@fx46.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 06 Dec 2021 23:29:30 UTC
Date: Mon, 06 Dec 2021 18:29:00 -0500
X-Received-Bytes: 8090
 by: EricP - Mon, 6 Dec 2021 23:29 UTC

MitchAlsup wrote:
> First a note stating that my computer went tits up last Thursday and I
> have not be able too access comp.arch via my wife's computer since
> google (in all its wisdom) thinks her computer could NEVER let her
> husband on Google.Groups under a name not hers...................
> <
> Bad Google, Bad Google........
> <
> On Monday, November 29, 2021 at 8:14:02 AM UTC-8, Stephen Fuld wrote:
>> When I first encountered the Elxsi 6400 system in the mid 1980s, I saw
>> the “unifying” power of hardware (well, in their case microcode)
>> assisted message passing. By unifying power, I mean the idea of
>> combining several otherwise diverse functions into a single, efficient
>> mechanism. This allows elimination of several other “specialized”
>> mechanisms. Putting the basics into hardware is what allows the
>> efficiency to make this all reasonable.
>>
>> Seeing Mitch Alsup’s posts in several threads here about what appears to
>> be at least a similar mechanism he is putting into his design reawakened
>> my interest in this, so I thought it might be interesting to see how far
>> is is reasonable to “push” on this to expand its utility into more than
>> its basic use. I will list some possibilities, and hope others will
>> comment on “reasonableness” and add other potential uses.
> <
> Thanks for the referral.
>> So, the basic use is to send a “message” of typically up to several
>> hundred bytes (though often much less) from one process to another
>> another process that is willing to accept it. An extension is to allow
>> the hardware itself to send or receiver a message. So on to other uses.
> <
> My intended use was to explicitly make interrupts a first class citizen
> of the architecture. An interrupt causes a waiting thread to receive control
> at a specified priority and within a specified set of CPUs (affinity).
> <
> My southbridge (PCIe) uses/converts wire interrupts into message based
> interrupts.
> <
> CPUs can send interrupts to other threads (and by SW convention other
> SPUs.)
> <
> An interrupt is associated with a thread, that thread has a priority,
> Thread header (PSW, Root Pointer, State) and a register file.
> <
> The receipt of an interrupt either gets queued onto that thread (if it is
> already running) or activates that thread if is is not running. The queueing
> guarantees that the thread is never active more than once, and that no
> interrupts are ever lost.

What if the same device sends a another interrupt request before an earlier
one has been serviced (like, say, receipt of another network packet)?

Basically asking if interrupts are one-shot or multi-shot,
similar to level sensitive vs edge triggered?
Because one-shot implies the device is responsible for buffering multiple
requests, and multi-shot could imply growing resource usage until some
buffer overflows and new interrupts are lost.
Just asking.

> <
> Should the newly active thread have a higher priority on the CPU of its
> affinity, a message is sent to the lowest priority thread currently running
> in the affinity set of the interrupt handler. This message contains all
> data necessary to perform a context switch into the interrupt handling
> thread. The arrival of such a message at the CPU causes its current
> thread state to be returned to the front of its run queue.
> <
> The CPU does NOT choose when to context switch, or how. This is
> all decided elsewhere.

What does this sentence mean, since the cpu is the HW?

> <
> In this model, the interrupt message is one (1) 64-bit container,
> And the queueing system is designed to efficiently manage smallish
> messages (1-8 Doublewords), and performs this task without locking
> since it is being performed in a place where all requests have already
> been serialized, and performs then one at a time. Note: This place
> is not a CPU !! { Critically important } nor is it SW that runs on a CPU.
>> 1. The first, and pretty obvious, mechanism to subsume into this
>> paradigm is that once you have it, you no longer need a separate
>> instruction to request service from the OS. This is replaced, of
>> course, with a message from the user program to a process within the OS.
> <
> I worked in ADA rendezvous into the system, too. They have almost all of
> the properties of the std interrupts, but add in the address space join/release,
> and the reactivation of caller at the end of rendezvous.

I have a certain amount of apprehension about this approach.

Note that Ada rendezvous are inherently _synchronous_ client-server only,
wherein the client waits until the server finishes the rendezvous.
There are no events, mutexes, semaphores, etc in that model.
It does have timed entry and accept statements,
but that is synchronous polling.

You can build things like events and mutexes out of Ada tasks and
rendezvous but it is ridiculously expensive because they require more
tasks each with its header and stack, etc to create "mutex servers".

Here your ESM multi-update atomics is important here as it allows updates
to things like double linked lists without the above expensive mutexes.

The first thing people found out about synchronous RPC is that they really
didn't want synchronous RPC, they wanted asynchronous RPC so they could
submit multiple concurrent requests at once then wait for all to be done.

I would want to work through some realistic examples, from soup to nuts,
enqueuing multiple async IO packets to various parts of the IO subsystem
using the rendezvous model. Just to be sure there are no gotcha's
and can be as efficient as hoped.

>> 2. Elxsi also used the message passing mechanism to process faults such
>> as divide by zero. The hardware sends a message to the offending
>> process, which can choose to process it or not. Thus another mechanism
>> can be subsumed into message passing.
> <
> Exceptions are not interrupts. Exception handlers are bound to the thread
> creating the exception and are synchronous, Interrupts are bound to a
> particular handler asynchronously and for a long duration (the SATA disk
> handler is associated with multiple SATA drive interrupts and for the entire
> time the system has power turned on.
>> 3. Similarly, with suitable hardware, you no longer need a separate
>> mechanism for inter-processor interrupts.
> <
> Exactly--and a useful byproduct. But this unification needs to be expressible
> at the instruction level. That is the SATA interrupt handler needs a way to
> schedule the user-I/O-completion handler efficiently (1 instruction) and that
> thread needs an efficient way to take the waiting thread off the wait queue
> and drop it back on the run queue (also 1-instruction). There is no particular
> reason that any CPU context switches are processed by running instructions!
> <
> Indeed, having the entire thread state arrive as the interrupt message
> gets rid of lots of OS issues (cue EricP)

Is this the same as the Thread Interrupt Procedures (TIP's)
I was talking about earlier?

TIPs are handy for working around the limitations of synchronous
interfaces and notifying threads of async completions.

Re: Hardware assisted message passing

<2b4aa260-7cec-4891-9813-569d66a15626n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22211&group=comp.arch#22211

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:4007:: with SMTP id kd7mr40808406qvb.52.1638837469814;
Mon, 06 Dec 2021 16:37:49 -0800 (PST)
X-Received: by 2002:a05:6808:211c:: with SMTP id r28mr2203695oiw.155.1638837469570;
Mon, 06 Dec 2021 16:37:49 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 16:37:49 -0800 (PST)
In-Reply-To: <u1xrJ.95050$np6.11511@fx46.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
<u1xrJ.95050$np6.11511@fx46.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2b4aa260-7cec-4891-9813-569d66a15626n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 07 Dec 2021 00:37:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 209
 by: MitchAlsup - Tue, 7 Dec 2021 00:37 UTC

On Monday, December 6, 2021 at 5:31:04 PM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > My intended use was to explicitly make interrupts a first class citizen
> > of the architecture. An interrupt causes a waiting thread to receive control
> > at a specified priority and within a specified set of CPUs (affinity).
> > <
> > My southbridge (PCIe) uses/converts wire interrupts into message based
> > interrupts.
> > <
> > CPUs can send interrupts to other threads (and by SW convention other
> > SPUs.)
> > <
> > An interrupt is associated with a thread, that thread has a priority,
> > Thread header (PSW, Root Pointer, State) and a register file.
> > <
> > The receipt of an interrupt either gets queued onto that thread (if it is
> > already running) or activates that thread if is is not running. The queueing
> > guarantees that the thread is never active more than once, and that no
> > interrupts are ever lost.
<
> What if the same device sends a another interrupt request before an earlier
> one has been serviced (like, say, receipt of another network packet)?
<
If an interrupt arrives while it's ISR is still queued or still running, the interrupt
is queued and will be played out in priority order. As soon as the ISR quits, the
queued interrupt will context switch back to the (then no longer running) ISR
(making all threads serially reusable.)
>
> Basically asking if interrupts are one-shot or multi-shot,
> similar to level sensitive vs edge triggered?
<
My guess is neither, each sent interrupt causes its ISR to run once. The ISR
is serially reusable, pending interrupts are queued.
<
> Because one-shot implies the device is responsible for buffering multiple
> requests, and multi-shot could imply growing resource usage until some
> buffer overflows and new interrupts are lost.
> Just asking.
<
At this point I am simply trying to be sufficiently flexible.
<
In any event, the time between when an interrupt is sent (from northbridge)
to system queue HW to system queue HW sending context switch to targeted
CPU, to context switch arriving at targeted CPU running ISR thread state is
expected to be in the ~50-cycle range (10+5+5+10)×1.5 (or ~10ns)
> > <
> > Should the newly active thread have a higher priority on the CPU of its
> > affinity, a message is sent to the lowest priority thread currently running
> > in the affinity set of the interrupt handler. This message contains all
> > data necessary to perform a context switch into the interrupt handling
> > thread. The arrival of such a message at the CPU causes its current
> > thread state to be returned to the front of its run queue.
> > <
> > The CPU does NOT choose when to context switch, or how. This is
> > all decided elsewhere.
<
> What does this sentence mean, since the cpu is the HW?
<
The memory controller, the DRAM controller, the southbridge, and fabric
controllers are all HW too--and none of these are a CPU, or CPU like nor
do they process ISA instructions.
<
I chose to put the system queue in the memory controller to be close to
memory, and have access to the buffers* between MC and DC and use them
like a cache of the queues, and thread state.
<
(*)Buffers could easily be a DRAM cache at this point in time.
> > <
> > In this model, the interrupt message is one (1) 64-bit container,
> > And the queueing system is designed to efficiently manage smallish
> > messages (1-8 Doublewords), and performs this task without locking
> > since it is being performed in a place where all requests have already
> > been serialized, and performs then one at a time. Note: This place
> > is not a CPU !! { Critically important } nor is it SW that runs on a CPU.
> >> 1. The first, and pretty obvious, mechanism to subsume into this
> >> paradigm is that once you have it, you no longer need a separate
> >> instruction to request service from the OS. This is replaced, of
> >> course, with a message from the user program to a process within the OS.
> > <
> > I worked in ADA rendezvous into the system, too. They have almost all of
> > the properties of the std interrupts, but add in the address space join/release,
> > and the reactivation of caller at the end of rendezvous.
<
> I have a certain amount of apprehension about this approach.
>
> Note that Ada rendezvous are inherently _synchronous_ client-server only,
> wherein the client waits until the server finishes the rendezvous.
> There are no events, mutexes, semaphores, etc in that model.
> It does have timed entry and accept statements,
> but that is synchronous polling.
<
Yes, got that all figured out::
<
An Acceptor can send an interrupt to system scheduler to accept a call
on a set of accept entry locations.
<
When a call hits on a waiting acceptor, the call is dispatched to the acceptor
an all other accept entry points are no longer capable of accepting a call
in a particular acceptor thread. So, multiple ADA tasks can be waiting on
multiple accept entry points, and each call gets connected to exactly one
thread of the acceptor. The caller goes to a wait state, and when acceptor
is done, the caller is moved back into run state.
<
Notice, no SW intermediaries and no ATOMIC stuff.
>
> You can build things like events and mutexes out of Ada tasks and
> rendezvous but it is ridiculously expensive because they require more
> tasks each with its header and stack, etc to create "mutex servers".
<
Understood and it is possible to do ATOMIC stuff like this (in ADA or other
languages using the rendezvous functionality.)
>
> Here your ESM multi-update atomics is important here as it allows updates
> to things like double linked lists without the above expensive mutexes.
<
Doing the queues over in the system queue HW gets rid of the locks and
makes everything operate at BigO(1) without any locking and everything
"happens" in the order things arrive at memory controller.
>
> The first thing people found out about synchronous RPC is that they really
> didn't want synchronous RPC, they wanted asynchronous RPC so they could
> submit multiple concurrent requests at once then wait for all to be done.
<
Why can an ADA asynch not be setup by having an acceptation entry point
enqueue the rendezvous calls onto (one or more) queues that this task
shares with a worker task ?
<
>
> I would want to work through some realistic examples, from soup to nuts,
> enqueuing multiple async IO packets to various parts of the IO subsystem
> using the rendezvous model. Just to be sure there are no gotcha's
> and can be as efficient as hoped.
<
I do not see any fundamental reason why asynchronous RPC could not be worked
into what I currently have. While it's not like the <strict> ADA rendezvous, it uses
this piece here, and that piece there FROM the rendezvous feature set.
<
> >> 3. Similarly, with suitable hardware, you no longer need a separate
> >> mechanism for inter-processor interrupts.
> > <
> > Exactly--and a useful byproduct. But this unification needs to be expressible
> > at the instruction level. That is the SATA interrupt handler needs a way to
> > schedule the user-I/O-completion handler efficiently (1 instruction) and that
> > thread needs an efficient way to take the waiting thread off the wait queue
> > and drop it back on the run queue (also 1-instruction). There is no particular
> > reason that any CPU context switches are processed by running instructions!
> > <
> > Indeed, having the entire thread state arrive as the interrupt message
> > gets rid of lots of OS issues (cue EricP)
<
> Is this the same as the Thread Interrupt Procedures (TIP's)
> I was talking about earlier?
<
Possibly, but the "jist" of the model is to remove context switching (instructions
visible control registers, instruction ordering) from the CPU and put decision
making elsewhere in the system (1 per chip)
>
> TIPs are handy for working around the limitations of synchronous
> interfaces and notifying threads of async completions.
<
Sounds more like I unified TIPs into IPIs. Any CPU anywhere in the system can
send an interrupt to any virtual machine running any OS running any set of
threads in a single instruction.

Re: Hardware assisted message passing

<4pbtqg98to3k4ib0qi4iuu7mlv7r2iafbc@4ax.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22212&group=comp.arch#22212

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Mon, 06 Dec 2021 20:19:21 -0500
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <4pbtqg98to3k4ib0qi4iuu7mlv7r2iafbc@4ax.com>
References: <so2u87$tbg$1@dont-email.me> <cPf*upDAy@news.chiark.greenend.org.uk> <ONqrJ.50504$zF3.12794@fx03.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="74bb97f1f36551c3690cc8abd07453bf";
logging-data="28109"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lq0DQ0MljsuXwmhdNnYFnIe1FKJVcX2k="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:xI52uuI1rV5unVNxWSSZM7BGTv8=
 by: George Neuner - Tue, 7 Dec 2021 01:19 UTC

On Mon, 06 Dec 2021 11:23:00 -0500, EricP
<ThatWouldBeTelling@thevillage.com> wrote:

>In the example scenario I'm thinking of there are, say, 64k processor
>nodes in a 256*256 2-D torus network, each node with a 16-bit address.
>Each node executes some number of neurons, say 16,
>so each neuron has a 20-bit address.
>Each neuron can have up to 128 synapses, arbitrarily connected,
>Any neuron can send a "fired" message addressed from one of its
>axons (outputs) to any different neuron dendrite (input).

Torus has a lot of communication overhead with so many nodes.

Connection Machines used a hypercube (SIMD) or hypertree (SPMD).
The SIMD models implemented single "machine" cycle any->any messaging.
[Note: "machine" cycle /not/ CPU cycle. Each CPU implemented one or
more vCPUs, and (being SIMD) every active vCPU sent its message in the
same cycle].

>So there are, say, 1M neurons and if every neuron fired and was
>maximally connected then there would be 128M messages _per iteration_.
>
>Since neurons can be arbitrarily connected, messages can travel
>in any direction through the torus to get to their destination,
>with variable routings and numbers of hops,
>and therefore different delivery latencies.
>Two messages can leave in the order A B and arrive in the order B A.

Can happen with any topology unless you implement causal messaging.

>To ensure experiments are repeatable the system must be globally
>synchronous. That is, for each clock tick a set of neurons fire
>and complete their operation. IOW it is NOT a free-running collection
>of asynchronous nodes firing messages at each other ASAP.
>
>For efficiency nodes may temporarily be locally asynchronous.
>
>For efficiency we do not want to send messages for neurons that
>don't fire in a clock cycle.

Which is problematic as you note below.

But if you can arrange that neurons output /something/ on every graph
cycle - even if it is a null or "no change" message - then every
neuron can process input on every graph cycle.

>The problem is, how do we know when the neurons should fire if
>we can't just count message arrivals since some messages may not
>be sent in a particular cycle,
>and when has a cycle has finished so we can start the next iteration?

Again the answer is causal messaging.

>Does that basically summarize the problem space, for neural nets at least?
>Because I had some ideas on how to do this which might be more efficient
>than Safra but I didn't want to launch into a great long explanation if
>I had misunderstood the problem.

YMMV,
George

Re: Hardware assisted message passing

<somcr4$rfa$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22213&group=comp.arch#22213

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Hardware assisted message passing
Date: Mon, 6 Dec 2021 17:19:31 -0800
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <somcr4$rfa$1@dont-email.me>
References: <so2u87$tbg$1@dont-email.me>
<e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
<u1xrJ.95050$np6.11511@fx46.iad>
<2b4aa260-7cec-4891-9813-569d66a15626n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 7 Dec 2021 01:19:32 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3c32d80d14e9609a4a7c4b9d7a1987a3";
logging-data="28138"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1883M6uz0A/qqGUyFgeZC2N"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:pkIVwR8HiNZUVaUxMmv9U/n2hLk=
In-Reply-To: <2b4aa260-7cec-4891-9813-569d66a15626n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Tue, 7 Dec 2021 01:19 UTC

On 12/6/2021 4:37 PM, MitchAlsup wrote:
> On Monday, December 6, 2021 at 5:31:04 PM UTC-6, EricP wrote:

>> Is this the same as the Thread Interrupt Procedures (TIP's)
>> I was talking about earlier?
> <
> Possibly, but the "jist" of the model is to remove context switching (instructions
> visible control registers, instruction ordering) from the CPU and put decision
> making elsewhere in the system (1 per chip)

>> interfaces and notifying threads of async completions.

Gist?

Re: Hardware assisted message passing

<422ed541-b8fb-46ee-aabb-1f03238ea085n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22214&group=comp.arch#22214

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:462b:: with SMTP id br43mr37159545qkb.465.1638844757764;
Mon, 06 Dec 2021 18:39:17 -0800 (PST)
X-Received: by 2002:a9d:4c90:: with SMTP id m16mr33128666otf.129.1638844757545;
Mon, 06 Dec 2021 18:39:17 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!npeer.as286.net!npeer-ng0.as286.net!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Dec 2021 18:39:17 -0800 (PST)
In-Reply-To: <somcr4$rfa$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8421:e3c:af7a:335a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8421:e3c:af7a:335a
References: <so2u87$tbg$1@dont-email.me> <e6d5d393-a57e-4e5d-9632-fa2fda932dc0n@googlegroups.com>
<u1xrJ.95050$np6.11511@fx46.iad> <2b4aa260-7cec-4891-9813-569d66a15626n@googlegroups.com>
<somcr4$rfa$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <422ed541-b8fb-46ee-aabb-1f03238ea085n@googlegroups.com>
Subject: Re: Hardware assisted message passing
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 07 Dec 2021 02:39:17 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1954
 by: MitchAlsup - Tue, 7 Dec 2021 02:39 UTC

On Monday, December 6, 2021 at 7:19:35 PM UTC-6, Ivan Godard wrote:
> On 12/6/2021 4:37 PM, MitchAlsup wrote:
> > On Monday, December 6, 2021 at 5:31:04 PM UTC-6, EricP wrote:
>
> >> Is this the same as the Thread Interrupt Procedures (TIP's)
> >> I was talking about earlier?
> > <
> > Possibly, but the "jist" of the model is to remove context switching (instructions
> > visible control registers, instruction ordering) from the CPU and put decision
> > making elsewhere in the system (1 per chip)
> >> interfaces and notifying threads of async completions.
> Gist?
<
Ever since taking Latin in 7th grade, I can no longer spell.
{Not that I was all that good prior.}

Pages:1234
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor