Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

On a clear disk you can seek forever.

Re: Processor Affinity in the age of unbounded cores

Subject	Author
Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	Stephen Fuld
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	Stephen Fuld
Re: Processor Affinity in the age of unbounded cores	EricP
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	EricP
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	EricP
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	JimBrakefield
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	JimBrakefield
Re: Processor Affinity in the age of unbounded cores	Ivan Godard
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	Stefan Monnier
Re: Processor Affinity in the age of unbounded cores	EricP
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	JimBrakefield
Re: Processor Affinity in the age of unbounded cores	Terje Mathisen
Re: Processor Affinity in the age of unbounded cores	pec...@gmail.com
Re: Processor Affinity in the age of unbounded cores	Quadibloc
Re: Processor Affinity in the age of unbounded cores	Stefan Monnier
Re: Processor Affinity in the age of unbounded cores	Quadibloc
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	Timothy McCaffrey
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	Ivan Godard
Re: Processor Affinity in the age of unbounded cores	MitchAlsup
Re: Processor Affinity in the age of unbounded cores	EricP
Re: Processor Affinity in the age of unbounded cores	MitchAlsup

Pages:12

Re: Processor Affinity in the age of unbounded cores

<f3fb8c68-eddc-4770-9b78-e9b22154d1a9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20235&group=comp.arch#20235

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:6447:: with SMTP id y68mr807387qkb.296.1630518045578;
Wed, 01 Sep 2021 10:40:45 -0700 (PDT)
X-Received: by 2002:a05:6808:1283:: with SMTP id a3mr513110oiw.99.1630518045260;
Wed, 01 Sep 2021 10:40:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 1 Sep 2021 10:40:45 -0700 (PDT)
In-Reply-To: <hXLXI.14168$7w6.11238@fx42.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com> <hXLXI.14168$7w6.11238@fx42.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f3fb8c68-eddc-4770-9b78-e9b22154d1a9n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 01 Sep 2021 17:40:45 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 118

by: MitchAlsup - Wed, 1 Sep 2021 17:40 UTC

On Wednesday, September 1, 2021 at 9:16:49 AM UTC-5, EricP wrote:
> Quadibloc wrote:
> > On Monday, August 23, 2021 at 12:24:05 PM UTC-6, MitchAlsup wrote:
> >
> >> The current model of 1-bit to denote "any core" and then a bit vector
> >> to say "any of these cores can run this task/thread" falls apart when
> >> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> >> The bit vector approach "does not scale".
> >
> > One simplistic way to represent processor affinity for a chip, or
> > multi-chip module, containing a very large number of cores,
> > where it's awkward to use a bit vector of more than 64 bits, would
> > be this:
> >
> > First, one designates the "home core" of a program. This is the ID of
> > the core that the program will start from.
> >
> > That core would belong to a group of 64 cores, and so a bit vector
> > would then be used to indicate which of the (other) processors in
> > that group could also be used for that task or thread.
> >
> > These groups of 64 cores would then be organized into bundles of
> > 64 such groups. If the task or thread could be executed by cores
> > _outside_ the initial group of 64 cores, the next bit vector would indicate
> > which other groups of 64 cores could also be used for the task. (One
> > assumes that in that case, if the task is that large in scale, any individual
> > core in an indicated group of 64 cores could be used.)
> >
> > And then, of course, how it scales up is obvious. The next 64-bit vector says
> > which groups of 4,096 cores outside the initial group of 4,096 cores can be
> > used for the program, and so on and so forth.
> >
> > John Savard
> Combining what Stephen Fuld and I wrote I'd summarize it
> as there are two kinds of thread affinities:
> hard which are fixed by the configuration of a particular system,
> and soft which are dynamic.
>
> The basic hard thread affinity is a capability bit vector,
> which cores are capable of running a thread.
> Maybe only some cores have a piece of hardware like FPU.
>
> Next hard thread affinity is a ranked preference vector,
> and there can be multiple cores of equal rank.
> In a big-little system low priority threads might give equal
> preference to the multiple little cores.
>
> Another example of hard preference is based on network topology.
> A particular core might have an attached device where it is
> cheapest to talk to the device from that core.
>
> 0--1 core 0 1 2 3
> | | rank 0 1 2 1
> 3--2
> \
> IO
>
> Soft affinity is based on the recent history of a thread, any cache
> it has invested in loading or the home directory of physical pages.
> As a thread runs it invests more on its current local core
> and lowers its investment in prior cores.
>
> Soft affinity interacts with the network topology graph you describe,
> to give a dynamic cost to locating threads at different cores.
>
> All the affinity info folds into how the OS thread scheduler works.
>
> One consideration is how much effort one wants to put into the
> scheduler algorithm. The above affinities isn't just a traveling
> salesman problem, its S traveling salesmen visiting C cities,
> some with lovers, sometimes multiple ones, at different cities
> that keep changing.
<
This is an excellent observation, missing only 1 point:
<
its S traveling salesmen visiting C cities where the next route is being
decided by K schedulers randomly placed around the country.
<
Where K is the number of OS/HV threads "taking a look" at the current
schedule.
<
>
> Another scheduler considerations is SMP data structures and locking.
> If all the scheduler data is in one pile guarded by a single spinlock,
> all cores can contend for it and you can wind up with a long queue
> spin-waiting to reschedule.
>
> That is why many SMP schedulers now separate their scheduling into
> core-local lists which do not contend and require no spinlocks,
> and global lists which are accessed infrequently and spinlock guarded.
<
Yes, this is where I thing it is heading. There is some master schedule
on which essentially all runnable processes reside. The first time a
process runs it may be placed rather randomly on a node (=chip) and
then as long as that node is not saturated with work, the process is
adequately served with compute cycles, so process[p] sticks around
on node[n] unless there becomes a compelling case where the OS
wants to run [p] on some other node[!=n]. OS then removes the
original affinity and applies a new affinity to the process.
<
There is an OS thread on node[n] that manages the schedule on
node[n] which is subservient to the master OS process scheduler.
>
> Also in a 4000 node NUMA SMP system, a remote cache hit can be 400 nsec
> away and DRAM can be 500 nsec. The more things the scheduler touches,
> the more chance it has of touching one of those 400 or 500 nsec things,
> and then it winds up sitting around playing the kazoo for a while.
<
It would surprise me that a remote cache was not a full microsecond or more.
Similar for DRAM. Most of this network (link) latency
<
>
> It may not be worthwhile over-thinking thread affinity because
> in the end it may be too complicated for the scheduler to make
> use of in a reasonable window of time.
>
> It might be worthwhile working backwards from the OS scheduler
> to see what affinity information it can make use of,
> then looking at how that information can be acquired.

Re: Processor Affinity in the age of unbounded cores

<7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20236&group=comp.arch#20236

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:102:: with SMTP id u2mr1097881qtw.149.1630554239267;
Wed, 01 Sep 2021 20:43:59 -0700 (PDT)
X-Received: by 2002:aca:ea8b:: with SMTP id i133mr838691oih.44.1630554238887;
Wed, 01 Sep 2021 20:43:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!3.eu.feeder.erje.net!feeder.erje.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 1 Sep 2021 20:43:58 -0700 (PDT)
In-Reply-To: <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f39d:2c00:cd80:d081:5057:afc1;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f39d:2c00:cd80:d081:5057:afc1
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com> <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 02 Sep 2021 03:43:59 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Thu, 2 Sep 2021 03:43 UTC

On Wednesday, September 1, 2021 at 6:26:11 AM UTC-6, Stefan Monnier wrote:
> > And then, of course, how it scales up is obvious. The next 64-bit vector says
> > which groups of 4,096 cores outside the initial group of 4,096 cores can be
> > used for the program, and so on and so forth.

> So you're proposing a kind of floating-point system, where the
> granularity/mantissa specifies "64 things" and the "exponent" specified
> if those things are cores, groups of cores (e.g. chips), or groups of
> chips (e.g. boards), ...

Not quite. Because I realized that one other thing is needed. The subthreads
down in one of the other groups of cores should have their _own_ bit map on
the single core level. Because the boundaries between groups may be artificial,
and so one will want processor affinity with single-core granularity crossing group
boundaries.

Still, what you've said is sort of partly true...

John Savard

Re: Processor Affinity in the age of unbounded cores

<758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20526&group=comp.arch#20526

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:409:: with SMTP id 9mr12083101qkp.76.1632082762288;
Sun, 19 Sep 2021 13:19:22 -0700 (PDT)
X-Received: by 2002:a4a:b442:: with SMTP id h2mr1893210ooo.35.1632082762045;
Sun, 19 Sep 2021 13:19:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 19 Sep 2021 13:19:21 -0700 (PDT)
In-Reply-To: <7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a507:5db:d937:bb50;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a507:5db:d937:bb50
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com> <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
<7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 19 Sep 2021 20:19:22 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 41

by: MitchAlsup - Sun, 19 Sep 2021 20:19 UTC

Let us consider a system made out of 16 "processor" chips, each chip contains
16 cores, a PCIe bridge, a Memory Controller, and "processor to processor" links.
<
One process has ask the OS to perform some I/O and wake it back up when done.
This process[p] was originally running on node[n] in core[c] and the OS decides that
the accessed device is over on node[IO].
<
Given that Interrupt handlers are affinitized to the "chip" containing the PCIe
bridge to the actual device: when the I/O is complete, device sends interrupt
to node[IO] core[IH].
<
Interrupt handler runs, captures I/O completion status, and signals the {lower
priority] device driver of the completion; and exits. Device driver runs, sees the
completion of request from process[p] and schedules process[p] to run on node[n].
<
Process[p] is affinitized to ANY core in node[n], so Device driver sends IPI to some
core[IPI] on node[n], and this core[IPI] determines that process[p] can be run on core[j!=c];
places process[p] on the run queue of core[j]. If process[p] is of higher priority than
what is currently running on core[j], and IPI != j, core[IPI] and then sends an IPI to node[j]
to context switch into this process[p].
<
So, it seems to me that affinity has added at least 1 to the exponent of process scheduling
complexity.
<
<
Now consider that the time delay from sending an IPI to receiving the IPI on a different
node may be 200-400 cycles ! So, anytime a decision is made to "talk" to another core
on another node, by the time the IPI arrives, the state of the system can change markedly.
{The typical L1 cache delay is 3-4 cycles (as seen at the core), L2 12-24 cycles, L3 may
be on the order of 50 cycles, and then there is link delay from sending node to receiving
node, so there is lots of cycles from sending of an IPI to receipt of IPI no mater whether
these are routed as memory accesses or messages. In addition there is context switch
time to get to the IPI handler.}
<
So, IPI arrives, OS.process[IPI] looks at the run queue, and decides to IPI over to affinitized
core to run process[p]. Does process[p] get taken off the run queue at this point, or does
core[j] have to discover this all by itself ?
<
This is beginning to smell like a difficult journey, and much like the CPU{core} is seldom
in the proper place in the system to perform ATOMIC stuff in the fewest number of cycles
possible; it seems none of the threads in a multi-threaded OS is in a position to "take a look
at the run queues" without grabbing a big "lock".

Re: Processor Affinity in the age of unbounded cores

<1474cbf1-2460-4866-bd0b-d3b44c8e498fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20611&group=comp.arch#20611

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:430c:: with SMTP id u12mr27997017qko.439.1632248027263;
Tue, 21 Sep 2021 11:13:47 -0700 (PDT)
X-Received: by 2002:a4a:d185:: with SMTP id j5mr10555796oor.16.1632248027006;
Tue, 21 Sep 2021 11:13:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 21 Sep 2021 11:13:46 -0700 (PDT)
In-Reply-To: <758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=73.188.126.34; posting-account=ujX_IwoAAACu0_cef9hMHeR8g0ZYDNHh
NNTP-Posting-Host: 73.188.126.34
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com> <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
<7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com> <758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1474cbf1-2460-4866-bd0b-d3b44c8e498fn@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: timcaff...@aol.com (Timothy McCaffrey)
Injection-Date: Tue, 21 Sep 2021 18:13:47 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 63

by: Timothy McCaffrey - Tue, 21 Sep 2021 18:13 UTC

On Sunday, September 19, 2021 at 4:19:23 PM UTC-4, MitchAlsup wrote:
> Let us consider a system made out of 16 "processor" chips, each chip contains
> 16 cores, a PCIe bridge, a Memory Controller, and "processor to processor" links.
> <
> One process has ask the OS to perform some I/O and wake it back up when done.
> This process[p] was originally running on node[n] in core[c] and the OS decides that
> the accessed device is over on node[IO].
> <
> Given that Interrupt handlers are affinitized to the "chip" containing the PCIe
> bridge to the actual device: when the I/O is complete, device sends interrupt
> to node[IO] core[IH].
> <
> Interrupt handler runs, captures I/O completion status, and signals the {lower
> priority] device driver of the completion; and exits. Device driver runs, sees the
> completion of request from process[p] and schedules process[p] to run on node[n].
> <
> Process[p] is affinitized to ANY core in node[n], so Device driver sends IPI to some
> core[IPI] on node[n], and this core[IPI] determines that process[p] can be run on core[j!=c];
> places process[p] on the run queue of core[j]. If process[p] is of higher priority than
> what is currently running on core[j], and IPI != j, core[IPI] and then sends an IPI to node[j]
> to context switch into this process[p].
> <
> So, it seems to me that affinity has added at least 1 to the exponent of process scheduling
> complexity.
> <
> <
> Now consider that the time delay from sending an IPI to receiving the IPI on a different
> node may be 200-400 cycles ! So, anytime a decision is made to "talk" to another core
> on another node, by the time the IPI arrives, the state of the system can change markedly.
> {The typical L1 cache delay is 3-4 cycles (as seen at the core), L2 12-24 cycles, L3 may
> be on the order of 50 cycles, and then there is link delay from sending node to receiving
> node, so there is lots of cycles from sending of an IPI to receipt of IPI no mater whether
> these are routed as memory accesses or messages. In addition there is context switch
> time to get to the IPI handler.}
> <
> So, IPI arrives, OS.process[IPI] looks at the run queue, and decides to IPI over to affinitized
> core to run process[p]. Does process[p] get taken off the run queue at this point, or does
> core[j] have to discover this all by itself ?
> <
> This is beginning to smell like a difficult journey, and much like the CPU{core} is seldom
> in the proper place in the system to perform ATOMIC stuff in the fewest number of cycles
> possible; it seems none of the threads in a multi-threaded OS is in a position to "take a look
> at the run queues" without grabbing a big "lock".

For some time I have thought that the model of:
(Sending): Get lock -> Put stuff in queue -> release lock -> send IPI
(Recieving): Rcv IPI -> get lock -> get stuff out of queue -> release lock -> process stuff from queue

Was pretty inefficient.

I would like a hardware queue (perhaps similar to the old I2O interface that Intel pushed (but better!))
between processors. Then the processing becomes:
(sending):Create message -> push message pointer to other CPU via hardware register
(Receiving): (probably get interrupt when Hardware message queue goes from empty to non-empty)
Interrupt->
Loop:
get pointer from hardware queue->process stuff in message.
if hardware queue not empty go to loop

Ta Da! No locks!

(well, they are conceptually still there, they are (sort of) part of the hardware queue).

- Tim

Re: Processor Affinity in the age of unbounded cores

<4104bf88-5743-4405-82c9-d95ff13bca49n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20612&group=comp.arch#20612

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:618c:: with SMTP id v134mr17494161qkb.231.1632258369536;
Tue, 21 Sep 2021 14:06:09 -0700 (PDT)
X-Received: by 2002:a9d:3a6:: with SMTP id f35mr27679452otf.144.1632258369264;
Tue, 21 Sep 2021 14:06:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 21 Sep 2021 14:06:09 -0700 (PDT)
In-Reply-To: <1474cbf1-2460-4866-bd0b-d3b44c8e498fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e11e:4214:e2ad:9e9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e11e:4214:e2ad:9e9
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com> <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
<7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com> <758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>
<1474cbf1-2460-4866-bd0b-d3b44c8e498fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4104bf88-5743-4405-82c9-d95ff13bca49n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 21 Sep 2021 21:06:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 266

by: MitchAlsup - Tue, 21 Sep 2021 21:06 UTC

On Tuesday, September 21, 2021 at 1:13:48 PM UTC-5, timca...@aol.com wrote:
> On Sunday, September 19, 2021 at 4:19:23 PM UTC-4, MitchAlsup wrote:
> > Let us consider a system made out of 16 "processor" chips, each chip contains
> > 16 cores, a PCIe bridge, a Memory Controller, and "processor to processor" links.
> > <
> > One process has ask the OS to perform some I/O and wake it back up when done.
> > This process[p] was originally running on node[n] in core[c] and the OS decides that
> > the accessed device is over on node[IO].
> > <
> > Given that Interrupt handlers are affinitized to the "chip" containing the PCIe
> > bridge to the actual device: when the I/O is complete, device sends interrupt
> > to node[IO] core[IH].
> > <
> > Interrupt handler runs, captures I/O completion status, and signals the {lower
> > priority] device driver of the completion; and exits. Device driver runs, sees the
> > completion of request from process[p] and schedules process[p] to run on node[n].
> > <
> > Process[p] is affinitized to ANY core in node[n], so Device driver sends IPI to some
> > core[IPI] on node[n], and this core[IPI] determines that process[p] can be run on core[j!=c];
> > places process[p] on the run queue of core[j]. If process[p] is of higher priority than
> > what is currently running on core[j], and IPI != j, core[IPI] and then sends an IPI to node[j]
> > to context switch into this process[p].
> > <
> > So, it seems to me that affinity has added at least 1 to the exponent of process scheduling
> > complexity.
> > <
> > <
> > Now consider that the time delay from sending an IPI to receiving the IPI on a different
> > node may be 200-400 cycles ! So, anytime a decision is made to "talk" to another core
> > on another node, by the time the IPI arrives, the state of the system can change markedly.
> > {The typical L1 cache delay is 3-4 cycles (as seen at the core), L2 12-24 cycles, L3 may
> > be on the order of 50 cycles, and then there is link delay from sending node to receiving
> > node, so there is lots of cycles from sending of an IPI to receipt of IPI no mater whether
> > these are routed as memory accesses or messages. In addition there is context switch
> > time to get to the IPI handler.}
> > <
> > So, IPI arrives, OS.process[IPI] looks at the run queue, and decides to IPI over to affinitized
> > core to run process[p]. Does process[p] get taken off the run queue at this point, or does
> > core[j] have to discover this all by itself ?
> > <
> > This is beginning to smell like a difficult journey, and much like the CPU{core} is seldom
> > in the proper place in the system to perform ATOMIC stuff in the fewest number of cycles
> > possible; it seems none of the threads in a multi-threaded OS is in a position to "take a look
> > at the run queues" without grabbing a big "lock".
<
> For some time I have thought that the model of:
> (Sending): Get lock -> Put stuff in queue -> release lock -> send IPI
> (Recieving): Rcv IPI -> get lock -> get stuff out of queue -> release lock -> process stuff from queue
>
> Was pretty inefficient.
<
Modern times have proven that assumption dubious at best.
>
> I would like a hardware queue (perhaps similar to the old I2O interface that Intel pushed (but better!))
> between processors.
<
This is the direction I am trying to head without using those words.
<
There is some HW facility where you send it messages (and in this context and IPI is simply
a message) and in response the facility sends to an affinitized processor a context switch
message all based on the priorities in some kind of queuing system.
<
Since there are multiple chips, and each chip contains multiple cores, in any given clock
there could be multiple interrupts, messages, and what-not flying around, a core is not an
appropriate place to "perform this work".
<
So I am investigating a "facility" located in memory controller on 'chip'. This facility is
configured with a couple of pointers in config space which give facility exclusive access
to 'some memory'. In this memory there are interrupt vectors, exception vectors, and ADA
entry vectors. Associated with these 'vectors' is a set of priority queues (about 64 levels
seems sufficient). These vectors are the targets of messages {interrupts, exceptions,
calls, taps} and such messages use the priority in the vector to find the appropriate queue.
Facility is a lot like a function unit, except that it is located away from any core, and close
the the memory it has been assigned. Its instructions are the first doubleword of the
messages, its operands the rest of the message, and its results are messages to cores,
and it performs a bit of lookup and queuing as its raison détra.
<
In My 66000 the entire facility is accessed via a single address (per chip) in non-cacheable
memory. Facility heavily caches this space, but cores all see this space as non-cacheable.
The ISA has been augmented to send messages to facility {size 1-64 doublewords}, and
core HW has been augmented to receive a context switch message by absorbing all of the
message while producing a "I got context switched out" message.
<
By using a single address (misaligned to boot) for all messages (within a chip) to facility,
messages are forced to arrive in transit order, and arrive completely, and without interference.
Inter-chip messages use a different address which is singular to the addressed chip.
<
Thus, context switch is a LOT like CDC 6600 where the peripheral processors sent XCHG
instruction to the cores to context switch form on thread to another in 16 cycles. Facillity
is this peripheral processor here.
<
>Then the processing becomes:
> (sending):Create message -> push message pointer to other CPU via hardware register
> (Receiving): (probably get interrupt when Hardware message queue goes from empty to non-empty)
> Interrupt->
> Loop:
> get pointer from hardware queue->process stuff in message.
> if hardware queue not empty go to loop
>
> Ta Da! No locks!
<
Getting rid to the locks is of paramount importance.
<
One MUST include ADA call-accept queuing in the SAME overall structure.
So a call to an ADA entry point is a message--perhaps the message is as simple at the
pointer to thread-state {which contains thread header and register file. Thread header
contain the root pointer to translations}.
An ADA <conditional> accept is a similarly small message.
Should the call arrive before the accept, the call is placed on the tail of the queue of
that entry point.
Should the <unconditional> accept arrive before any call, it is placed on the tail=head
of the <empty> queue.
When a call is joined with an accept a context switch to the rendezvous entry point
is performed.
The rendezvous has access to the caller via the thread state where it can access
the register file and memory using something that smells a lot like the PDP-11/70
LD-from-previous, ST-to-previous memory reference instructions. These are played
out in the address space of the caller. Rendezbvous has access to its own memory
via normal LD and ST instructions.
When rendezvous is done, a message is sent to facility when then re-enables both
caller and ADA task simultaneously.
<
Rendezvous have the NATURAL property that they can only be ACTIVE once--hence
no re-entrancy problems or even the need to perform locks.
<
I plan on making Interrupt service routines have his property, and use facility to
perform the interrupt queuing. ISRs are serially reusable.
<
Back at facility: messages from cores to facility are short, messages from facility
to core is "everything needed to finish performing a context switch", 8 thread state
doublewords and 32 register doublewords. You do not tell a core to go and context
switch into something, you send it everything it needs to already be in that context.
In return, if core is already running something, it sends back the current state of
the running thread, which facility puts on the front of of its priority run queue.
<
Thus multiple scheduling and context switches can be simultaneously in progress
without any locking going on--all because it is not being performed in a core !
<
Additionally, since the context switch is a single message, there is no chance something
can get in the way while it is transiting the links, thus removing any need for core to
perform locks. In My 66000 such a context switch message is 9 cache lines long.
<
Also note: Facillity remembers at what priority level all cores (in a chip) are operating,
and does not send messages unless they are of higher priority that what is currently
running.
<
Other than the latency from facility to core, each core is always running the highest
priority thread that has been afinitized to it; and there is no (0, zero, nada, zilch)
OS overhead associated in normal thread context switch behavior. The OS only needs
to be involved run<->wait management and affinity.
<
All the operating system is left to do is to manage the run<->wait status of threads
and perform affinitization based on whatever mode of affinity model the OS is using.
<
<-------------------------------------------------------------------------------------------------------------------------------
<
So, say we have a core running user thread[t] and along comes an interrupt. Facility'
sees message, finds priority, sees that core[c] running thread[t] is of lower priority than
ISR[i] so facility sends ISR thread state to core[c]; core[c] sends back current context of
thread[t] and begins running ISR[i] while facility places thread[t] on the head of its run
queue.
<
In the performance of its duties; ISR[i] sends a message to run I/O cleanup task[k]
to facillity, and then <shortly> exits.
<
Sooner or later, facility sends the context switch into I/O cleanup take[k] to some core[?]
where it takes thread[m!=t] off the wait queue and places it on the run queue by sending
a run message to facility, then <after a while> exits.
<
After all the flury of context switching has transpired, original thread[t] return to running
on core[c] and waiting thread[k] is running on some core[m!=c].
>
> (well, they are conceptually still there, they are (sort of) part of the hardware queue).
>
> - Tim

Click here to read the complete article

Re: Processor Affinity in the age of unbounded cores

<sie07o$jnv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20623&group=comp.arch#20623

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Tue, 21 Sep 2021 18:17:43 -0700
Organization: A noiseless patient Spider
Lines: 175
Message-ID: <sie07o$jnv$1@dont-email.me>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com>
<jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
<7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com>
<758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>
<1474cbf1-2460-4866-bd0b-d3b44c8e498fn@googlegroups.com>
<4104bf88-5743-4405-82c9-d95ff13bca49n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 22 Sep 2021 01:17:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="38828d7ed4b6834d3fd331b4e3f02254";
logging-data="20223"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19/KMsAiiNGI25mP8ZxEgye"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:46f2QXZ3+0MJrUlUhpAYFH5KqTE=
In-Reply-To: <4104bf88-5743-4405-82c9-d95ff13bca49n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Wed, 22 Sep 2021 01:17 UTC

On 9/21/2021 2:06 PM, MitchAlsup wrote:
> On Tuesday, September 21, 2021 at 1:13:48 PM UTC-5, timca...@aol.com wrote:
>> On Sunday, September 19, 2021 at 4:19:23 PM UTC-4, MitchAlsup wrote:
>>> Let us consider a system made out of 16 "processor" chips, each chip contains
>>> 16 cores, a PCIe bridge, a Memory Controller, and "processor to processor" links.
>>> <
>>> One process has ask the OS to perform some I/O and wake it back up when done.
>>> This process[p] was originally running on node[n] in core[c] and the OS decides that
>>> the accessed device is over on node[IO].
>>> <
>>> Given that Interrupt handlers are affinitized to the "chip" containing the PCIe
>>> bridge to the actual device: when the I/O is complete, device sends interrupt
>>> to node[IO] core[IH].
>>> <
>>> Interrupt handler runs, captures I/O completion status, and signals the {lower
>>> priority] device driver of the completion; and exits. Device driver runs, sees the
>>> completion of request from process[p] and schedules process[p] to run on node[n].
>>> <
>>> Process[p] is affinitized to ANY core in node[n], so Device driver sends IPI to some
>>> core[IPI] on node[n], and this core[IPI] determines that process[p] can be run on core[j!=c];
>>> places process[p] on the run queue of core[j]. If process[p] is of higher priority than
>>> what is currently running on core[j], and IPI != j, core[IPI] and then sends an IPI to node[j]
>>> to context switch into this process[p].
>>> <
>>> So, it seems to me that affinity has added at least 1 to the exponent of process scheduling
>>> complexity.
>>> <
>>> <
>>> Now consider that the time delay from sending an IPI to receiving the IPI on a different
>>> node may be 200-400 cycles ! So, anytime a decision is made to "talk" to another core
>>> on another node, by the time the IPI arrives, the state of the system can change markedly.
>>> {The typical L1 cache delay is 3-4 cycles (as seen at the core), L2 12-24 cycles, L3 may
>>> be on the order of 50 cycles, and then there is link delay from sending node to receiving
>>> node, so there is lots of cycles from sending of an IPI to receipt of IPI no mater whether
>>> these are routed as memory accesses or messages. In addition there is context switch
>>> time to get to the IPI handler.}
>>> <
>>> So, IPI arrives, OS.process[IPI] looks at the run queue, and decides to IPI over to affinitized
>>> core to run process[p]. Does process[p] get taken off the run queue at this point, or does
>>> core[j] have to discover this all by itself ?
>>> <
>>> This is beginning to smell like a difficult journey, and much like the CPU{core} is seldom
>>> in the proper place in the system to perform ATOMIC stuff in the fewest number of cycles
>>> possible; it seems none of the threads in a multi-threaded OS is in a position to "take a look
>>> at the run queues" without grabbing a big "lock".
> <
>> For some time I have thought that the model of:
>> (Sending): Get lock -> Put stuff in queue -> release lock -> send IPI
>> (Recieving): Rcv IPI -> get lock -> get stuff out of queue -> release lock -> process stuff from queue
>>
>> Was pretty inefficient.
> <
> Modern times have proven that assumption dubious at best.
>>
>> I would like a hardware queue (perhaps similar to the old I2O interface that Intel pushed (but better!))
>> between processors.
> <
> This is the direction I am trying to head without using those words.
> <
> There is some HW facility where you send it messages (and in this context and IPI is simply
> a message) and in response the facility sends to an affinitized processor a context switch
> message all based on the priorities in some kind of queuing system.
> <
> Since there are multiple chips, and each chip contains multiple cores, in any given clock
> there could be multiple interrupts, messages, and what-not flying around, a core is not an
> appropriate place to "perform this work".
> <
> So I am investigating a "facility" located in memory controller on 'chip'. This facility is
> configured with a couple of pointers in config space which give facility exclusive access
> to 'some memory'. In this memory there are interrupt vectors, exception vectors, and ADA
> entry vectors. Associated with these 'vectors' is a set of priority queues (about 64 levels
> seems sufficient). These vectors are the targets of messages {interrupts, exceptions,
> calls, taps} and such messages use the priority in the vector to find the appropriate queue.
> Facility is a lot like a function unit, except that it is located away from any core, and close
> the the memory it has been assigned. Its instructions are the first doubleword of the
> messages, its operands the rest of the message, and its results are messages to cores,
> and it performs a bit of lookup and queuing as its raison détra.
> <
> In My 66000 the entire facility is accessed via a single address (per chip) in non-cacheable
> memory. Facility heavily caches this space, but cores all see this space as non-cacheable.
> The ISA has been augmented to send messages to facility {size 1-64 doublewords}, and
> core HW has been augmented to receive a context switch message by absorbing all of the
> message while producing a "I got context switched out" message.
> <
> By using a single address (misaligned to boot) for all messages (within a chip) to facility,
> messages are forced to arrive in transit order, and arrive completely, and without interference.
> Inter-chip messages use a different address which is singular to the addressed chip.
> <
> Thus, context switch is a LOT like CDC 6600 where the peripheral processors sent XCHG
> instruction to the cores to context switch form on thread to another in 16 cycles. Facillity
> is this peripheral processor here.
> <
>> Then the processing becomes:
>> (sending):Create message -> push message pointer to other CPU via hardware register
>> (Receiving): (probably get interrupt when Hardware message queue goes from empty to non-empty)
>> Interrupt->
>> Loop:
>> get pointer from hardware queue->process stuff in message.
>> if hardware queue not empty go to loop
>>
>> Ta Da! No locks!
> <
> Getting rid to the locks is of paramount importance.
> <
> One MUST include ADA call-accept queuing in the SAME overall structure.
> So a call to an ADA entry point is a message--perhaps the message is as simple at the
> pointer to thread-state {which contains thread header and register file. Thread header
> contain the root pointer to translations}.
> An ADA <conditional> accept is a similarly small message.
> Should the call arrive before the accept, the call is placed on the tail of the queue of
> that entry point.
> Should the <unconditional> accept arrive before any call, it is placed on the tail=head
> of the <empty> queue.
> When a call is joined with an accept a context switch to the rendezvous entry point
> is performed.
> The rendezvous has access to the caller via the thread state where it can access
> the register file and memory using something that smells a lot like the PDP-11/70
> LD-from-previous, ST-to-previous memory reference instructions. These are played
> out in the address space of the caller. Rendezbvous has access to its own memory
> via normal LD and ST instructions.
> When rendezvous is done, a message is sent to facility when then re-enables both
> caller and ADA task simultaneously.
> <
> Rendezvous have the NATURAL property that they can only be ACTIVE once--hence
> no re-entrancy problems or even the need to perform locks.
> <
> I plan on making Interrupt service routines have his property, and use facility to
> perform the interrupt queuing. ISRs are serially reusable.
> <
> Back at facility: messages from cores to facility are short, messages from facility
> to core is "everything needed to finish performing a context switch", 8 thread state
> doublewords and 32 register doublewords. You do not tell a core to go and context
> switch into something, you send it everything it needs to already be in that context.
> In return, if core is already running something, it sends back the current state of
> the running thread, which facility puts on the front of of its priority run queue.
> <
> Thus multiple scheduling and context switches can be simultaneously in progress
> without any locking going on--all because it is not being performed in a core !
> <
> Additionally, since the context switch is a single message, there is no chance something
> can get in the way while it is transiting the links, thus removing any need for core to
> perform locks. In My 66000 such a context switch message is 9 cache lines long.
> <
> Also note: Facillity remembers at what priority level all cores (in a chip) are operating,
> and does not send messages unless they are of higher priority that what is currently
> running.
> <
> Other than the latency from facility to core, each core is always running the highest
> priority thread that has been afinitized to it; and there is no (0, zero, nada, zilch)
> OS overhead associated in normal thread context switch behavior. The OS only needs
> to be involved run<->wait management and affinity.
> <
> All the operating system is left to do is to manage the run<->wait status of threads
> and perform affinitization based on whatever mode of affinity model the OS is using.
> <
> <-------------------------------------------------------------------------------------------------------------------------------
> <
> So, say we have a core running user thread[t] and along comes an interrupt. Facility'
> sees message, finds priority, sees that core[c] running thread[t] is of lower priority than
> ISR[i] so facility sends ISR thread state to core[c]; core[c] sends back current context of
> thread[t] and begins running ISR[i] while facility places thread[t] on the head of its run
> queue.
> <
> In the performance of its duties; ISR[i] sends a message to run I/O cleanup task[k]
> to facillity, and then <shortly> exits.
> <
> Sooner or later, facility sends the context switch into I/O cleanup take[k] to some core[?]
> where it takes thread[m!=t] off the wait queue and places it on the run queue by sending
> a run message to facility, then <after a while> exits.
> <
> After all the flury of context switching has transpired, original thread[t] return to running
> on core[c] and waiting thread[k] is running on some core[m!=c].

Click here to read the complete article

Re: Processor Affinity in the age of unbounded cores

<ff98a455-58f0-46c5-9c99-f6067e0089b5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20624&group=comp.arch#20624

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:ea16:: with SMTP id t22mr33639265qkj.507.1632274744635;
Tue, 21 Sep 2021 18:39:04 -0700 (PDT)
X-Received: by 2002:a9d:7281:: with SMTP id t1mr7086134otj.129.1632274744267;
Tue, 21 Sep 2021 18:39:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 21 Sep 2021 18:39:04 -0700 (PDT)
In-Reply-To: <sie07o$jnv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e11e:4214:e2ad:9e9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e11e:4214:e2ad:9e9
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com> <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
<7d983951-c0fc-4fc0-82e6-dd4687dff46fn@googlegroups.com> <758177f3-732e-4c34-b7e2-a1cfb53ce701n@googlegroups.com>
<1474cbf1-2460-4866-bd0b-d3b44c8e498fn@googlegroups.com> <4104bf88-5743-4405-82c9-d95ff13bca49n@googlegroups.com>
<sie07o$jnv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ff98a455-58f0-46c5-9c99-f6067e0089b5n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 22 Sep 2021 01:39:04 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 281

by: MitchAlsup - Wed, 22 Sep 2021 01:39 UTC

On Tuesday, September 21, 2021 at 8:17:46 PM UTC-5, Ivan Godard wrote:
> On 9/21/2021 2:06 PM, MitchAlsup wrote:
> > On Tuesday, September 21, 2021 at 1:13:48 PM UTC-5, timca...@aol.com wrote:
> >> On Sunday, September 19, 2021 at 4:19:23 PM UTC-4, MitchAlsup wrote:
> >>> Let us consider a system made out of 16 "processor" chips, each chip contains
> >>> 16 cores, a PCIe bridge, a Memory Controller, and "processor to processor" links.
> >>> <
> >>> One process has ask the OS to perform some I/O and wake it back up when done.
> >>> This process[p] was originally running on node[n] in core[c] and the OS decides that
> >>> the accessed device is over on node[IO].
> >>> <
> >>> Given that Interrupt handlers are affinitized to the "chip" containing the PCIe
> >>> bridge to the actual device: when the I/O is complete, device sends interrupt
> >>> to node[IO] core[IH].
> >>> <
> >>> Interrupt handler runs, captures I/O completion status, and signals the {lower
> >>> priority] device driver of the completion; and exits. Device driver runs, sees the
> >>> completion of request from process[p] and schedules process[p] to run on node[n].
> >>> <
> >>> Process[p] is affinitized to ANY core in node[n], so Device driver sends IPI to some
> >>> core[IPI] on node[n], and this core[IPI] determines that process[p] can be run on core[j!=c];
> >>> places process[p] on the run queue of core[j]. If process[p] is of higher priority than
> >>> what is currently running on core[j], and IPI != j, core[IPI] and then sends an IPI to node[j]
> >>> to context switch into this process[p].
> >>> <
> >>> So, it seems to me that affinity has added at least 1 to the exponent of process scheduling
> >>> complexity.
> >>> <
> >>> <
> >>> Now consider that the time delay from sending an IPI to receiving the IPI on a different
> >>> node may be 200-400 cycles ! So, anytime a decision is made to "talk" to another core
> >>> on another node, by the time the IPI arrives, the state of the system can change markedly.
> >>> {The typical L1 cache delay is 3-4 cycles (as seen at the core), L2 12-24 cycles, L3 may
> >>> be on the order of 50 cycles, and then there is link delay from sending node to receiving
> >>> node, so there is lots of cycles from sending of an IPI to receipt of IPI no mater whether
> >>> these are routed as memory accesses or messages. In addition there is context switch
> >>> time to get to the IPI handler.}
> >>> <
> >>> So, IPI arrives, OS.process[IPI] looks at the run queue, and decides to IPI over to affinitized
> >>> core to run process[p]. Does process[p] get taken off the run queue at this point, or does
> >>> core[j] have to discover this all by itself ?
> >>> <
> >>> This is beginning to smell like a difficult journey, and much like the CPU{core} is seldom
> >>> in the proper place in the system to perform ATOMIC stuff in the fewest number of cycles
> >>> possible; it seems none of the threads in a multi-threaded OS is in a position to "take a look
> >>> at the run queues" without grabbing a big "lock".
> > <
> >> For some time I have thought that the model of:
> >> (Sending): Get lock -> Put stuff in queue -> release lock -> send IPI
> >> (Recieving): Rcv IPI -> get lock -> get stuff out of queue -> release lock -> process stuff from queue
> >>
> >> Was pretty inefficient.
> > <
> > Modern times have proven that assumption dubious at best.
> >>
> >> I would like a hardware queue (perhaps similar to the old I2O interface that Intel pushed (but better!))
> >> between processors.
> > <
> > This is the direction I am trying to head without using those words.
> > <
> > There is some HW facility where you send it messages (and in this context and IPI is simply
> > a message) and in response the facility sends to an affinitized processor a context switch
> > message all based on the priorities in some kind of queuing system.
> > <
> > Since there are multiple chips, and each chip contains multiple cores, in any given clock
> > there could be multiple interrupts, messages, and what-not flying around, a core is not an
> > appropriate place to "perform this work".
> > <
> > So I am investigating a "facility" located in memory controller on 'chip'. This facility is
> > configured with a couple of pointers in config space which give facility exclusive access
> > to 'some memory'. In this memory there are interrupt vectors, exception vectors, and ADA
> > entry vectors. Associated with these 'vectors' is a set of priority queues (about 64 levels
> > seems sufficient). These vectors are the targets of messages {interrupts, exceptions,
> > calls, taps} and such messages use the priority in the vector to find the appropriate queue.
> > Facility is a lot like a function unit, except that it is located away from any core, and close
> > the the memory it has been assigned. Its instructions are the first doubleword of the
> > messages, its operands the rest of the message, and its results are messages to cores,
> > and it performs a bit of lookup and queuing as its raison détra.
> > <
> > In My 66000 the entire facility is accessed via a single address (per chip) in non-cacheable
> > memory. Facility heavily caches this space, but cores all see this space as non-cacheable.
> > The ISA has been augmented to send messages to facility {size 1-64 doublewords}, and
> > core HW has been augmented to receive a context switch message by absorbing all of the
> > message while producing a "I got context switched out" message.
> > <
> > By using a single address (misaligned to boot) for all messages (within a chip) to facility,
> > messages are forced to arrive in transit order, and arrive completely, and without interference.
> > Inter-chip messages use a different address which is singular to the addressed chip.
> > <
> > Thus, context switch is a LOT like CDC 6600 where the peripheral processors sent XCHG
> > instruction to the cores to context switch form on thread to another in 16 cycles. Facillity
> > is this peripheral processor here.
> > <
> >> Then the processing becomes:
> >> (sending):Create message -> push message pointer to other CPU via hardware register
> >> (Receiving): (probably get interrupt when Hardware message queue goes from empty to non-empty)
> >> Interrupt->
> >> Loop:
> >> get pointer from hardware queue->process stuff in message.
> >> if hardware queue not empty go to loop
> >>
> >> Ta Da! No locks!
> > <
> > Getting rid to the locks is of paramount importance.
> > <
> > One MUST include ADA call-accept queuing in the SAME overall structure.
> > So a call to an ADA entry point is a message--perhaps the message is as simple at the
> > pointer to thread-state {which contains thread header and register file.. Thread header
> > contain the root pointer to translations}.
> > An ADA <conditional> accept is a similarly small message.
> > Should the call arrive before the accept, the call is placed on the tail of the queue of
> > that entry point.
> > Should the <unconditional> accept arrive before any call, it is placed on the tail=head
> > of the <empty> queue.
> > When a call is joined with an accept a context switch to the rendezvous entry point
> > is performed.
> > The rendezvous has access to the caller via the thread state where it can access
> > the register file and memory using something that smells a lot like the PDP-11/70
> > LD-from-previous, ST-to-previous memory reference instructions. These are played
> > out in the address space of the caller. Rendezbvous has access to its own memory
> > via normal LD and ST instructions.
> > When rendezvous is done, a message is sent to facility when then re-enables both
> > caller and ADA task simultaneously.
> > <
> > Rendezvous have the NATURAL property that they can only be ACTIVE once--hence
> > no re-entrancy problems or even the need to perform locks.
> > <
> > I plan on making Interrupt service routines have his property, and use facility to
> > perform the interrupt queuing. ISRs are serially reusable.
> > <
> > Back at facility: messages from cores to facility are short, messages from facility
> > to core is "everything needed to finish performing a context switch", 8 thread state
> > doublewords and 32 register doublewords. You do not tell a core to go and context
> > switch into something, you send it everything it needs to already be in that context.
> > In return, if core is already running something, it sends back the current state of
> > the running thread, which facility puts on the front of of its priority run queue.
> > <
> > Thus multiple scheduling and context switches can be simultaneously in progress
> > without any locking going on--all because it is not being performed in a core !
> > <
> > Additionally, since the context switch is a single message, there is no chance something
> > can get in the way while it is transiting the links, thus removing any need for core to
> > perform locks. In My 66000 such a context switch message is 9 cache lines long.
> > <
> > Also note: Facillity remembers at what priority level all cores (in a chip) are operating,
> > and does not send messages unless they are of higher priority that what is currently
> > running.
> > <
> > Other than the latency from facility to core, each core is always running the highest
> > priority thread that has been afinitized to it; and there is no (0, zero, nada, zilch)
> > OS overhead associated in normal thread context switch behavior. The OS only needs
> > to be involved run<->wait management and affinity.
> > <
> > All the operating system is left to do is to manage the run<->wait status of threads
> > and perform affinitization based on whatever mode of affinity model the OS is using.
> > <
> > <-------------------------------------------------------------------------------------------------------------------------------
> > <
> > So, say we have a core running user thread[t] and along comes an interrupt. Facility'
> > sees message, finds priority, sees that core[c] running thread[t] is of lower priority than
> > ISR[i] so facility sends ISR thread state to core[c]; core[c] sends back current context of
> > thread[t] and begins running ISR[i] while facility places thread[t] on the head of its run
> > queue.
> > <
> > In the performance of its duties; ISR[i] sends a message to run I/O cleanup task[k]
> > to facillity, and then <shortly> exits.
> > <
> > Sooner or later, facility sends the context switch into I/O cleanup take[k] to some core[?]
> > where it takes thread[m!=t] off the wait queue and places it on the run queue by sending
> > a run message to facility, then <after a while> exits.
> > <
> > After all the flury of context switching has transpired, original thread[t] return to running
> > on core[c] and waiting thread[k] is running on some core[m!=c].
<
> How do you track which quanta budget to charge for the switched-in runtime?
<
Facility knows 'when' the context switch message was sent out and 'when' that threads
<then> current state gets sent back.
<
Thus, with access to Real Time Clock, the subtraction is easy.
<
It is more precise than when the transition through microcode takes 500 cycles to "take"
the first interrupt of the flury.

Click here to read the complete article

Pages:12

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor