Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Time is an illusion perpetrated by the manufacturers of space.


devel / comp.arch / Processor Affinity in the age of unbounded cores

SubjectAuthor
* Processor Affinity in the age of unbounded coresMitchAlsup
+* Re: Processor Affinity in the age of unbounded coresStephen Fuld
|`* Re: Processor Affinity in the age of unbounded coresMitchAlsup
| `- Re: Processor Affinity in the age of unbounded coresStephen Fuld
+* Re: Processor Affinity in the age of unbounded coresEricP
|`* Re: Processor Affinity in the age of unbounded coresMitchAlsup
| `* Re: Processor Affinity in the age of unbounded coresEricP
|  `* Re: Processor Affinity in the age of unbounded coresMitchAlsup
|   `* Re: Processor Affinity in the age of unbounded coresEricP
|    `* Re: Processor Affinity in the age of unbounded coresMitchAlsup
|     +* Re: Processor Affinity in the age of unbounded coresMitchAlsup
|     |`* Re: Processor Affinity in the age of unbounded coresJimBrakefield
|     | `* Re: Processor Affinity in the age of unbounded coresMitchAlsup
|     |  +- Re: Processor Affinity in the age of unbounded coresJimBrakefield
|     |  `* Re: Processor Affinity in the age of unbounded coresIvan Godard
|     |   +- Re: Processor Affinity in the age of unbounded coresMitchAlsup
|     |   `- Re: Processor Affinity in the age of unbounded coresStefan Monnier
|     `* Re: Processor Affinity in the age of unbounded coresEricP
|      `- Re: Processor Affinity in the age of unbounded coresMitchAlsup
+- Re: Processor Affinity in the age of unbounded coresJimBrakefield
+- Re: Processor Affinity in the age of unbounded coresTerje Mathisen
+- Re: Processor Affinity in the age of unbounded corespec...@gmail.com
`* Re: Processor Affinity in the age of unbounded coresQuadibloc
 +* Re: Processor Affinity in the age of unbounded coresStefan Monnier
 |`* Re: Processor Affinity in the age of unbounded coresQuadibloc
 | `* Re: Processor Affinity in the age of unbounded coresMitchAlsup
 |  `* Re: Processor Affinity in the age of unbounded coresTimothy McCaffrey
 |   `* Re: Processor Affinity in the age of unbounded coresMitchAlsup
 |    `* Re: Processor Affinity in the age of unbounded coresIvan Godard
 |     `- Re: Processor Affinity in the age of unbounded coresMitchAlsup
 `* Re: Processor Affinity in the age of unbounded coresEricP
  `- Re: Processor Affinity in the age of unbounded coresMitchAlsup

Pages:12
Processor Affinity in the age of unbounded cores

<3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20065&group=comp.arch#20065

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:734c:: with SMTP id q12mr10955541qtp.192.1629743044536;
Mon, 23 Aug 2021 11:24:04 -0700 (PDT)
X-Received: by 2002:a54:4883:: with SMTP id r3mr12699750oic.7.1629743044292;
Mon, 23 Aug 2021 11:24:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 11:24:04 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Subject: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 18:24:04 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 23 Aug 2021 18:24 UTC

I have been giving some thought to processor affinity. But I want to
look at the future where there may be an essentially unbounded
number of cores.
<
The current model of 1-bit to denote "any core" and then a bit vector
to say "any of these cores can run this task/thread" falls apart when
number of cores is bigger than 32-cores (or 64-cores or 128-cores).
The bit vector approach "does not scale".
<
So, I ask: what would a programming model for assigning processor
affinity when the number of cores is not easily mapped into common
register widths ?

Re: Processor Affinity in the age of unbounded cores

<sg0rae$5c3$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20067&group=comp.arch#20067

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Mon, 23 Aug 2021 12:01:34 -0700
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <sg0rae$5c3$1@dont-email.me>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 23 Aug 2021 19:01:35 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="607e866bd76e72d2fba5e11a8a40d10b";
logging-data="5507"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18tAMWP+QnWiyXzqV7M931UDrRpASFdPJ8="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:fOz4ZaxpTTQyVv7fYGaAVp5PHf8=
In-Reply-To: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Mon, 23 Aug 2021 19:01 UTC

On 8/23/2021 11:24 AM, MitchAlsup wrote:
> I have been giving some thought to processor affinity. But I want to
> look at the future where there may be an essentially unbounded
> number of cores.
> <
> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".
> <
> So, I ask: what would a programming model for assigning processor
> affinity when the number of cores is not easily mapped into common
> register widths ?

Interesting question. I think we should take a step back and determine
what problem processor affinity is needed/desirable to solve. I can
think of four, though there are probably others.

1. A process needs to run on a particular specific core, say to run some
kind of diagnostic.

2. All the cores are not equivalent. Some could be faster or slower, or
some may have particular extensions that other don't have.

3. Some cores share some level of cache with some, but not all other
cores. So you may want a process to be in the same group or a different
group of cores from some other process, depending upon their cache usage
pattern.

4. There may be some other characteristic of the processes that is
relevant, such as one process produces an intermediate result used by
another process, so there is no use having the second process bidding
for another core and thus potentially using resources needlessly. I am
somewhat fuzzy about this.

It may be that there are different solutions for some of these. For
example, for the first, the program could use a "core number" rather
than a bit mask. For the second, you could have the OS define "groups"
each with particular characteristics. The a process would request a
"group number", and be assigned to any processor in that group.

For three and four, the software could specify "same core as process
"X", or "different core from process "Y". This get the applications
software out of specifying physical core sets.

Obviously, this is pretty much off the top of my head, and needs a lot
of "refinement". :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Processor Affinity in the age of unbounded cores

<CASUI.9322$vA6.4623@fx23.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20068&group=comp.arch#20068

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx23.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
In-Reply-To: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 40
Message-ID: <CASUI.9322$vA6.4623@fx23.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 23 Aug 2021 19:23:14 UTC
Date: Mon, 23 Aug 2021 15:22:54 -0400
X-Received-Bytes: 2563
 by: EricP - Mon, 23 Aug 2021 19:22 UTC

MitchAlsup wrote:
> I have been giving some thought to processor affinity. But I want to
> look at the future where there may be an essentially unbounded
> number of cores.
> <
> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".
> <
> So, I ask: what would a programming model for assigning processor
> affinity when the number of cores is not easily mapped into common
> register widths ?

Physical memory is allocated to minimize the distance between a
thread's preferred core, and that cores memory or neighbor memory.
The bit vector model does not take NUMA distance to memory into account.

A thread would be allocated NUMA memory as close as possible,
and would continue to use that memory until it is recycled.

Having two or more threads with shared process address space running on
different cores raises the question which NUMA node should the memory
be allocated on? Presumably each thread's stack would allocate on
the current core.

Moving thread T1 from core C0 to core C1 might have a shared L2,
to core C3 have 1 hop back to the memory previously allocated on C0,
to core C4 have 2 hops back to previous memory allocated on C0.
This extra hop cost continues until physical pages are recycled.

Moving a thread between cores also usually does not take into account
the loss of investment in the cache, and the delay to re-charge the cache.
Just guessing, but capacitor charge-discharge model probably applies.

This suggests a hierarchy of costs to move up, across, and then down
between cores for load balancing to take into account.

Re: Processor Affinity in the age of unbounded cores

<928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20070&group=comp.arch#20070

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:104c:: with SMTP id f12mr30718940qte.339.1629748687958;
Mon, 23 Aug 2021 12:58:07 -0700 (PDT)
X-Received: by 2002:a9d:7396:: with SMTP id j22mr29904934otk.206.1629748687712;
Mon, 23 Aug 2021 12:58:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 12:58:07 -0700 (PDT)
In-Reply-To: <CASUI.9322$vA6.4623@fx23.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <CASUI.9322$vA6.4623@fx23.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 19:58:07 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3713
 by: MitchAlsup - Mon, 23 Aug 2021 19:58 UTC

On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > I have been giving some thought to processor affinity. But I want to
> > look at the future where there may be an essentially unbounded
> > number of cores.
> > <
> > The current model of 1-bit to denote "any core" and then a bit vector
> > to say "any of these cores can run this task/thread" falls apart when
> > number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> > The bit vector approach "does not scale".
> > <
> > So, I ask: what would a programming model for assigning processor
> > affinity when the number of cores is not easily mapped into common
> > register widths ?
<
> Physical memory is allocated to minimize the distance between a
> thread's preferred core, and that cores memory or neighbor memory.
> The bit vector model does not take NUMA distance to memory into account.
>
> A thread would be allocated NUMA memory as close as possible,
> and would continue to use that memory until it is recycled.
<
So, would an I/O driver (SATA) want to be closer to his memory or closer
to his I/O (Host Bridge) ?
<
Would interrupt from the Host Bridge be directed to a core in this "Node" ?
So, could the device driver be situated close to its memory but its interrupt
handler closer to the device ?
<
Would a device page fault be directed at the interrupt handler or at I/O
page fault handling driver ?
>
> Having two or more threads with shared process address space running on
> different cores raises the question which NUMA node should the memory
> be allocated on? Presumably each thread's stack would allocate on
> the current core.
>
> Moving thread T1 from core C0 to core C1 might have a shared L2,
> to core C3 have 1 hop back to the memory previously allocated on C0,
> to core C4 have 2 hops back to previous memory allocated on C0.
> This extra hop cost continues until physical pages are recycled.
>
> Moving a thread between cores also usually does not take into account
> the loss of investment in the cache, and the delay to re-charge the cache.
> Just guessing, but capacitor charge-discharge model probably applies.
>
> This suggests a hierarchy of costs to move up, across, and then down
> between cores for load balancing to take into account.
<
Thanks for this observation.
<
Now what about multi-threaded cores, the cache cost of migrating around a
MT core is low, but the throughput could be lower due to both (or many) cores
sharing execution resources?

Re: Processor Affinity in the age of unbounded cores

<ee87abaa-04ec-455d-987d-c15cbe3a45f0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20071&group=comp.arch#20071

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e55:: with SMTP id h82mr23126780qke.42.1629748997677;
Mon, 23 Aug 2021 13:03:17 -0700 (PDT)
X-Received: by 2002:aca:a80a:: with SMTP id r10mr200940oie.119.1629748997263;
Mon, 23 Aug 2021 13:03:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 13:03:17 -0700 (PDT)
In-Reply-To: <sg0rae$5c3$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <sg0rae$5c3$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ee87abaa-04ec-455d-987d-c15cbe3a45f0n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 23 Aug 2021 20:03:17 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 23 Aug 2021 20:03 UTC

On Monday, August 23, 2021 at 2:01:37 PM UTC-5, Stephen Fuld wrote:
> On 8/23/2021 11:24 AM, MitchAlsup wrote:
> > I have been giving some thought to processor affinity. But I want to
> > look at the future where there may be an essentially unbounded
> > number of cores.
> > <
> > The current model of 1-bit to denote "any core" and then a bit vector
> > to say "any of these cores can run this task/thread" falls apart when
> > number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> > The bit vector approach "does not scale".
> > <
> > So, I ask: what would a programming model for assigning processor
> > affinity when the number of cores is not easily mapped into common
> > register widths ?
<
> Interesting question. I think we should take a step back and determine
> what problem processor affinity is needed/desirable to solve. I can
> think of four, though there are probably others.
>
> 1. A process needs to run on a particular specific core, say to run some
> kind of diagnostic.
<
Is pinning merely a subset of affinity ?
>
> 2. All the cores are not equivalent. Some could be faster or slower, or
> some may have particular extensions that other don't have.
<
This begs the question of powering back on various cores and the
latency to do just such!
>
> 3. Some cores share some level of cache with some, but not all other
> cores. So you may want a process to be in the same group or a different
> group of cores from some other process, depending upon their cache usage
> pattern.
<
This is similar to EricP's hierarchy observation.
>
> 4. There may be some other characteristic of the processes that is
> relevant, such as one process produces an intermediate result used by
> another process, so there is no use having the second process bidding
> for another core and thus potentially using resources needlessly. I am
> somewhat fuzzy about this.
<
And all sorts of transitivity issues between threads/tasks.
>
> It may be that there are different solutions for some of these. For
> example, for the first, the program could use a "core number" rather
> than a bit mask. For the second, you could have the OS define "groups"
> each with particular characteristics. The a process would request a
> "group number", and be assigned to any processor in that group.
<
This gets at the jist of my question:: what is the right model to
give users to control their machines which are vastly more core
intensive that we are using today ?
>
> For three and four, the software could specify "same core as process
> "X", or "different core from process "Y". This get the applications
> software out of specifying physical core sets.
<
But how does a user express this ? and how does an OS/HV do this ?
>
> Obviously, this is pretty much off the top of my head, and needs a lot
> of "refinement". :-)
<
Thanks for a go at it.
>
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Processor Affinity in the age of unbounded cores

<sg172b$uhb$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20074&group=comp.arch#20074

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Mon, 23 Aug 2021 15:22:03 -0700
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <sg172b$uhb$1@dont-email.me>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<sg0rae$5c3$1@dont-email.me>
<ee87abaa-04ec-455d-987d-c15cbe3a45f0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 23 Aug 2021 22:22:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cad2cf2880430e998ee1f0ce2607f70";
logging-data="31275"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Tlv3FnK5Kma0BFbz2HLMPL9bkmFB7rNc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:3nVu6jNPo9KQC+uqHmC94ablWLo=
In-Reply-To: <ee87abaa-04ec-455d-987d-c15cbe3a45f0n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Mon, 23 Aug 2021 22:22 UTC

On 8/23/2021 1:03 PM, MitchAlsup wrote:
> On Monday, August 23, 2021 at 2:01:37 PM UTC-5, Stephen Fuld wrote:
>> On 8/23/2021 11:24 AM, MitchAlsup wrote:
>>> I have been giving some thought to processor affinity. But I want to
>>> look at the future where there may be an essentially unbounded
>>> number of cores.
>>> <
>>> The current model of 1-bit to denote "any core" and then a bit vector
>>> to say "any of these cores can run this task/thread" falls apart when
>>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
>>> The bit vector approach "does not scale".
>>> <
>>> So, I ask: what would a programming model for assigning processor
>>> affinity when the number of cores is not easily mapped into common
>>> register widths ?
> <
>> Interesting question. I think we should take a step back and determine
>> what problem processor affinity is needed/desirable to solve. I can
>> think of four, though there are probably others.
>>
>> 1. A process needs to run on a particular specific core, say to run some
>> kind of diagnostic.
> <
> Is pinning merely a subset of affinity ?

No. I am thinking of a "group" concept. See below. I suppose you
could have the OS define each processor as a single member group. That
might make sense.

>>
>> 2. All the cores are not equivalent. Some could be faster or slower, or
>> some may have particular extensions that other don't have.
> <
> This begs the question of powering back on various cores and the
> latency to do just such!

The OS would pre define each processor type as a "group". i.e. GBOoO
processors, small in order processors, processors with a specific extra
functionality, etc. These would be well known to the application. The
the application would do some sort of syscall to specify to which group
it wants affinity.

>>
>> 3. Some cores share some level of cache with some, but not all other
>> cores. So you may want a process to be in the same group or a different
>> group of cores from some other process, depending upon their cache usage
>> pattern.
> <
> This is similar to EricP's hierarchy observation.
>>
>> 4. There may be some other characteristic of the processes that is
>> relevant, such as one process produces an intermediate result used by
>> another process, so there is no use having the second process bidding
>> for another core and thus potentially using resources needlessly. I am
>> somewhat fuzzy about this.
> <
> And all sorts of transitivity issues between threads/tasks.

Here, I would allow user processes to ask the OS to create a new group.
Then other processes could ask the OS to join that group. Processes
in the same group would share an affinity. Alternatively, a process
could ask to be scheduled "away" from a particular group.

The idea is that individual processes don't need your bitmap of cores,
just a group name. The OS maps group membership to specific cores as
needed.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Processor Affinity in the age of unbounded cores

<26856585-2914-4d4a-a307-9288c4b99066n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20075&group=comp.arch#20075

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5805:: with SMTP id g5mr13237779qtg.360.1629757541044;
Mon, 23 Aug 2021 15:25:41 -0700 (PDT)
X-Received: by 2002:a9d:6a4b:: with SMTP id h11mr30820515otn.5.1629757540770;
Mon, 23 Aug 2021 15:25:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 Aug 2021 15:25:40 -0700 (PDT)
In-Reply-To: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.182.0; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.182.0
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <26856585-2914-4d4a-a307-9288c4b99066n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Mon, 23 Aug 2021 22:25:41 +0000
Content-Type: text/plain; charset="UTF-8"
 by: JimBrakefield - Mon, 23 Aug 2021 22:25 UTC

On Monday, August 23, 2021 at 1:24:05 PM UTC-5, MitchAlsup wrote:
> I have been giving some thought to processor affinity. But I want to
> look at the future where there may be an essentially unbounded
> number of cores.
> <
> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".
> <
> So, I ask: what would a programming model for assigning processor
> affinity when the number of cores is not easily mapped into common
> register widths ?

Here's my go at it (explained as simply as possible):
Intrinsically going for a bottom-up approach, start with the communications
network. Store and forward (SNF) at one clock per step. For a modest
network of cores, a squashed torus is used. Take a fat torus, squash, fold
in half in both the X and Y directions. The packet size is of the order of 256
bits or so. For a large network, a 3D torus is needed. Take the 3rd dimension
and squash the cores into a small square. Say 4096 cores total, so 16 cores
is a small square. The SNF network needs to hop the 16 cores in a single
clock cycle. Core numbering is such that 3D blocks of any size or aspect ratios
can be specified. A packet can be sent to any size 3d block. This is a
mechanism similar to an OoO machine functional unit broadcasting it's result.

Ok, so how are cores allocated? A core requests a block of cores. To minimize
store and forward jumps the block should be as cubic as possible and as
near as possible. Perhaps this is something the SNF can perform?

These ideas resulted from a study for a NN machine for spiking neurons using
FPGA sticks for each node.

Re: Processor Affinity in the age of unbounded cores

<sg28it$d7f$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20083&group=comp.arch#20083

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!T3F9KNSTSM9ffyC31YXeHw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Tue, 24 Aug 2021 09:54:05 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sg28it$d7f$1@gioia.aioe.org>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="13551"; posting-host="T3F9KNSTSM9ffyC31YXeHw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Tue, 24 Aug 2021 07:54 UTC

MitchAlsup wrote:
> I have been giving some thought to processor affinity. But I want to
> look at the future where there may be an essentially unbounded
> number of cores.
> <
> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".
> <
> So, I ask: what would a programming model for assigning processor
> affinity when the number of cores is not easily mapped into common
> register widths ?
>
Same answer as for network routing tables and huge core count clusters?

I.e. directories so that only the interested cores need to bother. This
works well as long as you don't have many threads that you want to allow
to run "on any core except the first and last". I.e. you need either
"any core" as a single bit, or "small_list_of_cores".

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Processor Affinity in the age of unbounded cores

<13cd4415-ea0a-4c8a-8ce0-eb0bc309f84an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20086&group=comp.arch#20086

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e2d1:: with SMTP id t17mr6960qvl.56.1629801682128; Tue, 24 Aug 2021 03:41:22 -0700 (PDT)
X-Received: by 2002:a9d:1469:: with SMTP id h96mr2791030oth.82.1629801681870; Tue, 24 Aug 2021 03:41:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 03:41:21 -0700 (PDT)
In-Reply-To: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.173.41.82; posting-account=zjh_fgoAAABo0Nzgf6peaFtS6c-3xdgr
NNTP-Posting-Host: 5.173.41.82
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <13cd4415-ea0a-4c8a-8ce0-eb0bc309f84an@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: pec...@gmail.com (pec...@gmail.com)
Injection-Date: Tue, 24 Aug 2021 10:41:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 34
 by: pec...@gmail.com - Tue, 24 Aug 2021 10:41 UTC

poniedziałek, 23 sierpnia 2021 o 20:24:05 UTC+2 MitchAlsup napisał(a):
> I have been giving some thought to processor affinity. But I want to
> look at the future where there may be an essentially unbounded
> number of cores.
> <
> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".
It scales perfectly (linearly) with number of cores as long as the strict affinity has sense.
> So, I ask: what would a programming model for assigning processor
> affinity when the number of cores is not easily mapped into common
> register widths ?
Small memory area is the solution. Cache line will be sufficient for many years.
In practice processor allocation will be similar to memory allocation. You rarely ask for a specific physical bytes, just ask for compact areas.
Short to medium scale connectivity should have redundant capacity that can absorb imperfect allocations.
Virtualization layer should allow for dynamic core allocation, core migrations etc. It is a complex optimization problem.

In the large system topology defined by cores connectivity will be important.
On small scale topology will be tree-like, with shared memory and address space.
On larger scales - 3 dimensional connection mesh, topology of physical space will win.
Physical realisations od 3d topology can have fractal - like tree structure of connections with shortcuts and "routers".
The the largest scale physics forces us to use 2 dimentional surface of celestial body, then Dyson sphere ;)

Re: Processor Affinity in the age of unbounded cores

<OV9VI.25431$%r2.16417@fx37.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20103&group=comp.arch#20103

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx37.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
In-Reply-To: <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 134
Message-ID: <OV9VI.25431$%r2.16417@fx37.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 24 Aug 2021 17:22:54 UTC
Date: Tue, 24 Aug 2021 13:22:22 -0400
X-Received-Bytes: 7392
 by: EricP - Tue, 24 Aug 2021 17:22 UTC

MitchAlsup wrote:
> On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> I have been giving some thought to processor affinity. But I want to
>>> look at the future where there may be an essentially unbounded
>>> number of cores.
>>> <
>>> The current model of 1-bit to denote "any core" and then a bit vector
>>> to say "any of these cores can run this task/thread" falls apart when
>>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
>>> The bit vector approach "does not scale".
>>> <
>>> So, I ask: what would a programming model for assigning processor
>>> affinity when the number of cores is not easily mapped into common
>>> register widths ?
> <
>> Physical memory is allocated to minimize the distance between a
>> thread's preferred core, and that cores memory or neighbor memory.
>> The bit vector model does not take NUMA distance to memory into account.
>>
>> A thread would be allocated NUMA memory as close as possible,
>> and would continue to use that memory until it is recycled.
> <
> So, would an I/O driver (SATA) want to be closer to his memory or closer
> to his I/O (Host Bridge) ?
> <
> Would interrupt from the Host Bridge be directed to a core in this "Node" ?
> So, could the device driver be situated close to its memory but its interrupt
> handler closer to the device ?

I started to answer but the answer turned into a book. I'll try again.

Here we are talking specifically about HW threads servicing interrupts.
These would not be like normal application threads as they are part of
the OS kernel and are working on behalf of a cpu.
Prioritized HW interrupt threads are equivalent to prioritized
interrupt service routines (ISR).

If I was doing this there would be one HW thread for each core that can
service interrupts, for each HW priority interrupt request level (IRQL).
Each interrupt thread has a thread header plus a small (12-16 kB) stack.
These threads are created at OS boot, each has a hard affinity with
one core and the pages should be allocated from that core's home memory.

The Priority Interrupt Controller (PIC), or what x64 calls APIC,
controls to which core interrupts are routed. If any interrupt can be
routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
and if we have 64 cores, then there are 448 HW threads dedicated
to servicing interrupts (which is why we don't do this).

The SMP interrupt service threads use per-device, per-interrupt spinlocks
to serialize their execution of each device's ISR.

To answer your question, we could program the PIC to route the
interrupts to the physically closest core, and that core's
interrupt service thread would use that core's memory for its stack.

If we know a particular device driver will always be serviced by a
single core then when the driver is loaded or the devices are mounted,
its code and data should also be allocated from that core's home memory.

In reality there is no need for full SMP servicing of interrupts
on any core. Alpha VMS handled all interrupts on core-0.
There are likely only a couple of cores that are near a bridge,
so in reality maybe we only need 2 cores with 7 interrupt threads each.

Also each core has a HW thread for executing at its
software interrupt level, executing things like what WNT calls
Deferred Procedure Calls (DPC) callback work packets
which can be posted by, amongst other things, ISR's to continue
processing an interrupt at a lower priority level.

Each core needs an idle thread to execute when it has no other work.
It does various housekeeping like pre-zeroing physical memory pages
for the free list.

> <
> Would a device page fault be directed at the interrupt handler or at I/O
> page fault handling driver ?

OS kernel code and data falls into two categories, pagable and non-pagable.
Some OS only allow non-pagable kernel memory.
If a page fault occurs in a non-pagable memory section, the OS crashes.
In any OS that I am familiar with, all interrupt code and data are
non-pagable in order to avoid all sorts of nasty race conditions.

when a device driver is loaded, the memory sections specified by its exe
are allocated and assigned physical pages at create time,
the pages are pinned so they never get outswapped,
and the code or data is copied from the exe file at driver load.
Similarly, as each device of that driver is mounted,
any local memory for that device is allocated from non-paged heap.

As I said above, if we know a certain driver or device will always
be serviced by a single core, OS should ensure that only local
physical memory is allocated at driver load and device mount.

>> Having two or more threads with shared process address space running on
>> different cores raises the question which NUMA node should the memory
>> be allocated on? Presumably each thread's stack would allocate on
>> the current core.
>>
>> Moving thread T1 from core C0 to core C1 might have a shared L2,
>> to core C3 have 1 hop back to the memory previously allocated on C0,
>> to core C4 have 2 hops back to previous memory allocated on C0.
>> This extra hop cost continues until physical pages are recycled.
>>
>> Moving a thread between cores also usually does not take into account
>> the loss of investment in the cache, and the delay to re-charge the cache.
>> Just guessing, but capacitor charge-discharge model probably applies.
>>
>> This suggests a hierarchy of costs to move up, across, and then down
>> between cores for load balancing to take into account.
> <
> Thanks for this observation.
> <
> Now what about multi-threaded cores, the cache cost of migrating around a
> MT core is low, but the throughput could be lower due to both (or many) cores
> sharing execution resources?

If an application, say a database server, creates multiple threads,
one per core and binds it to that core, then the physical memory for
the stack of each thread should come from the local home memory and
that would be optimal for that thread.

But there is still shared code and data pages. Whichever home the physical
memory for code and data is allocated from, it will be non-optimal for
all but one thread. No way around this.

If threads move from core to core to balance load, the OS scheduler needs
to take into account that this move is not free by having some hysteresis
in the algorithm to make things a bit sticky.

Re: Processor Affinity in the age of unbounded cores

<a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20105&group=comp.arch#20105

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b686:: with SMTP id g128mr27648051qkf.68.1629831702926; Tue, 24 Aug 2021 12:01:42 -0700 (PDT)
X-Received: by 2002:a9d:4e96:: with SMTP id v22mr31920113otk.110.1629831702661; Tue, 24 Aug 2021 12:01:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 12:01:42 -0700 (PDT)
In-Reply-To: <OV9VI.25431$%r2.16417@fx37.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com> <OV9VI.25431$%r2.16417@fx37.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 19:01:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 259
 by: MitchAlsup - Tue, 24 Aug 2021 19:01 UTC

On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> I have been giving some thought to processor affinity. But I want to
> >>> look at the future where there may be an essentially unbounded
> >>> number of cores.
> >>> <
> >>> The current model of 1-bit to denote "any core" and then a bit vector
> >>> to say "any of these cores can run this task/thread" falls apart when
> >>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> >>> The bit vector approach "does not scale".
> >>> <
> >>> So, I ask: what would a programming model for assigning processor
> >>> affinity when the number of cores is not easily mapped into common
> >>> register widths ?
> > <
> >> Physical memory is allocated to minimize the distance between a
> >> thread's preferred core, and that cores memory or neighbor memory.
> >> The bit vector model does not take NUMA distance to memory into account.
> >>
> >> A thread would be allocated NUMA memory as close as possible,
> >> and would continue to use that memory until it is recycled.
> > <
> > So, would an I/O driver (SATA) want to be closer to his memory or closer
> > to his I/O (Host Bridge) ?
> > <
> > Would interrupt from the Host Bridge be directed to a core in this "Node" ?
> > So, could the device driver be situated close to its memory but its interrupt
> > handler closer to the device ?
<
> I started to answer but the answer turned into a book. I'll try again.
<
LoL
<
>
> Here we are talking specifically about HW threads servicing interrupts.
> These would not be like normal application threads as they are part of
> the OS kernel and are working on behalf of a cpu.
> Prioritized HW interrupt threads are equivalent to prioritized
> interrupt service routines (ISR).
>
> If I was doing this there would be one HW thread for each core that can
> service interrupts, for each HW priority interrupt request level (IRQL).
> Each interrupt thread has a thread header plus a small (12-16 kB) stack.
> These threads are created at OS boot, each has a hard affinity with
> one core and the pages should be allocated from that core's home memory.
<
Ok, but let me throw a monkey wrench into this::
<
Say we have an environment supporting virtual devices. That is, there is
a table in memory that maps the virtual device into a physical device
so that the hypervisor can allow the hosted OS to perform its own I/O
to something like a SATA drive. The virtualized SATA driver gets a
virtual Bus:Device,Function, and his page map contains a page in
memory mapped I/O space pointing at virtual device MMIO control
registers. The virtual device driver, running down near user priority level
does 5-10 stores to the virtual device control registers to initiate a read or
write to the disk. The disk schedules the access as it chooses, and later on
when DMA is ready, the device table is used to associate virtual
Bus:Device,Function with physical Bus:Device,Function and also the
virtual machine and mapping tables of the OS this DMA request should
use.
<
>
> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> controls to which core interrupts are routed. If any interrupt can be
> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> and if we have 64 cores, then there are 448 HW threads dedicated
> to servicing interrupts (which is why we don't do this).
<
My recent research on this shows the number of interrupt "levels"
increasing to 256-ish interrupts, each interrupt vector having its
own priority and a few related things. AMD virtualization via I/O-MMU.
<
So
<
DMA address gets translated though the OS mapping tables using
the requesting thread virtual address translated to host virtual address
which is translated by the virtual machine mapping tables to physical
address.
<
{Now this is where it gets interesting}
<
After DMA is complete, device sends an interrupt to devices interrupt
handler (which is IN the OS under the HV). Interrupt causes a control
transfer into interrupt handler, interrupt handler reads a few control
registers in memory mapped I/O space, builds a message for a lower
level Task handler , and exits. Interrupt handler ran at "rather" high priority,
task handler runs at medium priority. Task handler takes message
rummages through OS tables and figures out who needs (and what kind
of) notifications, and performs that work. Then Task handler exits (in one
way or another) to the system scheduler. System scheduler figures out who
to run next--which may have changed since the arrival of the interrupt.
<
All of the last several paragraphs are transpiring as other interrupts
are raised and various control transfers take place. {Sometimes in
rather bazaar orders due to timing; but it all works out.}
<
In this virtual device case:: the OS may not KNOW where its device actually
is {with paravirtualization it would} and so may not be able to properly
affinitize the handlers.
<
Anyway, I have figured out the vast majority of the device virtualization
requirements, and have been working on the kinds of HW services
are required to make this stuff fast {Mostly just caching and careful
Table organization.}
>
> The SMP interrupt service threads use per-device, per-interrupt spinlocks
> to serialize their execution of each device's ISR.
<
I have figured out a way to perform this serialization without any
kind of locking......
>
> To answer your question, we could program the PIC to route the
> interrupts to the physically closest core, and that core's
> interrupt service thread would use that core's memory for its stack.
<
Yes, route interrupts from the virtual device to the virtual requestor
interrupt handler and perform control transfer {based on priority of
the interrupt and the priority of the core set it is allowed to run on}
>
> If we know a particular device driver will always be serviced by a
> single core then when the driver is loaded or the devices are mounted,
> its code and data should also be allocated from that core's home memory.
<
So somebody in the chain of command granting guest access to virtual
device does the affinity or pinning of the interrupt threads to core set.
<
Check.
<
I suspect for completeness, the 'task' handler is pinned to the same
core set.
<
Now imagine that the interrupt handler can transfer control to the
task handler without an excursion through the OS (or HV) process
schedulers.
<
And that when Task Handler is done, it can return to interrupted process
(letting it run) while OS (and HV) search through their scheduling tables
to figure out what is the next process to run. {Remember I am talking
about a machine with inherent parallelism.}
<
When the OS/HV decides that process[k] is the next thing to run on
core[j] there is a means to cause a remote core[j] to context switch
from what it was doing to process[k] in a single instruction. Control
is transferred and cor[j] begins running process[k], the the OS/HV
thread continues on unabated. It spawned process[k] onto core[j]
without losing control of who it was or where it is.
>
> In reality there is no need for full SMP servicing of interrupts
> on any core. Alpha VMS handled all interrupts on core-0.
> There are likely only a couple of cores that are near a bridge,
> so in reality maybe we only need 2 cores with 7 interrupt threads each.
<
Simplifications are nice. But Alpha was from an era where there
were not than many nodes in a system. I am wonder what the
best model is for the future when those assumptions no longer
hold.
>
> Also each core has a HW thread for executing at its
> software interrupt level, executing things like what WNT calls
> Deferred Procedure Calls (DPC) callback work packets
> which can be posted by, amongst other things, ISR's to continue
> processing an interrupt at a lower priority level.
>
> Each core needs an idle thread to execute when it has no other work.
> It does various housekeeping like pre-zeroing physical memory pages
> for the free list.
<
Unless you have a means to make it simply appear as if the page
were filled with zeroes. In which case you allocate the page upon
call to malloc, and as you touch every line it automagically gets filled
with zeros (without reading memory to fill the lines--like Mill).
<
But there always seems to be an idle process somewhere running
at a priority lower than everyone else.
<
Thanks for reminding me of this.
> > <
> > Would a device page fault be directed at the interrupt handler or at I/O
> > page fault handling driver ?
<
> OS kernel code and data falls into two categories, pagable and non-pagable.
> Some OS only allow non-pagable kernel memory.
> If a page fault occurs in a non-pagable memory section, the OS crashes.
<
Question: In an OS under HV, does the HV have to know that OS page
is pinned ? Or since OS is under HV, HV can page this out, OS takes
page fault into HV, HV services ad returns, so as far as OS "sees'
the page is "resident". On the other hand, enough latency might have
transpired for the device to have failed before interrupt handler could
service interrupt. !?!?
<
> In any OS that I am familiar with, all interrupt code and data are
> non-pagable in order to avoid all sorts of nasty race conditions.
<
Check, and good to know (be reminded of)
>
> when a device driver is loaded, the memory sections specified by its exe
> are allocated and assigned physical pages at create time,
> the pages are pinned so they never get outswapped,
> and the code or data is copied from the exe file at driver load.
> Similarly, as each device of that driver is mounted,
> any local memory for that device is allocated from non-paged heap.
<
I suspect that when one plugs in a USB device this happens at that instant.
>
> As I said above, if we know a certain driver or device will always
> be serviced by a single core, OS should ensure that only local
> physical memory is allocated at driver load and device mount.
<
Check
<
> >> Having two or more threads with shared process address space running on
> >> different cores raises the question which NUMA node should the memory
> >> be allocated on? Presumably each thread's stack would allocate on
> >> the current core.
> >>
> >> Moving thread T1 from core C0 to core C1 might have a shared L2,
> >> to core C3 have 1 hop back to the memory previously allocated on C0,
> >> to core C4 have 2 hops back to previous memory allocated on C0.
> >> This extra hop cost continues until physical pages are recycled.
> >>
> >> Moving a thread between cores also usually does not take into account
> >> the loss of investment in the cache, and the delay to re-charge the cache.
> >> Just guessing, but capacitor charge-discharge model probably applies.
> >>
> >> This suggests a hierarchy of costs to move up, across, and then down
> >> between cores for load balancing to take into account.
> > <
> > Thanks for this observation.
> > <
> > Now what about multi-threaded cores, the cache cost of migrating around a
> > MT core is low, but the throughput could be lower due to both (or many) cores
> > sharing execution resources?
<
> If an application, say a database server, creates multiple threads,
> one per core and binds it to that core, then the physical memory for
> the stack of each thread should come from the local home memory and
> that would be optimal for that thread.
>
> But there is still shared code and data pages. Whichever home the physical
> memory for code and data is allocated from, it will be non-optimal for
> all but one thread. No way around this.
<
Yes no way around the data base actually occupying good chunks of
memory on every node which there is memory.
>
> If threads move from core to core to balance load, the OS scheduler needs
> to take into account that this move is not free by having some hysteresis
> in the algorithm to make things a bit sticky.
<
My guess is that the data base wants to move units of work from DBThread to
DBThread so each DBThread remains near its working memory. That is
the process might migrate around, the the core-cache footprint remains
localized.
<
While the OS threads are manages as the OS sees fit, and is instructed under
whatever affinity is supported in the OS.


Click here to read the complete article
Re: Processor Affinity in the age of unbounded cores

<wmDVI.3213$z%4.1108@fx37.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20129&group=comp.arch#20129

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!fdc2.netnews.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx37.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com> <OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
In-Reply-To: <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 296
Message-ID: <wmDVI.3213$z%4.1108@fx37.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 26 Aug 2021 02:53:16 UTC
Date: Wed, 25 Aug 2021 22:52:30 -0400
X-Received-Bytes: 14953
 by: EricP - Thu, 26 Aug 2021 02:52 UTC

MitchAlsup wrote:
> On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
>>>> MitchAlsup wrote:
>>>>> I have been giving some thought to processor affinity. But I want to
>>>>> look at the future where there may be an essentially unbounded
>>>>> number of cores.
>>>>> <
>>>>> The current model of 1-bit to denote "any core" and then a bit vector
>>>>> to say "any of these cores can run this task/thread" falls apart when
>>>>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
>>>>> The bit vector approach "does not scale".
>>>>> <
>>>>> So, I ask: what would a programming model for assigning processor
>>>>> affinity when the number of cores is not easily mapped into common
>>>>> register widths ?
>>> <
>>>> Physical memory is allocated to minimize the distance between a
>>>> thread's preferred core, and that cores memory or neighbor memory.
>>>> The bit vector model does not take NUMA distance to memory into account.
>>>>
>>>> A thread would be allocated NUMA memory as close as possible,
>>>> and would continue to use that memory until it is recycled.
>>> <
>>> So, would an I/O driver (SATA) want to be closer to his memory or closer
>>> to his I/O (Host Bridge) ?
>>> <
>>> Would interrupt from the Host Bridge be directed to a core in this "Node" ?
>>> So, could the device driver be situated close to its memory but its interrupt
>>> handler closer to the device ?
> <
>> I started to answer but the answer turned into a book. I'll try again.
> <
> LoL
> <
>> Here we are talking specifically about HW threads servicing interrupts.
>> These would not be like normal application threads as they are part of
>> the OS kernel and are working on behalf of a cpu.
>> Prioritized HW interrupt threads are equivalent to prioritized
>> interrupt service routines (ISR).
>>
>> If I was doing this there would be one HW thread for each core that can
>> service interrupts, for each HW priority interrupt request level (IRQL).
>> Each interrupt thread has a thread header plus a small (12-16 kB) stack.
>> These threads are created at OS boot, each has a hard affinity with
>> one core and the pages should be allocated from that core's home memory.
> <
> Ok, but let me throw a monkey wrench into this::
> <
> Say we have an environment supporting virtual devices. That is, there is
> a table in memory that maps the virtual device into a physical device
> so that the hypervisor can allow the hosted OS to perform its own I/O
> to something like a SATA drive. The virtualized SATA driver gets a
> virtual Bus:Device,Function, and his page map contains a page in
> memory mapped I/O space pointing at virtual device MMIO control
> registers. The virtual device driver, running down near user priority level
> does 5-10 stores to the virtual device control registers to initiate a read or
> write to the disk. The disk schedules the access as it chooses, and later on
> when DMA is ready, the device table is used to associate virtual
> Bus:Device,Function with physical Bus:Device,Function and also the
> virtual machine and mapping tables of the OS this DMA request should
> use.

You left out a few steps.

- mapping the virtual device ids (or device handle) to an actual device
- checking permissions for that operation on that device
- validate IO arguments (e.g. buffer address and range)
- checking IO size vs quotas, maybe break 1 big IO into many smaller
- translate virtual addresses to physical, choose number to pin
- if virtual page not Present then inswap
- pin physical page to prevent removal/recycling
- queue request to device
- when device available, queue for a free DMA channel
- when DMA channel available program actual device hardware
- raise interrupt priority to block device interrupts
- spinlock to prevent concurrent access to device interrupt data
- write device control registers

On IO completion interrupt undo all the things we did above,
releasing resources and possibly allowing other queued IO to continue.

Maybe loop over all of this if the IO was larger than resource allocations
allow to be allocated at once to a single requester.

There is also asynchronous IO cancellation to deal with,
both when IO is queued and, depending on the device, possibly while running.

> <
>> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
>> controls to which core interrupts are routed. If any interrupt can be
>> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
>> and if we have 64 cores, then there are 448 HW threads dedicated
>> to servicing interrupts (which is why we don't do this).
> <
> My recent research on this shows the number of interrupt "levels"
> increasing to 256-ish interrupts, each interrupt vector having its
> own priority and a few related things. AMD virtualization via I/O-MMU.

There are two issues: one is the number of interrupt priority levels,
and one is the ability to determine directly which device is interrupting.

I have heard of 255 levels on embedded systems where designs have a fixed,
well known set devices to deal with, they like to use one interrupt level
per device interrupt. Because the OS is often doing hard real-time,
having a unique priority for each device source fits nicely into
the deadline scheduling model.

For general purpose computers, a large number of priority interrupt
levels offers nothing.
However being able to directly determine which device id requested
an interrupt saves polling all the devices on a daisy chain.
One could have a small number of priorities, say 7 or 9,
but also something like PCI's Message Queued Interrupts to put
the requesting device's ID number into the queue for that priority
connected to a huge number of devices.
If a request queue is not empty then a hardware interrupt is
requested at the queue's priority.

This is why I defined my interrupt architecture as model specific,
because there is no one size fits all. Similarly, the cpu's interface
for interrupt handling is also model specific.

> <
> So
> <
> DMA address gets translated though the OS mapping tables using
> the requesting thread virtual address translated to host virtual address
> which is translated by the virtual machine mapping tables to physical
> address.

check

> <
> {Now this is where it gets interesting}
> <
> After DMA is complete, device sends an interrupt to devices interrupt
> handler (which is IN the OS under the HV). Interrupt causes a control
> transfer into interrupt handler, interrupt handler reads a few control
> registers in memory mapped I/O space, builds a message for a lower
> level Task handler , and exits. Interrupt handler ran at "rather" high priority,
> task handler runs at medium priority. Task handler takes message
> rummages through OS tables and figures out who needs (and what kind
> of) notifications, and performs that work. Then Task handler exits (in one
> way or another) to the system scheduler. System scheduler figures out who
> to run next--which may have changed since the arrival of the interrupt.

Yes.
The notifications to other tasks (or oneself) can complete a wait operation.
Completing a wait may transition a thread from Wait to Ready state
if it was not already Ready or Running.
Transition of thread to Ready state requests scheduler.
Eventually scheduler selects thread that requested IO which performs
any final resource deallocation inside the process address space.

I just realized there are two different schedulers.
One for threads that service HW like interrupts and other non-pagable code.
One for threads that can can perform waits like for IO or page faults.

> <
> All of the last several paragraphs are transpiring as other interrupts
> are raised and various control transfers take place. {Sometimes in
> rather bazaar orders due to timing; but it all works out.}

Yes

> <
> In this virtual device case:: the OS may not KNOW where its device actually
> is {with paravirtualization it would} and so may not be able to properly
> affinitize the handlers.
> <
> Anyway, I have figured out the vast majority of the device virtualization
> requirements, and have been working on the kinds of HW services
> are required to make this stuff fast {Mostly just caching and careful
> Table organization.}
>> The SMP interrupt service threads use per-device, per-interrupt spinlocks
>> to serialize their execution of each device's ISR.
> <
> I have figured out a way to perform this serialization without any
> kind of locking......


Click here to read the complete article
Re: Processor Affinity in the age of unbounded cores

<d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20134&group=comp.arch#20134

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4495:: with SMTP id x21mr4962863qkp.378.1630001802995;
Thu, 26 Aug 2021 11:16:42 -0700 (PDT)
X-Received: by 2002:a05:6830:2809:: with SMTP id w9mr4596625otu.114.1630001802695;
Thu, 26 Aug 2021 11:16:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 26 Aug 2021 11:16:42 -0700 (PDT)
In-Reply-To: <wmDVI.3213$z%4.1108@fx37.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 26 Aug 2021 18:16:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 191
 by: MitchAlsup - Thu, 26 Aug 2021 18:16 UTC

On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:

> > Say we have an environment supporting virtual devices. That is, there is
> > a table in memory that maps the virtual device into a physical device
> > so that the hypervisor can allow the hosted OS to perform its own I/O
> > to something like a SATA drive. The virtualized SATA driver gets a
> > virtual Bus:Device,Function, and his page map contains a page in
> > memory mapped I/O space pointing at virtual device MMIO control
> > registers. The virtual device driver, running down near user priority level
> > does 5-10 stores to the virtual device control registers to initiate a read or
> > write to the disk. The disk schedules the access as it chooses, and later on
> > when DMA is ready, the device table is used to associate virtual
> > Bus:Device,Function with physical Bus:Device,Function and also the
> > virtual machine and mapping tables of the OS this DMA request should
> > use.
> You left out a few steps.
>
> - mapping the virtual device ids (or device handle) to an actual device
> - checking permissions for that operation on that device
> - validate IO arguments (e.g. buffer address and range)
> - checking IO size vs quotas, maybe break 1 big IO into many smaller
> - translate virtual addresses to physical, choose number to pin
> - if virtual page not Present then inswap
> - pin physical page to prevent removal/recycling
> - queue request to device
<
I am assuming this is just std OS/HV stuff running privileged but
not at very high priority.
<
> - when device available, queue for a free DMA channel
> - when DMA channel available program actual device hardware
<
OS or HV on paravirtualized OS
<
> - raise interrupt priority to block device interrupts
> - spinlock to prevent concurrent access to device interrupt data
<
This spinlock is what I think I can get rid of.
<
> - write device control registers
>
> On IO completion interrupt undo all the things we did above,
> releasing resources and possibly allowing other queued IO to continue.
<
Is the above paragraph done at the Task handler level or the interrupt handler level?
>
> Maybe loop over all of this if the IO was larger than resource allocations
> allow to be allocated at once to a single requester.
>
> There is also asynchronous IO cancellation to deal with,
> both when IO is queued and, depending on the device, possibly while running.
> > <
> >> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> >> controls to which core interrupts are routed. If any interrupt can be
> >> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> >> and if we have 64 cores, then there are 448 HW threads dedicated
> >> to servicing interrupts (which is why we don't do this).
> > <
> > My recent research on this shows the number of interrupt "levels"
> > increasing to 256-ish interrupts, each interrupt vector having its
> > own priority and a few related things. AMD virtualization via I/O-MMU.
<
> There are two issues: one is the number of interrupt priority levels,
<
There are 64 priority levels, both handlers and processes run on these.
Although this is not fixed in stone
<
> and one is the ability to determine directly which device is interrupting.
<
physical device send virtual interrupt, interrupt vector table determine priority
and which processor and core the interrupt handler should run.
>
> I have heard of 255 levels on embedded systems where designs have a fixed,
> well known set devices to deal with, they like to use one interrupt level
> per device interrupt. Because the OS is often doing hard real-time,
> having a unique priority for each device source fits nicely into
> the deadline scheduling model.
>
> For general purpose computers, a large number of priority interrupt
> levels offers nothing.
<
How few is enough, but in any event I can fit 256 interrupt voctors,
64 exception vectors, 256 processes, and the 64-priority Queues
on a single page of memory. 64 was chosen merely because it fit.
<
> However being able to directly determine which device id requested
> an interrupt saves polling all the devices on a daisy chain.
<
Interrupt handler receives virtual Bus:Device,Function Node,
Interrupt number, and the process which initiated the I/O.
<
> One could have a small number of priorities, say 7 or 9,
> but also something like PCI's Message Queued Interrupts to put
> the requesting device's ID number into the queue for that priority
> connected to a huge number of devices.
<
Yes a large part of this priority queue is to put a place where message
signaled interrupts can land and wait for their turn at execution
resources.
<
> If a request queue is not empty then a hardware interrupt is
> requested at the queue's priority.
<
I was thinking more along the lines that this Queue reads from
the highest priority non-blocked queue, builds Thread State and
sends a context switch to a particular processor:core. The
message contains all the bits on needs to run the process, upon
arrival, whatever the processor:core was doing gets messaged
back to queue.
<
Thus, for interrupts, one has thread state of the interrupt
handler along with the notification of (ahem) the interrupt.
That is, as seen at a processor:core, the notification (interrupt)
arrives only 1 cycle before the thread state, including registers.
>
> This is why I defined my interrupt architecture as model specific,
> because there is no one size fits all. Similarly, the cpu's interface
> for interrupt handling is also model specific.
> > <
> > So
> > <
> > DMA address gets translated though the OS mapping tables using
> > the requesting thread virtual address translated to host virtual address
> > which is translated by the virtual machine mapping tables to physical
> > address.
> check
> > <
> > {Now this is where it gets interesting}
> > <
> > After DMA is complete, device sends an interrupt to devices interrupt
> > handler (which is IN the OS under the HV). Interrupt causes a control
> > transfer into interrupt handler, interrupt handler reads a few control
> > registers in memory mapped I/O space, builds a message for a lower
> > level Task handler , and exits. Interrupt handler ran at "rather" high priority,
> > task handler runs at medium priority. Task handler takes message
> > rummages through OS tables and figures out who needs (and what kind
> > of) notifications, and performs that work. Then Task handler exits (in one
> > way or another) to the system scheduler. System scheduler figures out who
> > to run next--which may have changed since the arrival of the interrupt.
> Yes.
> The notifications to other tasks (or oneself) can complete a wait operation.
> Completing a wait may transition a thread from Wait to Ready state
> if it was not already Ready or Running.
> Transition of thread to Ready state requests scheduler.
> Eventually scheduler selects thread that requested IO which performs
> any final resource deallocation inside the process address space.
<
In my model: The completing I/O can send (MSI) an interrupt to the
waiting thread, and the hardware places the waiting process on
appropriate priority queue where it awaits execution cycles. No
need to invoke scheduler (directly). Eventually the unWaited process
reached the front of the highest non-Empty non-Blocked Queue,
and gets launched into execution with a Thread State message.
>
> I just realized there are two different schedulers.
> One for threads that service HW like interrupts and other non-pagable code.
> One for threads that can can perform waits like for IO or page faults.
<
I saw this as 2-levels, your second level always being dependent on
the first. The first level runs non-pagable, the second runs pagable.
<
> > <
> > All of the last several paragraphs are transpiring as other interrupts
> > are raised and various control transfers take place. {Sometimes in
> > rather bazaar orders due to timing; but it all works out.}
> Yes
> > <
> > Unless you have a means to make it simply appear as if the page
> > were filled with zeroes. In which case you allocate the page upon
> > call to malloc, and as you touch every line it automagically gets filled
> > with zeros (without reading memory to fill the lines--like Mill).
<
> I mean inside the OS when physical pages are recycled.
> First they go onto the free-dirty list.
>
> The OS can have a pool of pre-zeroed physical pages
> for when someone faults a zero-init page.
>
> One of the things a background thread does is take free-dirty
> physical pages, zero them, put them on the free-zero list.
<
We could make this another Idle process (or really low priority
at least until the free page pool is rather dry when its priority is
changed and it runs more often.
> > <
> > But there always seems to be an idle process somewhere running
> > at a priority lower than everyone else.
> > <
> > Thanks for reminding me of this.


Click here to read the complete article
Re: Processor Affinity in the age of unbounded cores

<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20197&group=comp.arch#20197

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:8242:: with SMTP id e63mr19418512qkd.294.1630275855430; Sun, 29 Aug 2021 15:24:15 -0700 (PDT)
X-Received: by 2002:a9d:7204:: with SMTP id u4mr17746421otj.276.1630275855218; Sun, 29 Aug 2021 15:24:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 29 Aug 2021 15:24:15 -0700 (PDT)
In-Reply-To: <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com> <OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com> <wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 29 Aug 2021 22:24:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 22
 by: MitchAlsup - Sun, 29 Aug 2021 22:24 UTC

On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
<
Given a definition where Node means chip; and chip can contain
multiple processors, each processor having several cores (SMT or
other), along with memory controllers, Node repeater links, and
host bridges::
<
It seems to me that in an age of unbounded processors and cores,
using the affinity relationship to find a processor suitable for running
an affinitized process is harder than simply moving an already affinitized
process between RUN<->WAIT.
<
Thus, one can suitably "throw" HW at the run<->wait latency*, but
one should never bother to throw HW at the affinity problem--that
should always remain a problem the OS(HV) has to solve.
<
But this could lead to it being taking longer to affinitize a process to
a processor (and suitable core thereon) that it is to context switch
into an already affinitized process.
<
(*) both as seen in total overhead cycles and as seen by either OS(HV)
controlling a process, or by process being controlled.

Re: Processor Affinity in the age of unbounded cores

<f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20201&group=comp.arch#20201

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:34c:: with SMTP id r12mr4655593qtw.147.1630280716120;
Sun, 29 Aug 2021 16:45:16 -0700 (PDT)
X-Received: by 2002:a9d:4e96:: with SMTP id v22mr16856373otk.110.1630280715871;
Sun, 29 Aug 2021 16:45:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 29 Aug 2021 16:45:15 -0700 (PDT)
In-Reply-To: <1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.182.0; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.182.0
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Sun, 29 Aug 2021 23:45:16 +0000
Content-Type: text/plain; charset="UTF-8"
 by: JimBrakefield - Sun, 29 Aug 2021 23:45 UTC

On Sunday, August 29, 2021 at 5:24:16 PM UTC-5, MitchAlsup wrote:
> On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> > On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> <
> Given a definition where Node means chip; and chip can contain
> multiple processors, each processor having several cores (SMT or
> other), along with memory controllers, Node repeater links, and
> host bridges::
> <
> It seems to me that in an age of unbounded processors and cores,
> using the affinity relationship to find a processor suitable for running
> an affinitized process is harder than simply moving an already affinitized
> process between RUN<->WAIT.
> <
> Thus, one can suitably "throw" HW at the run<->wait latency*, but
> one should never bother to throw HW at the affinity problem--that
> should always remain a problem the OS(HV) has to solve.
> <
> But this could lead to it being taking longer to affinitize a process to
> a processor (and suitable core thereon) that it is to context switch
> into an already affinitized process.
> <
> (*) both as seen in total overhead cycles and as seen by either OS(HV)
> controlling a process, or by process being controlled.

Affinity allocation of cores would seem to require three aspects:

A) Definition of an affinity metric.
B) Parameters in the core(s) request (Number of cores, duration, memory
scratch space, existing memory allocation, cost for moving memory and
cache contents, etc).
C) Hardware to evaluate the affinity metric across available resources.

Re: Processor Affinity in the age of unbounded cores

<08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20204&group=comp.arch#20204

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:d804:: with SMTP id h4mr21582347qvj.37.1630288720988;
Sun, 29 Aug 2021 18:58:40 -0700 (PDT)
X-Received: by 2002:a05:6808:1283:: with SMTP id a3mr12753863oiw.99.1630288720780;
Sun, 29 Aug 2021 18:58:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 29 Aug 2021 18:58:40 -0700 (PDT)
In-Reply-To: <f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com> <f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 30 Aug 2021 01:58:40 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 30 Aug 2021 01:58 UTC

On Sunday, August 29, 2021 at 6:45:17 PM UTC-5, JimBrakefield wrote:
> On Sunday, August 29, 2021 at 5:24:16 PM UTC-5, MitchAlsup wrote:
> > On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> > > On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> > <
> > Given a definition where Node means chip; and chip can contain
> > multiple processors, each processor having several cores (SMT or
> > other), along with memory controllers, Node repeater links, and
> > host bridges::
> > <
> > It seems to me that in an age of unbounded processors and cores,
> > using the affinity relationship to find a processor suitable for running
> > an affinitized process is harder than simply moving an already affinitized
> > process between RUN<->WAIT.
> > <
> > Thus, one can suitably "throw" HW at the run<->wait latency*, but
> > one should never bother to throw HW at the affinity problem--that
> > should always remain a problem the OS(HV) has to solve.
> > <
> > But this could lead to it being taking longer to affinitize a process to
> > a processor (and suitable core thereon) that it is to context switch
> > into an already affinitized process.
> > <
> > (*) both as seen in total overhead cycles and as seen by either OS(HV)
> > controlling a process, or by process being controlled.
> Affinity allocation of cores would seem to require three aspects:
>
> A) Definition of an affinity metric.
<
Yes, we agreed earlier in this thread that this metric would take on the
kind of addressing used in networks--more hierarchical and less flat
than the current bit vector approach.
<
> B) Parameters in the core(s) request (Number of cores, duration, memory
> scratch space, existing memory allocation, cost for moving memory and
> cache contents, etc).
<
I think we agreed earlier that there would be a number of Nodes in a system
each node would have several processors and at least one Host Bridge
and at least one Memory Controller. Each processor could have several
cores, each core could run several threads.
<
In addition, we agreed that the reason to use affinity was to preserve
the momentum of the cache hierarchy (including TLBs and Table Walk
caches), either by running "things that conflict" in different processors,
or by running "things that share" in the same processors.
<
Greatest performance occurs when a process continues to run on the
original core.
<
Next greatest comes when there is another core in the same cache hierarchy
that can run process.
<
But sooner or later, someone has to decide that process[k] wants to run
and since core[c] is idle, why not just run k on c.
<
> C) Hardware to evaluate the affinity metric across available resources.
<
Since we don't know what affinity data looks or even smells like (other than
smelling like network addressing) it is a bit early to think about HW
looking across the affinity data and deciding what goes on who.
<
But I think we can agree that the process of making this decision is considerably
harder than the job of picking up a process[k] already loosely bound to core[c]
and running it on core[c] again, and again, and again.
<
In fact, I am working on HW that performs context switching amongst
processes that have already been affinitized, leaving the OS to only need
to bind the process to the processor.......................

Re: Processor Affinity in the age of unbounded cores

<ae920718-3b77-4c6c-825e-0183f57c02e2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20205&group=comp.arch#20205

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b4d:: with SMTP id 74mr19727639qkl.92.1630290935171;
Sun, 29 Aug 2021 19:35:35 -0700 (PDT)
X-Received: by 2002:aca:a80a:: with SMTP id r10mr14259763oie.119.1630290934874;
Sun, 29 Aug 2021 19:35:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 29 Aug 2021 19:35:34 -0700 (PDT)
In-Reply-To: <08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.182.0; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.182.0
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com> <f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
<08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ae920718-3b77-4c6c-825e-0183f57c02e2n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Mon, 30 Aug 2021 02:35:35 +0000
Content-Type: text/plain; charset="UTF-8"
 by: JimBrakefield - Mon, 30 Aug 2021 02:35 UTC

On Sunday, August 29, 2021 at 8:58:42 PM UTC-5, MitchAlsup wrote:
> On Sunday, August 29, 2021 at 6:45:17 PM UTC-5, JimBrakefield wrote:
> > On Sunday, August 29, 2021 at 5:24:16 PM UTC-5, MitchAlsup wrote:
> > > On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> > > > On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> > > <
> > > Given a definition where Node means chip; and chip can contain
> > > multiple processors, each processor having several cores (SMT or
> > > other), along with memory controllers, Node repeater links, and
> > > host bridges::
> > > <
> > > It seems to me that in an age of unbounded processors and cores,
> > > using the affinity relationship to find a processor suitable for running
> > > an affinitized process is harder than simply moving an already affinitized
> > > process between RUN<->WAIT.
> > > <
> > > Thus, one can suitably "throw" HW at the run<->wait latency*, but
> > > one should never bother to throw HW at the affinity problem--that
> > > should always remain a problem the OS(HV) has to solve.
> > > <
> > > But this could lead to it being taking longer to affinitize a process to
> > > a processor (and suitable core thereon) that it is to context switch
> > > into an already affinitized process.
> > > <
> > > (*) both as seen in total overhead cycles and as seen by either OS(HV)
> > > controlling a process, or by process being controlled.
> > Affinity allocation of cores would seem to require three aspects:
> >
> > A) Definition of an affinity metric.
> <
> Yes, we agreed earlier in this thread that this metric would take on the
> kind of addressing used in networks--more hierarchical and less flat
> than the current bit vector approach.
> <
> > B) Parameters in the core(s) request (Number of cores, duration, memory
> > scratch space, existing memory allocation, cost for moving memory and
> > cache contents, etc).
> <
> I think we agreed earlier that there would be a number of Nodes in a system
> each node would have several processors and at least one Host Bridge
> and at least one Memory Controller. Each processor could have several
> cores, each core could run several threads.
> <
> In addition, we agreed that the reason to use affinity was to preserve
> the momentum of the cache hierarchy (including TLBs and Table Walk
> caches), either by running "things that conflict" in different processors,
> or by running "things that share" in the same processors.
> <
> Greatest performance occurs when a process continues to run on the
> original core.
> <
> Next greatest comes when there is another core in the same cache hierarchy
> that can run process.
> <
> But sooner or later, someone has to decide that process[k] wants to run
> and since core[c] is idle, why not just run k on c.
> <
> > C) Hardware to evaluate the affinity metric across available resources.
> <
> Since we don't know what affinity data looks or even smells like (other than
> smelling like network addressing) it is a bit early to think about HW
> looking across the affinity data and deciding what goes on who.
> <
> But I think we can agree that the process of making this decision is considerably
> harder than the job of picking up a process[k] already loosely bound to core[c]
> and running it on core[c] again, and again, and again.
> <
> In fact, I am working on HW that performs context switching amongst
> processes that have already been affinitized, leaving the OS to only need
> to bind the process to the processor.......................

Thanks, now I understand some of the full context of the subject.
And next time promise to web search any new terms, no matter how straight forward they sound.

Re: Processor Affinity in the age of unbounded cores

<sghj1r$ou0$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20207&group=comp.arch#20207

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Sun, 29 Aug 2021 20:24:43 -0700
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <sghj1r$ou0$1@dont-email.me>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad>
<928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad>
<a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad>
<d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com>
<f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
<08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 30 Aug 2021 03:24:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4df7050554dd6e9f39923c89d95a7082";
logging-data="25536"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18BZms5OyULIvsj6I5O+F5z"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:i2SV4V5GShMCBGPFi9FrmtnhRtE=
In-Reply-To: <08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Mon, 30 Aug 2021 03:24 UTC

On 8/29/2021 6:58 PM, MitchAlsup wrote:
> On Sunday, August 29, 2021 at 6:45:17 PM UTC-5, JimBrakefield wrote:
>> On Sunday, August 29, 2021 at 5:24:16 PM UTC-5, MitchAlsup wrote:
>>> On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
>>>> On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
>>> <
>>> Given a definition where Node means chip; and chip can contain
>>> multiple processors, each processor having several cores (SMT or
>>> other), along with memory controllers, Node repeater links, and
>>> host bridges::
>>> <
>>> It seems to me that in an age of unbounded processors and cores,
>>> using the affinity relationship to find a processor suitable for running
>>> an affinitized process is harder than simply moving an already affinitized
>>> process between RUN<->WAIT.
>>> <
>>> Thus, one can suitably "throw" HW at the run<->wait latency*, but
>>> one should never bother to throw HW at the affinity problem--that
>>> should always remain a problem the OS(HV) has to solve.
>>> <
>>> But this could lead to it being taking longer to affinitize a process to
>>> a processor (and suitable core thereon) that it is to context switch
>>> into an already affinitized process.
>>> <
>>> (*) both as seen in total overhead cycles and as seen by either OS(HV)
>>> controlling a process, or by process being controlled.
>> Affinity allocation of cores would seem to require three aspects:
>>
>> A) Definition of an affinity metric.
> <
> Yes, we agreed earlier in this thread that this metric would take on the
> kind of addressing used in networks--more hierarchical and less flat
> than the current bit vector approach.
> <
>> B) Parameters in the core(s) request (Number of cores, duration, memory
>> scratch space, existing memory allocation, cost for moving memory and
>> cache contents, etc).
> <
> I think we agreed earlier that there would be a number of Nodes in a system
> each node would have several processors and at least one Host Bridge
> and at least one Memory Controller. Each processor could have several
> cores, each core could run several threads.
> <
> In addition, we agreed that the reason to use affinity was to preserve
> the momentum of the cache hierarchy (including TLBs and Table Walk
> caches), either by running "things that conflict" in different processors,
> or by running "things that share" in the same processors.
> <
> Greatest performance occurs when a process continues to run on the
> original core.
> <
> Next greatest comes when there is another core in the same cache hierarchy
> that can run process.
> <
> But sooner or later, someone has to decide that process[k] wants to run
> and since core[c] is idle, why not just run k on c.
> <
>> C) Hardware to evaluate the affinity metric across available resources.
> <
> Since we don't know what affinity data looks or even smells like (other than
> smelling like network addressing) it is a bit early to think about HW
> looking across the affinity data and deciding what goes on who.
> <
> But I think we can agree that the process of making this decision is considerably
> harder than the job of picking up a process[k] already loosely bound to core[c]
> and running it on core[c] again, and again, and again.
> <
> In fact, I am working on HW that performs context switching amongst
> processes that have already been affinitized, leaving the OS to only need
> to bind the process to the processor.......................

Seems like a hierarchy like that is a baby step toward
processor-in-memory - why not go all the way?

Re: Processor Affinity in the age of unbounded cores

<5v8XI.3995$6U3.3542@fx43.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20210&group=comp.arch#20210

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com> <OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com> <wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
In-Reply-To: <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 242
Message-ID: <5v8XI.3995$6U3.3542@fx43.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 30 Aug 2021 17:24:17 UTC
Date: Mon, 30 Aug 2021 13:23:31 -0400
X-Received-Bytes: 12675
 by: EricP - Mon, 30 Aug 2021 17:23 UTC

MitchAlsup wrote:
> On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:
>
>>> Say we have an environment supporting virtual devices. That is, there is
>>> a table in memory that maps the virtual device into a physical device
>>> so that the hypervisor can allow the hosted OS to perform its own I/O
>>> to something like a SATA drive. The virtualized SATA driver gets a
>>> virtual Bus:Device,Function, and his page map contains a page in
>>> memory mapped I/O space pointing at virtual device MMIO control
>>> registers. The virtual device driver, running down near user priority level
>>> does 5-10 stores to the virtual device control registers to initiate a read or
>>> write to the disk. The disk schedules the access as it chooses, and later on
>>> when DMA is ready, the device table is used to associate virtual
>>> Bus:Device,Function with physical Bus:Device,Function and also the
>>> virtual machine and mapping tables of the OS this DMA request should
>>> use.
>> You left out a few steps.
>>
>> - mapping the virtual device ids (or device handle) to an actual device
>> - checking permissions for that operation on that device
>> - validate IO arguments (e.g. buffer address and range)
>> - checking IO size vs quotas, maybe break 1 big IO into many smaller
>> - translate virtual addresses to physical, choose number to pin
>> - if virtual page not Present then inswap
>> - pin physical page to prevent removal/recycling
>> - queue request to device
> <
> I am assuming this is just std OS/HV stuff running privileged but
> not at very high priority.

Yes. Most of these are cpu-level resources so are managed
at the same reentrancy level as kernel heap or scheduling.

> <
>> - when device available, queue for a free DMA channel
>> - when DMA channel available program actual device hardware
> <
> OS or HV on paravirtualized OS

Both as the guest OS is going to manage them as though
it is a real SMP machine.

And at some point HV has to turn them into real resources
which means it does the same things.

> <
>> - raise interrupt priority to block device interrupts
>> - spinlock to prevent concurrent access to device interrupt data
> <
> This spinlock is what I think I can get rid of.

Replaced spinlocks by what?
The concurrent access to the shared memory & resources must be coordinated.

> <
>> - write device control registers
>>
>> On IO completion interrupt undo all the things we did above,
>> releasing resources and possibly allowing other queued IO to continue.
> <
> Is the above paragraph done at the Task handler level or the interrupt handler level?

The OS scheduler and heap management level, whatever that is called.

Its all down to reentrancy - most of this code and data structures
are not thread or interrupt reentrant.

This is the same problem as was discussed in the earlier thread about
asynchronous signal delivery and, say, memory heap management.
In that case heap access is disallowed from inside a signal routine
or any routine called by that signal routine. So we convert an
asynchronous signal to a synchronous message queued to the main
routine and poll the queue in the main loop.

OS has the same reentrancy problems but for more things.
So we create this fiction of a software interrupt priority level,
that sits just above where threads run and below where ISR's run,
where we loop servicing our queue of message from interrupts routines
and accessing all the non-reentrant data structures, like kernel heap
or scheduler, and doing so without blocking interrupts.
In WinNT the name of that fictional SWI level is Dispatch level.

Most of these resources are managed at "Dispatch" level.

Things like allocating and freeing a bounce buffer for DMA.
Or a DMA channel from a common pool. These might have multiple
IO requests waiting in a queue to use them.
As each IO completes, it frees its resources, which may dequeue the
next IO request, and cause it to continue its request while this
IO continues to complete its.

By restricting allocate & free of these resources from being performed
at interrupt level, they do not have to block interrupts.

>> Maybe loop over all of this if the IO was larger than resource allocations
>> allow to be allocated at once to a single requester.
>>
>> There is also asynchronous IO cancellation to deal with,
>> both when IO is queued and, depending on the device, possibly while running.
>>> <
>>>> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
>>>> controls to which core interrupts are routed. If any interrupt can be
>>>> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
>>>> and if we have 64 cores, then there are 448 HW threads dedicated
>>>> to servicing interrupts (which is why we don't do this).
>>> <
>>> My recent research on this shows the number of interrupt "levels"
>>> increasing to 256-ish interrupts, each interrupt vector having its
>>> own priority and a few related things. AMD virtualization via I/O-MMU.
> <
>> There are two issues: one is the number of interrupt priority levels,
> <
> There are 64 priority levels, both handlers and processes run on these.
> Although this is not fixed in stone
> <
>> and one is the ability to determine directly which device is interrupting.
> <
> physical device send virtual interrupt, interrupt vector table determine priority
> and which processor and core the interrupt handler should run.
>> I have heard of 255 levels on embedded systems where designs have a fixed,
>> well known set devices to deal with, they like to use one interrupt level
>> per device interrupt. Because the OS is often doing hard real-time,
>> having a unique priority for each device source fits nicely into
>> the deadline scheduling model.
>>
>> For general purpose computers, a large number of priority interrupt
>> levels offers nothing.
> <
> How few is enough, but in any event I can fit 256 interrupt voctors,
> 64 exception vectors, 256 processes, and the 64-priority Queues
> on a single page of memory. 64 was chosen merely because it fit.

For external devices I had three, low, medium and high,
with three sub levels in each numbered from low to high
as a concession that not all devices are equal.

Priority assignment is based on the nature of interrupt.

Low is for patient devices.
Mostly signaling completion of a prior request and have no time deadline.
Disk read or write completion is an example.
Network card buffer transmit is also (but usually not buffer receive).

Medium is for impatient but recoverable devices.
This would be for signaling receipt of some message where
a buffer overrun can occur, but can be recovered at some cost.
Network card buffer receive is an example.
Also tape drive buffer read or write.

High is for impatient and non-recoverable devices.
This would be for signaling receipt of some message where
a buffer overrun can occur and cannot be recovered.
Reading an analog to digital converter could be an example.

> <
>> However being able to directly determine which device id requested
>> an interrupt saves polling all the devices on a daisy chain.
> <
> Interrupt handler receives virtual Bus:Device,Function Node,
> Interrupt number, and the process which initiated the I/O.
> <
>> One could have a small number of priorities, say 7 or 9,
>> but also something like PCI's Message Queued Interrupts to put
>> the requesting device's ID number into the queue for that priority
>> connected to a huge number of devices.
> <
> Yes a large part of this priority queue is to put a place where message
> signaled interrupts can land and wait for their turn at execution
> resources.
> <
>> If a request queue is not empty then a hardware interrupt is
>> requested at the queue's priority.
> <
> I was thinking more along the lines that this Queue reads from
> the highest priority non-blocked queue, builds Thread State and
> sends a context switch to a particular processor:core. The
> message contains all the bits on needs to run the process, upon
> arrival, whatever the processor:core was doing gets messaged
> back to queue.
> <
> Thus, for interrupts, one has thread state of the interrupt
> handler along with the notification of (ahem) the interrupt.
> That is, as seen at a processor:core, the notification (interrupt)
> arrives only 1 cycle before the thread state, including registers.
>> This is why I defined my interrupt architecture as model specific,
>> because there is no one size fits all. Similarly, the cpu's interface
>> for interrupt handling is also model specific.
>>> <
>>> So
>>> <
>>> DMA address gets translated though the OS mapping tables using
>>> the requesting thread virtual address translated to host virtual address
>>> which is translated by the virtual machine mapping tables to physical
>>> address.
>> check
>>> <
>>> {Now this is where it gets interesting}
>>> <
>>> After DMA is complete, device sends an interrupt to devices interrupt
>>> handler (which is IN the OS under the HV). Interrupt causes a control
>>> transfer into interrupt handler, interrupt handler reads a few control
>>> registers in memory mapped I/O space, builds a message for a lower
>>> level Task handler , and exits. Interrupt handler ran at "rather" high priority,
>>> task handler runs at medium priority. Task handler takes message
>>> rummages through OS tables and figures out who needs (and what kind
>>> of) notifications, and performs that work. Then Task handler exits (in one
>>> way or another) to the system scheduler. System scheduler figures out who
>>> to run next--which may have changed since the arrival of the interrupt.
>> Yes.
>> The notifications to other tasks (or oneself) can complete a wait operation.
>> Completing a wait may transition a thread from Wait to Ready state
>> if it was not already Ready or Running.
>> Transition of thread to Ready state requests scheduler.
>> Eventually scheduler selects thread that requested IO which performs
>> any final resource deallocation inside the process address space.
> <
> In my model: The completing I/O can send (MSI) an interrupt to the
> waiting thread, and the hardware places the waiting process on
> appropriate priority queue where it awaits execution cycles. No
> need to invoke scheduler (directly). Eventually the unWaited process
> reached the front of the highest non-Empty non-Blocked Queue,
> and gets launched into execution with a Thread State message.
>> I just realized there are two different schedulers.
>> One for threads that service HW like interrupts and other non-pagable code.
>> One for threads that can can perform waits like for IO or page faults.
> <
> I saw this as 2-levels, your second level always being dependent on
> the first. The first level runs non-pagable, the second runs pagable.


Click here to read the complete article
Re: Processor Affinity in the age of unbounded cores

<f5202e1f-e68e-4472-b0ee-a23f3bdfe960n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20214&group=comp.arch#20214

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7b47:: with SMTP id m7mr21693050qtu.178.1630345960941;
Mon, 30 Aug 2021 10:52:40 -0700 (PDT)
X-Received: by 2002:a9d:d35:: with SMTP id 50mr20255763oti.22.1630345960731;
Mon, 30 Aug 2021 10:52:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 30 Aug 2021 10:52:40 -0700 (PDT)
In-Reply-To: <sghj1r$ou0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com> <f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
<08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com> <sghj1r$ou0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f5202e1f-e68e-4472-b0ee-a23f3bdfe960n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 30 Aug 2021 17:52:40 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 30 Aug 2021 17:52 UTC

On Sunday, August 29, 2021 at 10:24:46 PM UTC-5, Ivan Godard wrote:
> On 8/29/2021 6:58 PM, MitchAlsup wrote:
> > On Sunday, August 29, 2021 at 6:45:17 PM UTC-5, JimBrakefield wrote:
> >> On Sunday, August 29, 2021 at 5:24:16 PM UTC-5, MitchAlsup wrote:
> >>> On Thursday, August 26, 2021 at 1:16:44 PM UTC-5, MitchAlsup wrote:
> >>>> On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> >>> <
> >>> Given a definition where Node means chip; and chip can contain
> >>> multiple processors, each processor having several cores (SMT or
> >>> other), along with memory controllers, Node repeater links, and
> >>> host bridges::
> >>> <
> >>> It seems to me that in an age of unbounded processors and cores,
> >>> using the affinity relationship to find a processor suitable for running
> >>> an affinitized process is harder than simply moving an already affinitized
> >>> process between RUN<->WAIT.
> >>> <
> >>> Thus, one can suitably "throw" HW at the run<->wait latency*, but
> >>> one should never bother to throw HW at the affinity problem--that
> >>> should always remain a problem the OS(HV) has to solve.
> >>> <
> >>> But this could lead to it being taking longer to affinitize a process to
> >>> a processor (and suitable core thereon) that it is to context switch
> >>> into an already affinitized process.
> >>> <
> >>> (*) both as seen in total overhead cycles and as seen by either OS(HV)
> >>> controlling a process, or by process being controlled.
> >> Affinity allocation of cores would seem to require three aspects:
> >>
> >> A) Definition of an affinity metric.
> > <
> > Yes, we agreed earlier in this thread that this metric would take on the
> > kind of addressing used in networks--more hierarchical and less flat
> > than the current bit vector approach.
> > <
> >> B) Parameters in the core(s) request (Number of cores, duration, memory
> >> scratch space, existing memory allocation, cost for moving memory and
> >> cache contents, etc).
> > <
> > I think we agreed earlier that there would be a number of Nodes in a system
> > each node would have several processors and at least one Host Bridge
> > and at least one Memory Controller. Each processor could have several
> > cores, each core could run several threads.
> > <
> > In addition, we agreed that the reason to use affinity was to preserve
> > the momentum of the cache hierarchy (including TLBs and Table Walk
> > caches), either by running "things that conflict" in different processors,
> > or by running "things that share" in the same processors.
> > <
> > Greatest performance occurs when a process continues to run on the
> > original core.
> > <
> > Next greatest comes when there is another core in the same cache hierarchy
> > that can run process.
> > <
> > But sooner or later, someone has to decide that process[k] wants to run
> > and since core[c] is idle, why not just run k on c.
> > <
> >> C) Hardware to evaluate the affinity metric across available resources.
> > <
> > Since we don't know what affinity data looks or even smells like (other than
> > smelling like network addressing) it is a bit early to think about HW
> > looking across the affinity data and deciding what goes on who.
> > <
> > But I think we can agree that the process of making this decision is considerably
> > harder than the job of picking up a process[k] already loosely bound to core[c]
> > and running it on core[c] again, and again, and again.
> > <
> > In fact, I am working on HW that performs context switching amongst
> > processes that have already been affinitized, leaving the OS to only need
> > to bind the process to the processor.......................
<
> Seems like a hierarchy like that is a baby step toward
> processor-in-memory - why not go all the way?
<
The place where it is easiest to see the scheduling data, and the places
where scheduled processors run do not need any correspondence, and
those running processes have access to an entire 64-bit virtual address
space, not just local.

Re: Processor Affinity in the age of unbounded cores

<jwv7dg2dd6f.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20215&group=comp.arch#20215

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Mon, 30 Aug 2021 14:02:41 -0400
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <jwv7dg2dd6f.fsf-monnier+comp.arch@gnu.org>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad>
<928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad>
<a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad>
<d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<1700af92-639a-4d5f-a58f-d2737dcf22b3n@googlegroups.com>
<f551dd0b-b124-49de-8b5d-47ea649297aen@googlegroups.com>
<08b49c7d-e070-4a50-b19a-3f58f30c78ben@googlegroups.com>
<sghj1r$ou0$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="dca61cc33b6a26671f4e64f8997fc78b";
logging-data="5267"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18er6CJOU2HiBB5vaC44qxj"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:EfFNqP+J8ZHvRgCsXHKTf8ZS4xc=
sha1:o53W8h45Kyw+lsWkQb+945f+sQ0=
 by: Stefan Monnier - Mon, 30 Aug 2021 18:02 UTC

> Seems like a hierarchy like that is a baby step toward processor-in-memory -
> why not go all the way?

I suspect that the design space between the two extremes is probably
more interesting than either extreme.

Stefan

Re: Processor Affinity in the age of unbounded cores

<6c758277-3654-4619-ba0c-65569f824773n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20226&group=comp.arch#20226

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:58ea:: with SMTP id di10mr30232968qvb.60.1630427498978;
Tue, 31 Aug 2021 09:31:38 -0700 (PDT)
X-Received: by 2002:a9d:7204:: with SMTP id u4mr25510950otj.276.1630427498661;
Tue, 31 Aug 2021 09:31:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 31 Aug 2021 09:31:38 -0700 (PDT)
In-Reply-To: <5v8XI.3995$6U3.3542@fx43.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<5v8XI.3995$6U3.3542@fx43.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c758277-3654-4619-ba0c-65569f824773n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 31 Aug 2021 16:31:38 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 239
 by: MitchAlsup - Tue, 31 Aug 2021 16:31 UTC

On Monday, August 30, 2021 at 12:24:21 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:
> >
> >>> Say we have an environment supporting virtual devices. That is, there is
> >>> a table in memory that maps the virtual device into a physical device
> >>> so that the hypervisor can allow the hosted OS to perform its own I/O
> >>> to something like a SATA drive. The virtualized SATA driver gets a
> >>> virtual Bus:Device,Function, and his page map contains a page in
> >>> memory mapped I/O space pointing at virtual device MMIO control
> >>> registers. The virtual device driver, running down near user priority level
> >>> does 5-10 stores to the virtual device control registers to initiate a read or
> >>> write to the disk. The disk schedules the access as it chooses, and later on
> >>> when DMA is ready, the device table is used to associate virtual
> >>> Bus:Device,Function with physical Bus:Device,Function and also the
> >>> virtual machine and mapping tables of the OS this DMA request should
> >>> use.
> >> You left out a few steps.
> >>
> >> - mapping the virtual device ids (or device handle) to an actual device
> >> - checking permissions for that operation on that device
> >> - validate IO arguments (e.g. buffer address and range)
> >> - checking IO size vs quotas, maybe break 1 big IO into many smaller
> >> - translate virtual addresses to physical, choose number to pin
> >> - if virtual page not Present then inswap
> >> - pin physical page to prevent removal/recycling
> >> - queue request to device
> > <
> > I am assuming this is just std OS/HV stuff running privileged but
> > not at very high priority.
> Yes. Most of these are cpu-level resources so are managed
> at the same reentrancy level as kernel heap or scheduling.
> > <
> >> - when device available, queue for a free DMA channel
> >> - when DMA channel available program actual device hardware
> > <
> > OS or HV on paravirtualized OS
> Both as the guest OS is going to manage them as though
> it is a real SMP machine.
>
> And at some point HV has to turn them into real resources
> which means it does the same things.
> > <
> >> - raise interrupt priority to block device interrupts
> >> - spinlock to prevent concurrent access to device interrupt data
> > <
> > This spinlock is what I think I can get rid of.
> Replaced spinlocks by what?
> The concurrent access to the shared memory & resources must be coordinated.
> > <
> >> - write device control registers
> >>
> >> On IO completion interrupt undo all the things we did above,
> >> releasing resources and possibly allowing other queued IO to continue.
> > <
> > Is the above paragraph done at the Task handler level or the interrupt handler level?
> The OS scheduler and heap management level, whatever that is called.
>
> Its all down to reentrancy - most of this code and data structures
> are not thread or interrupt reentrant.
<
I can agree that you put your finger on the problem. A set of state (process)
can be running at most once.
>
> This is the same problem as was discussed in the earlier thread about
> asynchronous signal delivery and, say, memory heap management.
> In that case heap access is disallowed from inside a signal routine
> or any routine called by that signal routine. So we convert an
> asynchronous signal to a synchronous message queued to the main
> routine and poll the queue in the main loop.
>
> OS has the same reentrancy problems but for more things.
> So we create this fiction of a software interrupt priority level,
> that sits just above where threads run and below where ISR's run,
> where we loop servicing our queue of message from interrupts routines
> and accessing all the non-reentrant data structures, like kernel heap
> or scheduler, and doing so without blocking interrupts.
> In WinNT the name of that fictional SWI level is Dispatch level.
>
> Most of these resources are managed at "Dispatch" level.
>
> Things like allocating and freeing a bounce buffer for DMA.
> Or a DMA channel from a common pool. These might have multiple
> IO requests waiting in a queue to use them.
> As each IO completes, it frees its resources, which may dequeue the
> next IO request, and cause it to continue its request while this
> IO continues to complete its.
>
> By restricting allocate & free of these resources from being performed
> at interrupt level, they do not have to block interrupts.
<
After thinking about this for 2 days, I think if I said any more I would
lose patentability. Sorry--really I am.
<
> >> Maybe loop over all of this if the IO was larger than resource allocations
> >> allow to be allocated at once to a single requester.
> >>
> >> There is also asynchronous IO cancellation to deal with,
> >> both when IO is queued and, depending on the device, possibly while running.
> >>> <
> >>>> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> >>>> controls to which core interrupts are routed. If any interrupt can be
> >>>> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> >>>> and if we have 64 cores, then there are 448 HW threads dedicated
> >>>> to servicing interrupts (which is why we don't do this).
> >>> <
> >>> My recent research on this shows the number of interrupt "levels"
> >>> increasing to 256-ish interrupts, each interrupt vector having its
> >>> own priority and a few related things. AMD virtualization via I/O-MMU.
> > <
> >> There are two issues: one is the number of interrupt priority levels,
> > <
> > There are 64 priority levels, both handlers and processes run on these.
> > Although this is not fixed in stone
> > <
> >> and one is the ability to determine directly which device is interrupting.
> > <
> > physical device send virtual interrupt, interrupt vector table determine priority
> > and which processor and core the interrupt handler should run.
> >> I have heard of 255 levels on embedded systems where designs have a fixed,
> >> well known set devices to deal with, they like to use one interrupt level
> >> per device interrupt. Because the OS is often doing hard real-time,
> >> having a unique priority for each device source fits nicely into
> >> the deadline scheduling model.
> >>
> >> For general purpose computers, a large number of priority interrupt
> >> levels offers nothing.
> > <
> > How few is enough, but in any event I can fit 256 interrupt voctors,
> > 64 exception vectors, 256 processes, and the 64-priority Queues
> > on a single page of memory. 64 was chosen merely because it fit.
> For external devices I had three, low, medium and high,
> with three sub levels in each numbered from low to high
> as a concession that not all devices are equal.
>
> Priority assignment is based on the nature of interrupt.
>
> Low is for patient devices.
> Mostly signaling completion of a prior request and have no time deadline.
> Disk read or write completion is an example.
> Network card buffer transmit is also (but usually not buffer receive).
>
> Medium is for impatient but recoverable devices.
> This would be for signaling receipt of some message where
> a buffer overrun can occur, but can be recovered at some cost.
> Network card buffer receive is an example.
> Also tape drive buffer read or write.
>
> High is for impatient and non-recoverable devices.
> This would be for signaling receipt of some message where
> a buffer overrun can occur and cannot be recovered.
> Reading an analog to digital converter could be an example.
> > <
> >> However being able to directly determine which device id requested
> >> an interrupt saves polling all the devices on a daisy chain.
> > <
> > Interrupt handler receives virtual Bus:Device,Function Node,
> > Interrupt number, and the process which initiated the I/O.
> > <
> >> One could have a small number of priorities, say 7 or 9,
> >> but also something like PCI's Message Queued Interrupts to put
> >> the requesting device's ID number into the queue for that priority
> >> connected to a huge number of devices.
> > <
> > Yes a large part of this priority queue is to put a place where message
> > signaled interrupts can land and wait for their turn at execution
> > resources.
> > <
> >> If a request queue is not empty then a hardware interrupt is
> >> requested at the queue's priority.
> > <
> > I was thinking more along the lines that this Queue reads from
> > the highest priority non-blocked queue, builds Thread State and
> > sends a context switch to a particular processor:core. The
> > message contains all the bits on needs to run the process, upon
> > arrival, whatever the processor:core was doing gets messaged
> > back to queue.
> > <
> > Thus, for interrupts, one has thread state of the interrupt
> > handler along with the notification of (ahem) the interrupt.
> > That is, as seen at a processor:core, the notification (interrupt)
> > arrives only 1 cycle before the thread state, including registers.
> >> This is why I defined my interrupt architecture as model specific,
> >> because there is no one size fits all. Similarly, the cpu's interface
> >> for interrupt handling is also model specific.
> >>> <
> >>> So
> >>> <
> >>> DMA address gets translated though the OS mapping tables using
> >>> the requesting thread virtual address translated to host virtual address
> >>> which is translated by the virtual machine mapping tables to physical
> >>> address.
> >> check
> >>> <
> >>> {Now this is where it gets interesting}
> >>> <
> >>> After DMA is complete, device sends an interrupt to devices interrupt
> >>> handler (which is IN the OS under the HV). Interrupt causes a control
> >>> transfer into interrupt handler, interrupt handler reads a few control
> >>> registers in memory mapped I/O space, builds a message for a lower
> >>> level Task handler , and exits. Interrupt handler ran at "rather" high priority,
> >>> task handler runs at medium priority. Task handler takes message
> >>> rummages through OS tables and figures out who needs (and what kind
> >>> of) notifications, and performs that work. Then Task handler exits (in one
> >>> way or another) to the system scheduler. System scheduler figures out who
> >>> to run next--which may have changed since the arrival of the interrupt.
> >> Yes.
> >> The notifications to other tasks (or oneself) can complete a wait operation.
> >> Completing a wait may transition a thread from Wait to Ready state
> >> if it was not already Ready or Running.
> >> Transition of thread to Ready state requests scheduler.
> >> Eventually scheduler selects thread that requested IO which performs
> >> any final resource deallocation inside the process address space.
> > <
> > In my model: The completing I/O can send (MSI) an interrupt to the
> > waiting thread, and the hardware places the waiting process on
> > appropriate priority queue where it awaits execution cycles. No
> > need to invoke scheduler (directly). Eventually the unWaited process
> > reached the front of the highest non-Empty non-Blocked Queue,
> > and gets launched into execution with a Thread State message.
> >> I just realized there are two different schedulers.
> >> One for threads that service HW like interrupts and other non-pagable code.
> >> One for threads that can can perform waits like for IO or page faults.
> > <
> > I saw this as 2-levels, your second level always being dependent on
> > the first. The first level runs non-pagable, the second runs pagable.
> Yes, except I'm now thinking of 3 levels:
> - normal user and kernel threads which can do things like wait for IO
> - OS level HW threads which can do things like allocate and free
> resources, and schedule and wake up normal threads.
> - interrupt level thread to service device interrupts.
>
> Its all about controlling reentrancy and where and when thread Wait
> states for things like page fault IO can occur so that it cannot
> indefinitely block other OS activity.
<
A mighty thanks to EricP for annotating in great detail the interplay
between threads, interrupts an OS functionality.


Click here to read the complete article
Re: Processor Affinity in the age of unbounded cores

<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20231&group=comp.arch#20231

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:df04:: with SMTP id g4mr22264044qvl.10.1630473967863;
Tue, 31 Aug 2021 22:26:07 -0700 (PDT)
X-Received: by 2002:a9d:d35:: with SMTP id 50mr26736557oti.22.1630473967606;
Tue, 31 Aug 2021 22:26:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 31 Aug 2021 22:26:07 -0700 (PDT)
In-Reply-To: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f39d:2c00:b8fa:1f7d:7346:5a77;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f39d:2c00:b8fa:1f7d:7346:5a77
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 01 Sep 2021 05:26:07 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2656
 by: Quadibloc - Wed, 1 Sep 2021 05:26 UTC

On Monday, August 23, 2021 at 12:24:05 PM UTC-6, MitchAlsup wrote:

> The current model of 1-bit to denote "any core" and then a bit vector
> to say "any of these cores can run this task/thread" falls apart when
> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> The bit vector approach "does not scale".

One simplistic way to represent processor affinity for a chip, or
multi-chip module, containing a very large number of cores,
where it's awkward to use a bit vector of more than 64 bits, would
be this:

First, one designates the "home core" of a program. This is the ID of
the core that the program will start from.

That core would belong to a group of 64 cores, and so a bit vector
would then be used to indicate which of the (other) processors in
that group could also be used for that task or thread.

These groups of 64 cores would then be organized into bundles of
64 such groups. If the task or thread could be executed by cores
_outside_ the initial group of 64 cores, the next bit vector would indicate
which other groups of 64 cores could also be used for the task. (One
assumes that in that case, if the task is that large in scale, any individual
core in an indicated group of 64 cores could be used.)

And then, of course, how it scales up is obvious. The next 64-bit vector says
which groups of 4,096 cores outside the initial group of 4,096 cores can be
used for the program, and so on and so forth.

John Savard

Re: Processor Affinity in the age of unbounded cores

<jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20232&group=comp.arch#20232

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
Date: Wed, 01 Sep 2021 08:26:08 -0400
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <jwv35qoscxl.fsf-monnier+comp.arch@gnu.org>
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="8c73a2455d4f44a0e20d3a8de4360a09";
logging-data="25086"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19LIoTD/SnptNfTFCmWVTbG"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:YCeALlf79eth0Wr4PUSyd2arbnc=
sha1:Uj9h2uHDqi6X+0kafnK8dbSp/os=
 by: Stefan Monnier - Wed, 1 Sep 2021 12:26 UTC

> And then, of course, how it scales up is obvious. The next 64-bit vector says
> which groups of 4,096 cores outside the initial group of 4,096 cores can be
> used for the program, and so on and so forth.

So you're proposing a kind of floating-point system, where the
granularity/mantissa specifies "64 things" and the "exponent" specified
if those things are cores, groups of cores (e.g. chips), or groups of
chips (e.g. boards), ...

Stefan

Re: Processor Affinity in the age of unbounded cores

<hXLXI.14168$7w6.11238@fx42.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20233&group=comp.arch#20233

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx42.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Processor Affinity in the age of unbounded cores
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com> <90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com>
In-Reply-To: <90b693a3-261d-4e2e-9aca-2b748cbeea85n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 96
Message-ID: <hXLXI.14168$7w6.11238@fx42.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 01 Sep 2021 14:16:45 UTC
Date: Wed, 01 Sep 2021 10:16:22 -0400
X-Received-Bytes: 5043
 by: EricP - Wed, 1 Sep 2021 14:16 UTC

Quadibloc wrote:
> On Monday, August 23, 2021 at 12:24:05 PM UTC-6, MitchAlsup wrote:
>
>> The current model of 1-bit to denote "any core" and then a bit vector
>> to say "any of these cores can run this task/thread" falls apart when
>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
>> The bit vector approach "does not scale".
>
> One simplistic way to represent processor affinity for a chip, or
> multi-chip module, containing a very large number of cores,
> where it's awkward to use a bit vector of more than 64 bits, would
> be this:
>
> First, one designates the "home core" of a program. This is the ID of
> the core that the program will start from.
>
> That core would belong to a group of 64 cores, and so a bit vector
> would then be used to indicate which of the (other) processors in
> that group could also be used for that task or thread.
>
> These groups of 64 cores would then be organized into bundles of
> 64 such groups. If the task or thread could be executed by cores
> _outside_ the initial group of 64 cores, the next bit vector would indicate
> which other groups of 64 cores could also be used for the task. (One
> assumes that in that case, if the task is that large in scale, any individual
> core in an indicated group of 64 cores could be used.)
>
> And then, of course, how it scales up is obvious. The next 64-bit vector says
> which groups of 4,096 cores outside the initial group of 4,096 cores can be
> used for the program, and so on and so forth.
>
> John Savard

Combining what Stephen Fuld and I wrote I'd summarize it
as there are two kinds of thread affinities:
hard which are fixed by the configuration of a particular system,
and soft which are dynamic.

The basic hard thread affinity is a capability bit vector,
which cores are capable of running a thread.
Maybe only some cores have a piece of hardware like FPU.

Next hard thread affinity is a ranked preference vector,
and there can be multiple cores of equal rank.
In a big-little system low priority threads might give equal
preference to the multiple little cores.

Another example of hard preference is based on network topology.
A particular core might have an attached device where it is
cheapest to talk to the device from that core.

0--1 core 0 1 2 3
| | rank 0 1 2 1
3--2
\
IO

Soft affinity is based on the recent history of a thread, any cache
it has invested in loading or the home directory of physical pages.
As a thread runs it invests more on its current local core
and lowers its investment in prior cores.

Soft affinity interacts with the network topology graph you describe,
to give a dynamic cost to locating threads at different cores.

All the affinity info folds into how the OS thread scheduler works.

One consideration is how much effort one wants to put into the
scheduler algorithm. The above affinities isn't just a traveling
salesman problem, its S traveling salesmen visiting C cities,
some with lovers, sometimes multiple ones, at different cities
that keep changing.

Another scheduler considerations is SMP data structures and locking.
If all the scheduler data is in one pile guarded by a single spinlock,
all cores can contend for it and you can wind up with a long queue
spin-waiting to reschedule.

That is why many SMP schedulers now separate their scheduling into
core-local lists which do not contend and require no spinlocks,
and global lists which are accessed infrequently and spinlock guarded.

Also in a 4000 node NUMA SMP system, a remote cache hit can be 400 nsec
away and DRAM can be 500 nsec. The more things the scheduler touches,
the more chance it has of touching one of those 400 or 500 nsec things,
and then it winds up sitting around playing the kazoo for a while.

It may not be worthwhile over-thinking thread affinity because
in the end it may be too complicated for the scheduler to make
use of in a reasonable window of time.

It might be worthwhile working backwards from the OS scheduler
to see what affinity information it can make use of,
then looking at how that information can be acquired.

Pages:12
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor