Message-ID:

6 May, 2024: Currently experiencing networking issues at backend news servers. No resolution at this time.

On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, August 23, 2021 at 2:23:16 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> I have been giving some thought to processor affinity. But I want to
> >>> look at the future where there may be an essentially unbounded
> >>> number of cores.
> >>> <
> >>> The current model of 1-bit to denote "any core" and then a bit vector
> >>> to say "any of these cores can run this task/thread" falls apart when
> >>> number of cores is bigger than 32-cores (or 64-cores or 128-cores).
> >>> The bit vector approach "does not scale".
> >>> <
> >>> So, I ask: what would a programming model for assigning processor
> >>> affinity when the number of cores is not easily mapped into common
> >>> register widths ?
> > <
> >> Physical memory is allocated to minimize the distance between a
> >> thread's preferred core, and that cores memory or neighbor memory.
> >> The bit vector model does not take NUMA distance to memory into account.
> >>
> >> A thread would be allocated NUMA memory as close as possible,
> >> and would continue to use that memory until it is recycled.
> > <
> > So, would an I/O driver (SATA) want to be closer to his memory or closer
> > to his I/O (Host Bridge) ?
> > <
> > Would interrupt from the Host Bridge be directed to a core in this "Node" ?
> > So, could the device driver be situated close to its memory but its interrupt
> > handler closer to the device ?
<
> I started to answer but the answer turned into a book. I'll try again.
<
LoL
<
>
> Here we are talking specifically about HW threads servicing interrupts.
> These would not be like normal application threads as they are part of
> the OS kernel and are working on behalf of a cpu.
> Prioritized HW interrupt threads are equivalent to prioritized
> interrupt service routines (ISR).
>
> If I was doing this there would be one HW thread for each core that can
> service interrupts, for each HW priority interrupt request level (IRQL).
> Each interrupt thread has a thread header plus a small (12-16 kB) stack.
> These threads are created at OS boot, each has a hard affinity with
> one core and the pages should be allocated from that core's home memory.
<
Ok, but let me throw a monkey wrench into this::
<
Say we have an environment supporting virtual devices. That is, there is
a table in memory that maps the virtual device into a physical device
so that the hypervisor can allow the hosted OS to perform its own I/O
to something like a SATA drive. The virtualized SATA driver gets a
virtual Bus:Device,Function, and his page map contains a page in
memory mapped I/O space pointing at virtual device MMIO control
registers. The virtual device driver, running down near user priority level
does 5-10 stores to the virtual device control registers to initiate a read or
write to the disk. The disk schedules the access as it chooses, and later on
when DMA is ready, the device table is used to associate virtual
Bus:Device,Function with physical Bus:Device,Function and also the
virtual machine and mapping tables of the OS this DMA request should
use.
<
>
> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> controls to which core interrupts are routed. If any interrupt can be
> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> and if we have 64 cores, then there are 448 HW threads dedicated
> to servicing interrupts (which is why we don't do this).
<
My recent research on this shows the number of interrupt "levels"
increasing to 256-ish interrupts, each interrupt vector having its
own priority and a few related things. AMD virtualization via I/O-MMU.
<
So
<
DMA address gets translated though the OS mapping tables using
the requesting thread virtual address translated to host virtual address
which is translated by the virtual machine mapping tables to physical
address.
<
{Now this is where it gets interesting}
<
After DMA is complete, device sends an interrupt to devices interrupt
handler (which is IN the OS under the HV). Interrupt causes a control
transfer into interrupt handler, interrupt handler reads a few control
registers in memory mapped I/O space, builds a message for a lower
level Task handler , and exits. Interrupt handler ran at "rather" high priority,
task handler runs at medium priority. Task handler takes message
rummages through OS tables and figures out who needs (and what kind
of) notifications, and performs that work. Then Task handler exits (in one
way or another) to the system scheduler. System scheduler figures out who
to run next--which may have changed since the arrival of the interrupt.
<
All of the last several paragraphs are transpiring as other interrupts
are raised and various control transfers take place. {Sometimes in
rather bazaar orders due to timing; but it all works out.}
<
In this virtual device case:: the OS may not KNOW where its device actually
is {with paravirtualization it would} and so may not be able to properly
affinitize the handlers.
<
Anyway, I have figured out the vast majority of the device virtualization
requirements, and have been working on the kinds of HW services
are required to make this stuff fast {Mostly just caching and careful
Table organization.}
>
> The SMP interrupt service threads use per-device, per-interrupt spinlocks
> to serialize their execution of each device's ISR.
<
I have figured out a way to perform this serialization without any
kind of locking......
>
> To answer your question, we could program the PIC to route the
> interrupts to the physically closest core, and that core's
> interrupt service thread would use that core's memory for its stack.
<
Yes, route interrupts from the virtual device to the virtual requestor
interrupt handler and perform control transfer {based on priority of
the interrupt and the priority of the core set it is allowed to run on}
>
> If we know a particular device driver will always be serviced by a
> single core then when the driver is loaded or the devices are mounted,
> its code and data should also be allocated from that core's home memory.
<
So somebody in the chain of command granting guest access to virtual
device does the affinity or pinning of the interrupt threads to core set.
<
Check.
<
I suspect for completeness, the 'task' handler is pinned to the same
core set.
<
Now imagine that the interrupt handler can transfer control to the
task handler without an excursion through the OS (or HV) process
schedulers.
<
And that when Task Handler is done, it can return to interrupted process
(letting it run) while OS (and HV) search through their scheduling tables
to figure out what is the next process to run. {Remember I am talking
about a machine with inherent parallelism.}
<
When the OS/HV decides that process[k] is the next thing to run on
core[j] there is a means to cause a remote core[j] to context switch
from what it was doing to process[k] in a single instruction. Control
is transferred and cor[j] begins running process[k], the the OS/HV
thread continues on unabated. It spawned process[k] onto core[j]
without losing control of who it was or where it is.
>
> In reality there is no need for full SMP servicing of interrupts
> on any core. Alpha VMS handled all interrupts on core-0.
> There are likely only a couple of cores that are near a bridge,
> so in reality maybe we only need 2 cores with 7 interrupt threads each.
<
Simplifications are nice. But Alpha was from an era where there
were not than many nodes in a system. I am wonder what the
best model is for the future when those assumptions no longer
hold.
>
> Also each core has a HW thread for executing at its
> software interrupt level, executing things like what WNT calls
> Deferred Procedure Calls (DPC) callback work packets
> which can be posted by, amongst other things, ISR's to continue
> processing an interrupt at a lower priority level.
>
> Each core needs an idle thread to execute when it has no other work.
> It does various housekeeping like pre-zeroing physical memory pages
> for the free list.
<
Unless you have a means to make it simply appear as if the page
were filled with zeroes. In which case you allocate the page upon
call to malloc, and as you touch every line it automagically gets filled
with zeros (without reading memory to fill the lines--like Mill).
<
But there always seems to be an idle process somewhere running
at a priority lower than everyone else.
<
Thanks for reminding me of this.
> > <
> > Would a device page fault be directed at the interrupt handler or at I/O
> > page fault handling driver ?
<
> OS kernel code and data falls into two categories, pagable and non-pagable.
> Some OS only allow non-pagable kernel memory.
> If a page fault occurs in a non-pagable memory section, the OS crashes.
<
Question: In an OS under HV, does the HV have to know that OS page
is pinned ? Or since OS is under HV, HV can page this out, OS takes
page fault into HV, HV services ad returns, so as far as OS "sees'
the page is "resident". On the other hand, enough latency might have
transpired for the device to have failed before interrupt handler could
service interrupt. !?!?
<
> In any OS that I am familiar with, all interrupt code and data are
> non-pagable in order to avoid all sorts of nasty race conditions.
<
Check, and good to know (be reminded of)
>
> when a device driver is loaded, the memory sections specified by its exe
> are allocated and assigned physical pages at create time,
> the pages are pinned so they never get outswapped,
> and the code or data is copied from the exe file at driver load.
> Similarly, as each device of that driver is mounted,
> any local memory for that device is allocated from non-paged heap.
<
I suspect that when one plugs in a USB device this happens at that instant.
>
> As I said above, if we know a certain driver or device will always
> be serviced by a single core, OS should ensure that only local
> physical memory is allocated at driver load and device mount.
<
Check
<
> >> Having two or more threads with shared process address space running on
> >> different cores raises the question which NUMA node should the memory
> >> be allocated on? Presumably each thread's stack would allocate on
> >> the current core.
> >>
> >> Moving thread T1 from core C0 to core C1 might have a shared L2,
> >> to core C3 have 1 hop back to the memory previously allocated on C0,
> >> to core C4 have 2 hops back to previous memory allocated on C0.
> >> This extra hop cost continues until physical pages are recycled.
> >>
> >> Moving a thread between cores also usually does not take into account
> >> the loss of investment in the cache, and the delay to re-charge the cache.
> >> Just guessing, but capacitor charge-discharge model probably applies.
> >>
> >> This suggests a hierarchy of costs to move up, across, and then down
> >> between cores for load balancing to take into account.
> > <
> > Thanks for this observation.
> > <
> > Now what about multi-threaded cores, the cache cost of migrating around a
> > MT core is low, but the throughput could be lower due to both (or many) cores
> > sharing execution resources?
<
> If an application, say a database server, creates multiple threads,
> one per core and binds it to that core, then the physical memory for
> the stack of each thread should come from the local home memory and
> that would be optimal for that thread.
>
> But there is still shared code and data pages. Whichever home the physical
> memory for code and data is allocated from, it will be non-optimal for
> all but one thread. No way around this.
<
Yes no way around the data base actually occupying good chunks of
memory on every node which there is memory.
>
> If threads move from core to core to balance load, the OS scheduler needs
> to take into account that this move is not free by having some hysteresis
> in the algorithm to make things a bit sticky.
<
My guess is that the data base wants to move units of work from DBThread to
DBThread so each DBThread remains near its working memory. That is
the process might migrate around, the the core-cache footprint remains
localized.
<
While the OS threads are manages as the OS sees fit, and is instructed under
whatever affinity is supported in the OS.

Subject	Replies	Author
Processor Affinity in the age of unbounded cores By: MitchAlsup on Mon, 23 Aug 2021	31	MitchAlsup

6 May, 2024: Currently experiencing networking issues at backend news servers. No resolution at this time.

computers / comp.arch / Re: Processor Affinity in the age of unbounded cores