Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

6 May, 2024: The networking issue during the past two days has been identified and may be fixed. Will keep monitoring.


computers / comp.arch / Re: Processor Affinity in the age of unbounded cores

Re: Processor Affinity in the age of unbounded cores

<6c758277-3654-4619-ba0c-65569f824773n@googlegroups.com>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=20226&group=comp.arch#20226

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:58ea:: with SMTP id di10mr30232968qvb.60.1630427498978;
Tue, 31 Aug 2021 09:31:38 -0700 (PDT)
X-Received: by 2002:a9d:7204:: with SMTP id u4mr25510950otj.276.1630427498661;
Tue, 31 Aug 2021 09:31:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 31 Aug 2021 09:31:38 -0700 (PDT)
In-Reply-To: <5v8XI.3995$6U3.3542@fx43.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <3ba9db9c-d4c9-4348-97a1-fe114e90249fn@googlegroups.com>
<CASUI.9322$vA6.4623@fx23.iad> <928d098c-947e-4d19-85e8-d8c5261f9f3dn@googlegroups.com>
<OV9VI.25431$%r2.16417@fx37.iad> <a766152e-1c28-436a-af9c-a344c3565222n@googlegroups.com>
<wmDVI.3213$z%4.1108@fx37.iad> <d49e31a9-3dc0-4080-a1a0-66d118b01c6fn@googlegroups.com>
<5v8XI.3995$6U3.3542@fx43.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c758277-3654-4619-ba0c-65569f824773n@googlegroups.com>
Subject: Re: Processor Affinity in the age of unbounded cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 31 Aug 2021 16:31:38 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 239
 by: MitchAlsup - Tue, 31 Aug 2021 16:31 UTC

On Monday, August 30, 2021 at 12:24:21 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, August 25, 2021 at 9:53:19 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Tuesday, August 24, 2021 at 12:22:57 PM UTC-5, EricP wrote:
> >
> >>> Say we have an environment supporting virtual devices. That is, there is
> >>> a table in memory that maps the virtual device into a physical device
> >>> so that the hypervisor can allow the hosted OS to perform its own I/O
> >>> to something like a SATA drive. The virtualized SATA driver gets a
> >>> virtual Bus:Device,Function, and his page map contains a page in
> >>> memory mapped I/O space pointing at virtual device MMIO control
> >>> registers. The virtual device driver, running down near user priority level
> >>> does 5-10 stores to the virtual device control registers to initiate a read or
> >>> write to the disk. The disk schedules the access as it chooses, and later on
> >>> when DMA is ready, the device table is used to associate virtual
> >>> Bus:Device,Function with physical Bus:Device,Function and also the
> >>> virtual machine and mapping tables of the OS this DMA request should
> >>> use.
> >> You left out a few steps.
> >>
> >> - mapping the virtual device ids (or device handle) to an actual device
> >> - checking permissions for that operation on that device
> >> - validate IO arguments (e.g. buffer address and range)
> >> - checking IO size vs quotas, maybe break 1 big IO into many smaller
> >> - translate virtual addresses to physical, choose number to pin
> >> - if virtual page not Present then inswap
> >> - pin physical page to prevent removal/recycling
> >> - queue request to device
> > <
> > I am assuming this is just std OS/HV stuff running privileged but
> > not at very high priority.
> Yes. Most of these are cpu-level resources so are managed
> at the same reentrancy level as kernel heap or scheduling.
> > <
> >> - when device available, queue for a free DMA channel
> >> - when DMA channel available program actual device hardware
> > <
> > OS or HV on paravirtualized OS
> Both as the guest OS is going to manage them as though
> it is a real SMP machine.
>
> And at some point HV has to turn them into real resources
> which means it does the same things.
> > <
> >> - raise interrupt priority to block device interrupts
> >> - spinlock to prevent concurrent access to device interrupt data
> > <
> > This spinlock is what I think I can get rid of.
> Replaced spinlocks by what?
> The concurrent access to the shared memory & resources must be coordinated.
> > <
> >> - write device control registers
> >>
> >> On IO completion interrupt undo all the things we did above,
> >> releasing resources and possibly allowing other queued IO to continue.
> > <
> > Is the above paragraph done at the Task handler level or the interrupt handler level?
> The OS scheduler and heap management level, whatever that is called.
>
> Its all down to reentrancy - most of this code and data structures
> are not thread or interrupt reentrant.
<
I can agree that you put your finger on the problem. A set of state (process)
can be running at most once.
>
> This is the same problem as was discussed in the earlier thread about
> asynchronous signal delivery and, say, memory heap management.
> In that case heap access is disallowed from inside a signal routine
> or any routine called by that signal routine. So we convert an
> asynchronous signal to a synchronous message queued to the main
> routine and poll the queue in the main loop.
>
> OS has the same reentrancy problems but for more things.
> So we create this fiction of a software interrupt priority level,
> that sits just above where threads run and below where ISR's run,
> where we loop servicing our queue of message from interrupts routines
> and accessing all the non-reentrant data structures, like kernel heap
> or scheduler, and doing so without blocking interrupts.
> In WinNT the name of that fictional SWI level is Dispatch level.
>
> Most of these resources are managed at "Dispatch" level.
>
> Things like allocating and freeing a bounce buffer for DMA.
> Or a DMA channel from a common pool. These might have multiple
> IO requests waiting in a queue to use them.
> As each IO completes, it frees its resources, which may dequeue the
> next IO request, and cause it to continue its request while this
> IO continues to complete its.
>
> By restricting allocate & free of these resources from being performed
> at interrupt level, they do not have to block interrupts.
<
After thinking about this for 2 days, I think if I said any more I would
lose patentability. Sorry--really I am.
<
> >> Maybe loop over all of this if the IO was larger than resource allocations
> >> allow to be allocated at once to a single requester.
> >>
> >> There is also asynchronous IO cancellation to deal with,
> >> both when IO is queued and, depending on the device, possibly while running.
> >>> <
> >>>> The Priority Interrupt Controller (PIC), or what x64 calls APIC,
> >>>> controls to which core interrupts are routed. If any interrupt can be
> >>>> routed to any core, and if interrupt priorities are 1 (low) to 7 (high),
> >>>> and if we have 64 cores, then there are 448 HW threads dedicated
> >>>> to servicing interrupts (which is why we don't do this).
> >>> <
> >>> My recent research on this shows the number of interrupt "levels"
> >>> increasing to 256-ish interrupts, each interrupt vector having its
> >>> own priority and a few related things. AMD virtualization via I/O-MMU.
> > <
> >> There are two issues: one is the number of interrupt priority levels,
> > <
> > There are 64 priority levels, both handlers and processes run on these.
> > Although this is not fixed in stone
> > <
> >> and one is the ability to determine directly which device is interrupting.
> > <
> > physical device send virtual interrupt, interrupt vector table determine priority
> > and which processor and core the interrupt handler should run.
> >> I have heard of 255 levels on embedded systems where designs have a fixed,
> >> well known set devices to deal with, they like to use one interrupt level
> >> per device interrupt. Because the OS is often doing hard real-time,
> >> having a unique priority for each device source fits nicely into
> >> the deadline scheduling model.
> >>
> >> For general purpose computers, a large number of priority interrupt
> >> levels offers nothing.
> > <
> > How few is enough, but in any event I can fit 256 interrupt voctors,
> > 64 exception vectors, 256 processes, and the 64-priority Queues
> > on a single page of memory. 64 was chosen merely because it fit.
> For external devices I had three, low, medium and high,
> with three sub levels in each numbered from low to high
> as a concession that not all devices are equal.
>
> Priority assignment is based on the nature of interrupt.
>
> Low is for patient devices.
> Mostly signaling completion of a prior request and have no time deadline.
> Disk read or write completion is an example.
> Network card buffer transmit is also (but usually not buffer receive).
>
> Medium is for impatient but recoverable devices.
> This would be for signaling receipt of some message where
> a buffer overrun can occur, but can be recovered at some cost.
> Network card buffer receive is an example.
> Also tape drive buffer read or write.
>
> High is for impatient and non-recoverable devices.
> This would be for signaling receipt of some message where
> a buffer overrun can occur and cannot be recovered.
> Reading an analog to digital converter could be an example.
> > <
> >> However being able to directly determine which device id requested
> >> an interrupt saves polling all the devices on a daisy chain.
> > <
> > Interrupt handler receives virtual Bus:Device,Function Node,
> > Interrupt number, and the process which initiated the I/O.
> > <
> >> One could have a small number of priorities, say 7 or 9,
> >> but also something like PCI's Message Queued Interrupts to put
> >> the requesting device's ID number into the queue for that priority
> >> connected to a huge number of devices.
> > <
> > Yes a large part of this priority queue is to put a place where message
> > signaled interrupts can land and wait for their turn at execution
> > resources.
> > <
> >> If a request queue is not empty then a hardware interrupt is
> >> requested at the queue's priority.
> > <
> > I was thinking more along the lines that this Queue reads from
> > the highest priority non-blocked queue, builds Thread State and
> > sends a context switch to a particular processor:core. The
> > message contains all the bits on needs to run the process, upon
> > arrival, whatever the processor:core was doing gets messaged
> > back to queue.
> > <
> > Thus, for interrupts, one has thread state of the interrupt
> > handler along with the notification of (ahem) the interrupt.
> > That is, as seen at a processor:core, the notification (interrupt)
> > arrives only 1 cycle before the thread state, including registers.
> >> This is why I defined my interrupt architecture as model specific,
> >> because there is no one size fits all. Similarly, the cpu's interface
> >> for interrupt handling is also model specific.
> >>> <
> >>> So
> >>> <
> >>> DMA address gets translated though the OS mapping tables using
> >>> the requesting thread virtual address translated to host virtual address
> >>> which is translated by the virtual machine mapping tables to physical
> >>> address.
> >> check
> >>> <
> >>> {Now this is where it gets interesting}
> >>> <
> >>> After DMA is complete, device sends an interrupt to devices interrupt
> >>> handler (which is IN the OS under the HV). Interrupt causes a control
> >>> transfer into interrupt handler, interrupt handler reads a few control
> >>> registers in memory mapped I/O space, builds a message for a lower
> >>> level Task handler , and exits. Interrupt handler ran at "rather" high priority,
> >>> task handler runs at medium priority. Task handler takes message
> >>> rummages through OS tables and figures out who needs (and what kind
> >>> of) notifications, and performs that work. Then Task handler exits (in one
> >>> way or another) to the system scheduler. System scheduler figures out who
> >>> to run next--which may have changed since the arrival of the interrupt.
> >> Yes.
> >> The notifications to other tasks (or oneself) can complete a wait operation.
> >> Completing a wait may transition a thread from Wait to Ready state
> >> if it was not already Ready or Running.
> >> Transition of thread to Ready state requests scheduler.
> >> Eventually scheduler selects thread that requested IO which performs
> >> any final resource deallocation inside the process address space.
> > <
> > In my model: The completing I/O can send (MSI) an interrupt to the
> > waiting thread, and the hardware places the waiting process on
> > appropriate priority queue where it awaits execution cycles. No
> > need to invoke scheduler (directly). Eventually the unWaited process
> > reached the front of the highest non-Empty non-Blocked Queue,
> > and gets launched into execution with a Thread State message.
> >> I just realized there are two different schedulers.
> >> One for threads that service HW like interrupts and other non-pagable code.
> >> One for threads that can can perform waits like for IO or page faults.
> > <
> > I saw this as 2-levels, your second level always being dependent on
> > the first. The first level runs non-pagable, the second runs pagable.
> Yes, except I'm now thinking of 3 levels:
> - normal user and kernel threads which can do things like wait for IO
> - OS level HW threads which can do things like allocate and free
> resources, and schedule and wake up normal threads.
> - interrupt level thread to service device interrupts.
>
> Its all about controlling reentrancy and where and when thread Wait
> states for things like page fault IO can occur so that it cannot
> indefinitely block other OS activity.
<
A mighty thanks to EricP for annotating in great detail the interplay
between threads, interrupts an OS functionality.

SubjectRepliesAuthor
o Processor Affinity in the age of unbounded cores

By: MitchAlsup on Mon, 23 Aug 2021

31MitchAlsup
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor