novaBBS - comp.arch - Re: Packing Hash Tables

On Thursday, March 10, 2022 at 1:15:31 PM UTC-5, BGB wrote:
> On 3/10/2022 9:04 AM, robf...@gmail.com wrote:
> > On Wednesday, March 9, 2022 at 12:37:26 PM UTC-5, BGB wrote:
> >> On 3/9/2022 10:20 AM, EricP wrote:
> >
> >>> Note that operating systems such as Linux the hierarchical page table is
> >>> part of its design. To port it to Power with its hash/inverted page table
> >>> they kept the hierarchical table managed in memory as before,
> >>> and move valid PTE entries into and out of the hash table.
> >>>
> >>> In effect, the hierarchical table is still the OS page table,
> >>> and the hash table is treated like a giant software managed TLB.
> >>> Software could implement, say, a FIFO mechanism to rotate a working set
> >>> of Present hierarchical PTE's through hash PTE slots,
> >>> similar to how a HW TLB might rotate through its entries.
> >>>
> >>> In this arrangement, Address Space ID's (ASID) for the hash table entries
> >>> would be very important for efficiency, as without them a process switch
> >>> would require invalidating all hash table entries just like a real TLB.
> >>>
> > Seems kind of kludgy. I am wondering why now this approach was used.
> >
> > I have coded in the bus interface unit the capability to use either hashed or
> > hierarchical page tables. I was not so insane as to ignore the popularity of
> > hierarchical page tables. Can probably use both types at the same time.
> > Hashed tables are a better way to go for a large address
> > space in my opinion. They use less memory and may have faster look-ups.
> >
> Hashed tables could also make sense as an alternative to B-Trees.
>
> A B-Tree lookup is likely to be relatively complicated and expensive if
> compared to a hash lookup.
>
> It is possible that I could also combine a hashed top-level with
> 2-levels of nested page table (optionally collapsing down the upper 6
> levels into a single page).
> >
> >> OK. This gives more context for what I was confused about in my response.
> >>
> >> In my case, I use a software managed TLB, with hierarchical page tables,
> >> and no sort of intermediate stage representation.
> >>
> >> TLB miss interrupt occurs, handler walks the page table, and loads the
> >> resulting entry into the TLB. Basic design for the mechanism was
> >> more-or-less lifted from SH-4, but with the major difference of going
> >> from a 64-entry fully associative TLB to a 64x or 256x 4-way TLB.
> >>
> >> Arguably, a 6-way or 8-way TLB could be better, but more expensive.
> >>
> >>
> >> I had considered potentially using B-Trees for the upper levels (instead
> >> of hierarchical nesting) for the 96-bit mode, mostly because as-is this
> >> leads to a "very sparse" 8-level table, and a B-Tree could be more
> >> compact (with more traditional nested page tables for the lower 2 or 3
> >> levels).
> >>
> >> Though, there are other (potentially simpler/cheaper) ways to deal with
> >> this.
> >>
> >> This would not really effect the hardware in my case.
> >
> > Page tables are walked via hardware for Thor. Software only gets involved on a
> > page fault. While software managing the TLB is probably comparable in
> > performance and more flexible than hardware, TLB updates with hardware
> > walking are almost invisible to software.
> >
> This was one annoyance with BJX2 with the use of software managed TLB:
> Getting the software TLB miss interrupts stable was a major source of
> annoyance, as there were a whole lot of "little things" that can go
> wrong in these mechanisms.
>
> The L1 caches needing to be able to stop what they are doing, then
> quietly transfer control to an ISR, then resume normal operation without
> any adverse side effects, was a bit of a pain to debug.
>
> Some of the cases were infrequent, so one might to get Doom or Quake
> launched (in a simulation) before the crash happens, where firing these
> up inside a Verilog simulation is not exactly a fast process (the time
> issue is also a major reason why I am using LZ compressed binaries and
> WADs).
>
>
> As noted, the MMU is temporarily disabled within ISRs, partly because
> there is no way for an ISR to deal with a TLB miss (unlike some CPU
> architectures, ISRs are not re-entrant in my case).
>
>
> Similarly, the TLB miss handler would have to resolve page-fault
> scenarios (for virtual memory), without access to the main filesystem
> driver (it has to bypass the driver and access the SDCard directly), but
> does also mean that a page fault (to virtual memory) during some other
> SDcard IO request would be "potentially fatal".
>
> ...
>
> Nevermind whether or not it makes sense to put a pagefile on an SDcard,
> there is nowhere else to put it in this case.
>
>
> What to do with a page-fault to an unmapped address (eg: not backed to
> the page file) is currently an unresolved issue, but this is more a
> software thing than a hardware thing (in a real OS, would probably
> terminate the process or similar here; for the time being, the only real
> solution is to trigger a breakpoint).
>
> I may need to consider going and adding preemptive multitasking, as it
> looks like most "real OS" stuff would depend on this (cooperative
> multitasking doesn't scale particularly well, or leave any real good
> solution for "OK, this program has gone into an infinite loop or
> crashed", which basically leaves no real solution otherwise).
>
>
>
> The debugging was getting bad enough that I almost started to debate
> whether it would be worthwhile to add hardware page walking and not have
> to deal with this as much (but would still need a Page Fault handler, so
> it is not entirely avoidable).
>
>
> Sorta spent most of February beating on trying to debug stuff related to
> the L1 Caches and their interaction with the TLB Miss ISR, partly as
> "poking at something" in the L1 D$ set off an unholy bees nest of bugs.
>
> Likewise, much of my past debugging effort here was more focused on
> "quick and dirty" fixes which typically don't really resolve the
> underlying issues; but, trying to figure out "why" something is going
> wrong is often a lot harder than adding "make it work" hacks (and then
> notice the bug hasn't gone away, it has merely moved to a new location).
>
>
> I still can't say (with much confidence) that the TLB is now stable, but
> in my testing at the moment, it appears to work.
>
> Though, this debugging effort wasn't helped with Verilator also having
> the annoying behavior that adding and removing $display statements can
> sometimes change the behavior of the logic...
> > The TLB can be updated via software, its almost possible then to use a software
> > managed TLB. I guess if the TLB miss signal were fed into the interrupt system,
> > then maybe things could have the option of software management.
> >
> Could be.
>
> In my case, TLB miss triggers an interrupt, and the ISR deals with it.
>
> In the ISR, there is a 'TEA' register that holds the address of whatever
> caused the interrupt (target address for TLB miss), and an 'EXSR'
> register where the low bits encode which interrupt had occurred (as a
> 16-bit number divided into multiple bit fields).
> > I get the feeling there are too many choices available in Thor. Maybe a configuration
> > utility would be better than building in everything.
> >
> > Wondering what entity to place share counts, access keys, and privilege levels in. I
> > had these in the PTE but took them out. There is only one set required for each page
> > of memory.
> >
> I have this stuff in the PTEs/TLBEs.
>
> There is also an ACL Cache with an ACL Miss interrupt as well.
>
> The software side of things isn't really developed all that well yet, as
> most of the time thus far had gone more into trying to get things stable
> enough that the address translation and TLB miss interrupts weren't
> prone to crash stuff.
>
> I seem to have this part stable now, so now it should be possible to
> start making forward progress in this area.

For Thor, the TLBE / PTE is 128-bits in size with no room to spare without having the
share-counts, etc. included in it. A larger PTE would really complicate things.
The TLBE value can be loaded or stored from / to the TLB using a single
128-bit wide register. Anything wider means using a pair of registers. I could
maybe reduce the size of the page number fields a few bits. The fields are 48-bit.
I had the thought that since there only needs to be one copy of the information it
could be stored in another entity. But this means potentially adding to the memory
access time since another table is accessed after the physical page number is
known.

I do not know much about ACL except that I suspect it can be quite complicated to do
with hardware. Is there ACL for individual pages of memory? There can be users or
groups of users and system processes all with separate access requirements. I suspect
there would be some limits engineered into ACL. For instance, limited numbers of
groups and users to keep the reference fields a reasonable size. Can you elaborate on
the ACL?

In my PTE eight bits are used for access control to indicate read/write/execute/cache
for either user or system. Maybe these bits would be better used to refer to an ACL. If
I were to trim four bits off the page number fields that means a 16-bit ACL reference
field could be supported.

Subject	Replies	Author
Packing Hash Tables By: robf...@gmail.com on Wed, 9 Mar 2022	77	robf...@gmail.com

Garbage In -- Gospel Out.

computers / comp.arch / Re: Packing Hash Tables