novaBBS - comp.arch - Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

On 3/25/2022 3:01 PM, MitchAlsup wrote:
> On Friday, March 25, 2022 at 1:25:13 PM UTC-5, BGB wrote:
>> On 3/25/2022 12:15 PM, MitchAlsup wrote:
>>> On Friday, March 25, 2022 at 11:51:19 AM UTC-5, BGB wrote:
>>>> On 3/25/2022 8:19 AM, EricP wrote:
>>>>> BGB wrote:
>>>>>> On 3/21/2022 3:19 AM, BGB wrote:
>>>>>>
>>>>>>>
>>>>>>> I had also been experimenting with alternatives to page tables, and
>>>>>>> ended up going with AVL trees as the main alternative.
>>>>>>>
>>>>>>
>>>>>> Add B-Trees to this list.
>>>>>> I realized that B-Trees have a few capabilities which the AVL trees lack.
>>>>>>
>>>> Also with some design tweaks and optimizations, the B-Trees now seem to
>>>> be ahead.
>>>>
>>>> I am likely now back to putting emphasis on B-Trees rather than AVL Trees.
>>>>>>
>>>>>>> * For small virtual-address spaces, the overhead of the AVL walk is
>>>>>>> relatively modest compared with the cost of saving/restoring all of
>>>>>>> the CPU registers and similar for the ISR. For larger address spaces,
>>>>>>> they were faster than using a chain-hash (lookup cost follows a log2
>>>>>>> curve relative to memory footprint, whereas the chain-hash follows a
>>>>>>> linear cost curve).
>>>>>>>
>>>>>>> * Along with chain-hashing, they are the most space-efficient option
>>>>>>> at present. Memory overhead is significantly lower than traditional
>>>>>>> page tables, and slightly smaller than B-Trees in this case (due to
>>>>>>> the comparably large "keys" in this case).
>>>>>>>
>>>>>>>
>>>>>>> So, this leaves as current options:
>>>>>>> 4K pages, 4-level page table (48b VA);
>>>>>>> 16K pages, 3-level page table (47b VA, currently used);
>>>>>>> 64K pages, 2-level page table (42b VA);
>>>>>>> 4K/16K/64K, AVL, 48b VA (variable height);
>>>>>>> 4K/16K/64K, AVL, 96b VA (variable height).
>>>>>
>>>>> I don't see how it is possible for AVL to compete with a radix tree.
>>>>> You must be doing something different than what I thought you were saying.
>>>>>
>>>>> Because I don't see how an AVL tree with 1 node per 4kB page could
>>>>> complete with the fixed 4 memory accesses to translate a radix tree.
>>>>> A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
>>>>> are 4 times the size of a PTE with key, left&right ptr plus data word.
>>>>>
>>>> I am mostly using 16K pages.
>>>>
>>>>
>>>> The AVL Tree currently uses 32B per node.
>>>> I had to bit-pack some stuff to fit it into 32B.
>>>>
>>>> The B-Tree currently uses 20B per entry (for 96B addressing).
>>>> Nodes being 16K and holding 816 entries.
>>>>
>>>> The Page Tables (Radix Tree) used 8B per entry.
>>>> They would win if they were more than (on average):
>>>> 25% full (vs AVL)
>>>> 40% full (vs B-Tree)
>>>>
>>>>
>>>> Throw ASLR in the mix, and most of the page table (Radix Tree) pages
>>>> will end up mostly empty (far below the break-even point).
>>>> Most of the pages ending up mostly full of zeroes.
>>> <
>>> Most MMU entries in a hierarchical page table ARE zeros already.
>> Yes, however ASLR doesn't exactly help.
>>
>>
>> If the table doesn't need to encode the vast expanses of zero entries,
>> it works out smaller than a table which does need to store all of these
>> zeroes, even when the per entry cost is larger...
>>
>>
>> As can be noted, the earlier form of the B-Trees (with 510x 32B entries)
>> were losing to AVL trees (with 32B nodes).
>>
>> Redesigning the B-Trees to use 816 entries (at the cost of them no
>> longer being a power of 2 size), seems to have shifted the advantage
>> mostly back in favor of B-Trees.
>>
>>
>> Well, at least for fetching stuff; filling the B-Tree is slower than
>> filling the AVL.
>>
>> I suspect this is because, whenever one adds an entry into the B-Tree
>> node, it is necessary to shift everything in the node after this point
>> forward by 1 position (the "shift everything over" loops kinda "eat it"
>> during insertion tasks).
>>
>>
>> This cost would be reduced with smaller nodes, but there is a certain
>> advantage to keeping the B-Tree node size the same as the page size.
>>
>>
>> Then briefly imagines "What if I put an AVL tree inside each B-Tree
>> node?" and, no, not gonna go this route...
>>>>
>>>> The ASLR strategy I am using is mostly using a random number generator
>>>> to generate the addresses for memory allocations. Where one generates a
>>>> random number and then verify it doesn't hit anything (if it does, try
>>>> again).
>>> <
>>> Why does ALSR not take position independent code and end up
>>> requiring a Linker to load the a.out ? {like in the old days before
>>> PIC}
>> ?...
>>
>> I am using a PE/COFF variant.
>>
>> So, in theory, the loader would pick a random address and then
>> base-relocate the image to that address (in my ABI, all the images are
>> required to be relocatable).
> <
> This prevents using original a.out location as backing store for unwritten
> memory pages.

Potentially, though given I am also LZ compressing them, this is sorta
already off the table. They could still be mapped to the swapfile
though, but this isn't an immediate plan.

Current thinking is that code addresses will be randomized, but for now
program code will be direct-mapped.

>>
>> I am not yet putting program images in pagefile backed memory, but this
>> concept still works with direct remapping.
>>
>>
>> ASLR would apply both to loading programs, and also to things like
>> mmap/VirtualAlloc, and (by extension) things like malloc and similar.
>>
>> Where, internally, mmap is built on top of mechanisms with names like
>> VaVirtualAlloc and VaCreateFileMapping and similar...
>>
>> Cough, yes, these are "inspired" by the calls in the Win32 API. TestKern
>> internals sorta flip-flop between taking design inspiration from Linux
>> and Windows (excluding cases where I went and did my own thing).
>>
>>
>> Note, malloc uses a strategy:
>> Is allocation over 64K?
>> If yes, handle it with the page allocator.
>> Is allocation over 4K?
>> If yes, allocated it with the linked-list of blocks allocator.
>> Else, use cell-and-bitmap allocator.
>> This allocates blocks 64K at a time.
>>
>> In the linked-list case, memory is allocated in 1MB chunks (via the page
>> allocator).
>>>>
>>>>
>>>> There is also the possibility of hybrid tables (last level using 8B
>>>> entries), which should do well if the last level is sufficiently full.
>>> <
>>> Skipping of levels, means that for most programs, there is only 1 page
>>> needed to perform mapping.
>> But, most schemes here are very limited in what they can skip (eg: all
>> 0s or 1s). This doesn't work with RNG output.
> <
> Good point. But if the mapping entries are mostly zeros, switching them
> around is childsplay.

OK.

>>
>>
>> The possible workaround (namely small tables which only contained a few
>> entries in place of full sized pages) is partly what originally led down
>> the AVL rabbit hole.
>>>>
>>>> However, this still requires either allocating objects adjacent to each
>>>> other (going against ASLR), or requires average allocation sizes to be
>>>> considerably larger (8MB or 16MB rather than 1MB).
>>>>
>>>> However, hybrid tables are a bit faster.
>>> <
>>> Faster than a 1-level table ?!?
>> No, erm, faster than a pure AVL or B-Tree...
>>
>>
>> The hybrid tables can roughly double the speed relative to a pure AVL or
>> B-Tree.
>>
>> They do partially inherit the drawback (from traditional page tables)
>> that they don't deal well with sparse allocations, albeit nowhere near
>> as bad as with a more deeply nested page table.
>>>>
>>>> ...
>>>>> Are you doing AVL tree variable length *segments* rather than pages?
>>>>>
>>>> No, it is still pages.
>>>> It was more about memory-use efficiency than speed in this case.
>>>>
>>>>
>>>> Compared with page-tables, the AVL Trees and B-Trees are very slow.
>>>>
>>>> However, they also end up using significantly less memory in many
>>>> scenarios (namely with ASLR).
>>>>
>>>>
>>>> However, despite the slowness, it doesn't make much difference relative
>>>> to the average cycle count of the ISR, which is bad enough (due to other
>>>> ISR related overheads), as to mostly mask the difference.
>>>>
>>>>
>>>>
>>>> The problem area isn't really with 3 and 4 level tables (which are used
>>>> for 48-bit addressing), but rather with 8 and 10 level tables used in
>>>> 96-bit addressing.
>>> <
>>> Compare the percentage of applications that fit in 48-bits versus the
>>> number which do not...........
>> This isn't really about how much stuff will fit at this point...
>>
>> The 96-bit space is more about making it nearly impossible to guess the
>> addresses of things.
> <
> It seems to me that making "guessing an address" worthless is the solution.
> <
> So, lets say we have an application that is guessing addresses (and has a SIG
> handler setup to <basically> catch and ignore guessed address faults: now if
> this <dangerous> application cannot see a change in timing between guessing
> addresses actually mapped to someone and not mapped to someone, then
> being able to guess does him no good whatsoever.
> <
> You get this property with the "no flying blind" mantra of execution window design.
> <

Click here to read the complete article

On 3/26/2022 2:05 AM, BGB wrote:
> On 3/25/2022 3:01 PM, MitchAlsup wrote:
>> On Friday, March 25, 2022 at 1:25:13 PM UTC-5, BGB wrote:
>>> On 3/25/2022 12:15 PM, MitchAlsup wrote:
>>>> On Friday, March 25, 2022 at 11:51:19 AM UTC-5, BGB wrote:
>>>>> On 3/25/2022 8:19 AM, EricP wrote:
>>>>>> BGB wrote:
>>>>>>> On 3/21/2022 3:19 AM, BGB wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I had also been experimenting with alternatives to page tables, and
>>>>>>>> ended up going with AVL trees as the main alternative.
>>>>>>>>
>>>>>>>
>>>>>>> Add B-Trees to this list.
>>>>>>> I realized that B-Trees have a few capabilities which the AVL
>>>>>>> trees lack.
>>>>>>>
>>>>> Also with some design tweaks and optimizations, the B-Trees now
>>>>> seem to
>>>>> be ahead.
>>>>>
>>>>> I am likely now back to putting emphasis on B-Trees rather than AVL
>>>>> Trees.
>>>>>>>
>>>>>>>> * For small virtual-address spaces, the overhead of the AVL walk is
>>>>>>>> relatively modest compared with the cost of saving/restoring all of
>>>>>>>> the CPU registers and similar for the ISR. For larger address
>>>>>>>> spaces,
>>>>>>>> they were faster than using a chain-hash (lookup cost follows a
>>>>>>>> log2
>>>>>>>> curve relative to memory footprint, whereas the chain-hash
>>>>>>>> follows a
>>>>>>>> linear cost curve).
>>>>>>>>
>>>>>>>> * Along with chain-hashing, they are the most space-efficient
>>>>>>>> option
>>>>>>>> at present. Memory overhead is significantly lower than traditional
>>>>>>>> page tables, and slightly smaller than B-Trees in this case (due to
>>>>>>>> the comparably large "keys" in this case).
>>>>>>>>
>>>>>>>>
>>>>>>>> So, this leaves as current options:
>>>>>>>> 4K pages, 4-level page table (48b VA);
>>>>>>>> 16K pages, 3-level page table (47b VA, currently used);
>>>>>>>> 64K pages, 2-level page table (42b VA);
>>>>>>>> 4K/16K/64K, AVL, 48b VA (variable height);
>>>>>>>> 4K/16K/64K, AVL, 96b VA (variable height).
>>>>>>
>>>>>> I don't see how it is possible for AVL to compete with a radix tree.
>>>>>> You must be doing something different than what I thought you were
>>>>>> saying.
>>>>>>
>>>>>> Because I don't see how an AVL tree with 1 node per 4kB page could
>>>>>> complete with the fixed 4 memory accesses to translate a radix tree.
>>>>>> A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
>>>>>> are 4 times the size of a PTE with key, left&right ptr plus data
>>>>>> word.
>>>>>>
>>>>> I am mostly using 16K pages.
>>>>>
>>>>>
>>>>> The AVL Tree currently uses 32B per node.
>>>>> I had to bit-pack some stuff to fit it into 32B.
>>>>>
>>>>> The B-Tree currently uses 20B per entry (for 96B addressing).
>>>>> Nodes being 16K and holding 816 entries.
>>>>>
>>>>> The Page Tables (Radix Tree) used 8B per entry.
>>>>> They would win if they were more than (on average):
>>>>> 25% full (vs AVL)
>>>>> 40% full (vs B-Tree)
>>>>>
>>>>>
>>>>> Throw ASLR in the mix, and most of the page table (Radix Tree) pages
>>>>> will end up mostly empty (far below the break-even point).
>>>>> Most of the pages ending up mostly full of zeroes.
>>>> <
>>>> Most MMU entries in a hierarchical page table ARE zeros already.
>>> Yes, however ASLR doesn't exactly help.
>>>
>>>
>>> If the table doesn't need to encode the vast expanses of zero entries,
>>> it works out smaller than a table which does need to store all of these
>>> zeroes, even when the per entry cost is larger...
>>>
>>>
>>> As can be noted, the earlier form of the B-Trees (with 510x 32B entries)
>>> were losing to AVL trees (with 32B nodes).
>>>
>>> Redesigning the B-Trees to use 816 entries (at the cost of them no
>>> longer being a power of 2 size), seems to have shifted the advantage
>>> mostly back in favor of B-Trees.
>>>
>>>
>>> Well, at least for fetching stuff; filling the B-Tree is slower than
>>> filling the AVL.
>>>
>>> I suspect this is because, whenever one adds an entry into the B-Tree
>>> node, it is necessary to shift everything in the node after this point
>>> forward by 1 position (the "shift everything over" loops kinda "eat it"
>>> during insertion tasks).
>>>
>>>
>>> This cost would be reduced with smaller nodes, but there is a certain
>>> advantage to keeping the B-Tree node size the same as the page size.
>>>
>>>
>>> Then briefly imagines "What if I put an AVL tree inside each B-Tree
>>> node?" and, no, not gonna go this route...
>>>>>
>>>>> The ASLR strategy I am using is mostly using a random number generator
>>>>> to generate the addresses for memory allocations. Where one
>>>>> generates a
>>>>> random number and then verify it doesn't hit anything (if it does, try
>>>>> again).
>>>> <
>>>> Why does ALSR not take position independent code and end up
>>>> requiring a Linker to load the a.out ? {like in the old days before
>>>> PIC}
>>> ?...
>>>
>>> I am using a PE/COFF variant.
>>>
>>> So, in theory, the loader would pick a random address and then
>>> base-relocate the image to that address (in my ABI, all the images are
>>> required to be relocatable).
>> <
>> This prevents using original a.out location as backing store for
>> unwritten
>> memory pages.
>
> Potentially, though given I am also LZ compressing them, this is sorta
> already off the table. They could still be mapped to the swapfile
> though, but this isn't an immediate plan.
>
> Current thinking is that code addresses will be randomized, but for now
> program code will be direct-mapped.
>

Current Status:
Program is loaded to a randomized address 000020000000..00007FFFFFFF;
Programs which use the swapfile (with enough memory usage to start
paging stuff out of RAM) seem to be working.

Had a bit of another crap-storm of bugs when I started trying to ASLR
things, eventually realizing that there was a race condition between my
L1 caches and TLB:
TLB would disable itself (forcing physical addressing) when an interrupt
would occur;
However, this did not necessarily mean that prior requests were not
already on the ring, which would then be erroneously treated as physical
addresses once they passed through the TLB.

I moved the responsibility for switching from virtual to physical
addressing during an interrupt from the TLB over to the L1 caches in
order to avoid this issue.

Hopefully I have most of these issues mostly sorted out now. But, have
had multiple cases where everything seems to be working, until
everything starts exploding, so can't be too sure.

There are still issues though with trying to load program binaries
outside of the low 2GB though, but I suspect these are more of a
software issue.

For the pagefile, did end up adding a fairly simplistic compression
scheme, more optimized for encoding speed than good compression:
It logically breaks the block into 16-byte lines;
It encodes a bitmap for which lines are non-zero
(1024 bits for a 16K page);
It then encodes the contents of any non-zero lines
(stored end-to-end).

Decoding process:
Clear target page to zero;
Loop over the lines in the page;
If line's bit is set, copy line to target page;
Advance pointers as appropriate.

Click here to read the complete article

E = MC ** 2 +- 3db

devel / comp.arch / Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

devel / comp.arch / Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

Subject	Author
Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: Misc: Page Tables, Virtual Memory, Swapfile	robf...@gmail.com
Re: Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: Misc: Page Tables, Virtual Memory, Swapfile	EricP
Re: Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: Misc: Page Tables, Virtual Memory, Swapfile	Anton Ertl
Re: Misc: Page Tables, Virtual Memory, Swapfile	John Levine
Re: Misc: Page Tables, Virtual Memory, Swapfile	Anton Ertl
Re: Misc: Page Tables, Virtual Memory, Swapfile	Stephen Fuld
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	John Levine
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	Stephen Fuld
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	MitchAlsup
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	Ivan Godard
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	Thomas Koenig
Re: Misc: Page Tables, Virtual Memory, Swapfile	BGB
Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	EricP
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	MitchAlsup
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	MitchAlsup
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	MitchAlsup
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	EricP