Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Vulcans never bluff. -- Spock, "The Doomsday Machine", stardate 4202.1

Misc: Page Tables, Virtual Memory, Swapfile

Subject	Author
Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: Misc: Page Tables, Virtual Memory, Swapfile	robf...@gmail.com
Re: Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: Misc: Page Tables, Virtual Memory, Swapfile	EricP
Re: Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: Misc: Page Tables, Virtual Memory, Swapfile	Anton Ertl
Re: Misc: Page Tables, Virtual Memory, Swapfile	John Levine
Re: Misc: Page Tables, Virtual Memory, Swapfile	Anton Ertl
Re: Misc: Page Tables, Virtual Memory, Swapfile	Stephen Fuld
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	John Levine
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	Stephen Fuld
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	MitchAlsup
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	Ivan Godard
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	BGB
Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile	Thomas Koenig
Re: Misc: Page Tables, Virtual Memory, Swapfile	BGB
Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	EricP
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	MitchAlsup
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	MitchAlsup
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	MitchAlsup
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	BGB
Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual	EricP

Pages:12

Misc: Page Tables, Virtual Memory, Swapfile

<t19cei$aj7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24357&group=comp.arch#24357

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 03:19:28 -0500
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <t19cei$aj7$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 08:19:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a46ca344aa2f35062157743cce59a47";
logging-data="10855"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19rRul2xL7JVKChL7hBXEvb"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:ycADD1jD9ysM7SvQY+skhMAJeFE=
Content-Language: en-US

by: BGB - Mon, 21 Mar 2022 08:19 UTC

Well, as of earlier today I was (more or less) now able to get
pagefile/swapfile stuff working.

More testing/debugging is likely still needed, but "so far so good" it
seems (such as whether it holds up when the program starts using enough
memory to cause the pagefile to start thrashing, ...).

It was a bit of a pain, but I eventually realized an issue:
The "VaVirtualAlloc" function (used for pagefile backed memory) would
try to reserve an exact number of pages;
Accesses would sometimes trigger an access to the adjacent cache-line,
which would occasionally cross a page boundary into the following page
(even if the access itself did not cross the boundary);
A TLB miss would then load whatever was in the following PTE into the TLB.

If another VaVirtualAlloc call created an allocation starting at the
next page, it would not be aware that the address was already loaded
into the TLB, with the newly assigned page still using whatever was
previously at that location in the page table.

In theory, I could pedantically flush this part of the TLB every time
something was loaded or modified, *, but I figured that, since this is
virtual space located above the 4GB mark, I can "live a little" and
simply add an extra dummy page to each VaVirtalAlloc (doesn't actually
need to be memory-backed).

*: It does flush parts of the TLB whenever something is free'd or
evicted, but not currently for loading a page from swap, as it is
assumes that the virtual-address is not already in the TLB.

Where:
VaVirtualAlloc: Allocate virtual memory
VaVirtualFree: Free virtual memory
VaMProtect: Modify access mode for memory
...

I had also been experimenting with alternatives to page tables, and
ended up going with AVL trees as the main alternative.

* For small virtual-address spaces, the overhead of the AVL walk is
relatively modest compared with the cost of saving/restoring all of the
CPU registers and similar for the ISR. For larger address spaces, they
were faster than using a chain-hash (lookup cost follows a log2 curve
relative to memory footprint, whereas the chain-hash follows a linear
cost curve).

* Along with chain-hashing, they are the most space-efficient option at
present. Memory overhead is significantly lower than traditional page
tables, and slightly smaller than B-Trees in this case (due to the
comparably large "keys" in this case).

So, this leaves as current options:
4K pages, 4-level page table (48b VA);
16K pages, 3-level page table (47b VA, currently used);
64K pages, 2-level page table (42b VA);
4K/16K/64K, AVL, 48b VA (variable height);
4K/16K/64K, AVL, 96b VA (variable height).

The 10/8/7 level page-table formats are (most likely) now effectively
deprecated.

While hybrid tables are a possible (AVL with the bottom level as a
page-table), they were kind of lackluster in my tests (neither
particularly fast nor particularly space efficient).

At present, the pagefile is a fairly naive format, where each
virtual-memory page has an "index" which maps directly to its location
in the pagefile. When a page is committed, it is assigned a spot in the
pagefile.

One other thing I was considering is whether or not it would be
"effective" to use a quick/dirty LZ compressor on the pages to reduce
the amount of data that needs to be written to or read from the SD card.

This is because IO happens in 512B blocks, and I can reduce the number
of IO blocks via LZ compression, however, this assumes a fast LZ compressor.

One concern though is that the SD cards internally operate via much
larger blocks, meaning that saving a few K here and there would not be
effective for lifespan. About the only real option here would be to
gather up evicted pages, and then write them out all at once in a big chunk.

Say, we have a 4MB "Page Evict Buffer":
Evicted pages are LZ compressed, and added to this buffer;
When the buffer is full, it is written to the pagefile;
The pagefile is written as big circular log buffer.
Likely would need a special case to avoid overwriting "still live" pages
(if the next chunk includes live pages, they are copied as-is into the
evict buffer);
....

So, for every page, one will need to keep track of is current position
within the pagefile, and for every position, an ability to query for any
still-live pages, ...

It is a question of what would be the best LZ design to use in this
case. Ideally, I would want something that favors encoder speed over
ratio (likely a specialized encoder either for RP2, a simplified RP2
variant, or some other more specialized format).

....

Though, it may not be as big of an immediate issue, as the amount of
"thrashing" to the page file seems to be a lot smaller than I was
originally expecting (with currently 32MB of RAM being set aside to be
used as pagefile backed virtual memory).

Any thoughts?...

Re: Misc: Page Tables, Virtual Memory, Swapfile

<2f446e73-67e3-458a-a18e-9ceff3edd9cen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24361&group=comp.arch#24361

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:244f:b0:67d:ccec:3eaa with SMTP id h15-20020a05620a244f00b0067dccec3eaamr12396582qkn.744.1647864371118;
Mon, 21 Mar 2022 05:06:11 -0700 (PDT)
X-Received: by 2002:a05:6808:2185:b0:2d9:ebf0:fb66 with SMTP id
be5-20020a056808218500b002d9ebf0fb66mr9431417oib.69.1647864370847; Mon, 21
Mar 2022 05:06:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Mar 2022 05:06:10 -0700 (PDT)
In-Reply-To: <t19cei$aj7$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de4:1f00:5c0:179:ecd9:1d28;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de4:1f00:5c0:179:ecd9:1d28
References: <t19cei$aj7$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f446e73-67e3-458a-a18e-9ceff3edd9cen@googlegroups.com>
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 21 Mar 2022 12:06:11 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 6991

by: robf...@gmail.com - Mon, 21 Mar 2022 12:06 UTC

On Monday, March 21, 2022 at 4:19:33 AM UTC-4, BGB wrote:
> Well, as of earlier today I was (more or less) now able to get
> pagefile/swapfile stuff working.
>
> More testing/debugging is likely still needed, but "so far so good" it
> seems (such as whether it holds up when the program starts using enough
> memory to cause the pagefile to start thrashing, ...).
>
>
> It was a bit of a pain, but I eventually realized an issue:
> The "VaVirtualAlloc" function (used for pagefile backed memory) would
> try to reserve an exact number of pages;
> Accesses would sometimes trigger an access to the adjacent cache-line,
> which would occasionally cross a page boundary into the following page
> (even if the access itself did not cross the boundary);
> A TLB miss would then load whatever was in the following PTE into the TLB.
>
> If another VaVirtualAlloc call created an allocation starting at the
> next page, it would not be aware that the address was already loaded
> into the TLB, with the newly assigned page still using whatever was
> previously at that location in the page table.
>
> In theory, I could pedantically flush this part of the TLB every time
> something was loaded or modified, *, but I figured that, since this is
> virtual space located above the 4GB mark, I can "live a little" and
> simply add an extra dummy page to each VaVirtalAlloc (doesn't actually
> need to be memory-backed).
>
> *: It does flush parts of the TLB whenever something is free'd or
> evicted, but not currently for loading a page from swap, as it is
> assumes that the virtual-address is not already in the TLB.
>
> Where:
> VaVirtualAlloc: Allocate virtual memory
> VaVirtualFree: Free virtual memory
> VaMProtect: Modify access mode for memory
> ...
>
>
>
> I had also been experimenting with alternatives to page tables, and
> ended up going with AVL trees as the main alternative.
>
> * For small virtual-address spaces, the overhead of the AVL walk is
> relatively modest compared with the cost of saving/restoring all of the
> CPU registers and similar for the ISR. For larger address spaces, they
> were faster than using a chain-hash (lookup cost follows a log2 curve
> relative to memory footprint, whereas the chain-hash follows a linear
> cost curve).
>
> * Along with chain-hashing, they are the most space-efficient option at
> present. Memory overhead is significantly lower than traditional page
> tables, and slightly smaller than B-Trees in this case (due to the
> comparably large "keys" in this case).
>
>
> So, this leaves as current options:
> 4K pages, 4-level page table (48b VA);
> 16K pages, 3-level page table (47b VA, currently used);
> 64K pages, 2-level page table (42b VA);
> 4K/16K/64K, AVL, 48b VA (variable height);
> 4K/16K/64K, AVL, 96b VA (variable height).
>
> The 10/8/7 level page-table formats are (most likely) now effectively
> deprecated.
>
> While hybrid tables are a possible (AVL with the bottom level as a
> page-table), they were kind of lackluster in my tests (neither
> particularly fast nor particularly space efficient).
>
>
> At present, the pagefile is a fairly naive format, where each
> virtual-memory page has an "index" which maps directly to its location
> in the pagefile. When a page is committed, it is assigned a spot in the
> pagefile.
>
>
> One other thing I was considering is whether or not it would be
> "effective" to use a quick/dirty LZ compressor on the pages to reduce
> the amount of data that needs to be written to or read from the SD card.
>
> This is because IO happens in 512B blocks, and I can reduce the number
> of IO blocks via LZ compression, however, this assumes a fast LZ compressor.
>
> One concern though is that the SD cards internally operate via much
> larger blocks, meaning that saving a few K here and there would not be
> effective for lifespan. About the only real option here would be to
> gather up evicted pages, and then write them out all at once in a big chunk.
>
>
> Say, we have a 4MB "Page Evict Buffer":
> Evicted pages are LZ compressed, and added to this buffer;
> When the buffer is full, it is written to the pagefile;
> The pagefile is written as big circular log buffer.
> Likely would need a special case to avoid overwriting "still live" pages
> (if the next chunk includes live pages, they are copied as-is into the
> evict buffer);
> ...
>
> So, for every page, one will need to keep track of is current position
> within the pagefile, and for every position, an ability to query for any
> still-live pages, ...
>
> It is a question of what would be the best LZ design to use in this
> case. Ideally, I would want something that favors encoder speed over
> ratio (likely a specialized encoder either for RP2, a simplified RP2
> variant, or some other more specialized format).
>
> ...
>
> Though, it may not be as big of an immediate issue, as the amount of
> "thrashing" to the page file seems to be a lot smaller than I was
> originally expecting (with currently 32MB of RAM being set aside to be
> used as pagefile backed virtual memory).
>
>
> Any thoughts?...

If larger pages are used, they could be partially compressed and decompressed.
Say 64kB pages are used. On a page load only the 4kB block that is accessed is
decompressed. The rest of the page is decompressed on demand. It could turn
out most of the page is not accessed and would not need to be compressed
when written back to the pagefile because it would already remain compressed.
It does need a bitmap of compressed pages. 16 bits in the PTE.

Re: Misc: Page Tables, Virtual Memory, Swapfile

<Fo0_J.260950$7F2.145604@fx12.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24362&group=comp.arch#24362

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
References: <t19cei$aj7$1@dont-email.me>
In-Reply-To: <t19cei$aj7$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 93
Message-ID: <Fo0_J.260950$7F2.145604@fx12.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 21 Mar 2022 14:58:45 UTC
Date: Mon, 21 Mar 2022 10:58:46 -0400
X-Received-Bytes: 5162

by: EricP - Mon, 21 Mar 2022 14:58 UTC

BGB wrote:
>
> At present, the pagefile is a fairly naive format, where each
> virtual-memory page has an "index" which maps directly to its location
> in the pagefile. When a page is committed, it is assigned a spot in the
> pagefile.

It's not naive, its simple and straight forward. That's a good thing.
The "in use" space tracking is a bit vector and allocation is a find-first.
Its drawback is that pages that are contiguous in virtual space
wind up scattered about the page file. Most page fault handlers have
a "cluster" optimization that tries to bring in multiple pages at once
if the blocks are also contiguous on disk. However this optimization
doesn't work with the dynamically allocated scattered page file blocks.

A different way is to allocate the page file blocks as a contiguous
segment at the time the memory section is created. That leaves virtually
contiguous pages as physically adjacent in the page file.
The downside is it gets complicated if the memory section starts to
change size and ends up with multiple file segments as backing store.

> One other thing I was considering is whether or not it would be
> "effective" to use a quick/dirty LZ compressor on the pages to reduce
> the amount of data that needs to be written to or read from the SD card.

Its is doable but complicated.
The problem is that compressing a fixed size block becomes a variable
sized block. So a fixed "logical block number" winds up at a variable
offset. To find a particular block again you need something that does
the equivalent to what the i-node file system structure does,
but this structure is in memory because it doesn't need to be retained.

And you need some way to recover deleted file space but retain the
non-deleted blocks.

This is starting to sound like building your own
log file system but layered onto the page file.

> This is because IO happens in 512B blocks, and I can reduce the number
> of IO blocks via LZ compression, however, this assumes a fast LZ
> compressor.
>
> One concern though is that the SD cards internally operate via much
> larger blocks, meaning that saving a few K here and there would not be
> effective for lifespan. About the only real option here would be to
> gather up evicted pages, and then write them out all at once in a big
> chunk.

There is the larger block size, but also the write amplification because
the SSD has a map from the disk "physical block number" to its internal
actual block number, and that map has to also be stored when it changes.

In general, paging to a SSD is discouraged because it wears it out.

> Say, we have a 4MB "Page Evict Buffer":
> Evicted pages are LZ compressed, and added to this buffer;
> When the buffer is full, it is written to the pagefile;
> The pagefile is written as big circular log buffer.
> Likely would need a special case to avoid overwriting "still live" pages
> (if the next chunk includes live pages, they are copied as-is into the
> evict buffer);
> ....
>
> So, for every page, one will need to keep track of is current position
> within the pagefile, and for every position, an ability to query for any
> still-live pages, ...

Yes. That block number can be stored in the Not-Present PTE so if the
page is touched again you know exactly where to find the backing store.
If the page gets swapped back in, the old page file block number
can be stashed in the frame table (aka the struct page table) while
the PTE contains the page frame number (PFN ) (aka physical page number).

> It is a question of what would be the best LZ design to use in this
> case. Ideally, I would want something that favors encoder speed over
> ratio (likely a specialized encoder either for RP2, a simplified RP2
> variant, or some other more specialized format).
>
> ....
>
> Though, it may not be as big of an immediate issue, as the amount of
> "thrashing" to the page file seems to be a lot smaller than I was
> originally expecting (with currently 32MB of RAM being set aside to be
> used as pagefile backed virtual memory).
>
>
> Any thoughts?...

Every time it writes an SSD block it has to update its internal map.
The only way to avoid that is don't page swap, or use an HDD.
Or page swap to a different SSD device, like a cheap-o USB drive
that cost $5 and you don't care if it wears out.
And skip that whole compression approach.

Re: Misc: Page Tables, Virtual Memory, Swapfile

<t1ab4o$20j$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24363&group=comp.arch#24363

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 12:03:19 -0500
Organization: A noiseless patient Spider
Lines: 220
Message-ID: <t1ab4o$20j$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <Fo0_J.260950$7F2.145604@fx12.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 17:03:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a46ca344aa2f35062157743cce59a47";
logging-data="2067"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HKtQ3JDcyCb5/DsKBQnO1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:ZdvBSIaUac+VQalu87qbqN9JfWk=
In-Reply-To: <Fo0_J.260950$7F2.145604@fx12.iad>
Content-Language: en-US

by: BGB - Mon, 21 Mar 2022 17:03 UTC

On 3/21/2022 9:58 AM, EricP wrote:
> BGB wrote:
>>
>> At present, the pagefile is a fairly naive format, where each
>> virtual-memory page has an "index" which maps directly to its location
>> in the pagefile. When a page is committed, it is assigned a spot in
>> the pagefile.
>
> It's not naive, its simple and straight forward. That's a good thing.
> The "in use" space tracking is a bit vector and allocation is a find-first.
> Its drawback is that pages that are contiguous in virtual space
> wind up scattered about the page file. Most page fault handlers have
> a "cluster" optimization that tries to bring in multiple pages at once
> if the blocks are also contiguous on disk. However this optimization
> doesn't work with the dynamically allocated scattered page file blocks.
>

Clustering is mostly N/A for an SDcard, as "seek time" doesn't really
exist. What is potentially a problem, is that while the SDcard is
written in terms of 512-byte sectors, the area being rewritten is
actually significantly larger than a given sector.

Then again, it is possible the SDcard could have logic to lessen the
impact of a bunch of scattered small writes, dunno...

Though, from what I can gather, the usual way to extend the lifetime of
an SDcard is to try to do a smaller number of bigger writes aligned on
the SDcard's native block boundary. Historically, they don't hold up
well with small scattered writes.

> A different way is to allocate the page file blocks as a contiguous
> segment at the time the memory section is created. That leaves virtually
> contiguous pages as physically adjacent in the page file.
> The downside is it gets complicated if the memory section starts to
> change size and ends up with multiple file segments as backing store.
>

As noted, this part likely matters a lot more for an HDD than an SDcard.

>> One other thing I was considering is whether or not it would be
>> "effective" to use a quick/dirty LZ compressor on the pages to reduce
>> the amount of data that needs to be written to or read from the SD card.
>
> Its is doable but complicated.
> The problem is that compressing a fixed size block becomes a variable
> sized block. So a fixed "logical block number" winds up at a variable
> offset. To find a particular block again you need something that does
> the equivalent to what the i-node file system structure does,
> but this structure is in memory because it doesn't need to be retained.
>
> And you need some way to recover deleted file space but retain the
> non-deleted blocks.
>
> This is starting to sound like building your own
> log file system but layered onto the page file.
>

Probably.

Haven't really looked much into log file systems, but am aware that some
projects use them (in preference to FAT32 and similar) on SD cards and SSDs.

A log structured approach would seem to be a way to minimize the amount
of writes to an SDcard, and thus hopefully extend its lifespan.

Every page in the pagefile would need to keep track of which chunk it is
in, and its size/offset within the chunk.

If compressed data is stored at a granularity of 64B, we need 16b for
the offset and size within the chunk, plus ~ 8-16b to encode the chunk
within the pagefile.

So, say, 48b to store the location and size of an LZ compressed page
(along with a few flags).

For a 256MB pagefile, this is ~ 128K (with 16K pages).

One would also need a way to daisy-chain all the pages within a chunk,
but this would need ~ 18 or 20 bits, increasing the overhead to 256K.

If I assume a 512B granularity (maps nicely to the sector size), need
13b for size and offset.

So, say:
(63:60): Page Flags
(59:40): Chunk Link Chain
(39:26): Chunk Index
(25:13): Compressed Size
(12: 0): Compressed Offset

The chain is needed, as otherwise one would need to sweep "all" of the
pages to build a list of which pages might be impacted by overwriting a
given chunk.

Though, this operation is likely to be rare enough, and would coincides
with writing ~ 4MB to the SDcard, that it may make sense to skip the
chain and do a brute-force query (leaving these bits for other uses).

It could also be possible to use a log-structured format without LZ
compressing the pages, but this would mean around 3x or 4x as much IO
traffic.

>> This is because IO happens in 512B blocks, and I can reduce the number
>> of IO blocks via LZ compression, however, this assumes a fast LZ
>> compressor.
>>
>> One concern though is that the SD cards internally operate via much
>> larger blocks, meaning that saving a few K here and there would not be
>> effective for lifespan. About the only real option here would be to
>> gather up evicted pages, and then write them out all at once in a big
>> chunk.
>
> There is the larger block size, but also the write amplification because
> the SSD has a map from the disk "physical block number" to its internal
> actual block number, and that map has to also be stored when it changes.
>
> In general, paging to a SSD is discouraged because it wears it out.
>

Yeah, though in this case, can't really connect an HDD up to an FPGA
board. Main option is an SDcard, but I would prefer not to ruin the
lifespan of the SD cards.

>> Say, we have a 4MB "Page Evict Buffer":
>> Evicted pages are LZ compressed, and added to this buffer;
>> When the buffer is full, it is written to the pagefile;
>> The pagefile is written as big circular log buffer.
>> Likely would need a special case to avoid overwriting "still live"
>> pages (if the next chunk includes live pages, they are copied as-is
>> into the evict buffer);
>> ....
>>
>> So, for every page, one will need to keep track of is current position
>> within the pagefile, and for every position, an ability to query for
>> any still-live pages, ...
>
> Yes. That block number can be stored in the Not-Present PTE so if the
> page is touched again you know exactly where to find the backing store.
> If the page gets swapped back in, the old page file block number
> can be stashed in the frame table (aka the struct page table) while
> the PTE contains the page frame number (PFN ) (aka physical page number).
>

The PTE's for non-live pages encode a logical "page index" within the
pagefile.

This gets more complicated with an indirect compressed-page scheme, as
now we need both a logical page number, and another table to encode
where it is currently stored in its LZ compressed form (which may change
dynamically).

It doesn't seem like a great idea to store the compressed location
directly within the PTE, and also my PTE scheme doesn't really leave
enough bits for this.

>> It is a question of what would be the best LZ design to use in this
>> case. Ideally, I would want something that favors encoder speed over
>> ratio (likely a specialized encoder either for RP2, a simplified RP2
>> variant, or some other more specialized format).
>>
>> ....
>>
>> Though, it may not be as big of an immediate issue, as the amount of
>> "thrashing" to the page file seems to be a lot smaller than I was
>> originally expecting (with currently 32MB of RAM being set aside to be
>> used as pagefile backed virtual memory).
>>
>>
>> Any thoughts?...
>
> Every time it writes an SSD block it has to update its internal map.
> The only way to avoid that is don't page swap, or use an HDD.
> Or page swap to a different SSD device, like a cheap-o USB drive
> that cost $5 and you don't care if it wears out.
> And skip that whole compression approach.

I am not dealing with SSD's in this case, but mostly SDcards.

For the physical testing, I mostly have a bunch of 16GB SDcards I got
off Amazon, forget how much they cost now, wasn't all that much. They
were a lot cheaper than 64GB or 128GB SDcards at least.

For my testing on an FPGA, 16GB seemed like plenty.
Nevermind if I hack off several GB to use as a pagefile.

Still generally using FAT32 as the main filesystem, partly as Windows
doesn't really have anything else it can use (otherwise, would have
assumed using EXTn or similar).

There is an option for mounting WAD4 images as part of the filesystem,
but this isn't really used as much yet.

Unlike Linux and friends, most of the basic commands are built into the
shell (which is also the OS kernel for now). Eventually, it may may make
sense to split the shell and kernel into separate binaries, but
consolidating most of the basic command-line tools into the shell makes
more sense from an overhead perspective.

Well, and it is also sort of like Cygwin in that binaries typically
still have an "EXE" file extension (if you launch "whatever", the shell
looks for "whatever.exe" or similar).

Click here to read the complete article

Re: Misc: Page Tables, Virtual Memory, Swapfile

<t1aepg$2dio$1@gal.iecc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24365&group=comp.arch#24365

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 18:05:36 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <t1aepg$2dio$1@gal.iecc.com>
References: <t19cei$aj7$1@dont-email.me>
Injection-Date: Mon, 21 Mar 2022 18:05:36 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="79448"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <t19cei$aj7$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)

by: John Levine - Mon, 21 Mar 2022 18:05 UTC

According to BGB <cr88192@gmail.com>:
>One other thing I was considering is whether or not it would be
>"effective" to use a quick/dirty LZ compressor on the pages to reduce
>the amount of data that needs to be written to or read from the SD card.

I wouldn't do that. LZ can only compress data where it can find redundancy,
and on random-looking data LZ can end up larger than the original.

There's also the issue that you're now storing blocks of variable and
unpredictable size. I suppose if you compress 4.5K to 3.5K you win since
you are rewriting one SD block rather than two, but it seems like
asking for trouble.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Misc: Page Tables, Virtual Memory, Swapfile

<t1af2a$7gu$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24366&group=comp.arch#24366

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 13:10:17 -0500
Organization: A noiseless patient Spider
Lines: 194
Message-ID: <t1af2a$7gu$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me>
<2f446e73-67e3-458a-a18e-9ceff3edd9cen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 18:10:18 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a46ca344aa2f35062157743cce59a47";
logging-data="7710"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/8qnvTuDBpIfil5fGI8L85"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:abl6HnPUB5o0SamIhtAMdj8hNFo=
In-Reply-To: <2f446e73-67e3-458a-a18e-9ceff3edd9cen@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 21 Mar 2022 18:10 UTC

On 3/21/2022 7:06 AM, robf...@gmail.com wrote:
> On Monday, March 21, 2022 at 4:19:33 AM UTC-4, BGB wrote:
>> Well, as of earlier today I was (more or less) now able to get
>> pagefile/swapfile stuff working.
>>
>> More testing/debugging is likely still needed, but "so far so good" it
>> seems (such as whether it holds up when the program starts using enough
>> memory to cause the pagefile to start thrashing, ...).
>>
>>
>> It was a bit of a pain, but I eventually realized an issue:
>> The "VaVirtualAlloc" function (used for pagefile backed memory) would
>> try to reserve an exact number of pages;
>> Accesses would sometimes trigger an access to the adjacent cache-line,
>> which would occasionally cross a page boundary into the following page
>> (even if the access itself did not cross the boundary);
>> A TLB miss would then load whatever was in the following PTE into the TLB.
>>
>> If another VaVirtualAlloc call created an allocation starting at the
>> next page, it would not be aware that the address was already loaded
>> into the TLB, with the newly assigned page still using whatever was
>> previously at that location in the page table.
>>
>> In theory, I could pedantically flush this part of the TLB every time
>> something was loaded or modified, *, but I figured that, since this is
>> virtual space located above the 4GB mark, I can "live a little" and
>> simply add an extra dummy page to each VaVirtalAlloc (doesn't actually
>> need to be memory-backed).
>>
>> *: It does flush parts of the TLB whenever something is free'd or
>> evicted, but not currently for loading a page from swap, as it is
>> assumes that the virtual-address is not already in the TLB.
>>
>> Where:
>> VaVirtualAlloc: Allocate virtual memory
>> VaVirtualFree: Free virtual memory
>> VaMProtect: Modify access mode for memory
>> ...
>>
>>
>>
>> I had also been experimenting with alternatives to page tables, and
>> ended up going with AVL trees as the main alternative.
>>
>> * For small virtual-address spaces, the overhead of the AVL walk is
>> relatively modest compared with the cost of saving/restoring all of the
>> CPU registers and similar for the ISR. For larger address spaces, they
>> were faster than using a chain-hash (lookup cost follows a log2 curve
>> relative to memory footprint, whereas the chain-hash follows a linear
>> cost curve).
>>
>> * Along with chain-hashing, they are the most space-efficient option at
>> present. Memory overhead is significantly lower than traditional page
>> tables, and slightly smaller than B-Trees in this case (due to the
>> comparably large "keys" in this case).
>>
>>
>> So, this leaves as current options:
>> 4K pages, 4-level page table (48b VA);
>> 16K pages, 3-level page table (47b VA, currently used);
>> 64K pages, 2-level page table (42b VA);
>> 4K/16K/64K, AVL, 48b VA (variable height);
>> 4K/16K/64K, AVL, 96b VA (variable height).
>>
>> The 10/8/7 level page-table formats are (most likely) now effectively
>> deprecated.
>>
>> While hybrid tables are a possible (AVL with the bottom level as a
>> page-table), they were kind of lackluster in my tests (neither
>> particularly fast nor particularly space efficient).
>>
>>
>> At present, the pagefile is a fairly naive format, where each
>> virtual-memory page has an "index" which maps directly to its location
>> in the pagefile. When a page is committed, it is assigned a spot in the
>> pagefile.
>>
>>
>> One other thing I was considering is whether or not it would be
>> "effective" to use a quick/dirty LZ compressor on the pages to reduce
>> the amount of data that needs to be written to or read from the SD card.
>>
>> This is because IO happens in 512B blocks, and I can reduce the number
>> of IO blocks via LZ compression, however, this assumes a fast LZ compressor.
>>
>> One concern though is that the SD cards internally operate via much
>> larger blocks, meaning that saving a few K here and there would not be
>> effective for lifespan. About the only real option here would be to
>> gather up evicted pages, and then write them out all at once in a big chunk.
>>
>>
>> Say, we have a 4MB "Page Evict Buffer":
>> Evicted pages are LZ compressed, and added to this buffer;
>> When the buffer is full, it is written to the pagefile;
>> The pagefile is written as big circular log buffer.
>> Likely would need a special case to avoid overwriting "still live" pages
>> (if the next chunk includes live pages, they are copied as-is into the
>> evict buffer);
>> ...
>>
>> So, for every page, one will need to keep track of is current position
>> within the pagefile, and for every position, an ability to query for any
>> still-live pages, ...
>>
>> It is a question of what would be the best LZ design to use in this
>> case. Ideally, I would want something that favors encoder speed over
>> ratio (likely a specialized encoder either for RP2, a simplified RP2
>> variant, or some other more specialized format).
>>
>> ...
>>
>> Though, it may not be as big of an immediate issue, as the amount of
>> "thrashing" to the page file seems to be a lot smaller than I was
>> originally expecting (with currently 32MB of RAM being set aside to be
>> used as pagefile backed virtual memory).
>>
>>
>> Any thoughts?...
>
> If larger pages are used, they could be partially compressed and decompressed.
> Say 64kB pages are used. On a page load only the 4kB block that is accessed is
> decompressed. The rest of the page is decompressed on demand. It could turn
> out most of the page is not accessed and would not need to be compressed
> when written back to the pagefile because it would already remain compressed.
> It does need a bitmap of compressed pages. 16 bits in the PTE.

As noted, for now I am dealing with 16K pages, which will (probably
compress down to around 4..6K on average).

Granted, 64K pages would be slightly better from a compression POV, but
I had previously noted a drawback with 64K pages is that they are coarse
enough to adversely effect the memory footprint from programs.

As for LZ scheme, possible would be RP2:
* dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
* dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191, r=0..7)
* dddddddd-dddddddd-dlllllll-llrrr011 (l=4..515, d=0..131071, r=0..7)
* rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)
* rrr01111 (Long Match, N/A)
* rr011111 (r=1..3 bytes, 0=EOB)
* rrrrrrrr-r0111111 (Long Raw, r=(r+1)*8, 8..4096)
** d: Distance
** l: Match Length
** r: Literal Length

Though, one concern here is that the encoding is more complicated than
ideal for a fast encoder, and it assumes a 128K sliding window which is
N/A for 16K pages.

A slight tweak could be, say:
* dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
* dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191, r=0..7)
* dddddddd-ddddddll-llllllll-rrrrr011 (l=4..1027, d=0..16383, r=0..31)
* rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)

As in my LZ4 variants, (d==0) would be an escape case (used for EOB).

Doesn't seem like this would gain enough over RP2 to be "worthwhile"
though, vs just using my RP2 format as-is (nevermind if the sliding
window is needlessly large).

Also possible could be a WORD or DWORD based format.

Eg, 16-bit WORD based:
* dddddddd-dlllrrr0 (l=4..18, d=0..511, r=0..14)
* dddddddd-ddddddll-llllllll-rrrrrr01 (l=4..2050, d=0..16383, r=0..126)

The primary difference being that literal and match lengths are
constrained to a multiple of 16 bits. This would trade some compression
for a simpler encoder/decoder (but would still allow for byte-aligned LZ
copies).

(d==0):
* (r==0), (l==0): EOB
* (r==0), (l!=0): Bare Literals
* (r!=0): Escape Command

....

A DWORD based format would be similar, except that everything is
constrained to a multiple of 32 bits (I have done similar formats
before). This pays more in terms of compression ratio.

....

Though, the simplest option here is to just reuse RP2, as I already have
code for this format in TestKern (even if its sliding window size is
much larger than is useful for this use-case).

Re: Misc: Page Tables, Virtual Memory, Swapfile

<2022Mar21.190513@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24368&group=comp.arch#24368

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 18:05:13 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 32
Message-ID: <2022Mar21.190513@mips.complang.tuwien.ac.at>
References: <t19cei$aj7$1@dont-email.me> <Fo0_J.260950$7F2.145604@fx12.iad>
Injection-Info: reader02.eternal-september.org; posting-host="a80db0f3fed560af586ce2b434a8401a";
logging-data="27450"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5GH+QsybOtjZb+AkVhZ4A"
Cancel-Lock: sha1:HDInP+eG1hCN/T7Z+rShsJQ4bvI=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 21 Mar 2022 18:05 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>In general, paging to a SSD is discouraged because it wears it out.

A few years ago I read (in c't) about an attempt to wear out SSDs bye
writing on them at maximum rate. After half a year or so they had
managed to destroy a few SSDs with that, but not all. More recently
they explained why they did not repeat these experiments: First, these
experiments would take too long with current SSDs. But more
importantly, the results are not representative of more usual
conditions: SSDs survive a lot more writes if you write to them when they
are hot (conversely, they keep the data for longer when they are cold).

My conclusion from this is: If you need swap, using an SSD is ok.
Either you swap a lot, then the SSD will be hot and survive many more
writes than the SSD is rated for. Or you swap little, then there is
no problem anyway.

Moreover, if you page at all, page to SSDs. Paging to HDDs just takes
too long at current memory sizes.

But we certainly have not allocated swap space on our recent machines,
e.g.:

> free -h
total used free shared buff/cache available
Mem: 125Gi 70Gi 8.1Gi 7.0Mi 47Gi 54Gi
Swap: 0B 0B 0B

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Page Tables, Virtual Memory, Swapfile

<2022Mar21.192939@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24369&group=comp.arch#24369

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 18:29:39 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 30
Message-ID: <2022Mar21.192939@mips.complang.tuwien.ac.at>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
Injection-Info: reader02.eternal-september.org; posting-host="a80db0f3fed560af586ce2b434a8401a";
logging-data="27450"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+A79pz+jBT4RByV1bX3mnO"
Cancel-Lock: sha1:QioUD9E9dx1m0kVUlCm8D8kFPw4=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 21 Mar 2022 18:29 UTC

John Levine <johnl@taugh.com> writes:
>According to BGB <cr88192@gmail.com>:
>>One other thing I was considering is whether or not it would be
>>"effective" to use a quick/dirty LZ compressor on the pages to reduce
>>the amount of data that needs to be written to or read from the SD card.
>
>I wouldn't do that. LZ can only compress data where it can find redundancy,
>and on random-looking data LZ can end up larger than the original.

The Starfive Visionfive V1 we acquired recently has 8GB RAM. The
Fedora image for it is configured to swap by compressing into RAM
(/dev/zram0). Given how slow the SD-card interface is and the fears
of wearing out the flash memory, that's a reasonable choice. Of
course, if you fill your memory with compressed (or random) data, you
are out of luck.

>There's also the issue that you're now storing blocks of variable and
>unpredictable size. I suppose if you compress 4.5K to 3.5K you win since
>you are rewriting one SD block rather than two, but it seems like
>asking for trouble.

However, SSD manufacturers and tape drive manufacturers have been
using compression despite the block-oriented nature of the storage. I
am not sure how widespread it is for SSDs, but last I looked, it was
universal for tape drives.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Page Tables, Virtual Memory, Swapfile

<t1ak5f$jr6$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24375&group=comp.arch#24375

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 12:37:17 -0700
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <t1ak5f$jr6$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 19:37:19 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d276618c96160569f3105435f9bb5476";
logging-data="20326"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19QdSwu8TSaqueNU0WkJVLyyx4p5P8ydS4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Cancel-Lock: sha1:4KS0WTWp+Ag8xJxxCYC8ennT2BU=
In-Reply-To: <2022Mar21.192939@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: Stephen Fuld - Mon, 21 Mar 2022 19:37 UTC

On 3/21/2022 11:29 AM, Anton Ertl wrote:
> John Levine <johnl@taugh.com> writes:
>> According to BGB <cr88192@gmail.com>:
>>> One other thing I was considering is whether or not it would be
>>> "effective" to use a quick/dirty LZ compressor on the pages to reduce
>>> the amount of data that needs to be written to or read from the SD card.
>>
>> I wouldn't do that. LZ can only compress data where it can find redundancy,
>> and on random-looking data LZ can end up larger than the original.
>
> The Starfive Visionfive V1 we acquired recently has 8GB RAM. The
> Fedora image for it is configured to swap by compressing into RAM
> (/dev/zram0). Given how slow the SD-card interface is and the fears
> of wearing out the flash memory, that's a reasonable choice. Of
> course, if you fill your memory with compressed (or random) data, you
> are out of luck.
>
>> There's also the issue that you're now storing blocks of variable and
>> unpredictable size. I suppose if you compress 4.5K to 3.5K you win since
>> you are rewriting one SD block rather than two, but it seems like
>> asking for trouble.
>
> However, SSD manufacturers and tape drive manufacturers have been
> using compression despite the block-oriented nature of the storage. I
> am not sure how widespread it is for SSDs, but last I looked, it was
> universal for tape drives.

It has been universal for tape drives and before that tape drive
controllers, for decades. Of course, it is much easier for them, as
they are inherently sequential access, not random access, so addressing
the varying block sizes just aren't an issue.

One other point that might be relevant. If the compression is put into
the hardware on the CPU side of the device interface, you will transfer
less data over the, presumably slower than memory, device interface,
thus speeding up I/O operations, without requiring CPU cycles to do the
compression.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1aliq$81f$1@gal.iecc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24378&group=comp.arch#24378

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 20:01:30 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <t1aliq$81f$1@gal.iecc.com>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com> <2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
Injection-Date: Mon, 21 Mar 2022 20:01:30 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="8239"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com> <2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)

by: John Levine - Mon, 21 Mar 2022 20:01 UTC

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>It has been universal for tape drives and before that tape drive
>controllers, for decades. Of course, it is much easier for them, as
>they are inherently sequential access, not random access, so addressing
>the varying block sizes just aren't an issue.

I think they have buffers big enough that they can tell whether a block
has perverse compression behavior and store it uncompressed. If the
stuff you are storing is already compressed, e.g., powerpoint files which
are in fact ZIP files with the usual ZIP compression, trying to compress
them again is counterproductive.

I ran into this a while ago doing daily archives of TLD zone files.
ICANN provides each file in gzip form and I realized that making a tgz
of all of them was a bad idea.

R's,
John
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Misc: Page Tables, Virtual Memory, Swapfile

<t1ambp$2e3$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24379&group=comp.arch#24379

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 15:14:48 -0500
Organization: A noiseless patient Spider
Lines: 147
Message-ID: <t1ambp$2e3$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 20:14:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a46ca344aa2f35062157743cce59a47";
logging-data="2499"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MO1L3XSX8lu313oLdo2je"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:MMwNoLRAtyCvxBBa5tUHUVcKUwQ=
In-Reply-To: <t1aepg$2dio$1@gal.iecc.com>
Content-Language: en-US

by: BGB - Mon, 21 Mar 2022 20:14 UTC

On 3/21/2022 1:05 PM, John Levine wrote:
> According to BGB <cr88192@gmail.com>:
>> One other thing I was considering is whether or not it would be
>> "effective" to use a quick/dirty LZ compressor on the pages to reduce
>> the amount of data that needs to be written to or read from the SD card.
>
> I wouldn't do that. LZ can only compress data where it can find redundancy,
> and on random-looking data LZ can end up larger than the original.
>

Granted, but one can special-case this case as a "stored" block.

I already have a special case to detect and skip storing pages which are
all zeroes, which turn out to be a fairly common case (the program
commits a page, memset's it to zero, and then doesn't touch it for a while).

This case was particularly notable for program startup, where there is
initially a big flurry of activity mostly in the form of allocating and
then zeroing pages.

Generally, compressible patterns tend to be more common than
non-compressible ones among the "actually used" pages.

> There's also the issue that you're now storing blocks of variable and
> unpredictable size. I suppose if you compress 4.5K to 3.5K you win since
> you are rewriting one SD block rather than two, but it seems like
> asking for trouble.
>

Some of the stuff I was reading was saying that the SD cards internally
used blocks between 128K and 4MB.

I am mostly dealing with 16K pages.
This isn't ideal WRT an LZ compressor, but should still be plenty usable.

But, yeah, I was at first considering a simpler "in place" compacting,
which could optimize IO bandwidth, but would do little to avoid burning
out the SD card if one assumes it is using a large block size.

Granted, if the SD card were using a 4K block size, even a simple
in-place scheme could potentially help here (say, where it still
reserves/used 16K, but then compresses it down to 4..6K or similar
before storing it to the SD card).

May need to try to look into it more and see if I can better figure out
SDcard behavior in these areas.

LZ decompression is generally significantly faster than reading from the
SDcard, though compressing data tends to slower (mostly due to checking
for LZ matches and similar). Would need to try to also optimize for a
fast LZ encoder in this case.

Eg: Using a direct 1:1 hash table for lookups rather than chain-hashing
and checking multiple matches.

....

Well, it is this, or go looking for compiler bugs. I have noticed that a
bug has popped up recently which is seemingly causing incorrect behavior
in Hexen's ACS script interpreter (also seems to be causing demos to
desync, ...). Also ROTT seems to be kinda broken at the moment as well.

The issues in this case seem unrelated either to virtual memory or to a
recent change to the which encodings I am using for branch instructions:
Went from F0dd_Eddd/F0dd_Fddd to E0dd_Cddd/E4dd_Cddd mostly so that I
can (eventually) reclaim their encoding space.

Also enabled this because the virtual memory did already introduce the
(partial) breaking change in that it effectively moves MMIO from
0000_Fzzzzzzz to FFFF_Fzzzzzzz. This is mostly because, to use addresses
above the 4GB mark, I needed to set the JQ flag, which causes bits
(47:32) to no longer be ignored, which (implicitly) moves the MMIO space
to a different address (technically, any address matching the pattern
zzzz_Fzzzzzzz is MMIO when the JQ flag is clear; as this mode is meant
to mimic the behavior of a 32-bit machine).

Or, more specifically, MMIO is at:
0000_00000000_FFFF_Fzzzzzzz

Within the larger 96-bit address space. Effectively, MMIO and similar
would be disabled outside of Quadrant 0; all non-zero quadrants
effectively being virtual-address only.

Moving MMIO (in a properly designed OS) would have been invisible to
applications, except that TestKern is still DOS-like in that it doesn't
have (or enforce) a proper HAL. This is something that will need to
change eventually though (though, preferably in a way that doesn't ruin
performance worse than it is already, *).

*: For example, if I went the route of trying for force "blit"
operations through the filesystem API, this could abstract the display
hardware at least, but would not be ideal for performance as it would
significantly increase the cost of copying an image to the screen.

Granted, it may make sense to change this, as the MMIO framebuffer
interface is now itself effectively a wrapper over a DRAM backed
framebuffer, and it could be faster to instead "blit" into this
underlying screen RAM than doing so via MMIO operations, ...

But, I am on the fence about things like a dedicated "blit image from
this buffer to this location in screen memory" system call (and "Unix
tradition" would be "shoe horn it through ioctl or write or
something..."; and the "faster" options are generally "do it from within
userland without going through a syscall", *).

....

*: Or, the RasPi option of mmap'ing the MMIO space into userland and
then sidestepping the kernel, which is arguably "not ideal" either...

Though, partly this is because it has mostly grown out of random crap I
threw together mostly as test-case logic, which has taken an OS-like form.

One could probably infer that it is kind of lazily hacked-together in
that the "OS" is located as a sub-directory under the Quake source tree,
which would likely not be the case in a sanely designed OS (partly this
is because Quake came first, and the "OS" was an offshoot of the C
library and support code needed to make Quake work...).

Could reorganize stuff, but this would be its own thing.
Probably also good would be if application binaries didn't link in most
of the kernel so that they can be launched bare metal; this is pros/cons
as it makes debugging harder in the emulator (the symbol table mechanism
only matches up symbols for things loaded to a fixed address, which
isn't true of binaries loaded via the TestKern shell).

....

Also because porting Linux to BJX2 looks like too much effort at present
(with Linux being huge and complicated).

....

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1ao8a$abq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24380&group=comp.arch#24380

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 15:47:05 -0500
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <t1ao8a$abq$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 20:47:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a46ca344aa2f35062157743cce59a47";
logging-data="10618"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Tv0upZdKuO6l9HYcSbsP1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:+yXeHhj8P1Cqrvzmr3t+IuPs+KE=
In-Reply-To: <t1aliq$81f$1@gal.iecc.com>
Content-Language: en-US

by: BGB - Mon, 21 Mar 2022 20:47 UTC

On 3/21/2022 3:01 PM, John Levine wrote:
> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>> It has been universal for tape drives and before that tape drive
>> controllers, for decades. Of course, it is much easier for them, as
>> they are inherently sequential access, not random access, so addressing
>> the varying block sizes just aren't an issue.
>
> I think they have buffers big enough that they can tell whether a block
> has perverse compression behavior and store it uncompressed. If the
> stuff you are storing is already compressed, e.g., powerpoint files which
> are in fact ZIP files with the usual ZIP compression, trying to compress
> them again is counterproductive.
>

You can detect these cases easily enough:
You try to compress it, and it gets bigger (or, at least, doesn't get
sufficiently smaller);
If this happens, the pagefile logic can be like "well, that didn't work,
store it uncompressed".

The encoder could also have a fixed size output buffer it encodes into,
and it checks how much space is left. If we hit the end of this buffer,
the encoder aborts, and then the page is assumed not to be compressible.

Say, for example, if the page size is 16K, the encoder can abort early
if it exceeds 13.5K or so (at this point, the compression being "no
longer worthwhile").

Provided one has a few bits to encode a method, this is good, say:
00: Zeroed Page
01: Stored Page (couldn't compress)
10: Use RP2 or similar (TBD).
11: ?

I am mostly skipping LZ4 as an option, because:
It tends to compress worse than RP2;
Its performance (on BJX2) tends to be worse than RP2.

Some of this is mostly because LZ4 needs a lot of byte loads, and its
mod-255 length encoding "kinda sucks" in this area.

RP2 needs a lot of bit-shifting and masking, but BJX2 does pretty OK at
this part. However, decoding LZ4 does tend to be faster on x86 and similar.

Both formats would have similar cost on the encoder side of things.

The choice can be pretty much arbitrary here, since pagefiles are not
used for data interchange...

Though, I guess it is possible that someone could have a pagefile which
also serves as a hibernation file (provided some additional metadata).

> I ran into this a while ago doing daily archives of TLD zone files.
> ICANN provides each file in gzip form and I realized that making a tgz
> of all of them was a bad idea.
>

I don't think anyone is claiming here that non-compressible data doesn't
exist...

But, a lot of stuff one might find in an applications' heap memory or
similar is fairly readily compressible.

Programs don't (usually) flood their heap space with random garbage, and
even if they do, it isn't really any worse off than had a non-compressed
pagefile been used.

> R's,
> John

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1ap7m$rp2$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24382&group=comp.arch#24382

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 14:03:50 -0700
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <t1ap7m$rp2$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com> <t1ao8a$abq$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 21 Mar 2022 21:03:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d276618c96160569f3105435f9bb5476";
logging-data="28450"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18k3uScKXan77Yolrn7d4DeLn0UnzQEWAs="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Cancel-Lock: sha1:nOheasTA7AD/EvUMxrDmrLl45tM=
In-Reply-To: <t1ao8a$abq$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Mon, 21 Mar 2022 21:03 UTC

On 3/21/2022 1:47 PM, BGB wrote:
> On 3/21/2022 3:01 PM, John Levine wrote:
>> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>>> It has been universal for tape drives and before that tape drive
>>> controllers, for decades. Of course, it is much easier for them, as
>>> they are inherently sequential access, not random access, so addressing
>>> the varying block sizes just aren't an issue.
>>
>> I think they have buffers big enough that they can tell whether a block
>> has perverse compression behavior and store it uncompressed. If the
>> stuff you are storing is already compressed, e.g., powerpoint files which
>> are in fact ZIP files with the usual ZIP compression, trying to compress
>> them again is counterproductive.

I think that is right. Given that you only need a tape block size
buffer, it isn't a big problem. Of course, you need to prepend a bit to
indicate not to try decompressing it when you read the tape back.

> You can detect these cases easily enough:
> You try to compress it, and it gets bigger (or, at least, doesn't get
> sufficiently smaller);
> If this happens, the pagefile logic can be like "well, that didn't work,
> store it uncompressed".

You are talking about a different situation than John and I are. We are
talking about a hardware compression in the tape drive (or tape
controller), not some logic in the host software. And, of course, no
one is going top use a tape drive for a page file. :-)

snip

> I am mostly skipping LZ4 as an option, because:
> It tends to compress worse than RP2;
> Its performance (on BJX2) tends to be worse than RP2.
>
> Some of this is mostly because LZ4 needs a lot of byte loads, and its
> mod-255 length encoding "kinda sucks" in this area.
>
> RP2 needs a lot of bit-shifting and masking, but BJX2 does pretty OK at
> this part. However, decoding LZ4 does tend to be faster on x86 and similar.

Again, difference between doing it software versus hardware. The
original IBM 3480 used arithmetic compression. I don't know what
current drives use.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<5e2eed77-78f0-4526-a579-c5f59839d4f0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24386&group=comp.arch#24386

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5a95:0:b0:2e2:e4f:63c with SMTP id c21-20020ac85a95000000b002e20e4f063cmr8003074qtc.537.1647900320035;
Mon, 21 Mar 2022 15:05:20 -0700 (PDT)
X-Received: by 2002:a05:6870:f2a5:b0:da:b3f:2b50 with SMTP id
u37-20020a056870f2a500b000da0b3f2b50mr453806oap.239.1647900319768; Mon, 21
Mar 2022 15:05:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Mar 2022 15:05:19 -0700 (PDT)
In-Reply-To: <t1ap7m$rp2$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:30:df54:dce3:b07d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:30:df54:dce3:b07d
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com> <t1ao8a$abq$1@dont-email.me> <t1ap7m$rp2$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5e2eed77-78f0-4526-a579-c5f59839d4f0n@googlegroups.com>
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 21 Mar 2022 22:05:20 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 47

by: MitchAlsup - Mon, 21 Mar 2022 22:05 UTC

On Monday, March 21, 2022 at 4:03:53 PM UTC-5, Stephen Fuld wrote:
> On 3/21/2022 1:47 PM, BGB wrote:
> > On 3/21/2022 3:01 PM, John Levine wrote:
> >> According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
> >>> It has been universal for tape drives and before that tape drive
> >>> controllers, for decades. Of course, it is much easier for them, as
> >>> they are inherently sequential access, not random access, so addressing
> >>> the varying block sizes just aren't an issue.
> >>
> >> I think they have buffers big enough that they can tell whether a block
> >> has perverse compression behavior and store it uncompressed. If the
> >> stuff you are storing is already compressed, e.g., powerpoint files which
> >> are in fact ZIP files with the usual ZIP compression, trying to compress
> >> them again is counterproductive.
> I think that is right. Given that you only need a tape block size
> buffer, it isn't a big problem. Of course, you need to prepend a bit to
> indicate not to try decompressing it when you read the tape back.
> > You can detect these cases easily enough:
> > You try to compress it, and it gets bigger (or, at least, doesn't get
> > sufficiently smaller);
> > If this happens, the pagefile logic can be like "well, that didn't work,
> > store it uncompressed".
<
> You are talking about a different situation than John and I are. We are
> talking about a hardware compression in the tape drive (or tape
> controller), not some logic in the host software. And, of course, no
> one is going top use a tape drive for a page file. :-)
<
I used a paper tape punch to make a PDP-11/20 into a time sharing machine.
You just feed the output tape into the reader and hit go. The mess is rather
spectacular at the end.
>
> snip
> > I am mostly skipping LZ4 as an option, because:
> > It tends to compress worse than RP2;
> > Its performance (on BJX2) tends to be worse than RP2.
> >
> > Some of this is mostly because LZ4 needs a lot of byte loads, and its
> > mod-255 length encoding "kinda sucks" in this area.
> >
> > RP2 needs a lot of bit-shifting and masking, but BJX2 does pretty OK at
> > this part. However, decoding LZ4 does tend to be faster on x86 and similar.
> Again, difference between doing it software versus hardware. The
> original IBM 3480 used arithmetic compression. I don't know what
> current drives use.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1ateg$d9h$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24388&group=comp.arch#24388

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 22:15:44 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1ateg$d9h$1@newsreader4.netcologne.de>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com>
Injection-Date: Mon, 21 Mar 2022 22:15:44 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="13617"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 21 Mar 2022 22:15 UTC

John Levine <johnl@taugh.com> schrieb:
> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>>It has been universal for tape drives and before that tape drive
>>controllers, for decades. Of course, it is much easier for them, as
>>they are inherently sequential access, not random access, so addressing
>>the varying block sizes just aren't an issue.
>
> I think they have buffers big enough that they can tell whether a block
> has perverse compression behavior and store it uncompressed.

Encrypted blocks should have that behavior, or the algorithm is
no good :-)

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1ats6$v9a$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24389&group=comp.arch#24389

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 17:23:01 -0500
Organization: A noiseless patient Spider
Lines: 87
Message-ID: <t1ats6$v9a$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com> <t1ao8a$abq$1@dont-email.me>
<t1ap7m$rp2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 21 Mar 2022 22:23:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a46ca344aa2f35062157743cce59a47";
logging-data="32042"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18lYfQft6I0GkMqlShkwQiI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:1P3E7BQx5Fnr3qkyw4DH10EwYto=
In-Reply-To: <t1ap7m$rp2$1@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 21 Mar 2022 22:23 UTC

On 3/21/2022 4:03 PM, Stephen Fuld wrote:
> On 3/21/2022 1:47 PM, BGB wrote:
>> On 3/21/2022 3:01 PM, John Levine wrote:
>>> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>>>> It has been universal for tape drives and before that tape drive
>>>> controllers, for decades. Of course, it is much easier for them, as
>>>> they are inherently sequential access, not random access, so addressing
>>>> the varying block sizes just aren't an issue.
>>>
>>> I think they have buffers big enough that they can tell whether a block
>>> has perverse compression behavior and store it uncompressed. If the
>>> stuff you are storing is already compressed, e.g., powerpoint files
>>> which
>>> are in fact ZIP files with the usual ZIP compression, trying to compress
>>> them again is counterproductive.
>
> I think that is right. Given that you only need a tape block size
> buffer, it isn't a big problem. Of course, you need to prepend a bit to
> indicate not to try decompressing it when you read the tape back.
>
>
>> You can detect these cases easily enough:
>> You try to compress it, and it gets bigger (or, at least, doesn't get
>> sufficiently smaller);
>> If this happens, the pagefile logic can be like "well, that didn't
>> work, store it uncompressed".
>
> You are talking about a different situation than John and I are. We are
> talking about a hardware compression in the tape drive (or tape
> controller), not some logic in the host software. And, of course, no
> one is going top use a tape drive for a page file. :-)
>

OK, had missed that the topic had got redirected here...

> snip
>
>> I am mostly skipping LZ4 as an option, because:
>> It tends to compress worse than RP2;
>> Its performance (on BJX2) tends to be worse than RP2.
>>
>> Some of this is mostly because LZ4 needs a lot of byte loads, and its
>> mod-255 length encoding "kinda sucks" in this area.
>>
>> RP2 needs a lot of bit-shifting and masking, but BJX2 does pretty OK
>> at this part. However, decoding LZ4 does tend to be faster on x86 and
>> similar.
>
> Again, difference between doing it software versus hardware. The
> original IBM 3480 used arithmetic compression. I don't know what
> current drives use.
>

OK, yeah. Arithmetic encoding is also very different from byte-oriented
LZ77 as well...

I am not considering any sort of entropy coding in the pagefile case, as
the extra cost and complexity here is not likely to be worthwhile.

I guess maybe someone could evaluate the tradeoffs of LZW or similar in
this case, but LZW kinda sucked in my past testing (not great in terms
of either speed or compression ratio when compared with LZ77 variants,
also more complicated).

There is also LZP (uses a hash table based match predictor), which can
work pretty OK for compression ratio (allows encoding matches without
needing to encode a match distance). However, the need for the decoder
to maintain a hash table which is bit exact to the one which existed in
the encoder has some serious drawbacks.

An encoding for a simple LZP style format might be, say:
llllrrr0, Emit 0..7 literal bytes, followed by an 4..18 byte match
*01, ... Longer cases.

Where, the preceding 3 or 4 bytes are mashed down to 8..12 bits to be
used as a lookup into a hash table (encodes the window position at the
last time this hash index was seen; copying a match if one is encountered).

Actually, such a format could make sense for a pagefile, since
compression and decompression is roughly at parity, with no real need
for format interchange.

....

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1b0eg$h1a$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24391&group=comp.arch#24391

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Mon, 21 Mar 2022 16:06:57 -0700
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <t1b0eg$h1a$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com> <t1ao8a$abq$1@dont-email.me>
<t1ap7m$rp2$1@dont-email.me>
<5e2eed77-78f0-4526-a579-c5f59839d4f0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Mar 2022 23:06:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="497a4cf3baa82b393718ff3361e6e263";
logging-data="17450"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19vXkuzE7xsnersgZkpLjZ5"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.2
Cancel-Lock: sha1:e9wcsFjFzS/Iz3biuz//xXTiq4E=
In-Reply-To: <5e2eed77-78f0-4526-a579-c5f59839d4f0n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Mon, 21 Mar 2022 23:06 UTC

On 3/21/2022 3:05 PM, MitchAlsup wrote:
> On Monday, March 21, 2022 at 4:03:53 PM UTC-5, Stephen Fuld wrote:
>> On 3/21/2022 1:47 PM, BGB wrote:
>>> On 3/21/2022 3:01 PM, John Levine wrote:
>>>> According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
>>>>> It has been universal for tape drives and before that tape drive
>>>>> controllers, for decades. Of course, it is much easier for them, as
>>>>> they are inherently sequential access, not random access, so addressing
>>>>> the varying block sizes just aren't an issue.
>>>>
>>>> I think they have buffers big enough that they can tell whether a block
>>>> has perverse compression behavior and store it uncompressed. If the
>>>> stuff you are storing is already compressed, e.g., powerpoint files which
>>>> are in fact ZIP files with the usual ZIP compression, trying to compress
>>>> them again is counterproductive.
>> I think that is right. Given that you only need a tape block size
>> buffer, it isn't a big problem. Of course, you need to prepend a bit to
>> indicate not to try decompressing it when you read the tape back.
>>> You can detect these cases easily enough:
>>> You try to compress it, and it gets bigger (or, at least, doesn't get
>>> sufficiently smaller);
>>> If this happens, the pagefile logic can be like "well, that didn't work,
>>> store it uncompressed".
> <
>> You are talking about a different situation than John and I are. We are
>> talking about a hardware compression in the tape drive (or tape
>> controller), not some logic in the host software. And, of course, no
>> one is going top use a tape drive for a page file. :-)
> <
> I used a paper tape punch to make a PDP-11/20 into a time sharing machine.
> You just feed the output tape into the reader and hit go. The mess is rather
> spectacular at the end.

Fabulous! In both senses.

Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile

<t1blkr$t7b$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24393&group=comp.arch#24393

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: compressing, was Misc: Page Tables, Virtual Memory, Swapfile
Date: Tue, 22 Mar 2022 00:08:41 -0500
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <t1blkr$t7b$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1aepg$2dio$1@gal.iecc.com>
<2022Mar21.192939@mips.complang.tuwien.ac.at> <t1ak5f$jr6$1@dont-email.me>
<t1aliq$81f$1@gal.iecc.com> <t1ao8a$abq$1@dont-email.me>
<t1ap7m$rp2$1@dont-email.me> <t1ats6$v9a$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 22 Mar 2022 05:08:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="704730ca154d89a2457bb61df0c35dcd";
logging-data="29931"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Lf/Z2UoDBoWYY6UeazYlV"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:miCzp7+stPAQ5zujQm9MdU+P4x0=
In-Reply-To: <t1ats6$v9a$1@dont-email.me>
Content-Language: en-US

by: BGB - Tue, 22 Mar 2022 05:08 UTC

On 3/21/2022 5:23 PM, BGB wrote:
> On 3/21/2022 4:03 PM, Stephen Fuld wrote:
>> On 3/21/2022 1:47 PM, BGB wrote:
>>> On 3/21/2022 3:01 PM, John Levine wrote:
>>>> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>>>>> It has been universal for tape drives and before that tape drive
>>>>> controllers, for decades. Of course, it is much easier for them, as
>>>>> they are inherently sequential access, not random access, so
>>>>> addressing
>>>>> the varying block sizes just aren't an issue.
>>>>
>>>> I think they have buffers big enough that they can tell whether a block
>>>> has perverse compression behavior and store it uncompressed. If the
>>>> stuff you are storing is already compressed, e.g., powerpoint files
>>>> which
>>>> are in fact ZIP files with the usual ZIP compression, trying to
>>>> compress
>>>> them again is counterproductive.
>>
>> I think that is right. Given that you only need a tape block size
>> buffer, it isn't a big problem. Of course, you need to prepend a bit
>> to indicate not to try decompressing it when you read the tape back.
>>
>>
>>> You can detect these cases easily enough:
>>> You try to compress it, and it gets bigger (or, at least, doesn't get
>>> sufficiently smaller);
>>> If this happens, the pagefile logic can be like "well, that didn't
>>> work, store it uncompressed".
>>
>> You are talking about a different situation than John and I are. We
>> are talking about a hardware compression in the tape drive (or tape
>> controller), not some logic in the host software. And, of course, no
>> one is going top use a tape drive for a page file. :-)
>>
>
> OK, had missed that the topic had got redirected here...
>
>
>> snip
>>
>>> I am mostly skipping LZ4 as an option, because:
>>> It tends to compress worse than RP2;
>>> Its performance (on BJX2) tends to be worse than RP2.
>>>
>>> Some of this is mostly because LZ4 needs a lot of byte loads, and its
>>> mod-255 length encoding "kinda sucks" in this area.
>>>
>>> RP2 needs a lot of bit-shifting and masking, but BJX2 does pretty OK
>>> at this part. However, decoding LZ4 does tend to be faster on x86 and
>>> similar.
>>
>> Again, difference between doing it software versus hardware. The
>> original IBM 3480 used arithmetic compression. I don't know what
>> current drives use.
>>
>
> OK, yeah. Arithmetic encoding is also very different from byte-oriented
> LZ77 as well...
>
> I am not considering any sort of entropy coding in the pagefile case, as
> the extra cost and complexity here is not likely to be worthwhile.
>
>
> I guess maybe someone could evaluate the tradeoffs of LZW or similar in
> this case, but LZW kinda sucked in my past testing (not great in terms
> of either speed or compression ratio when compared with LZ77 variants,
> also more complicated).
>

....

>
> There is also LZP (uses a hash table based match predictor), which can
> work pretty OK for compression ratio (allows encoding matches without
> needing to encode a match distance). However, the need for the decoder
> to maintain a hash table which is bit exact to the one which existed in
> the encoder has some serious drawbacks.
>
>
> An encoding for a simple LZP style format might be, say:
> llllrrr0, Emit 0..7 literal bytes, followed by an 4..18 byte match
> *01, ... Longer cases.
>
> Where, the preceding 3 or 4 bytes are mashed down to 8..12 bits to be
> used as a lookup into a hash table (encodes the window position at the
> last time this hash index was seen; copying a match if one is encountered).
>
> Actually, such a format could make sense for a pagefile, since
> compression and decompression is roughly at parity, with no real need
> for format interchange.
>

After throwing together some code.

Tested this idea (on an offline test, compressing a file as 16K chunks):
Simple LZP variant, 10-bit hash, avg=9.0K (56.3% orig size)
RP2 (Normal Encoder), avg=5.9K (36.9% orig size)
RP2 (Fast Mini Encoder, 10-bit hash), avg=7.3K (45.6% orig size)

For reference:
Deflate (gzip -9): avg=5.54K
LZ4 (lz4 -9): avg=6.59K

The LZP variant is a slightly expanded version of the idea from the
prior post
llllrrr0
llllllll-rrrrrr01

So, RP2 still does better than the LZP variant despite still needing to
encode a distance, mostly for sake that using a hash of the next several
bytes is somewhat more effective than using a hash of the preceding
bytes (as used in an LZP variant).

The normal encoder uses a bigger hash, and looks up via a hash chain.
The mini encoder only checks for a single match.

RP2 is also faster than LZP, as it can use block-oriented copying,
whereas the LZP variant needs to copy one byte at a time and update the
hash after each byte.

The Mini-RP2 encoder also still does better than the LZP variant even
when using a smaller hash table (8-bit lookup).

Although it is still "to be determined" if I will bother with swap-file,
compression; it looks like RP2 will probably do OK at it.

Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<t1ie0o$l1m$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24459&group=comp.arch#24459

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
Date: Thu, 24 Mar 2022 13:41:25 -0500
Organization: A noiseless patient Spider
Lines: 125
Message-ID: <t1ie0o$l1m$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 24 Mar 2022 18:41:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="82de82794c627957d0b2aa4e1623b11d";
logging-data="21558"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Uo9gsTyjZYh5dm3VJlqcc"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:iSqdOoY8du2ya2e28KhCIHdphSE=
In-Reply-To: <t19cei$aj7$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 24 Mar 2022 18:41 UTC

On 3/21/2022 3:19 AM, BGB wrote:
> Well, as of earlier today I was (more or less) now able to get
> pagefile/swapfile stuff working.
>
> More testing/debugging is likely still needed, but "so far so good" it
> seems (such as whether it holds up when the program starts using enough
> memory to cause the pagefile to start thrashing, ...).
>
(snip)

Thus far, this stuff still seems to be working.

Though, it isn't to say yet that it is confirmed to work for programs
which start thrashing the pagefile. It appears that the 32MB I set aside
for this is going a lot further than initially expected.

It is almost like, compared with physically-mapped memory, it almost
works like a RAM multiplier.

>
> I had also been experimenting with alternatives to page tables, and
> ended up going with AVL trees as the main alternative.
>

Add B-Trees to this list.
I realized that B-Trees have a few capabilities which the AVL trees lack.

> * For small virtual-address spaces, the overhead of the AVL walk is
> relatively modest compared with the cost of saving/restoring all of the
> CPU registers and similar for the ISR. For larger address spaces, they
> were faster than using a chain-hash (lookup cost follows a log2 curve
> relative to memory footprint, whereas the chain-hash follows a linear
> cost curve).
>
> * Along with chain-hashing, they are the most space-efficient option at
> present. Memory overhead is significantly lower than traditional page
> tables, and slightly smaller than B-Trees in this case (due to the
> comparably large "keys" in this case).
>
>
> So, this leaves as current options:
> 4K pages, 4-level page table (48b VA);
> 16K pages, 3-level page table (47b VA, currently used);
> 64K pages, 2-level page table (42b VA);
> 4K/16K/64K, AVL, 48b VA (variable height);
> 4K/16K/64K, AVL, 96b VA (variable height).
>
> The 10/8/7 level page-table formats are (most likely) now effectively
> deprecated.
>
> While hybrid tables are a possible (AVL with the bottom level as a
> page-table), they were kind of lackluster in my tests (neither
> particularly fast nor particularly space efficient).
>

Have added B-Trees now, and posted a version on PasteBin:
https://pastebin.com/65chYBPs

After some design tweaks (eg, changing how nodes are structured
internally, and going from 510 to 816 entries per 16K node, *) the
B-Trees are now slightly ahead of the AVL trees in terms of memory density.

*: The original design had used 32B entries, placed in continuous
memory. I debated using continuous memory for the 20B elements, which
while better for the L1 cache, requires a multiplication by 5 (granted,
in theory this case could be encoded with a LEA.L instruction).

Though, even despite the parallel arrays, this version seems to be
slightly ahead in terms of overall performance.

Performance for fetching pages is faster than AVL, however the B-Tree
performance suffers (relative to AVL) for inserting new pages (updating
existing pages is still OK, so it mostly adds a penalty for things like
mmap/VirtualAlloc style operations; or for the initial "flood fill"
during start-up).

There could still be ways to make it faster, but my fiddling wasn't
gaining much here.

Granted, one could argue that maybe my implementation of a B-Tree is
kinda crap, wont exactly deny this...

It is possible I could consider a version optimized for 48b addressing
(rather than 96b), though the most obvious option here would be to
switch to a different node structure again (likely with 1022 elements
per node).

But, this would be in face-to-face competition with using a 3-level
page-table. Even a "faster" B-Tree isn't likely to win this fight in
terms of performance.

It mostly comes down to memory density in the face of ASLR.

I guess, more thoughts?...

<snip>

Not much new to add to the rest.

For now, I am now leaning against pagefile compression in the near term:
LZ compression is slow;
I am not getting enough compression on average to have much effect.

A 65% or so reduction in average page size (for non-zeroed pages) is not
sufficient.

The hope was that pages would have had a lot more highly compressible
data, but once one skips over "large spans of zeroed memory" the results
are only modestly compressible.

It seems like one can get OK enough results with either zero-skipping,
or maybe skipping zeroed sub-pages (only storing the nonzero parts). The
"compressor" for this scheme can also be made a bit faster as well.

....

Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<Pkj%J.214706$r6p7.200598@fx41.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24467&group=comp.arch#24467

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx41.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
References: <t19cei$aj7$1@dont-email.me> <t1ie0o$l1m$1@dont-email.me>
In-Reply-To: <t1ie0o$l1m$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 43
Message-ID: <Pkj%J.214706$r6p7.200598@fx41.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 25 Mar 2022 13:20:47 UTC
Date: Fri, 25 Mar 2022 09:19:40 -0400
X-Received-Bytes: 2563

by: EricP - Fri, 25 Mar 2022 13:19 UTC

BGB wrote:
> On 3/21/2022 3:19 AM, BGB wrote:
>
>>
>> I had also been experimenting with alternatives to page tables, and
>> ended up going with AVL trees as the main alternative.
>>
>
> Add B-Trees to this list.
> I realized that B-Trees have a few capabilities which the AVL trees lack.
>
>
>> * For small virtual-address spaces, the overhead of the AVL walk is
>> relatively modest compared with the cost of saving/restoring all of
>> the CPU registers and similar for the ISR. For larger address spaces,
>> they were faster than using a chain-hash (lookup cost follows a log2
>> curve relative to memory footprint, whereas the chain-hash follows a
>> linear cost curve).
>>
>> * Along with chain-hashing, they are the most space-efficient option
>> at present. Memory overhead is significantly lower than traditional
>> page tables, and slightly smaller than B-Trees in this case (due to
>> the comparably large "keys" in this case).
>>
>>
>> So, this leaves as current options:
>> 4K pages, 4-level page table (48b VA);
>> 16K pages, 3-level page table (47b VA, currently used);
>> 64K pages, 2-level page table (42b VA);
>> 4K/16K/64K, AVL, 48b VA (variable height);
>> 4K/16K/64K, AVL, 96b VA (variable height).

I don't see how it is possible for AVL to compete with a radix tree.
You must be doing something different than what I thought you were saying.

Because I don't see how an AVL tree with 1 node per 4kB page could
complete with the fixed 4 memory accesses to translate a radix tree.
A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
are 4 times the size of a PTE with key, left&right ptr plus data word.

Are you doing AVL tree variable length *segments* rather than pages?

Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<t1kru4$fh7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24471&group=comp.arch#24471

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
Date: Fri, 25 Mar 2022 11:51:15 -0500
Organization: A noiseless patient Spider
Lines: 119
Message-ID: <t1kru4$fh7$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1ie0o$l1m$1@dont-email.me>
<Pkj%J.214706$r6p7.200598@fx41.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 25 Mar 2022 16:51:16 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f2432f73bfac495c94b9e46d8a5e15e8";
logging-data="15911"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/X6SlvA7UMZdxmeVBwyerv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:Tvxnz1kRB9WZyajH3Ji4fD+wugg=
In-Reply-To: <Pkj%J.214706$r6p7.200598@fx41.iad>
Content-Language: en-US

by: BGB - Fri, 25 Mar 2022 16:51 UTC

On 3/25/2022 8:19 AM, EricP wrote:
> BGB wrote:
>> On 3/21/2022 3:19 AM, BGB wrote:
>>
>>>
>>> I had also been experimenting with alternatives to page tables, and
>>> ended up going with AVL trees as the main alternative.
>>>
>>
>> Add B-Trees to this list.
>> I realized that B-Trees have a few capabilities which the AVL trees lack.
>>

Also with some design tweaks and optimizations, the B-Trees now seem to
be ahead.

I am likely now back to putting emphasis on B-Trees rather than AVL Trees.

>>
>>> * For small virtual-address spaces, the overhead of the AVL walk is
>>> relatively modest compared with the cost of saving/restoring all of
>>> the CPU registers and similar for the ISR. For larger address spaces,
>>> they were faster than using a chain-hash (lookup cost follows a log2
>>> curve relative to memory footprint, whereas the chain-hash follows a
>>> linear cost curve).
>>>
>>> * Along with chain-hashing, they are the most space-efficient option
>>> at present. Memory overhead is significantly lower than traditional
>>> page tables, and slightly smaller than B-Trees in this case (due to
>>> the comparably large "keys" in this case).
>>>
>>>
>>> So, this leaves as current options:
>>>    4K pages, 4-level page table (48b VA);
>>>    16K pages, 3-level page table (47b VA, currently used);
>>>    64K pages, 2-level page table (42b VA);
>>>    4K/16K/64K, AVL, 48b VA (variable height);
>>>    4K/16K/64K, AVL, 96b VA (variable height).
>
> I don't see how it is possible for AVL to compete with a radix tree.
> You must be doing something different than what I thought you were saying.
>
> Because I don't see how an AVL tree with 1 node per 4kB page could
> complete with the fixed 4 memory accesses to translate a radix tree.
> A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
> are 4 times the size of a PTE with key, left&right ptr plus data word.
>

I am mostly using 16K pages.

The AVL Tree currently uses 32B per node.
I had to bit-pack some stuff to fit it into 32B.

The B-Tree currently uses 20B per entry (for 96B addressing).
Nodes being 16K and holding 816 entries.

The Page Tables (Radix Tree) used 8B per entry.
They would win if they were more than (on average):
25% full (vs AVL)
40% full (vs B-Tree)

Throw ASLR in the mix, and most of the page table (Radix Tree) pages
will end up mostly empty (far below the break-even point).
Most of the pages ending up mostly full of zeroes.

The ASLR strategy I am using is mostly using a random number generator
to generate the addresses for memory allocations. Where one generates a
random number and then verify it doesn't hit anything (if it does, try
again).

There is also the possibility of hybrid tables (last level using 8B
entries), which should do well if the last level is sufficiently full.

However, this still requires either allocating objects adjacent to each
other (going against ASLR), or requires average allocation sizes to be
considerably larger (8MB or 16MB rather than 1MB).

However, hybrid tables are a bit faster.

....

> Are you doing AVL tree variable length *segments* rather than pages?
>

No, it is still pages.
It was more about memory-use efficiency than speed in this case.

Compared with page-tables, the AVL Trees and B-Trees are very slow.

However, they also end up using significantly less memory in many
scenarios (namely with ASLR).

However, despite the slowness, it doesn't make much difference relative
to the average cycle count of the ISR, which is bad enough (due to other
ISR related overheads), as to mostly mask the difference.

The problem area isn't really with 3 and 4 level tables (which are used
for 48-bit addressing), but rather with 8 and 10 level tables used in
96-bit addressing.

When ASLR was used (spreading memory chunks all over the place, using
the same chunking size I typically used for malloc, namely 1MB), the
space overhead was *absurd* (albeit, not anywhere near as bad with a
3-level table).

With an 8 level table, doing 256 randomized 1MB allocations eats ~ 29MB.
With a B-Tree, it is 480K, with an AVL it is 528K.

....

Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<b277ca66-0703-4929-a6ab-d67e6073c9d3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24472&group=comp.arch#24472

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1127:b0:67e:7670:b5c with SMTP id p7-20020a05620a112700b0067e76700b5cmr7780078qkk.367.1648228542552;
Fri, 25 Mar 2022 10:15:42 -0700 (PDT)
X-Received: by 2002:a05:6808:1451:b0:2ec:cfe4:21e with SMTP id
x17-20020a056808145100b002eccfe4021emr10966134oiv.147.1648228542293; Fri, 25
Mar 2022 10:15:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Mar 2022 10:15:42 -0700 (PDT)
In-Reply-To: <t1kru4$fh7$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0b7:ad45:609a:53bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0b7:ad45:609a:53bb
References: <t19cei$aj7$1@dont-email.me> <t1ie0o$l1m$1@dont-email.me>
<Pkj%J.214706$r6p7.200598@fx41.iad> <t1kru4$fh7$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b277ca66-0703-4929-a6ab-d67e6073c9d3n@googlegroups.com>
Subject: Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Mar 2022 17:15:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 126

by: MitchAlsup - Fri, 25 Mar 2022 17:15 UTC

On Friday, March 25, 2022 at 11:51:19 AM UTC-5, BGB wrote:
> On 3/25/2022 8:19 AM, EricP wrote:
> > BGB wrote:
> >> On 3/21/2022 3:19 AM, BGB wrote:
> >>
> >>>
> >>> I had also been experimenting with alternatives to page tables, and
> >>> ended up going with AVL trees as the main alternative.
> >>>
> >>
> >> Add B-Trees to this list.
> >> I realized that B-Trees have a few capabilities which the AVL trees lack.
> >>
> Also with some design tweaks and optimizations, the B-Trees now seem to
> be ahead.
>
> I am likely now back to putting emphasis on B-Trees rather than AVL Trees.
> >>
> >>> * For small virtual-address spaces, the overhead of the AVL walk is
> >>> relatively modest compared with the cost of saving/restoring all of
> >>> the CPU registers and similar for the ISR. For larger address spaces,
> >>> they were faster than using a chain-hash (lookup cost follows a log2
> >>> curve relative to memory footprint, whereas the chain-hash follows a
> >>> linear cost curve).
> >>>
> >>> * Along with chain-hashing, they are the most space-efficient option
> >>> at present. Memory overhead is significantly lower than traditional
> >>> page tables, and slightly smaller than B-Trees in this case (due to
> >>> the comparably large "keys" in this case).
> >>>
> >>>
> >>> So, this leaves as current options:
> >>> 4K pages, 4-level page table (48b VA);
> >>> 16K pages, 3-level page table (47b VA, currently used);
> >>> 64K pages, 2-level page table (42b VA);
> >>> 4K/16K/64K, AVL, 48b VA (variable height);
> >>> 4K/16K/64K, AVL, 96b VA (variable height).
> >
> > I don't see how it is possible for AVL to compete with a radix tree.
> > You must be doing something different than what I thought you were saying.
> >
> > Because I don't see how an AVL tree with 1 node per 4kB page could
> > complete with the fixed 4 memory accesses to translate a radix tree.
> > A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
> > are 4 times the size of a PTE with key, left&right ptr plus data word.
> >
> I am mostly using 16K pages.
>
>
> The AVL Tree currently uses 32B per node.
> I had to bit-pack some stuff to fit it into 32B.
>
> The B-Tree currently uses 20B per entry (for 96B addressing).
> Nodes being 16K and holding 816 entries.
>
> The Page Tables (Radix Tree) used 8B per entry.
> They would win if they were more than (on average):
> 25% full (vs AVL)
> 40% full (vs B-Tree)
>
>
> Throw ASLR in the mix, and most of the page table (Radix Tree) pages
> will end up mostly empty (far below the break-even point).
> Most of the pages ending up mostly full of zeroes.
<
Most MMU entries in a hierarchical page table ARE zeros already.
>
> The ASLR strategy I am using is mostly using a random number generator
> to generate the addresses for memory allocations. Where one generates a
> random number and then verify it doesn't hit anything (if it does, try
> again).
<
Why does ALSR not take position independent code and end up
requiring a Linker to load the a.out ? {like in the old days before
PIC}
>
>
> There is also the possibility of hybrid tables (last level using 8B
> entries), which should do well if the last level is sufficiently full.
<
Skipping of levels, means that for most programs, there is only 1 page
needed to perform mapping.
>
> However, this still requires either allocating objects adjacent to each
> other (going against ASLR), or requires average allocation sizes to be
> considerably larger (8MB or 16MB rather than 1MB).
>
> However, hybrid tables are a bit faster.
<
Faster than a 1-level table ?!?
>
> ...
> > Are you doing AVL tree variable length *segments* rather than pages?
> >
> No, it is still pages.
> It was more about memory-use efficiency than speed in this case.
>
>
> Compared with page-tables, the AVL Trees and B-Trees are very slow.
>
> However, they also end up using significantly less memory in many
> scenarios (namely with ASLR).
>
>
> However, despite the slowness, it doesn't make much difference relative
> to the average cycle count of the ISR, which is bad enough (due to other
> ISR related overheads), as to mostly mask the difference.
>
>
>
> The problem area isn't really with 3 and 4 level tables (which are used
> for 48-bit addressing), but rather with 8 and 10 level tables used in
> 96-bit addressing.
<
Compare the percentage of applications that fit in 48-bits versus the
number which do not...........
>
> When ASLR was used (spreading memory chunks all over the place, using
> the same chunking size I typically used for malloc, namely 1MB), the
> space overhead was *absurd* (albeit, not anywhere near as bad with a
> 3-level table).
>
> With an 8 level table, doing 256 randomized 1MB allocations eats ~ 29MB.
> With a B-Tree, it is 480K, with an AVL it is 528K.
>
>
> ...

Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<t1l1e6$b6q$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24477&group=comp.arch#24477

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
Date: Fri, 25 Mar 2022 13:25:07 -0500
Organization: A noiseless patient Spider
Lines: 229
Message-ID: <t1l1e6$b6q$1@dont-email.me>
References: <t19cei$aj7$1@dont-email.me> <t1ie0o$l1m$1@dont-email.me>
<Pkj%J.214706$r6p7.200598@fx41.iad> <t1kru4$fh7$1@dont-email.me>
<b277ca66-0703-4929-a6ab-d67e6073c9d3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 25 Mar 2022 18:25:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f2432f73bfac495c94b9e46d8a5e15e8";
logging-data="11482"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Mcp0jVuimD01brtwkpQOn"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:Jc9zQEd5X0cJLxJg97f2568hvXQ=
In-Reply-To: <b277ca66-0703-4929-a6ab-d67e6073c9d3n@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 25 Mar 2022 18:25 UTC

On 3/25/2022 12:15 PM, MitchAlsup wrote:
> On Friday, March 25, 2022 at 11:51:19 AM UTC-5, BGB wrote:
>> On 3/25/2022 8:19 AM, EricP wrote:
>>> BGB wrote:
>>>> On 3/21/2022 3:19 AM, BGB wrote:
>>>>
>>>>>
>>>>> I had also been experimenting with alternatives to page tables, and
>>>>> ended up going with AVL trees as the main alternative.
>>>>>
>>>>
>>>> Add B-Trees to this list.
>>>> I realized that B-Trees have a few capabilities which the AVL trees lack.
>>>>
>> Also with some design tweaks and optimizations, the B-Trees now seem to
>> be ahead.
>>
>> I am likely now back to putting emphasis on B-Trees rather than AVL Trees.
>>>>
>>>>> * For small virtual-address spaces, the overhead of the AVL walk is
>>>>> relatively modest compared with the cost of saving/restoring all of
>>>>> the CPU registers and similar for the ISR. For larger address spaces,
>>>>> they were faster than using a chain-hash (lookup cost follows a log2
>>>>> curve relative to memory footprint, whereas the chain-hash follows a
>>>>> linear cost curve).
>>>>>
>>>>> * Along with chain-hashing, they are the most space-efficient option
>>>>> at present. Memory overhead is significantly lower than traditional
>>>>> page tables, and slightly smaller than B-Trees in this case (due to
>>>>> the comparably large "keys" in this case).
>>>>>
>>>>>
>>>>> So, this leaves as current options:
>>>>> 4K pages, 4-level page table (48b VA);
>>>>> 16K pages, 3-level page table (47b VA, currently used);
>>>>> 64K pages, 2-level page table (42b VA);
>>>>> 4K/16K/64K, AVL, 48b VA (variable height);
>>>>> 4K/16K/64K, AVL, 96b VA (variable height).
>>>
>>> I don't see how it is possible for AVL to compete with a radix tree.
>>> You must be doing something different than what I thought you were saying.
>>>
>>> Because I don't see how an AVL tree with 1 node per 4kB page could
>>> complete with the fixed 4 memory accesses to translate a radix tree.
>>> A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
>>> are 4 times the size of a PTE with key, left&right ptr plus data word.
>>>
>> I am mostly using 16K pages.
>>
>>
>> The AVL Tree currently uses 32B per node.
>> I had to bit-pack some stuff to fit it into 32B.
>>
>> The B-Tree currently uses 20B per entry (for 96B addressing).
>> Nodes being 16K and holding 816 entries.
>>
>> The Page Tables (Radix Tree) used 8B per entry.
>> They would win if they were more than (on average):
>> 25% full (vs AVL)
>> 40% full (vs B-Tree)
>>
>>
>> Throw ASLR in the mix, and most of the page table (Radix Tree) pages
>> will end up mostly empty (far below the break-even point).
>> Most of the pages ending up mostly full of zeroes.
> <
> Most MMU entries in a hierarchical page table ARE zeros already.

Yes, however ASLR doesn't exactly help.

If the table doesn't need to encode the vast expanses of zero entries,
it works out smaller than a table which does need to store all of these
zeroes, even when the per entry cost is larger...

As can be noted, the earlier form of the B-Trees (with 510x 32B entries)
were losing to AVL trees (with 32B nodes).

Redesigning the B-Trees to use 816 entries (at the cost of them no
longer being a power of 2 size), seems to have shifted the advantage
mostly back in favor of B-Trees.

Well, at least for fetching stuff; filling the B-Tree is slower than
filling the AVL.

I suspect this is because, whenever one adds an entry into the B-Tree
node, it is necessary to shift everything in the node after this point
forward by 1 position (the "shift everything over" loops kinda "eat it"
during insertion tasks).

This cost would be reduced with smaller nodes, but there is a certain
advantage to keeping the B-Tree node size the same as the page size.

Then briefly imagines "What if I put an AVL tree inside each B-Tree
node?" and, no, not gonna go this route...

>>
>> The ASLR strategy I am using is mostly using a random number generator
>> to generate the addresses for memory allocations. Where one generates a
>> random number and then verify it doesn't hit anything (if it does, try
>> again).
> <
> Why does ALSR not take position independent code and end up
> requiring a Linker to load the a.out ? {like in the old days before
> PIC}

?...

I am using a PE/COFF variant.

So, in theory, the loader would pick a random address and then
base-relocate the image to that address (in my ABI, all the images are
required to be relocatable).

I am not yet putting program images in pagefile backed memory, but this
concept still works with direct remapping.

ASLR would apply both to loading programs, and also to things like
mmap/VirtualAlloc, and (by extension) things like malloc and similar.

Where, internally, mmap is built on top of mechanisms with names like
VaVirtualAlloc and VaCreateFileMapping and similar...

Cough, yes, these are "inspired" by the calls in the Win32 API. TestKern
internals sorta flip-flop between taking design inspiration from Linux
and Windows (excluding cases where I went and did my own thing).

Note, malloc uses a strategy:
Is allocation over 64K?
If yes, handle it with the page allocator.
Is allocation over 4K?
If yes, allocated it with the linked-list of blocks allocator.
Else, use cell-and-bitmap allocator.
This allocates blocks 64K at a time.

In the linked-list case, memory is allocated in 1MB chunks (via the page
allocator).

>>
>>
>> There is also the possibility of hybrid tables (last level using 8B
>> entries), which should do well if the last level is sufficiently full.
> <
> Skipping of levels, means that for most programs, there is only 1 page
> needed to perform mapping.

But, most schemes here are very limited in what they can skip (eg: all
0s or 1s). This doesn't work with RNG output.

The possible workaround (namely small tables which only contained a few
entries in place of full sized pages) is partly what originally led down
the AVL rabbit hole.

>>
>> However, this still requires either allocating objects adjacent to each
>> other (going against ASLR), or requires average allocation sizes to be
>> considerably larger (8MB or 16MB rather than 1MB).
>>
>> However, hybrid tables are a bit faster.
> <
> Faster than a 1-level table ?!?

No, erm, faster than a pure AVL or B-Tree...

The hybrid tables can roughly double the speed relative to a pure AVL or
B-Tree.

They do partially inherit the drawback (from traditional page tables)
that they don't deal well with sparse allocations, albeit nowhere near
as bad as with a more deeply nested page table.

>>
>> ...
>>> Are you doing AVL tree variable length *segments* rather than pages?
>>>
>> No, it is still pages.
>> It was more about memory-use efficiency than speed in this case.
>>
>>
>> Compared with page-tables, the AVL Trees and B-Trees are very slow.
>>
>> However, they also end up using significantly less memory in many
>> scenarios (namely with ASLR).
>>
>>
>> However, despite the slowness, it doesn't make much difference relative
>> to the average cycle count of the ISR, which is bad enough (due to other
>> ISR related overheads), as to mostly mask the difference.
>>
>>
>>
>> The problem area isn't really with 3 and 4 level tables (which are used
>> for 48-bit addressing), but rather with 8 and 10 level tables used in
>> 96-bit addressing.
> <
> Compare the percentage of applications that fit in 48-bits versus the
> number which do not...........

This isn't really about how much stuff will fit at this point...

The 96-bit space is more about making it nearly impossible to guess the
addresses of things.

>>
>> When ASLR was used (spreading memory chunks all over the place, using
>> the same chunking size I typically used for malloc, namely 1MB), the
>> space overhead was *absurd* (albeit, not anywhere near as bad with a
>> 3-level table).
>>
>> With an 8 level table, doing 256 randomized 1MB allocations eats ~ 29MB.
>> With a B-Tree, it is 480K, with an AVL it is 528K.
>>
>>
>> ...

Click here to read the complete article

Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<a4495ef0-2682-4080-9431-f66b975512a7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24478&group=comp.arch#24478

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7f46:0:b0:2e1:ec7c:c461 with SMTP id g6-20020ac87f46000000b002e1ec7cc461mr10906371qtk.463.1648238475648;
Fri, 25 Mar 2022 13:01:15 -0700 (PDT)
X-Received: by 2002:a05:6808:152b:b0:2ec:f48f:8120 with SMTP id
u43-20020a056808152b00b002ecf48f8120mr6502726oiw.58.1648238475432; Fri, 25
Mar 2022 13:01:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Mar 2022 13:01:15 -0700 (PDT)
In-Reply-To: <t1l1e6$b6q$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0b7:ad45:609a:53bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0b7:ad45:609a:53bb
References: <t19cei$aj7$1@dont-email.me> <t1ie0o$l1m$1@dont-email.me>
<Pkj%J.214706$r6p7.200598@fx41.iad> <t1kru4$fh7$1@dont-email.me>
<b277ca66-0703-4929-a6ab-d67e6073c9d3n@googlegroups.com> <t1l1e6$b6q$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a4495ef0-2682-4080-9431-f66b975512a7n@googlegroups.com>
Subject: Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Mar 2022 20:01:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 229

by: MitchAlsup - Fri, 25 Mar 2022 20:01 UTC

On Friday, March 25, 2022 at 1:25:13 PM UTC-5, BGB wrote:
> On 3/25/2022 12:15 PM, MitchAlsup wrote:
> > On Friday, March 25, 2022 at 11:51:19 AM UTC-5, BGB wrote:
> >> On 3/25/2022 8:19 AM, EricP wrote:
> >>> BGB wrote:
> >>>> On 3/21/2022 3:19 AM, BGB wrote:
> >>>>
> >>>>>
> >>>>> I had also been experimenting with alternatives to page tables, and
> >>>>> ended up going with AVL trees as the main alternative.
> >>>>>
> >>>>
> >>>> Add B-Trees to this list.
> >>>> I realized that B-Trees have a few capabilities which the AVL trees lack.
> >>>>
> >> Also with some design tweaks and optimizations, the B-Trees now seem to
> >> be ahead.
> >>
> >> I am likely now back to putting emphasis on B-Trees rather than AVL Trees.
> >>>>
> >>>>> * For small virtual-address spaces, the overhead of the AVL walk is
> >>>>> relatively modest compared with the cost of saving/restoring all of
> >>>>> the CPU registers and similar for the ISR. For larger address spaces,
> >>>>> they were faster than using a chain-hash (lookup cost follows a log2
> >>>>> curve relative to memory footprint, whereas the chain-hash follows a
> >>>>> linear cost curve).
> >>>>>
> >>>>> * Along with chain-hashing, they are the most space-efficient option
> >>>>> at present. Memory overhead is significantly lower than traditional
> >>>>> page tables, and slightly smaller than B-Trees in this case (due to
> >>>>> the comparably large "keys" in this case).
> >>>>>
> >>>>>
> >>>>> So, this leaves as current options:
> >>>>> 4K pages, 4-level page table (48b VA);
> >>>>> 16K pages, 3-level page table (47b VA, currently used);
> >>>>> 64K pages, 2-level page table (42b VA);
> >>>>> 4K/16K/64K, AVL, 48b VA (variable height);
> >>>>> 4K/16K/64K, AVL, 96b VA (variable height).
> >>>
> >>> I don't see how it is possible for AVL to compete with a radix tree.
> >>> You must be doing something different than what I thought you were saying.
> >>>
> >>> Because I don't see how an AVL tree with 1 node per 4kB page could
> >>> complete with the fixed 4 memory accesses to translate a radix tree.
> >>> A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
> >>> are 4 times the size of a PTE with key, left&right ptr plus data word.
> >>>
> >> I am mostly using 16K pages.
> >>
> >>
> >> The AVL Tree currently uses 32B per node.
> >> I had to bit-pack some stuff to fit it into 32B.
> >>
> >> The B-Tree currently uses 20B per entry (for 96B addressing).
> >> Nodes being 16K and holding 816 entries.
> >>
> >> The Page Tables (Radix Tree) used 8B per entry.
> >> They would win if they were more than (on average):
> >> 25% full (vs AVL)
> >> 40% full (vs B-Tree)
> >>
> >>
> >> Throw ASLR in the mix, and most of the page table (Radix Tree) pages
> >> will end up mostly empty (far below the break-even point).
> >> Most of the pages ending up mostly full of zeroes.
> > <
> > Most MMU entries in a hierarchical page table ARE zeros already.
> Yes, however ASLR doesn't exactly help.
>
>
> If the table doesn't need to encode the vast expanses of zero entries,
> it works out smaller than a table which does need to store all of these
> zeroes, even when the per entry cost is larger...
>
>
> As can be noted, the earlier form of the B-Trees (with 510x 32B entries)
> were losing to AVL trees (with 32B nodes).
>
> Redesigning the B-Trees to use 816 entries (at the cost of them no
> longer being a power of 2 size), seems to have shifted the advantage
> mostly back in favor of B-Trees.
>
>
> Well, at least for fetching stuff; filling the B-Tree is slower than
> filling the AVL.
>
> I suspect this is because, whenever one adds an entry into the B-Tree
> node, it is necessary to shift everything in the node after this point
> forward by 1 position (the "shift everything over" loops kinda "eat it"
> during insertion tasks).
>
>
> This cost would be reduced with smaller nodes, but there is a certain
> advantage to keeping the B-Tree node size the same as the page size.
>
>
> Then briefly imagines "What if I put an AVL tree inside each B-Tree
> node?" and, no, not gonna go this route...
> >>
> >> The ASLR strategy I am using is mostly using a random number generator
> >> to generate the addresses for memory allocations. Where one generates a
> >> random number and then verify it doesn't hit anything (if it does, try
> >> again).
> > <
> > Why does ALSR not take position independent code and end up
> > requiring a Linker to load the a.out ? {like in the old days before
> > PIC}
> ?...
>
> I am using a PE/COFF variant.
>
> So, in theory, the loader would pick a random address and then
> base-relocate the image to that address (in my ABI, all the images are
> required to be relocatable).
<
This prevents using original a.out location as backing store for unwritten
memory pages.
>
> I am not yet putting program images in pagefile backed memory, but this
> concept still works with direct remapping.
>
>
> ASLR would apply both to loading programs, and also to things like
> mmap/VirtualAlloc, and (by extension) things like malloc and similar.
>
> Where, internally, mmap is built on top of mechanisms with names like
> VaVirtualAlloc and VaCreateFileMapping and similar...
>
> Cough, yes, these are "inspired" by the calls in the Win32 API. TestKern
> internals sorta flip-flop between taking design inspiration from Linux
> and Windows (excluding cases where I went and did my own thing).
>
>
> Note, malloc uses a strategy:
> Is allocation over 64K?
> If yes, handle it with the page allocator.
> Is allocation over 4K?
> If yes, allocated it with the linked-list of blocks allocator.
> Else, use cell-and-bitmap allocator.
> This allocates blocks 64K at a time.
>
> In the linked-list case, memory is allocated in 1MB chunks (via the page
> allocator).
> >>
> >>
> >> There is also the possibility of hybrid tables (last level using 8B
> >> entries), which should do well if the last level is sufficiently full.
> > <
> > Skipping of levels, means that for most programs, there is only 1 page
> > needed to perform mapping.
> But, most schemes here are very limited in what they can skip (eg: all
> 0s or 1s). This doesn't work with RNG output.
<
Good point. But if the mapping entries are mostly zeros, switching them
around is childsplay.
>
>
> The possible workaround (namely small tables which only contained a few
> entries in place of full sized pages) is partly what originally led down
> the AVL rabbit hole.
> >>
> >> However, this still requires either allocating objects adjacent to each
> >> other (going against ASLR), or requires average allocation sizes to be
> >> considerably larger (8MB or 16MB rather than 1MB).
> >>
> >> However, hybrid tables are a bit faster.
> > <
> > Faster than a 1-level table ?!?
> No, erm, faster than a pure AVL or B-Tree...
>
>
> The hybrid tables can roughly double the speed relative to a pure AVL or
> B-Tree.
>
> They do partially inherit the drawback (from traditional page tables)
> that they don't deal well with sparse allocations, albeit nowhere near
> as bad as with a more deeply nested page table.
> >>
> >> ...
> >>> Are you doing AVL tree variable length *segments* rather than pages?
> >>>
> >> No, it is still pages.
> >> It was more about memory-use efficiency than speed in this case.
> >>
> >>
> >> Compared with page-tables, the AVL Trees and B-Trees are very slow.
> >>
> >> However, they also end up using significantly less memory in many
> >> scenarios (namely with ASLR).
> >>
> >>
> >> However, despite the slowness, it doesn't make much difference relative
> >> to the average cycle count of the ISR, which is bad enough (due to other
> >> ISR related overheads), as to mostly mask the difference.
> >>
> >>
> >>
> >> The problem area isn't really with 3 and 4 level tables (which are used
> >> for 48-bit addressing), but rather with 8 and 10 level tables used in
> >> 96-bit addressing.
> > <
> > Compare the percentage of applications that fit in 48-bits versus the
> > number which do not...........
> This isn't really about how much stuff will fit at this point...
>
> The 96-bit space is more about making it nearly impossible to guess the
> addresses of things.
<
It seems to me that making "guessing an address" worthless is the solution.
<
So, lets say we have an application that is guessing addresses (and has a SIG
handler setup to <basically> catch and ignore guessed address faults: now if
this <dangerous> application cannot see a change in timing between guessing
addresses actually mapped to someone and not mapped to someone, then
being able to guess does him no good whatsoever.
<
You get this property with the "no flying blind" mantra of execution window design.
<
> >>
> >> When ASLR was used (spreading memory chunks all over the place, using
> >> the same chunking size I typically used for malloc, namely 1MB), the
> >> space overhead was *absurd* (albeit, not anywhere near as bad with a
> >> 3-level table).
> >>
> >> With an 8 level table, doing 256 randomized 1MB allocations eats ~ 29MB.
> >> With a B-Tree, it is 480K, with an AVL it is 528K.
> >>
> >>
> >> ...

Click here to read the complete article

Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual Memory, Swapfile)

<b5e1b9d1-72e9-4ff7-939c-ad15f2a2b7b0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24482&group=comp.arch#24482

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:150e:b0:67d:3243:12dd with SMTP id i14-20020a05620a150e00b0067d324312ddmr8915249qkk.229.1648255003786;
Fri, 25 Mar 2022 17:36:43 -0700 (PDT)
X-Received: by 2002:a05:6830:22ea:b0:5b2:35c1:de3c with SMTP id
t10-20020a05683022ea00b005b235c1de3cmr5654142otc.282.1648255003482; Fri, 25
Mar 2022 17:36:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Mar 2022 17:36:43 -0700 (PDT)
In-Reply-To: <a4495ef0-2682-4080-9431-f66b975512a7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0b7:ad45:609a:53bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0b7:ad45:609a:53bb
References: <t19cei$aj7$1@dont-email.me> <t1ie0o$l1m$1@dont-email.me>
<Pkj%J.214706$r6p7.200598@fx41.iad> <t1kru4$fh7$1@dont-email.me>
<b277ca66-0703-4929-a6ab-d67e6073c9d3n@googlegroups.com> <t1l1e6$b6q$1@dont-email.me>
<a4495ef0-2682-4080-9431-f66b975512a7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b5e1b9d1-72e9-4ff7-939c-ad15f2a2b7b0n@googlegroups.com>
Subject: Re: Page-table Alternatives (B-Tree) (Re: Misc: Page Tables, Virtual
Memory, Swapfile)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 26 Mar 2022 00:36:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 292

by: MitchAlsup - Sat, 26 Mar 2022 00:36 UTC

On Friday, March 25, 2022 at 3:01:17 PM UTC-5, MitchAlsup wrote:
> On Friday, March 25, 2022 at 1:25:13 PM UTC-5, BGB wrote:
> > On 3/25/2022 12:15 PM, MitchAlsup wrote:
> > > On Friday, March 25, 2022 at 11:51:19 AM UTC-5, BGB wrote:
> > >> On 3/25/2022 8:19 AM, EricP wrote:
> > >>> BGB wrote:
> > >>>> On 3/21/2022 3:19 AM, BGB wrote:
> > >>>>
> > >>>>>
> > >>>>> I had also been experimenting with alternatives to page tables, and
> > >>>>> ended up going with AVL trees as the main alternative.
> > >>>>>
> > >>>>
> > >>>> Add B-Trees to this list.
> > >>>> I realized that B-Trees have a few capabilities which the AVL trees lack.
> > >>>>
> > >> Also with some design tweaks and optimizations, the B-Trees now seem to
> > >> be ahead.
> > >>
> > >> I am likely now back to putting emphasis on B-Trees rather than AVL Trees.
> > >>>>
> > >>>>> * For small virtual-address spaces, the overhead of the AVL walk is
> > >>>>> relatively modest compared with the cost of saving/restoring all of
> > >>>>> the CPU registers and similar for the ISR. For larger address spaces,
> > >>>>> they were faster than using a chain-hash (lookup cost follows a log2
> > >>>>> curve relative to memory footprint, whereas the chain-hash follows a
> > >>>>> linear cost curve).
> > >>>>>
> > >>>>> * Along with chain-hashing, they are the most space-efficient option
> > >>>>> at present. Memory overhead is significantly lower than traditional
> > >>>>> page tables, and slightly smaller than B-Trees in this case (due to
> > >>>>> the comparably large "keys" in this case).
> > >>>>>
> > >>>>>
> > >>>>> So, this leaves as current options:
> > >>>>> 4K pages, 4-level page table (48b VA);
> > >>>>> 16K pages, 3-level page table (47b VA, currently used);
> > >>>>> 64K pages, 2-level page table (42b VA);
> > >>>>> 4K/16K/64K, AVL, 48b VA (variable height);
> > >>>>> 4K/16K/64K, AVL, 96b VA (variable height).
> > >>>
> > >>> I don't see how it is possible for AVL to compete with a radix tree..
> > >>> You must be doing something different than what I thought you were saying.
> > >>>
> > >>> Because I don't see how an AVL tree with 1 node per 4kB page could
> > >>> complete with the fixed 4 memory accesses to translate a radix tree..
> > >>> A 4GB memory space = 1M pages = 20 level AVL tree, plus the AVL nodes
> > >>> are 4 times the size of a PTE with key, left&right ptr plus data word.
> > >>>
> > >> I am mostly using 16K pages.
> > >>
> > >>
> > >> The AVL Tree currently uses 32B per node.
> > >> I had to bit-pack some stuff to fit it into 32B.
> > >>
> > >> The B-Tree currently uses 20B per entry (for 96B addressing).
> > >> Nodes being 16K and holding 816 entries.
> > >>
> > >> The Page Tables (Radix Tree) used 8B per entry.
> > >> They would win if they were more than (on average):
> > >> 25% full (vs AVL)
> > >> 40% full (vs B-Tree)
> > >>
> > >>
> > >> Throw ASLR in the mix, and most of the page table (Radix Tree) pages
> > >> will end up mostly empty (far below the break-even point).
> > >> Most of the pages ending up mostly full of zeroes.
> > > <
> > > Most MMU entries in a hierarchical page table ARE zeros already.
> > Yes, however ASLR doesn't exactly help.
> >
> >
> > If the table doesn't need to encode the vast expanses of zero entries,
> > it works out smaller than a table which does need to store all of these
> > zeroes, even when the per entry cost is larger...
> >
> >
> > As can be noted, the earlier form of the B-Trees (with 510x 32B entries)
> > were losing to AVL trees (with 32B nodes).
> >
> > Redesigning the B-Trees to use 816 entries (at the cost of them no
> > longer being a power of 2 size), seems to have shifted the advantage
> > mostly back in favor of B-Trees.
> >
> >
> > Well, at least for fetching stuff; filling the B-Tree is slower than
> > filling the AVL.
> >
> > I suspect this is because, whenever one adds an entry into the B-Tree
> > node, it is necessary to shift everything in the node after this point
> > forward by 1 position (the "shift everything over" loops kinda "eat it"
> > during insertion tasks).
> >
> >
> > This cost would be reduced with smaller nodes, but there is a certain
> > advantage to keeping the B-Tree node size the same as the page size.
> >
> >
> > Then briefly imagines "What if I put an AVL tree inside each B-Tree
> > node?" and, no, not gonna go this route...
> > >>
> > >> The ASLR strategy I am using is mostly using a random number generator
> > >> to generate the addresses for memory allocations. Where one generates a
> > >> random number and then verify it doesn't hit anything (if it does, try
> > >> again).
> > > <
> > > Why does ALSR not take position independent code and end up
> > > requiring a Linker to load the a.out ? {like in the old days before
> > > PIC}
> > ?...
> >
> > I am using a PE/COFF variant.
> >
> > So, in theory, the loader would pick a random address and then
> > base-relocate the image to that address (in my ABI, all the images are
> > required to be relocatable).
> <
> This prevents using original a.out location as backing store for unwritten
> memory pages.
> >
> > I am not yet putting program images in pagefile backed memory, but this
> > concept still works with direct remapping.
> >
> >
> > ASLR would apply both to loading programs, and also to things like
> > mmap/VirtualAlloc, and (by extension) things like malloc and similar.
> >
> > Where, internally, mmap is built on top of mechanisms with names like
> > VaVirtualAlloc and VaCreateFileMapping and similar...
> >
> > Cough, yes, these are "inspired" by the calls in the Win32 API. TestKern
> > internals sorta flip-flop between taking design inspiration from Linux
> > and Windows (excluding cases where I went and did my own thing).
> >
> >
> > Note, malloc uses a strategy:
> > Is allocation over 64K?
> > If yes, handle it with the page allocator.
> > Is allocation over 4K?
> > If yes, allocated it with the linked-list of blocks allocator.
> > Else, use cell-and-bitmap allocator.
> > This allocates blocks 64K at a time.
> >
> > In the linked-list case, memory is allocated in 1MB chunks (via the page
> > allocator).
> > >>
> > >>
> > >> There is also the possibility of hybrid tables (last level using 8B
> > >> entries), which should do well if the last level is sufficiently full.
> > > <
> > > Skipping of levels, means that for most programs, there is only 1 page
> > > needed to perform mapping.
> > But, most schemes here are very limited in what they can skip (eg: all
> > 0s or 1s). This doesn't work with RNG output.
> <
> Good point. But if the mapping entries are mostly zeros, switching them
> around is childsplay.
> >
> >
> > The possible workaround (namely small tables which only contained a few
> > entries in place of full sized pages) is partly what originally led down
> > the AVL rabbit hole.
> > >>
> > >> However, this still requires either allocating objects adjacent to each
> > >> other (going against ASLR), or requires average allocation sizes to be
> > >> considerably larger (8MB or 16MB rather than 1MB).
> > >>
> > >> However, hybrid tables are a bit faster.
> > > <
> > > Faster than a 1-level table ?!?
> > No, erm, faster than a pure AVL or B-Tree...
> >
> >
> > The hybrid tables can roughly double the speed relative to a pure AVL or
> > B-Tree.
> >
> > They do partially inherit the drawback (from traditional page tables)
> > that they don't deal well with sparse allocations, albeit nowhere near
> > as bad as with a more deeply nested page table.
> > >>
> > >> ...
> > >>> Are you doing AVL tree variable length *segments* rather than pages?
> > >>>
> > >> No, it is still pages.
> > >> It was more about memory-use efficiency than speed in this case.
> > >>
> > >>
> > >> Compared with page-tables, the AVL Trees and B-Trees are very slow.
> > >>
> > >> However, they also end up using significantly less memory in many
> > >> scenarios (namely with ASLR).
> > >>
> > >>
> > >> However, despite the slowness, it doesn't make much difference relative
> > >> to the average cycle count of the ISR, which is bad enough (due to other
> > >> ISR related overheads), as to mostly mask the difference.
> > >>
> > >>
> > >>
> > >> The problem area isn't really with 3 and 4 level tables (which are used
> > >> for 48-bit addressing), but rather with 8 and 10 level tables used in
> > >> 96-bit addressing.
> > > <
> > > Compare the percentage of applications that fit in 48-bits versus the
> > > number which do not...........
> > This isn't really about how much stuff will fit at this point...
> >
> > The 96-bit space is more about making it nearly impossible to guess the
> > addresses of things.
> <
> It seems to me that making "guessing an address" worthless is the solution.
> <
> So, lets say we have an application that is guessing addresses (and has a SIG
> handler setup to <basically> catch and ignore guessed address faults: now if
> this <dangerous> application cannot see a change in timing between guessing
> addresses actually mapped to someone and not mapped to someone, then
> being able to guess does him no good whatsoever.
> <
> You get this property with the "no flying blind" mantra of execution window design.
<
What if you could detect the access pattern of Spectré-1; and take the attacking
application out of the run state and place it on the back of the priority queue
below its current level ?
<
So, every time the attack is performed, 20,000-odd clocks is added to the wall clock
time of the attacking process. And if there is anything else to run, a lot more than that.

Click here to read the complete article

Pages:12

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor