novaBBS - comp.arch - Re: Packing Hash Tables

On 3/25/2022 12:40 AM, robf...@gmail.com wrote:
> On Tuesday, March 22, 2022 at 4:50:14 PM UTC-4, BGB wrote:
>> On 3/22/2022 10:59 AM, robf...@gmail.com wrote:
>>> On Tuesday, March 22, 2022 at 11:07:15 AM UTC-4, MitchAlsup wrote:
>>>> On Tuesday, March 22, 2022 at 1:28:05 AM UTC-5, robf...@gmail.com wrote:
>>>>> On Saturday, March 19, 2022 at 4:19:12 PM UTC-4, EricP wrote:
>>>>>> MitchAlsup wrote:
>>>>>>> On Friday, March 18, 2022 at 11:08:08 AM UTC-5, EricP wrote:
>>>>>>>> MitchAlsup wrote:
>>>>>>>>> <
>>>>>>>>> Page Fault gets attached to thread number (which accesses the faulting IP)
>>>>>>>>> ....................gets the faulting instruction-specifier in R1
>>>>>>>> I assume this is the instruction pointer.
>>>>>>> <
>>>>>>> It is the first 32-bits of the instruction itself. This contains everything but the
>>>>>>> constants.
>>>>>> If it faults during instruction fetch then this instruction specifier
>>>>>> buffer could be empty or partially filled.
>>>>>>
>>>>>> It absolutely needs the address of the start of the faulting instruction.
>>>>>> Structured exception handlers. Debugger watch points.
>>>>>> These can't function without it.
>>>>>>
>>>>>> You can provide this partial instruction bits in addition to the standard
>>>>>> exception details. Though as I point out below I don't think anyone
>>>>>> will need this if the page fault syndrome bits are provided.
>>>>>>> <
>>>>>>>>> ....................gets the faulting virtual address in R2
>>>>>>>>> ....................gets the address of fault generating PTE/PTP in R3
>>>>>>>> Passing the physical address of the faulting PTE to the page fault
>>>>>>>> handler doesn't harm anything but it doesn't really help.
>>>>>>>> Though I've not seen a page fault exception frame the does this.
>>>>>>>>
>>>>>>>> This misses the critical page fault reason info. The page fault handler
>>>>>>>> needs to know what the thread was attempting to do when it triggered
>>>>>>>> the fault. Looking at x86 that is a set of bit flags indicating:
>>>>>>>>
>>>>>>>> P = Caused by Not-Present PTE, or other protection codes
>>>>>>>> R/W = Caused by Read, or caused by Write
>>>>>>>> U/S = Cause by User/Super mode
>>>>>>>> RSVD = Reserved bit non-zero
>>>>>>>> I/D = access was instruction fetch, or data access
>>>>>>>> PK = Protection Key violation
>>>>>>>> SGX = SGX Software Guard Extensions violation
>>>>>>> <
>>>>>>> The given address points at the PT[PE] where the fault was detected.
>>>>>>> So, for example if you were trying to Read the first PT[PE] that does not have
>>>>>>> read permission is where you end up pointing {same for write and execute}
>>>>>>> <
>>>>>>> The instruction tells you what the instruction stream was doing.
>>>>>>> Inst<31..26> = 001001 & Inst<10..5> = 0000xx -> LD
>>>>>>> Inst<31..26> = 001001 & Inst<10..5> = 00x1xx -> ST
>>>>>>> Inst<31..26> = 1000xx LD
>>>>>>> Inst<31..26> = 1001xx ST
>>>>>>> <
>>>>>>> So, there is a clever trick, Because My 66000 detects all mal-formed instructions,
>>>>>>> {mal-formed instructions arrive at the Operation exception handler}
>>>>>>> the page fault handler can look at Instruction<31> and decide to look at
>>>>>>> Inst<28> or Inst<7> where 0->LD and 1->ST.
>>>>>>> So 3 PREDs in a row parse the instruction that caused the page fault.
>>>>>>> <
>>>>>>> But you make a good point:: I have another 32-bits in R1 I could completely
>>>>>>> specify what problem was detected. Let me work on that.
>>>>>> The fault syndrome bits should be sufficient.
>>>>>> I don't think anyone wants to have to reverse guesstimate what the
>>>>>> problem was in a page fault handler by parsing instructions.
>>>>>>
>>>>>> An emulator might want to parse the faulting instructions but that
>>>>>> is an application specific use for the general page fault mechanism
>>>>>> and would take place in an application specific user mode handler.
>>>>> For my core I have both the address of the faulting instruction stored in an exception
>>>>> instruction pointer register, and the faulting address stored in a bad address register.
>>>>> The cause of an exception is stored in a cause register. Maybe the cause register
>>>>> does not have enough information, but it can tell if the fault was data or instruction,
>>>>> or read, write, or execute violations.
>>>> <
>>>> There is also the case of a poorly-formed table entry.
>>>>>
>>>>> Looks like its back to the drawing board for me. Planning on using compressed
>>>>> pages, an idea gleaned from BGB’s posting.
>>>>> I was using an inverted hash table, but had not considered memory paged out to disk.
>>>>> Being able to swap out to disk means making the page table larger, meaning it will no
>>>>> longer fit in block RAM. So, back to using a TLB. I was thinking of increasing the page
>>>>> table size by a factor four, ¼ real memory ¾ disk storage. I gather its not a good idea
>>>>> to have too large of a swap file.
>>>>>
>> I think the size of a swap file is a matter of interpretation.
>>
>> In current tests, sizes are:
>> RAM = 128MB (set to match the Nexys A7 I am using)
>> SWAP = 256MB or 384MB in emulation and simulation tests.
>>
>>
>> On my SDcard, IIRC, the swapfile was set up as 2GB (and I didn't
>> originally think it would take this long to before trying to get this
>> stuff working).
>>
>> I am not assuming any particular limit at the moment, apart from the 4GB
>> size limit implied by FAT32 (at present, this would be 256k pages).
>>
>>
>> Currently, the swap code sets up a 32MB region of RAM to be used as swap
>> pages, with the rest (for now) being left for either physical mapping or
>> direct remapping (where pages are mapped into virtual addresses, but are
>> not backed by swap).
>>
>> I initially thought 32MB would be too limiting, but it appears that Doom
>> and similar can sit pretty easily in this space.
>>
>> Doom doesn't really generate any swapfile traffic, but Quake uses enough
>> memory that does start accessing swap.
>>
>> One limit for swapfile size would be the relative cost of any management
>> structures.
>>
>> So, for example, with a 4GB swapfile, if spending 8B per page to track
>> page location / compression, this would be an ~ 2MB array, ...
>>
>> In the current (uncompressed) page-table scheme, page management is
>> mostly limited to a few bitmaps and similar.
>>
>>
>>
>> Though, this stuff still doesn't work particularly reliably as of yet
>> (trying to run programs from swapfile backed memory is still pretty buggy).
>>
>> There is also some hackery, like needing to make sure that everything in
>> the FAT driver and similar is using physically mapped memory.
>>
>> There are ending up being various functions for memory allocation with
>> various properties:
>> tk_malloc: Generic malloc, may use Swap
>> tk_malloc_usr: Usermode accessible (kernel), may use Swap
>> tk_malloc_wxe: Read/Write/Execute, Physically Mapped
>> tk_malloc_krn: Kernel-Mode memory, Physically Mapped
>> ...
>>
>> Internally, these map to a "tk_malloc_cat" function, which adds a
>> "category" argument describing the type of memory one wants (currently
>> as a "magic number" scheme).
>>
>>
>> Generally, anything allocated from "userland" (with the exception of WXE
>> memory) is currently assumed to use Swap.
>>
>> Anything which may be accessed from an ISR needs to be physically mapped
>> (ISRs don't use address translation).
>>
>> ...
>>
>>
>>
>> However, the program loaders still use direct-mapping for the '.text'
>> sections and similar (for now at least, may change this later). It is
>> likely though that the '.data'/'.bss' sections could be put into
>> swap-backed memory (in my ABI, the read-only and read/write sections are
>> handled independently of each other, *).
>>
>> *: This is a little hacky as PE/COFF wasn't really designed to be used
>> this way, but either way.
>>
>>
>>
>> Because both may access the SD card, trying to page stuff in or out from
>> within another IO operation is potentially catastrophic (though the
>> swapfile code does mostly bypass the FAT driver; it figures out the LBA
>> range and goes from there).
>>
>> It doesn't currently deal with any fragmentation in the pagefile, it is
>> possible I could try to deal with this, to be decided if doing so is
>> worthwhile.
>>
>> Would require either:
>> Pretranslating and caching the FAT cluster chains;
>> Hacking into the FAT FS driver internals;
>> Feeding swapfile operations through normal filesystem IO interfaces.
>>
>> Sending swapfile stuff through the filesystem driver is undesirable, for
>> various reasons. For the time being, it is easier to require an
>> unfragmented swapfile (which, granted, does require some special care
>> when setting up the SDcard).
>>
>> Would have almost gone the Linux route of requiring a swap partition,
>> except I am using Windows and there is no good way to set this up short
>> of doing it on a Linux machine, so using a file was the easier option.
>>
>>
>>
>> I am at the moment again torn between wanting to post the code to GitHub
>> (since it has been a few weeks), and not wanting to post the code when
>> stuff is still obviously buggy or broken.
>>
>>
>> I am left to wonder if all this stuff is usually so much of a PITA to
>> debug for everyone else, or if I just kinda suck at it...
>>>>> The TLBE is 256 bits in size, and about 90% of the bits are used. The PTE could be
>>>>> smaller, but I am having it mimic the TLBE to keep things simple.
>>>>> For the hash table, hash groups are used to alleviate the need to walk chains. The hash
>>>>> chain walks by a group not by a single entry. Groups contain eight entries that hashed to
>>>>> the same value or are displaced, so there is seldom a need to walk more than two or
>>>>> three steps.
>>>
>>> There are only a couple of things I could think of for poorly formed table entries. If the
>>> physical address is not a legal value, or if the page mask begin is greater than the page
>>> mask end. Otherwise, the remaining fields could have just about any value. If the physical
>>> address is not legal a PMA exception will occur when the address is mapped. If the page
>>> mask begin is greater than the end, then there will be no access to the page resulting in
>>> read, write, or execute violations. I suppose the cause code could be expanded to add
>>> syndrome bits. They may be useful for other units too.
>>>
>>> I have the page compression indicators covering a 4kB block sub-page range. While the
>>> page mask begin, end divides the page into 1kB blocks. It may be better to cover the
>>> same range with both fields. 4kB blocks might be used to emulate an older system.
>>> The first half of the PTE is mainly the translation info. The second half is access rights,
>>> key and compression indicators. The second half of the PTE is loaded from a per-page
>>> page management table. It is info that needs to be in the TLBE so the PTE is just used to
>>> cache the information. The second half of the PTE could be stuffed with other information
>>> for software use until the access rights info is made valid.
>>>
>> So, considering something simplistic like "skip zeroed parts of a
>> not-entirely-zero" page or something?...
>>
>>
>> I guess this is possible, vs what I was considering of having one of 3
>> options:
>> Zeroed / Skipped;
>> Stored;
>> LZ Compressed.
>>
>> At present (with an uncompressed pagefile), only the Zeroed and Stored
>> states exist.
>>
>>
>> Though, I still have a few concerns with my idea:
>> Making LZ compression sufficiently fast is still an issue;
>> Doing a chunk buffer turns a large number of small stalls into a smaller
>> number of large stalls.
>>
>> It will probably be painfully obvious if the whole system stalls for
>> several seconds at a time to write several MB of compressed pagefile
>> data to the SD card.
>>
>>
>> Though, looking stuff up, it appears I may not need to write the data
>> all at once, rather it may be sufficient to (merely) write to the data
>> in a roughly sequential order.
>>
>> This would imply abandoning the 4MB chunking idea, and instead treating
>> the swap as a circular log. This would, however, require some way to
>> relocate or skip over existing pages if they are "still live".
>>
>> I could probably manage this with a cell bitmap (probably with 512B or
>> 1K cells), effectively treating the pagefile like a large bitmap-managed
>> heap. This could work OK, but its effectiveness (outside of an explicit
>> "compactor" mechanism) would depend highly on fragmentation.
>>
>>
>> Assuming 1K cells with a 2GB pagefile, would need 256K at 1b/cell.
>> Would need ~ 28b to encode the size and location of a compressed page.
>> Assuming there are 128k pages (each 16K), with 4B overhead per page,
>> this part costs 512K.
>>
>> Probably doable...
>>
>>
>> Though, it is possible that (if enough pages are poorly compressible),
>> that fragmentation would result in the pagefile turning into an unusable
>> state with this design.
>>
>> Assuming a 1:1 parity between pagefile size and logical page count; this
>> will likely only work OK if average compressed page size is below around
>> 8K or so. The alternative being to only fill the pagefile to 50% or 75%
>> capacity, such that the pagefile still works even if a majority of the
>> pages are poorly compressible.
>>
>> But, at that point, one would be better off using an uncompressed pagefile.
>>
>>
>> Some of my mock-up tests aren't giving as good of compression as I had
>> hoped. This idea would work a lot better if pages got an average
>> compression ratio of around 75% or so, but the 45%..65% I am seeing in
>> my mock-up tests is likely to be borderline at best.
>>
>>
>> From what I can find, most OS's seem to be using uncompressed pagefiles.
>>> I am assuming the PPN field can be used to store the page-file swap address when the
>>> page is not resident in memory.
>>>
>> Storing the logical pagefile page-number in non-resident pages is how I
>> was doing it.
>>
>> Address is Virtual:
>> Present==1:
>> Page number maps to a physical page address;
>> Present==0:
>> Page Number==0: PTE is Free (PTE==0) or Reserved (PTE!=0)
>> Page Number!=0: PTE points at a pagefile page
>>
>> ...
>>
>> Pagefile page 0 is currently reserved.
>
> Ran out of block RAMs for my design. I counted them up, but the toolset must have
> replicated a few of them. It worked out to about 800 block rams and there are only
> 730 available. So, now I am scaling things down a bit. Going back to a hierarchically
> oriented table, since with MMU caching it probably offers the fastest access. I am
> also thinking of using software to reduce the hardware footprint. Done some thinking
> about the purpose of what I am doing. I want the thing to run in a low-cost FPGA.
>

Click here to read the complete article

On Friday, March 25, 2022 at 12:40:05 AM UTC-5, robf...@gmail.com wrote:
> On Tuesday, March 22, 2022 at 4:50:14 PM UTC-4, BGB wrote:
> > On 3/22/2022 10:59 AM, robf...@gmail.com wrote:
> > > On Tuesday, March 22, 2022 at 11:07:15 AM UTC-4, MitchAlsup wrote:
> > >> On Tuesday, March 22, 2022 at 1:28:05 AM UTC-5, robf...@gmail.com wrote:
> > >>> On Saturday, March 19, 2022 at 4:19:12 PM UTC-4, EricP wrote:
> > >>>> MitchAlsup wrote:
> > >>>>> On Friday, March 18, 2022 at 11:08:08 AM UTC-5, EricP wrote:
> > >>>>>> MitchAlsup wrote:
> > >>>>>>> <
> > >>>>>>> Page Fault gets attached to thread number (which accesses the faulting IP)
> > >>>>>>> ....................gets the faulting instruction-specifier in R1
> > >>>>>> I assume this is the instruction pointer.
> > >>>>> <
> > >>>>> It is the first 32-bits of the instruction itself. This contains everything but the
> > >>>>> constants.
> > >>>> If it faults during instruction fetch then this instruction specifier
> > >>>> buffer could be empty or partially filled.
> > >>>>
> > >>>> It absolutely needs the address of the start of the faulting instruction.
> > >>>> Structured exception handlers. Debugger watch points.
> > >>>> These can't function without it.
> > >>>>
> > >>>> You can provide this partial instruction bits in addition to the standard
> > >>>> exception details. Though as I point out below I don't think anyone
> > >>>> will need this if the page fault syndrome bits are provided.
> > >>>>> <
> > >>>>>>> ....................gets the faulting virtual address in R2
> > >>>>>>> ....................gets the address of fault generating PTE/PTP in R3
> > >>>>>> Passing the physical address of the faulting PTE to the page fault
> > >>>>>> handler doesn't harm anything but it doesn't really help.
> > >>>>>> Though I've not seen a page fault exception frame the does this.
> > >>>>>>
> > >>>>>> This misses the critical page fault reason info. The page fault handler
> > >>>>>> needs to know what the thread was attempting to do when it triggered
> > >>>>>> the fault. Looking at x86 that is a set of bit flags indicating:
> > >>>>>>
> > >>>>>> P = Caused by Not-Present PTE, or other protection codes
> > >>>>>> R/W = Caused by Read, or caused by Write
> > >>>>>> U/S = Cause by User/Super mode
> > >>>>>> RSVD = Reserved bit non-zero
> > >>>>>> I/D = access was instruction fetch, or data access
> > >>>>>> PK = Protection Key violation
> > >>>>>> SGX = SGX Software Guard Extensions violation
> > >>>>> <
> > >>>>> The given address points at the PT[PE] where the fault was detected.
> > >>>>> So, for example if you were trying to Read the first PT[PE] that does not have
> > >>>>> read permission is where you end up pointing {same for write and execute}
> > >>>>> <
> > >>>>> The instruction tells you what the instruction stream was doing.
> > >>>>> Inst<31..26> = 001001 & Inst<10..5> = 0000xx -> LD
> > >>>>> Inst<31..26> = 001001 & Inst<10..5> = 00x1xx -> ST
> > >>>>> Inst<31..26> = 1000xx LD
> > >>>>> Inst<31..26> = 1001xx ST
> > >>>>> <
> > >>>>> So, there is a clever trick, Because My 66000 detects all mal-formed instructions,
> > >>>>> {mal-formed instructions arrive at the Operation exception handler}
> > >>>>> the page fault handler can look at Instruction<31> and decide to look at
> > >>>>> Inst<28> or Inst<7> where 0->LD and 1->ST.
> > >>>>> So 3 PREDs in a row parse the instruction that caused the page fault.
> > >>>>> <
> > >>>>> But you make a good point:: I have another 32-bits in R1 I could completely
> > >>>>> specify what problem was detected. Let me work on that.
> > >>>> The fault syndrome bits should be sufficient.
> > >>>> I don't think anyone wants to have to reverse guesstimate what the
> > >>>> problem was in a page fault handler by parsing instructions.
> > >>>>
> > >>>> An emulator might want to parse the faulting instructions but that
> > >>>> is an application specific use for the general page fault mechanism
> > >>>> and would take place in an application specific user mode handler.
> > >>> For my core I have both the address of the faulting instruction stored in an exception
> > >>> instruction pointer register, and the faulting address stored in a bad address register.
> > >>> The cause of an exception is stored in a cause register. Maybe the cause register
> > >>> does not have enough information, but it can tell if the fault was data or instruction,
> > >>> or read, write, or execute violations.
> > >> <
> > >> There is also the case of a poorly-formed table entry.
> > >>>
> > >>> Looks like its back to the drawing board for me. Planning on using compressed
> > >>> pages, an idea gleaned from BGB’s posting.
> > >>> I was using an inverted hash table, but had not considered memory paged out to disk.
> > >>> Being able to swap out to disk means making the page table larger, meaning it will no
> > >>> longer fit in block RAM. So, back to using a TLB. I was thinking of increasing the page
> > >>> table size by a factor four, ¼ real memory ¾ disk storage.. I gather its not a good idea
> > >>> to have too large of a swap file.
> > >>>
> > I think the size of a swap file is a matter of interpretation.
> >
> > In current tests, sizes are:
> > RAM = 128MB (set to match the Nexys A7 I am using)
> > SWAP = 256MB or 384MB in emulation and simulation tests.
> >
> >
> > On my SDcard, IIRC, the swapfile was set up as 2GB (and I didn't
> > originally think it would take this long to before trying to get this
> > stuff working).
> >
> > I am not assuming any particular limit at the moment, apart from the 4GB
> > size limit implied by FAT32 (at present, this would be 256k pages).
> >
> >
> > Currently, the swap code sets up a 32MB region of RAM to be used as swap
> > pages, with the rest (for now) being left for either physical mapping or
> > direct remapping (where pages are mapped into virtual addresses, but are
> > not backed by swap).
> >
> > I initially thought 32MB would be too limiting, but it appears that Doom
> > and similar can sit pretty easily in this space.
> >
> > Doom doesn't really generate any swapfile traffic, but Quake uses enough
> > memory that does start accessing swap.
> >
> > One limit for swapfile size would be the relative cost of any management
> > structures.
> >
> > So, for example, with a 4GB swapfile, if spending 8B per page to track
> > page location / compression, this would be an ~ 2MB array, ...
> >
> > In the current (uncompressed) page-table scheme, page management is
> > mostly limited to a few bitmaps and similar.
> >
> >
> >
> > Though, this stuff still doesn't work particularly reliably as of yet
> > (trying to run programs from swapfile backed memory is still pretty buggy).
> >
> > There is also some hackery, like needing to make sure that everything in
> > the FAT driver and similar is using physically mapped memory.
> >
> > There are ending up being various functions for memory allocation with
> > various properties:
> > tk_malloc: Generic malloc, may use Swap
> > tk_malloc_usr: Usermode accessible (kernel), may use Swap
> > tk_malloc_wxe: Read/Write/Execute, Physically Mapped
> > tk_malloc_krn: Kernel-Mode memory, Physically Mapped
> > ...
> >
> > Internally, these map to a "tk_malloc_cat" function, which adds a
> > "category" argument describing the type of memory one wants (currently
> > as a "magic number" scheme).
> >
> >
> > Generally, anything allocated from "userland" (with the exception of WXE
> > memory) is currently assumed to use Swap.
> >
> > Anything which may be accessed from an ISR needs to be physically mapped
> > (ISRs don't use address translation).
> >
> > ...
> >
> >
> >
> > However, the program loaders still use direct-mapping for the '.text'
> > sections and similar (for now at least, may change this later). It is
> > likely though that the '.data'/'.bss' sections could be put into
> > swap-backed memory (in my ABI, the read-only and read/write sections are
> > handled independently of each other, *).
> >
> > *: This is a little hacky as PE/COFF wasn't really designed to be used
> > this way, but either way.
> >
> >
> >
> > Because both may access the SD card, trying to page stuff in or out from
> > within another IO operation is potentially catastrophic (though the
> > swapfile code does mostly bypass the FAT driver; it figures out the LBA
> > range and goes from there).
> >
> > It doesn't currently deal with any fragmentation in the pagefile, it is
> > possible I could try to deal with this, to be decided if doing so is
> > worthwhile.
> >
> > Would require either:
> > Pretranslating and caching the FAT cluster chains;
> > Hacking into the FAT FS driver internals;
> > Feeding swapfile operations through normal filesystem IO interfaces.
> >
> > Sending swapfile stuff through the filesystem driver is undesirable, for
> > various reasons. For the time being, it is easier to require an
> > unfragmented swapfile (which, granted, does require some special care
> > when setting up the SDcard).
> >
> > Would have almost gone the Linux route of requiring a swap partition,
> > except I am using Windows and there is no good way to set this up short
> > of doing it on a Linux machine, so using a file was the easier option.
> >
> >
> >
> > I am at the moment again torn between wanting to post the code to GitHub
> > (since it has been a few weeks), and not wanting to post the code when
> > stuff is still obviously buggy or broken.
> >
> >
> > I am left to wonder if all this stuff is usually so much of a PITA to
> > debug for everyone else, or if I just kinda suck at it...
> > >>> The TLBE is 256 bits in size, and about 90% of the bits are used. The PTE could be
> > >>> smaller, but I am having it mimic the TLBE to keep things simple.
> > >>> For the hash table, hash groups are used to alleviate the need to walk chains. The hash
> > >>> chain walks by a group not by a single entry. Groups contain eight entries that hashed to
> > >>> the same value or are displaced, so there is seldom a need to walk more than two or
> > >>> three steps.
> > >
> > > There are only a couple of things I could think of for poorly formed table entries. If the
> > > physical address is not a legal value, or if the page mask begin is greater than the page
> > > mask end. Otherwise, the remaining fields could have just about any value. If the physical
> > > address is not legal a PMA exception will occur when the address is mapped. If the page
> > > mask begin is greater than the end, then there will be no access to the page resulting in
> > > read, write, or execute violations. I suppose the cause code could be expanded to add
> > > syndrome bits. They may be useful for other units too.
> > >
> > > I have the page compression indicators covering a 4kB block sub-page range. While the
> > > page mask begin, end divides the page into 1kB blocks. It may be better to cover the
> > > same range with both fields. 4kB blocks might be used to emulate an older system.
> > > The first half of the PTE is mainly the translation info. The second half is access rights,
> > > key and compression indicators. The second half of the PTE is loaded from a per-page
> > > page management table. It is info that needs to be in the TLBE so the PTE is just used to
> > > cache the information. The second half of the PTE could be stuffed with other information
> > > for software use until the access rights info is made valid.
> > >
> > So, considering something simplistic like "skip zeroed parts of a
> > not-entirely-zero" page or something?...
> >
> >
> > I guess this is possible, vs what I was considering of having one of 3
> > options:
> > Zeroed / Skipped;
> > Stored;
> > LZ Compressed.
> >
> > At present (with an uncompressed pagefile), only the Zeroed and Stored
> > states exist.
> >
> >
> > Though, I still have a few concerns with my idea:
> > Making LZ compression sufficiently fast is still an issue;
> > Doing a chunk buffer turns a large number of small stalls into a smaller
> > number of large stalls.
> >
> > It will probably be painfully obvious if the whole system stalls for
> > several seconds at a time to write several MB of compressed pagefile
> > data to the SD card.
> >
> >
> > Though, looking stuff up, it appears I may not need to write the data
> > all at once, rather it may be sufficient to (merely) write to the data
> > in a roughly sequential order.
> >
> > This would imply abandoning the 4MB chunking idea, and instead treating
> > the swap as a circular log. This would, however, require some way to
> > relocate or skip over existing pages if they are "still live".
> >
> > I could probably manage this with a cell bitmap (probably with 512B or
> > 1K cells), effectively treating the pagefile like a large bitmap-managed
> > heap. This could work OK, but its effectiveness (outside of an explicit
> > "compactor" mechanism) would depend highly on fragmentation.
> >
> >
> > Assuming 1K cells with a 2GB pagefile, would need 256K at 1b/cell.
> > Would need ~ 28b to encode the size and location of a compressed page.
> > Assuming there are 128k pages (each 16K), with 4B overhead per page,
> > this part costs 512K.
> >
> > Probably doable...
> >
> >
> > Though, it is possible that (if enough pages are poorly compressible),
> > that fragmentation would result in the pagefile turning into an unusable
> > state with this design.
> >
> > Assuming a 1:1 parity between pagefile size and logical page count; this
> > will likely only work OK if average compressed page size is below around
> > 8K or so. The alternative being to only fill the pagefile to 50% or 75%
> > capacity, such that the pagefile still works even if a majority of the
> > pages are poorly compressible.
> >
> > But, at that point, one would be better off using an uncompressed pagefile.
> >
> >
> > Some of my mock-up tests aren't giving as good of compression as I had
> > hoped. This idea would work a lot better if pages got an average
> > compression ratio of around 75% or so, but the 45%..65% I am seeing in
> > my mock-up tests is likely to be borderline at best.
> >
> >
> > From what I can find, most OS's seem to be using uncompressed pagefiles..
> > > I am assuming the PPN field can be used to store the page-file swap address when the
> > > page is not resident in memory.
> > >
> > Storing the logical pagefile page-number in non-resident pages is how I
> > was doing it.
> >
> > Address is Virtual:
> > Present==1:
> > Page number maps to a physical page address;
> > Present==0:
> > Page Number==0: PTE is Free (PTE==0) or Reserved (PTE!=0)
> > Page Number!=0: PTE points at a pagefile page
> >
> > ...
> >
> > Pagefile page 0 is currently reserved.
> Ran out of block RAMs for my design. I counted them up, but the toolset must have
> replicated a few of them. It worked out to about 800 block rams and there are only
> 730 available. So, now I am scaling things down a bit. Going back to a hierarchically
> oriented table, since with MMU caching it probably offers the fastest access. I am
> also thinking of using software to reduce the hardware footprint. Done some thinking
> about the purpose of what I am doing. I want the thing to run in a low-cost FPGA.
>
> Squeezed the page table entries down to 64-bits for the hierarchical table. Basically,
> using the PTE from Mitch’s design, but I had a couple of extra bits to work with since
> pages are 64kB. I took the “where” bits added a bit to allow secondary storage locations.
> HDD, SDD, Network, etc. Four bits are used to indicate 16kB page slices to access.
<
So, you gained 3-bits on my design, and then blew 4-bits for subPages. Might as well
just use 8K pages and be 1-bit better off.
<
Also note: If the PT[PE] is not present (does not map anything in memory) then the
rest of the bits have no HW meaning (that is they can point at secondary storage.)
<
> “Where” makes me think of “were” as in were-wolf. So, the “were” bits can transform
> from one type to another.

Click here to read the complete article

On Friday, March 25, 2022 at 11:14:32 AM UTC-4, MitchAlsup wrote:
> On Friday, March 25, 2022 at 12:40:05 AM UTC-5, robf...@gmail.com wrote:
> > On Tuesday, March 22, 2022 at 4:50:14 PM UTC-4, BGB wrote:
> > > On 3/22/2022 10:59 AM, robf...@gmail.com wrote:
> > > > On Tuesday, March 22, 2022 at 11:07:15 AM UTC-4, MitchAlsup wrote:
> > > >> On Tuesday, March 22, 2022 at 1:28:05 AM UTC-5, robf...@gmail.com wrote:
> > > >>> On Saturday, March 19, 2022 at 4:19:12 PM UTC-4, EricP wrote:
> > > >>>> MitchAlsup wrote:
> > > >>>>> On Friday, March 18, 2022 at 11:08:08 AM UTC-5, EricP wrote:
> > > >>>>>> MitchAlsup wrote:
> > > >>>>>>> <
> > > >>>>>>> Page Fault gets attached to thread number (which accesses the faulting IP)
> > > >>>>>>> ....................gets the faulting instruction-specifier in R1
> > > >>>>>> I assume this is the instruction pointer.
> > > >>>>> <
> > > >>>>> It is the first 32-bits of the instruction itself. This contains everything but the
> > > >>>>> constants.
> > > >>>> If it faults during instruction fetch then this instruction specifier
> > > >>>> buffer could be empty or partially filled.
> > > >>>>
> > > >>>> It absolutely needs the address of the start of the faulting instruction.
> > > >>>> Structured exception handlers. Debugger watch points.
> > > >>>> These can't function without it.
> > > >>>>
> > > >>>> You can provide this partial instruction bits in addition to the standard
> > > >>>> exception details. Though as I point out below I don't think anyone
> > > >>>> will need this if the page fault syndrome bits are provided.
> > > >>>>> <
> > > >>>>>>> ....................gets the faulting virtual address in R2
> > > >>>>>>> ....................gets the address of fault generating PTE/PTP in R3
> > > >>>>>> Passing the physical address of the faulting PTE to the page fault
> > > >>>>>> handler doesn't harm anything but it doesn't really help.
> > > >>>>>> Though I've not seen a page fault exception frame the does this.
> > > >>>>>>
> > > >>>>>> This misses the critical page fault reason info. The page fault handler
> > > >>>>>> needs to know what the thread was attempting to do when it triggered
> > > >>>>>> the fault. Looking at x86 that is a set of bit flags indicating:
> > > >>>>>>
> > > >>>>>> P = Caused by Not-Present PTE, or other protection codes
> > > >>>>>> R/W = Caused by Read, or caused by Write
> > > >>>>>> U/S = Cause by User/Super mode
> > > >>>>>> RSVD = Reserved bit non-zero
> > > >>>>>> I/D = access was instruction fetch, or data access
> > > >>>>>> PK = Protection Key violation
> > > >>>>>> SGX = SGX Software Guard Extensions violation
> > > >>>>> <
> > > >>>>> The given address points at the PT[PE] where the fault was detected.
> > > >>>>> So, for example if you were trying to Read the first PT[PE] that does not have
> > > >>>>> read permission is where you end up pointing {same for write and execute}
> > > >>>>> <
> > > >>>>> The instruction tells you what the instruction stream was doing..
> > > >>>>> Inst<31..26> = 001001 & Inst<10..5> = 0000xx -> LD
> > > >>>>> Inst<31..26> = 001001 & Inst<10..5> = 00x1xx -> ST
> > > >>>>> Inst<31..26> = 1000xx LD
> > > >>>>> Inst<31..26> = 1001xx ST
> > > >>>>> <
> > > >>>>> So, there is a clever trick, Because My 66000 detects all mal-formed instructions,
> > > >>>>> {mal-formed instructions arrive at the Operation exception handler}
> > > >>>>> the page fault handler can look at Instruction<31> and decide to look at
> > > >>>>> Inst<28> or Inst<7> where 0->LD and 1->ST.
> > > >>>>> So 3 PREDs in a row parse the instruction that caused the page fault.
> > > >>>>> <
> > > >>>>> But you make a good point:: I have another 32-bits in R1 I could completely
> > > >>>>> specify what problem was detected. Let me work on that.
> > > >>>> The fault syndrome bits should be sufficient.
> > > >>>> I don't think anyone wants to have to reverse guesstimate what the
> > > >>>> problem was in a page fault handler by parsing instructions.
> > > >>>>
> > > >>>> An emulator might want to parse the faulting instructions but that
> > > >>>> is an application specific use for the general page fault mechanism
> > > >>>> and would take place in an application specific user mode handler.
> > > >>> For my core I have both the address of the faulting instruction stored in an exception
> > > >>> instruction pointer register, and the faulting address stored in a bad address register.
> > > >>> The cause of an exception is stored in a cause register. Maybe the cause register
> > > >>> does not have enough information, but it can tell if the fault was data or instruction,
> > > >>> or read, write, or execute violations.
> > > >> <
> > > >> There is also the case of a poorly-formed table entry.
> > > >>>
> > > >>> Looks like its back to the drawing board for me. Planning on using compressed
> > > >>> pages, an idea gleaned from BGB’s posting.
> > > >>> I was using an inverted hash table, but had not considered memory paged out to disk.
> > > >>> Being able to swap out to disk means making the page table larger, meaning it will no
> > > >>> longer fit in block RAM. So, back to using a TLB. I was thinking of increasing the page
> > > >>> table size by a factor four, ¼ real memory ¾ disk storage. I gather its not a good idea
> > > >>> to have too large of a swap file.
> > > >>>
> > > I think the size of a swap file is a matter of interpretation.
> > >
> > > In current tests, sizes are:
> > > RAM = 128MB (set to match the Nexys A7 I am using)
> > > SWAP = 256MB or 384MB in emulation and simulation tests.
> > >
> > >
> > > On my SDcard, IIRC, the swapfile was set up as 2GB (and I didn't
> > > originally think it would take this long to before trying to get this
> > > stuff working).
> > >
> > > I am not assuming any particular limit at the moment, apart from the 4GB
> > > size limit implied by FAT32 (at present, this would be 256k pages).
> > >
> > >
> > > Currently, the swap code sets up a 32MB region of RAM to be used as swap
> > > pages, with the rest (for now) being left for either physical mapping or
> > > direct remapping (where pages are mapped into virtual addresses, but are
> > > not backed by swap).
> > >
> > > I initially thought 32MB would be too limiting, but it appears that Doom
> > > and similar can sit pretty easily in this space.
> > >
> > > Doom doesn't really generate any swapfile traffic, but Quake uses enough
> > > memory that does start accessing swap.
> > >
> > > One limit for swapfile size would be the relative cost of any management
> > > structures.
> > >
> > > So, for example, with a 4GB swapfile, if spending 8B per page to track
> > > page location / compression, this would be an ~ 2MB array, ...
> > >
> > > In the current (uncompressed) page-table scheme, page management is
> > > mostly limited to a few bitmaps and similar.
> > >
> > >
> > >
> > > Though, this stuff still doesn't work particularly reliably as of yet
> > > (trying to run programs from swapfile backed memory is still pretty buggy).
> > >
> > > There is also some hackery, like needing to make sure that everything in
> > > the FAT driver and similar is using physically mapped memory.
> > >
> > > There are ending up being various functions for memory allocation with
> > > various properties:
> > > tk_malloc: Generic malloc, may use Swap
> > > tk_malloc_usr: Usermode accessible (kernel), may use Swap
> > > tk_malloc_wxe: Read/Write/Execute, Physically Mapped
> > > tk_malloc_krn: Kernel-Mode memory, Physically Mapped
> > > ...
> > >
> > > Internally, these map to a "tk_malloc_cat" function, which adds a
> > > "category" argument describing the type of memory one wants (currently
> > > as a "magic number" scheme).
> > >
> > >
> > > Generally, anything allocated from "userland" (with the exception of WXE
> > > memory) is currently assumed to use Swap.
> > >
> > > Anything which may be accessed from an ISR needs to be physically mapped
> > > (ISRs don't use address translation).
> > >
> > > ...
> > >
> > >
> > >
> > > However, the program loaders still use direct-mapping for the '.text'
> > > sections and similar (for now at least, may change this later). It is
> > > likely though that the '.data'/'.bss' sections could be put into
> > > swap-backed memory (in my ABI, the read-only and read/write sections are
> > > handled independently of each other, *).
> > >
> > > *: This is a little hacky as PE/COFF wasn't really designed to be used
> > > this way, but either way.
> > >
> > >
> > >
> > > Because both may access the SD card, trying to page stuff in or out from
> > > within another IO operation is potentially catastrophic (though the
> > > swapfile code does mostly bypass the FAT driver; it figures out the LBA
> > > range and goes from there).
> > >
> > > It doesn't currently deal with any fragmentation in the pagefile, it is
> > > possible I could try to deal with this, to be decided if doing so is
> > > worthwhile.
> > >
> > > Would require either:
> > > Pretranslating and caching the FAT cluster chains;
> > > Hacking into the FAT FS driver internals;
> > > Feeding swapfile operations through normal filesystem IO interfaces.
> > >
> > > Sending swapfile stuff through the filesystem driver is undesirable, for
> > > various reasons. For the time being, it is easier to require an
> > > unfragmented swapfile (which, granted, does require some special care
> > > when setting up the SDcard).
> > >
> > > Would have almost gone the Linux route of requiring a swap partition,
> > > except I am using Windows and there is no good way to set this up short
> > > of doing it on a Linux machine, so using a file was the easier option..
> > >
> > >
> > >
> > > I am at the moment again torn between wanting to post the code to GitHub
> > > (since it has been a few weeks), and not wanting to post the code when
> > > stuff is still obviously buggy or broken.
> > >
> > >
> > > I am left to wonder if all this stuff is usually so much of a PITA to
> > > debug for everyone else, or if I just kinda suck at it...
> > > >>> The TLBE is 256 bits in size, and about 90% of the bits are used. The PTE could be
> > > >>> smaller, but I am having it mimic the TLBE to keep things simple.
> > > >>> For the hash table, hash groups are used to alleviate the need to walk chains. The hash
> > > >>> chain walks by a group not by a single entry. Groups contain eight entries that hashed to
> > > >>> the same value or are displaced, so there is seldom a need to walk more than two or
> > > >>> three steps.
> > > >
> > > > There are only a couple of things I could think of for poorly formed table entries. If the
> > > > physical address is not a legal value, or if the page mask begin is greater than the page
> > > > mask end. Otherwise, the remaining fields could have just about any value. If the physical
> > > > address is not legal a PMA exception will occur when the address is mapped. If the page
> > > > mask begin is greater than the end, then there will be no access to the page resulting in
> > > > read, write, or execute violations. I suppose the cause code could be expanded to add
> > > > syndrome bits. They may be useful for other units too.
> > > >
> > > > I have the page compression indicators covering a 4kB block sub-page range. While the
> > > > page mask begin, end divides the page into 1kB blocks. It may be better to cover the
> > > > same range with both fields. 4kB blocks might be used to emulate an older system.
> > > > The first half of the PTE is mainly the translation info. The second half is access rights,
> > > > key and compression indicators. The second half of the PTE is loaded from a per-page
> > > > page management table. It is info that needs to be in the TLBE so the PTE is just used to
> > > > cache the information. The second half of the PTE could be stuffed with other information
> > > > for software use until the access rights info is made valid.
> > > >
> > > So, considering something simplistic like "skip zeroed parts of a
> > > not-entirely-zero" page or something?...
> > >
> > >
> > > I guess this is possible, vs what I was considering of having one of 3
> > > options:
> > > Zeroed / Skipped;
> > > Stored;
> > > LZ Compressed.
> > >
> > > At present (with an uncompressed pagefile), only the Zeroed and Stored
> > > states exist.
> > >
> > >
> > > Though, I still have a few concerns with my idea:
> > > Making LZ compression sufficiently fast is still an issue;
> > > Doing a chunk buffer turns a large number of small stalls into a smaller
> > > number of large stalls.
> > >
> > > It will probably be painfully obvious if the whole system stalls for
> > > several seconds at a time to write several MB of compressed pagefile
> > > data to the SD card.
> > >
> > >
> > > Though, looking stuff up, it appears I may not need to write the data
> > > all at once, rather it may be sufficient to (merely) write to the data
> > > in a roughly sequential order.
> > >
> > > This would imply abandoning the 4MB chunking idea, and instead treating
> > > the swap as a circular log. This would, however, require some way to
> > > relocate or skip over existing pages if they are "still live".
> > >
> > > I could probably manage this with a cell bitmap (probably with 512B or
> > > 1K cells), effectively treating the pagefile like a large bitmap-managed
> > > heap. This could work OK, but its effectiveness (outside of an explicit
> > > "compactor" mechanism) would depend highly on fragmentation.
> > >
> > >
> > > Assuming 1K cells with a 2GB pagefile, would need 256K at 1b/cell.
> > > Would need ~ 28b to encode the size and location of a compressed page..
> > > Assuming there are 128k pages (each 16K), with 4B overhead per page,
> > > this part costs 512K.
> > >
> > > Probably doable...
> > >
> > >
> > > Though, it is possible that (if enough pages are poorly compressible),
> > > that fragmentation would result in the pagefile turning into an unusable
> > > state with this design.
> > >
> > > Assuming a 1:1 parity between pagefile size and logical page count; this
> > > will likely only work OK if average compressed page size is below around
> > > 8K or so. The alternative being to only fill the pagefile to 50% or 75%
> > > capacity, such that the pagefile still works even if a majority of the
> > > pages are poorly compressible.
> > >
> > > But, at that point, one would be better off using an uncompressed pagefile.
> > >
> > >
> > > Some of my mock-up tests aren't giving as good of compression as I had
> > > hoped. This idea would work a lot better if pages got an average
> > > compression ratio of around 75% or so, but the 45%..65% I am seeing in
> > > my mock-up tests is likely to be borderline at best.
> > >
> > >
> > > From what I can find, most OS's seem to be using uncompressed pagefiles.
> > > > I am assuming the PPN field can be used to store the page-file swap address when the
> > > > page is not resident in memory.
> > > >
> > > Storing the logical pagefile page-number in non-resident pages is how I
> > > was doing it.
> > >
> > > Address is Virtual:
> > > Present==1:
> > > Page number maps to a physical page address;
> > > Present==0:
> > > Page Number==0: PTE is Free (PTE==0) or Reserved (PTE!=0)
> > > Page Number!=0: PTE points at a pagefile page
> > >
> > > ...
> > >
> > > Pagefile page 0 is currently reserved.
> > Ran out of block RAMs for my design. I counted them up, but the toolset must have
> > replicated a few of them. It worked out to about 800 block rams and there are only
> > 730 available. So, now I am scaling things down a bit. Going back to a hierarchically
> > oriented table, since with MMU caching it probably offers the fastest access. I am
> > also thinking of using software to reduce the hardware footprint. Done some thinking
> > about the purpose of what I am doing. I want the thing to run in a low-cost FPGA.
> >
> > Squeezed the page table entries down to 64-bits for the hierarchical table. Basically,
> > using the PTE from Mitch’s design, but I had a couple of extra bits to work with since
> > pages are 64kB. I took the “where” bits added a bit to allow secondary storage locations.
> > HDD, SDD, Network, etc. Four bits are used to indicate 16kB page slices to access.
> <
> So, you gained 3-bits on my design, and then blew 4-bits for subPages. Might as well
> just use 8K pages and be 1-bit better off.
> <
> Also note: If the PT[PE] is not present (does not map anything in memory) then the
> rest of the bits have no HW meaning (that is they can point at secondary storage.)
> <
> > “Where” makes me think of “were” as in were-wolf. So, the “were” bits can transform
> > from one type to another.

Click here to read the complete article

Heisenberg might have been here.

devel / comp.arch / Re: Packing Hash Tables

Subject	Author
Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	Thomas Koenig
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	Thomas Koenig
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	aph
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	Ivan Godard
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	BGB
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	robf...@gmail.com
Re: Packing Hash Tables	Ivan Godard
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	Thomas Koenig
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	Thomas Koenig
Re: Packing Hash Tables	Stephen Fuld
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	MitchAlsup
Re: Packing Hash Tables	EricP
Re: Packing Hash Tables	BGB