novaBBS - comp.arch - Re: Compact representation for common integer constants

On 5/5/2021 7:07 PM, MitchAlsup wrote:
> On Wednesday, May 5, 2021 at 3:07:43 PM UTC-5, BGB wrote:
>> On 5/5/2021 2:25 AM, JohnG wrote:
>>> Hey All,
>>>
>>> I've had this for ages and figured I'd toss it out for people's consideration.
>>>
>>> Instead of a complete binary representation, represent common integers as 1 times a count of a small number of prime factors and a left shift.
>>>
>>> Example, an 8-bit value with 1 bit for the presence of a factor of 3, and 2 bits for factors of 5, 25, and 125, and the remaining 5 bits for left shift. So this would cover integers that were multiples of 1, 3, 5, 15, 25, 75, 125, and 375 then left-shifted between 0 and 31 bits.
>>>
>>> It seemed like a nice way to encode a lot of useful constants in a compact opcode format. Variants of the basic idea are hopefully obvious.
>>>
>> My experiments with shifted-constants were not particularly promising:
>> Normal integer constants tend to be highly clustered around zero;
> <
> With 80%+ of them positive
> <

Yes, granted.

>> There are few obvious common patterns for most larger constants.
>>
>> And, for this example, not much reason why multiples of 3 or 5 would be
>> significantly more common than non-multiples of 3 or 5, now combined
>> with needing hardware to scale things by a multiple of 3 or 5.
>>
>> This can maybe deal slightly better with "casual" constants, like
>> "i=10000;" or similar, but these cases are a relative minority IME, and
>> probably not worth the cost of needing special handling in hardware.
> <
> Note: 10,000 fits in 14-bits.....

True, but as can be noted, most 3R encodings need a jumbo prefix in my
case to hold a value larger than 9 bits.

Eg:
ADD R12, 291, R29
Can be encoded using a 32-bit form, whereas:
ADD R12, 9999, R29
Needs a 64-bit encoding.

>>
>>
>> The clustering near zero is also lopsided, with significantly more
>> positive constants than negative ones. The probability also seems to
>> fall off with roughly the square of the distance from zero.
> <
> 4-to-6 positive constants to 1 negative constant for integer stuff
> For logical masking stuff there is a bit more on the negative side.

Yep.

>>
>>
>>
>> The majority of constants which are not clustered near zero tend to fall
>> into one of several categories:
>> Hard-coded memory addresses or similar;
>> Bit-masks or similar;
>> Floating-point literals.
>>
>>
>> Memory Addresses: Not a whole lot one can do there, but these do make a
>> strong case for some sort of relative addressing, such as PC-Relative or
>> having a Global Pointer or similar.
> <
> Which My 66000 has...

Likewise, BJX2 has these as well.

Not every ISA has them though...

>>
>> In which case, the number of bits needed is proportional to
>> text+data+bss, which for "most binaries" (at least of the ones I am
>> testing) tends to fall in the range of 20..24 bits.
> <
> I am seeing 2% of the addresses needing to be 64 bits in size on a
> 64-bit address space machine.

Pretty much.

I have a 48-bit space, but thus far pretty much everything sits in the
low 4GB.

I figure I may as well use the full pointer size though, since the
relative cost isn't too drastic, and dealing with 32/64 ABI
compatibility issues is likely to be a bigger issue in the future than
needing to use ~ 3% more RAM.

Code size isn't really effected much either way, though could be
slightly better if 48-bit Disp24 encodings or similar were reintroduced,
but this is unlikely at the moment.

I actually saw a bigger theoretical gain from the 24-bit encodings. As
can be noted, in the case where they were relevant, despite eliminating
a fair chunk of the 32-bit encodings, the overall savings were fairly
modest.

Say, program is:
70% 16-bit ops, 30% 32-bit ops.
And, we eliminate 1/3 of the 32-bit ops:
70% 16-bit ops, 20% 32-bit ops, 10% 24-bit ops.

Binary is only 4% smaller...

For comparison, LZ4 packing tends to save ~ 40-60%, though with the
tradeoff that it is necessary to unpack the code before it can be
executed (so, doesn't make much sense with a small RAM space).

Some sort of hardware-level LZ packing could be possible, in theory, but
would do evil things to the L1 I$; it would effectively need to be
multiple L1 caches glued together in order to be able to support the
in-flight LZ unpacking. The impact on the compiler would likely also be
kinda evil as it would effectively need to perform both code generation
and dictionary compression at the same time (and the constraints imposed
on it are likely to make it somewhat less effective than a traditional
LZ compressor).

>>
>> Though, for relative addressing for structs and arrays, ~ 8..12 bits
>> seems to be "mostly sufficient" (I am having OK results with a 9-bit
>> displacement for load/store ops; but as noted, I typically need
>> "something bigger" to deal with PC-rel or GBR-rel).
> <
> I remember that the 16-bit immediates on Mc 88K were a lot more useful
> in SPEC{89,93} than the 13-bit immediates in SPARC.

Could be, dunno...

As noted, a 9-bit displacement with a struct, for 32 and 64-bit
load/store, allows a struct of 2K or 4K.

Given a majority of structs tend to be less than 512 bytes, it works out
OK for the most part.

Albeit, large structs end up needing 64-bit jumbo encodings to deal with
the field displacements.

Some of this is partly a tradeoff of being able to support 16-bit
instructions, WEX encodings, predicated instructions, ... These things
did cut a few bits off the encoding space.

This was compensated for originally by being able to readily load
constants into registers when needed.

>>
>> In my case, since my C ABI is mostly built around GBR relative
>> addressing, the number of full-width addresses (48-bits, in theory) is
>> relatively small (in practice, most of these are references to the MMIO
>> area at 0000_F0000000, *1).

> <
> Why does each I/O device not have its own "base register" wich you load
> and then talk at it using small displacements ??

Each MMIO device does have a base address and range...

But one needs to load an address to it, somewhere...

so, eg, something like:
volatile uint32_t *somedevice_regs32;
volatile uint64_t *somedevice_regs64;

void somedevice_init()
{ somedevice_regs32 = (uint32_t *)0x0000F00CD000ULL;
somedevice_regs64 = (uint64_t *)somedevice_regs32;
}

uint64_t somedevice_access()
{ uint32_t fl;
uint64_t v;
fl=somedevice_regs32[SOMEDEVICE_STATUS];
... check status or whatever ...
v=somedevice_regs64[SOMEDEVICE_DATAQ];
return(v);
}

If I make the change I mentioned, it would be necessary to instead
write, say:
somedevice_regs32 = (uint32_t *)0xF000000CD000ULL;
somedevice_regs64 = (uint64_t *)somedevice_regs32;

But, they would correspond to the same address as far as the hardware is
concerned.

There is also an MMIO bypass range at 0xC00000000000...

So:
void *mmu_cant_touch_this;
mmu_cant_touch_this = 0xC00001234000;
But, userland code "can't touch this" either...

Though, with the tradeoff that the L1 cache wont necessarily see them as
the same memory. If a store is performed to one address and then read
from another which happens to alias to the same underlying memory, the
results of the first store may not (necessarily) be visible...

However, in simple cases, the direct-mapped L1 is likely to evict one
location before it loads the other. This pattern will fall on its face
if the MMU is used though.

It is possible in the future though, it is possible that the caches
could be made to keep track of this stuff, and possibly the L1 caches
would signal their intentions for a cache-line to the L2, which would
keep track of things, and then notify the L1 caches about cases where
they need to flush certain cache lines.

>>
>> *1: I am almost half-tempted to eliminate this region if using 48-bit
>> addressing with the MMU enabled (JQ+MMU). This would force using 48-bit
>> MMIO addresses if the MMU is enabled, but would effectively give an
>> entirely flat userland region (vs, "mostly flat but with a random hole
>> just below the 4GB mark for MMIO and similar").
>>
>>
>>
>> Bit Masks:
>> These are the most likely to benefit from some form of shifted encoding,
>> but need to be able to encode several types of patterns (to be useful).
>> 0000_0010_0000_0000 / 0000_0000_0800_0000 / ...
>> 0000_007F_FFFF_FFFF / 0000_0000_01FF_FFFF / ...
>> 0000_3FE0_0000_0000 / 0000_0000_01FC_0000 / ...
>> FFFF_FFEF_FFFF_FFFF / FFFF_FFFF_F7FF_FFFF / ...
>> FFFF_FF80_0000_0000 / FFFF_FFFF_FE00_0000 / ...
>> FFFF_C01F_FFFF_FFFF / FFFF_FFFF_FE03_FFFF / ...
>> ...
> <
> Mc 88K has MAKe instruction which created a 000111000 kind of mask
> using two 5-bit values. Offset and size of the 1s field. My 66000 droped
> this easy to perform instruction because immediates are more efficient
> in execution time.

OK.

>>
>> Here, one needs to hit "approximately" the right area, and have rules
>> for filling in the rest of the bits.
>>
>> In an ideal case, one would need 16-bits of payload, with a shift, and a
>> few bits as a fill pattern selector. To save bits, the shift could be a
>> multiple of 4 or 8 bits.
>>
>> This would take ~ 22-bits, which is a bit steep.
>> Cramming it down to 12 to 16 bits is possible, but makes it a lot less
>> useful (eg: 10-bit literal, 4-bit nybble-shift, 2-bit fill pattern).
>>
>>
>> Implementation is more difficult, as then one either needs to spend the
>> resources to handle it in the decoder, or re-purpose another unit (such
>> as the shift unit). Doing it in a way that leaves it "sufficiently
>> useful" (vs simpler options), is more difficult.
>>
>> Shift-style units are kind of expensive, so aren't a good thing to put
>> in the instruction decoder.
>>
>>
>> Floating point literals:
>> If one has a few free spots with 16-bit immediate values, and a
>> Half-Float converter, then this actually works fairly well (a majority
>> of floating point constants which appear in code tend to be able to be
>> expressed exactly as Half-Float).
>>
>> Main exceptions (can't be encoded exactly / at-all):
>> Powers of 10, eg: 1000.0, 0.1, 0.001, ...
>> Series of 9s, eg: 9999.0
>> Values that fall outside the exponent range
>> Values derived from irrational numbers
>> ...
>>
>> However, for: 1.0, 1.5, 0.75, 0.5, ...
>> They work pretty well.
> <
> PDP-10 had a pretty workable scheme for float constants from reasonable
> sized constant.
>
> My 66000 looked at this and decided not to "waste" the 16-bit immediate
> OpCode space and save it for future use.

I have a few spots left for "Imm16, Rn" ops, but I use them sparingly...

Currently:
LDIZ, LDIN, ADD, LDISH, FLDCH, -, -, -

LDIZ/LDIN: Load a Zero or One extended 16-bit constant.
ADD: Add 16-bit constant, sign-extended.
LDISH: Rn=(Rn<<16)|Imm16u
FLDCH: Load a Half-Float Immediate

>>
>> Even E4.F4 microfloats (FP8) or similar "aren't completely useless" (and
>> may provide a more compact way to express things like FP-SIMD vectors).
>>
>> These cases may also be able to express a range of patterns which while
>> not themselves FP constants, happen to exist as a bit pattern which can
>> be represented exactly using packed FP16 or FP8. This includes a certain
>> subset of bit-masks.
>>
>>
>> The "exception" cases tend not to follow an obvious pattern (in terms of
>> global statistics) so are "at best" handled with a lookup table, whose
>> optimal contents will tend to vary from one application to another.
>>
>> In this latter case, the main viable options then end up being to encode
>> constants inline, or fetch them from a software managed lookup table,
>> with the usual sorts of tradeoffs.

Subject	Replies	Author
Compact representation for common integer constants By: JohnG on Wed, 5 May 2021	364	JohnG

lp1 on fire (One of the more obfuscated kernel messages)

computers / comp.arch / Re: Compact representation for common integer constants