Message-ID:

"The one charm of marriage is that it makes a life of deception a necessity." -- Oscar Wilde

On 6/3/2021 6:26 PM, MitchAlsup wrote:
> On Thursday, June 3, 2021 at 2:47:40 PM UTC-5, BGB wrote:
>> On 6/3/2021 4:56 AM, Marcus wrote:
>>> On 2021-06-02 Anton Ertl wrote:
>>>> Marcus <m.de...@this.bitsnbites.eu> writes:
>>>>> A different axis is the dynamic instruction count (i.e. instruction
>>>>> execution frequencies) which I think is more relevant for
>>>>> performance analysis/tuning of an ISA.
>>>>
>>>> One would think so, but the problem is that in a corpus of large
>>>> programs the dynamically executed instructions are mostly confined to
>>>> a few hot spots, and the rest of the large programs them plays hardly
>>>> any role. And as a consequence, the most frequent dynamically
>>>> executed instruction sequences tend to be not very representative of
>>>> what happens in other programs.
>>>
>>> Interesting observation. I suspected that something similar would be the
>>> case, since looping/hot code must give a very strong bias.
>>>
>>> I have planned to add instruction frequency profiling to my simulator (I
>>> already have symbol-based function profiling which has proven very
>>> useful), but I'll keep in mind the difficulties that you pointed out.
>>>
>>> I wonder if you could improve the situation if you used something like
>>> different scales (logarithmic?) and some sort of filtering or
>>> thresholding (e.g. count each memory location at most N times) to
>>> reduce the bias from loops? Or possibly categorize counts into different
>>> bins (e.g. "cold", "medium", "hot").
>>>
> <
>> I have instruction-use stats in my emulator, but they tend to look
>> something like:
>> Program hammers it with various load/store instructions;
>> Slightly down the list, one encounters some branch ops;
>> Then ADD, TEST, SHLD / SHAD, ...;
>> Then CMPxx, and then the other ALU ops;
>>
>> Pretty much everything else tends to be sort of a soup of fractional
>> percentages at the bottom (except under fairly specific workloads).
>>
>>
>> The patterns do tend to imply that some higher-priority features are:
>> Load/Store Ops:
>> Load/Store with constant displacement;
>> Load/Store with register index scaled by element size;
>> Load/Store in both signed and unsigned variants.
>> Followed by Branch ops in (PC, Disp) form;
>> Followed by ADD/SUB and TEST/CMP in Reg/Immed forms;
>> Followed by shift ops in "Reg, Imm, Reg" form;
>> Followed by 3R ALU ops (Both "Reg, Reg, Reg" and "Reg, Imm, Reg");
>> ...
>>
>> Overall instruction counts are also useful to consider for both size and
>> performance optimization.
>>
>>
>> For general-purpose code, there is seemingly no real way to eliminate
>> the high frequency of load/store operations. In theory, it can be helped
>> by "strict aliasing" semantics in C, but this only goes so far and is
>> mostly N/A for code which does these optimizations itself.
>>
>> Being able to keep things in registers is good, but there is still a
>> limiting factor when most of ones' data is passed around in memory via
>> pointers (as is typical in most C code).
>>
>>
>> Loops tend to be fairly common and particularly biased towards using
>> lots of load/store ops, since frequently they are working on array data,
>> and arrays are almost invariably kept in memory.
> <
> It is moderately difficult to make a loop that is both useful and doe not need loads
> or stores! But I digress.........

Pretty much.

Loops are sort of a place where memory loads and stores like to gather...

>>
>>
>> For loads/stores:
>> The vast majority of constant displacements tend to be positive (unless
>> one designs the ABI to use negative-offsets relative to a frame-pointer
>> or similar);
> <
> And of these, with 16-bit displacements, one would actually want about 7/8ths
> of the displacements to be positive rather than 1/2 positive and 1/2 negative.
> <

I was thinking of x86, where the common sequence was:
PUSH EBP
MOV EBP, ESP
PUSH EDI
PUSH ESI
...

But, then local variables were frequently accessed using [EBP-Disp],
except in GCC which IIRC typically did [ESP+Disp] or "$disp(%esp)"

>> The vast majority of displacements also tend to be scaled by the element
>> size, so this makes sense as a default.
>>
>> Negative and misaligned displacements are still common enough though to
>> where one may need some way to be able encode them without an excessive
>> penalty. In my ISA, these cases were mostly relegated to loading the
>> displacement into R0 and using a special (Rb, R0) encoding with an
>> unscaled index.
>>

Though, forgot to mention that negative displacements can still be
encoded using a jumbo prefix.

I ended up going with positive-only displacements originally as, in my
testing, a 9-bit zero-extended displacement tended to do better in terms
of code density and performance than a 9-bit sign-extended displacement.

>>
>> IME, one also needs both sign and zero extending loads, since then
>> whatever is the non-default case ends up hammering the sign or zero
>> extension instructions.
> <
> absofriggenlutely !!

Yeah. Several ISAs only provide sign or zero extended loads, which isn't
ideal.

>>
>> This is also a partial reason why "ADDS.L" and "ADDU.L" and similar
>> exist in my ISA, since typically:
>> Code makes heavy use of 32-bit integers;
>> For better or for worse, wrap-on-overflow is the commonly expected
>> behavior, and some amount of code will fail otherwise;
>> These eliminate the vast majority of cases where one might otherwise
>> need to sign or zero extend a value for a memory access (if the index
>> register is greater than 32 bits).
>>
>> Note that AND/OR/XOR don't need 32-bit variants, since their behavior
>> with sign or zero extended values tends to be identical to what a
>> matching size or zero extended operation would have produced.
>>
>>
>>
>> In my compiler output, there also tends to be a lot of "MOV Reg,Reg"
>> instructions, but most of these are more due to my compiler being
>> stupid. If the generated code were closer to optimal, there would be a
>> lot fewer of these.
>>
> I see the majority of MOV Rd,Rs1 instructions at the middle of nested loops
> getting setup to do the outer ADD/CMP/BC so the top of the loop code runs
> smoothly.

In my case, there are a lot which exist more as an artifact of
intermediate IR stages and similar.

Front end uses ASTs;
ASTs are transformed into a Stack/RPN form;
RPN is converted to 3AC;
Codegen then uses the 3AC.

In many cases, the stack machine model works via pushing and popping
temporaries, where values are moved between operators, or between source
and destination variables and literals, via said temporaries.

This RPN form long predates my CPU ISA project (and some of my VM
projects had gone back and forth between RPN and 3AC models). RPN
generally had the advantage of being less painful for a compiler
frontend though.

It was generally easier to deal with some of these complexities in an
RPN -> 3AC/SSA converter stage, but this converter stage might also need
to do things like fold literal values back into an operator, fold the
destination back into the operator, ... Which may not have been
necessary had it gone more directly from the AST to 3AC.

The compiler logic doesn't entirely filter these out sometimes, so there
is a tendency for things like values to sometimes get loaded into a
register, then shuffled through other registers, and then operated on
after the fact.

Or a bug where:
y=x<<N;
Was being compiled as, effectively:
MOV N, R4
EXTS.L R4, R5
SHAD R8, R5, R9
Rather than:
SHAD R8, N, R9
....

Some other cases turn out to be code which was left over from before
various ops existed (eg: it uses MOV and a 2R op, because when that part
of the codegen was written, a 3R form hadn't been added yet).

And some amount of the compiler logic is a bit crufty as the ISA has
moved a fair bit from where it started out, and a lot of the design
changes were implemented using layers of hacks rather than doing a more
proper rewrite of that part of the compiler.

>>
>>
>> At the moment, there doesn't seem to be much obvious "this needs to be
>> better" stuff left in my core ISA, at least within the confines of the
>> general ISA design (eg, "Load/Store", ...).
> <
> see below::
>>
>> There are a few semi-common patterns which fall outside the load/store
>> model, most commonly using a memory operand as a source to an ALU
>> instruction (typically an integer add/subtract).
>>
>> But, I don't really want to go there, as doing stuff like this would
>> likely require using a longer pipeline.
>>
>> However, if one were willing to spend the cost of having such a longer
>> pipeline, they could potentially gain a few benefits:
>> Single cycle ALU ops against values loaded from memory;
>> Potentially being able to have fully pipelined FPU instructions;
>> ...
>>
>> With the downsides:
>> The resource cost of register forwarding and interlock handling goes up
>> rapidly with pipeline length;
>> Unpredictable are not friendly to longer pipelines;
>> ...
>>
>> There is also the potential for added complexity for instruction
>> decoding if one wants to go an x86-like route.
>>
>> Though, it looks like the majority of these cases could be addressed via
>> a "simple" encoding, eg:
>> ADDS.L (Rb, Disp), Rn
>> SUBU.L (Rb, Ri), Rn
>> ...
>> Basically, encoded like a memory load.
> <
> from above::
> <
> There is nothing that prevents an implementation from CoIssuing these
> as paired instructions.
>>
>> Though, the utility is diminished slightly in that these would (in many
>> cases) require an additional register MOV:
>> MOV Rs, Rn
>> ADDS.L (Rb, Disp), Rn
> <
> CoIssue
> <
> If you dig into CoIssue enough> you will see that as many as 30% of RISC
> instructions can be paired

OK.

Sometimes, parallel encodings can eat the occasional MOV or similar, but
a lot cases where a memory load is directly followed by an ALU op
triggers an interlock stall if the compiler can't reorganize
instructions to eliminate the interlock.

Though, it is unclear how much MemLoad+Op could save, or if it would be
enough to be justifiable for the costs vs pure Load/Store.

> <
>> Whereas:
>> ADDS.L Rs, (Rb, Disp), Rn
>> SUBU.L Rs, (Rb, Ri), Rn
>> Would avoid an additional MOV, but would require being able to encode an
>> instruction with 4 register IDs.
>>
>> Another option, eg:
>> ADDS.L Rt, (Rb), Rn
>> Would be possible, but would be hindered in that it would (nearly
>> always) require a LEA (thus making it "kinda useless").
>>
>> ...
>>
>>
>> Either way, it doesn't seem that likely the added cost/complexity of a
>> longer pipeline is likely to pay off. My guess is that one would end up
>> paying a lot more from the branch misses than they would save by saving
>> a few cycles here and there on ALU ops.
>>
>> It could be possible also to try to shove it in as a special case
>> without making the pipeline longer, but (to have any hope of passing
>> timing) would require restricting the range of operand sizes (small
>> integer types only, with a sign/zero extended result).
>>
>> ...

Subject	Replies	Author
Squeezing Those Bits: Concertina II By: Quadibloc on Sat, 29 May 2021	153	Quadibloc

"The one charm of marriage is that it makes a life of deception a necessity." -- Oscar Wilde

computers / comp.arch / Re: Squeezing Those Bits: Concertina II