Message-ID:

Our way is peace. -- Septimus, the Son Worshiper, "Bread and Circuses", stardate 4040.7.

On 6/5/2021 9:03 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> In my experience with them, at similar clock speeds, the original Atom
>> gets beaten pretty hard by ARM32.
>
> What do you mean with "ARM32"? If you mean the 32-bit ARM
> architecture, there are many different cores that implement this
> architecture. For the LateX benchmark I have:
>
> run time (s)
> - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
> - OMAP4 Panda board ES (1.2GHz Cortex-A9) Ubuntu 12.04 2.984
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
>

I had something which had an Atom N270 (an ASUS Eee running Linux), its
performance kinda sucked vs a RasPi2 in my tests (900MHz Cortex-A53),
running 32-bit Raspbian.

IIRC, at the time I was mostly testing it with color-cell based video
codecs and similar.

These mostly use some arithmetic to calculate color endpoints, then
typically fill blocks of memory with pixel values using 1 or 2 bit
selectors. These can be implemented either with a small lookup table, or
with "c?x:y".

Eg:
ct0=dest;
ct1=dest+stride;
...
ct0[0]=(bpx&0x0001)?clra:clrb;
ct0[1]=(bpx&0x0002)?clra:clrb;
ct0[2]=(bpx&0x0004)?clra:clrb;
ct0[3]=(bpx&0x0008)?clra:clrb;
ct1[0]=(bpx&0x0010)?clra:clrb;
ct1[1]=(bpx&0x0020)?clra:clrb;
...

> For Gforth I have, e.g.:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.410 0.520 0.260 0.635 0.280 Exynos 4 (Cortex A9) 1.6GHz; gcc-4.8.x
> 0.600 0.650 0.310 0.870 0.450 Odroid C2 Cortex A53 32b 1536MHz, gcc 5.3.1
> 0.390 0.490 0.270 0.520 0.260 Odroid C2 Cortex A53 64b 1536MHz, gcc 5.3.1
>
> So, yes, OoO 32-bit ARMs like the Cortex-A9 have better
> performance/clock than Bonnell, but the Cortex-A53 in 32b-moe not su
> much. Which is quite surprising, because I would expect a RISC to
> suffer less from the in-order implementation than a CISC. This
> expected advantage is realized in the 64b-A53 result.
>

OK.

>> From what I can tell, it appears that x86 benefits a lot more from OoO
>> than ARM did, and Aarch64 is still pretty solid even with in-order
>> implementations.
>
> OoO implementations are a lot faster on both architectures:
>
> LaTeX:
>
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
>
> Gforth:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
>

OK.

When I tested with running interpreters, this was a case that somewhat
favored x86.

>> However, an OoO x86 machine does seem to be a lot more tolerant of
>> lackluster code generation than an in-order ARM machine (where, if the
>> generated code kinda sucks, its performance on an ARM machine also
>> sucks).
>
> Not sure what you mean with "lackluster code generation", but OoO of
> course deals better with code that has not been scheduled for in-order
> architectures. There is also the effect on OoO that instructions can
> often hide in the shadows of long dependency paths. But if you make
> the dependency path longer, you feel that at least as hard on an OoO
> CPU than on an in-order CPU.
>

I was thinking stuff like (pseudocode):
Load A
Load B
Op C=A+B
Load D
Store E
Move E=A
Store C
Move C=D
Op F=C+E
Store F
...

Or (closer to typical GCC "-O0" output):
Load A
Load B
Op C=A+B
Store C
Load A
Load B
Op C=A-B
Store C
...

This stuff does "sorta OK" on x86 machines (within ~ 3x of optimized
code), but poorly on ARM machines (~ 5x-7x slower than optimized).

In another past test trying to do dynamic recompilation of BJX2 code to
32-bit ARM (on a RasPi3), it was a fair bit slower than the FPGA
implementation (despite the RasPi3 having a fairly significant
clock-frequency advantage).

>> The x86 machine seems to just sort of take whatever garbage one
>> throws at it and makes it "sorta fast-ish" (even if it is basically just
>> a big mess of memory loads and stores with a bunch of hidden function
>> calls and similar thrown in).
>
> I don't know what you mean with "hidden function calls and similar",
> but the stuff about "a big mess of memory loads and stores" sounds
> like the ancient (and wrong) myth that loads and stores are free on
> IA-32 and AMD64.

They are not "free", but their performance impact is a lot less obvious.

For the hidden function calls, say, assume that for a certain type,
operators are implemented with function calls, such that:
z=3*x+y;
Compiles as if it were, say:
t0=__convsixi(3);
t1=__mulsxi(t0, x);
t2=__addxi(t1, y);
z=t2;

Likewise, for a simpler compiler, the logic implemented by the
code-generator might be fairly minimal, with the generated code
consisting almost entirely of function calls.

In this case, the code-gen mostly consists of logic for looking up the
name of which function to call.

In BGBCC, runtime calls are still used a lot (BJX2):
Int (and smaller):
Handled directly: +, -, *, &, |, ^, <<, >>
Runtime call: /, % (1)
Long / Long Long:
Handled directly: +, -, &, |, ^, <<, >>
Runtime call: *, /, %
Int128:
Handled directly: &, |, ^
Handled directly (ALUX): +, -, <<, >>
Runtime call: *, /, %
Runtime call (Non-ALUX): +, -, <<, >>
Float / Double:
Handled directly: +, -, *
Runtime call: /
Float128 / Long Double:
Runtime call: Everything.
Variant:
Runtime call: Everything.
Vector (vec3f / vec4f):
Handled directly: +, -, * (pairwise)
Runtime call: /, % ^ (cross and dot product)
...

*1: Constant division may be implemented as ("(x*C)>>32") in some cases.

Type conversion paths also may or may not involve runtime calls.

The main part of the codegen doesn't necessarily know when function
calls may occur, so is paranoid by default (doesn't use any of the
scratch registers). There is logic that detects if a basic-block is
"pure" (no function calls or complex operations), which enables the use
of scratch registers (by the register allocator) within this block.

Contrast with the SH and BJX1 backends:
Int (and smaller):
Handled directly: +, -, &, |, ^
Runtime call: *, /, %, <<, >> (*2)
Long Long:
Runtime call: Everything.
...

*2: This was because SH kinda sucked in some ways.
Some of the SH variants also used shift-slides, so being able to do a
shift directly was more of a special case. Much of the FPU operations
were also implemented via function calls (or pretty much the entire FPU
for SoftFP cases).

Likewise, if in an local array one does:
int a[256]; //allocated directly
But:
int a[4099]; //runtime call (__alloca)
//__alloca is in turn built on top of "malloc()".

Likewise:
struct bigstruct_s {
int arr[1999];
};

struct bigstruct_s a, b; //implicit __alloca calls
...
b=a; //implicitly calls memcpy()

The prolog and epilog compression may or may not be counted. In this
case, the called functions are generated by the compiler. This was done
even for performance-optimized code as it tended to save more due to
fewer I$ misses than the cost of the extra branch instructions or
call/return overhead.

The presence of arrays or similar on the stack may also involve the
addition of a "security tokens" and similar to try to detect stack
thrashing due to buffer overruns (idea kinda borrowed from MSVC).

Very little of the C library is handled with builtins, with the main
exception of "memcpy" and similar potentially being handled as a
special-case (if the size is a small constant value, it may be
transformed into memory loads and stores).

Note that some of the C library functions were rewritten to be a little
more efficient from what they were originally.

Eg, the C library I am using originally did strcmp() kinda like:
while(*srca && *srca++==*srcb++);

And, memcpy like:
while(n--)*dst++=*src++;

Working 1 byte at a time, which was not exactly efficient.

Well, among other things...

> I actually have a nice benchmark for that:
>
> The difference between gforth and gforth-fast --ss-number=0 is that
> gforth has some extra loads and stores (it stores and loads the
> top-of-stack all the time, and it stores the Forth instruction pointer
> all the time; let's see how they perform on a Skylake (i5 6600K):
>
> sieve bubble matrix fib fft
> 0.080 0.108 0.044 0.080 0.028 gforth-fast --ss-number=0
> 0.128 0.208 0.084 0.140 0.056 gforth
>
> One might think that the Zen3 (Ryzen 7 5800X) with its improved
> store-to-load forwarding is more tolerant of the extra loads and
> stores of gforth, but there is still a lot of difference:
>
> sieve bubble matrix fib fft
> 0.079 0.062 0.034 0.053 0.022 Zen3 gforth-fast --ss-number=0
> 0.102 0.135 0.053 0.161 0.056 Zen3 gforth
>

OK.

Subject	Replies	Author
Squeezing Those Bits: Concertina II By: Quadibloc on Sat, 29 May 2021	153	Quadibloc

Our way is peace. -- Septimus, the Son Worshiper, "Bread and Circuses", stardate 4040.7.

computers / comp.arch / Re: Squeezing Those Bits: Concertina II