Message-ID:

The trouble with computers is that they do what you tell them, not what you want. -- D. Cohen

devel / comp.arch / Re: Squeezing Those Bits: Concertina II

MitchAlsup wrote:
> On Friday, June 4, 2021 at 8:36:40 AM UTC-5, EricP wrote:
>> Anton Ertl wrote:
>>> Stefan Monnier <mon...@iro.umontreal.ca> writes:
>>>>> With a 20 gate per cycle design point, one can build a 6-wide reservation
>>>>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
>>>>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
>>>>> pipeline.
>>>> If we count 5-gates of delay for the clock-boundary's flip-flop, that
>>>> means:
>>>>
>>>> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
>>>>
>>>>> At 16 cycles this necessarily becomes 9-10 stages.
>>>>> At 12 gates this necessarily becomes 12-15 stages.
>>>> And that gives:
>>>>
>>>> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
>>>> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
>>>>
>>>> So at least in terms of the latency of a single instruction going
>>>> through the whole pipeline, the gain of targetting a lower-clocked
>>>> design seems clear ;-)
>>> But that's not particularly relevant. You want to minimize the total
>>> execution time of a program; and, with a few exceptions (e.g., PAUSE),
>>> one instruction does not wait until the previous instruction has left
>>> the pipeline; if it did, there would be no point in pipelining.
>>>
>>> Instead, a data-flow instruction waits until its operands are
>>> available (and the functional unit is available). For simple ALU
>>> operations, this typically takes 1 cycle (exceptions:
>>> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
>>> made deep pipelines a win, until CPUs ran into power limits ~2005.
>>>
>>> - anton
> <
>> The relevance of latency comes in, I think, when one considers the effect
>> of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
>> injects a bubble whose size is independent of the number of stages.
> <
> Make that number of stages TIMES width of execution.
> {and yes, I saw that you wrote that in negative context}
>> If we go from 6 stages, 20+5 gates to 12 stages, 12+5 gates
>> we increase the clock by a factor of (20+5)/(12+5) = 1.47.
>> But a bubble now takes 2x as many clocks to recover from.
> <
> And whereas branch predictors continue to get better, L1 cache hit
> ratios are essentially frozen by size and sets. So what started out as
> branch prediction limited (1990) ends up as L2 latency limited (2000+).
>> Also adding pipeline stages doesn't change the speed of data cache.
> <
> What changes the throughput of the data cache is ports. If you can
> perform 4 accesses per cycle to a 4-way banked cache, throughput
> goes way up and you quickly realize that you have to adequately port
> the L2 similarly. You want simultaneous misses in L1 to be handled
> simultaneously in the L2 !!

There are two design dimensions at play here.
The original discussion was the degree to which adding stages to
an in-order pipeline would improve performance. Another dimension
is making stages wider by carrying multiple uOps per stage packet.

Multiple cache ports might benefit with wider stages but I wonder how much.
To utilize the multiple ports the packet must contain multiple load ops
(stores that cache miss can be saved in cache MSHR buffers if necessary).
It could allow some optimizations, like load prefetching
under miss for multiple load uOps, early address translation.
However for a packet to move on from the load stage,
all the load uOps in it must be finished.

Wide stages are orthogonal to the original question of the
beneficial effects of adding more stages.
I don't see that multiple cache ports could be utilized by adding
more stages (I suppose it allows overlap of translate and load
if two memory uOps are sequential, but that's really it).

> <
> We worried a lot about the number of wires in the 1990s, but even GPUs
> get 10 layers of metal, and IBM is using 17 layers in its modern mainframes.
> With this wire resource, there is little reason NOT to bank the cache hierarchy.
> <
> {Aside: many GPUs run 1024 wires from and another 1024 wires to the L1 cache
> (nor including the addresses and control)} And these busses pass data back
> and forth in the same "beat" structure as the SIMT calculation beat structure.}
> <
>> Adding pipeline stages to increase the frequency means we somewhat
>> decrease the unused cache access time between loads and stores.
>> However if the D$ cache access saturates, stages should have minimal impact.
> <
> Add ports and AGEN width to eliminate saturation.

Yes but wouldn't addition full cache ports be prohibitively expensive,
requiring whole extra decoder, word lines, bit lines, sense amps?
Its one thing for a register file, but for a large-ish cache?

>> It also depends on how one measures performance.
> <
> There is only one sane metric here: wall clock time for entire application.
> <

Which is why one should not assume that adding more stages to increase
the clock frequency will necessarily decrease wall clock time.

>> More stages means higher frequency means higher potential issued MIPS.
>> If instead we count retired MIPS, to take into account bubbles and
> <
> Only compiler and CPU architects should be able to see unretired statistics.

I said this because the discussion at that point seemed to be assuming
that by adding pipeline stages to increase the clock frequency by, say,
a factor of 1.47 that it would increase pipeline output by 1.47.
Increasing the frequency potentially allows instructions to be stuffed
into the pipeline faster. There are other considerations that limit
the actual gains to < 1.47.

On 6/5/2021 9:03 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> In my experience with them, at similar clock speeds, the original Atom
>> gets beaten pretty hard by ARM32.
>
> What do you mean with "ARM32"? If you mean the 32-bit ARM
> architecture, there are many different cores that implement this
> architecture. For the LateX benchmark I have:
>
> run time (s)
> - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
> - OMAP4 Panda board ES (1.2GHz Cortex-A9) Ubuntu 12.04 2.984
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
>

I had something which had an Atom N270 (an ASUS Eee running Linux), its
performance kinda sucked vs a RasPi2 in my tests (900MHz Cortex-A53),
running 32-bit Raspbian.

IIRC, at the time I was mostly testing it with color-cell based video
codecs and similar.

These mostly use some arithmetic to calculate color endpoints, then
typically fill blocks of memory with pixel values using 1 or 2 bit
selectors. These can be implemented either with a small lookup table, or
with "c?x:y".

Eg:
ct0=dest;
ct1=dest+stride;
...
ct0[0]=(bpx&0x0001)?clra:clrb;
ct0[1]=(bpx&0x0002)?clra:clrb;
ct0[2]=(bpx&0x0004)?clra:clrb;
ct0[3]=(bpx&0x0008)?clra:clrb;
ct1[0]=(bpx&0x0010)?clra:clrb;
ct1[1]=(bpx&0x0020)?clra:clrb;
...

> For Gforth I have, e.g.:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.410 0.520 0.260 0.635 0.280 Exynos 4 (Cortex A9) 1.6GHz; gcc-4.8.x
> 0.600 0.650 0.310 0.870 0.450 Odroid C2 Cortex A53 32b 1536MHz, gcc 5.3.1
> 0.390 0.490 0.270 0.520 0.260 Odroid C2 Cortex A53 64b 1536MHz, gcc 5.3.1
>
> So, yes, OoO 32-bit ARMs like the Cortex-A9 have better
> performance/clock than Bonnell, but the Cortex-A53 in 32b-moe not su
> much. Which is quite surprising, because I would expect a RISC to
> suffer less from the in-order implementation than a CISC. This
> expected advantage is realized in the 64b-A53 result.
>

OK.

>> From what I can tell, it appears that x86 benefits a lot more from OoO
>> than ARM did, and Aarch64 is still pretty solid even with in-order
>> implementations.
>
> OoO implementations are a lot faster on both architectures:
>
> LaTeX:
>
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
>
> Gforth:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
>

OK.

When I tested with running interpreters, this was a case that somewhat
favored x86.

>> However, an OoO x86 machine does seem to be a lot more tolerant of
>> lackluster code generation than an in-order ARM machine (where, if the
>> generated code kinda sucks, its performance on an ARM machine also
>> sucks).
>
> Not sure what you mean with "lackluster code generation", but OoO of
> course deals better with code that has not been scheduled for in-order
> architectures. There is also the effect on OoO that instructions can
> often hide in the shadows of long dependency paths. But if you make
> the dependency path longer, you feel that at least as hard on an OoO
> CPU than on an in-order CPU.
>

I was thinking stuff like (pseudocode):
Load A
Load B
Op C=A+B
Load D
Store E
Move E=A
Store C
Move C=D
Op F=C+E
Store F
...

Or (closer to typical GCC "-O0" output):
Load A
Load B
Op C=A+B
Store C
Load A
Load B
Op C=A-B
Store C
...

This stuff does "sorta OK" on x86 machines (within ~ 3x of optimized
code), but poorly on ARM machines (~ 5x-7x slower than optimized).

In another past test trying to do dynamic recompilation of BJX2 code to
32-bit ARM (on a RasPi3), it was a fair bit slower than the FPGA
implementation (despite the RasPi3 having a fairly significant
clock-frequency advantage).

>> The x86 machine seems to just sort of take whatever garbage one
>> throws at it and makes it "sorta fast-ish" (even if it is basically just
>> a big mess of memory loads and stores with a bunch of hidden function
>> calls and similar thrown in).
>
> I don't know what you mean with "hidden function calls and similar",
> but the stuff about "a big mess of memory loads and stores" sounds
> like the ancient (and wrong) myth that loads and stores are free on
> IA-32 and AMD64.

They are not "free", but their performance impact is a lot less obvious.

For the hidden function calls, say, assume that for a certain type,
operators are implemented with function calls, such that:
z=3*x+y;
Compiles as if it were, say:
t0=__convsixi(3);
t1=__mulsxi(t0, x);
t2=__addxi(t1, y);
z=t2;

Likewise, for a simpler compiler, the logic implemented by the
code-generator might be fairly minimal, with the generated code
consisting almost entirely of function calls.

In this case, the code-gen mostly consists of logic for looking up the
name of which function to call.

In BGBCC, runtime calls are still used a lot (BJX2):
Int (and smaller):
Handled directly: +, -, *, &, |, ^, <<, >>
Runtime call: /, % (1)
Long / Long Long:
Handled directly: +, -, &, |, ^, <<, >>
Runtime call: *, /, %
Int128:
Handled directly: &, |, ^
Handled directly (ALUX): +, -, <<, >>
Runtime call: *, /, %
Runtime call (Non-ALUX): +, -, <<, >>
Float / Double:
Handled directly: +, -, *
Runtime call: /
Float128 / Long Double:
Runtime call: Everything.
Variant:
Runtime call: Everything.
Vector (vec3f / vec4f):
Handled directly: +, -, * (pairwise)
Runtime call: /, % ^ (cross and dot product)
...

*1: Constant division may be implemented as ("(x*C)>>32") in some cases.

Type conversion paths also may or may not involve runtime calls.

The main part of the codegen doesn't necessarily know when function
calls may occur, so is paranoid by default (doesn't use any of the
scratch registers). There is logic that detects if a basic-block is
"pure" (no function calls or complex operations), which enables the use
of scratch registers (by the register allocator) within this block.

Contrast with the SH and BJX1 backends:
Int (and smaller):
Handled directly: +, -, &, |, ^
Runtime call: *, /, %, <<, >> (*2)
Long Long:
Runtime call: Everything.
...

*2: This was because SH kinda sucked in some ways.
Some of the SH variants also used shift-slides, so being able to do a
shift directly was more of a special case. Much of the FPU operations
were also implemented via function calls (or pretty much the entire FPU
for SoftFP cases).

Likewise, if in an local array one does:
int a[256]; //allocated directly
But:
int a[4099]; //runtime call (__alloca)
//__alloca is in turn built on top of "malloc()".

Likewise:
struct bigstruct_s {
int arr[1999];
};

struct bigstruct_s a, b; //implicit __alloca calls
...
b=a; //implicitly calls memcpy()

The prolog and epilog compression may or may not be counted. In this
case, the called functions are generated by the compiler. This was done
even for performance-optimized code as it tended to save more due to
fewer I$ misses than the cost of the extra branch instructions or
call/return overhead.

The presence of arrays or similar on the stack may also involve the
addition of a "security tokens" and similar to try to detect stack
thrashing due to buffer overruns (idea kinda borrowed from MSVC).

Very little of the C library is handled with builtins, with the main
exception of "memcpy" and similar potentially being handled as a
special-case (if the size is a small constant value, it may be
transformed into memory loads and stores).

Note that some of the C library functions were rewritten to be a little
more efficient from what they were originally.

Eg, the C library I am using originally did strcmp() kinda like:
while(*srca && *srca++==*srcb++);

Click here to read the complete article

Anssi Saari wrote:
> Quadibloc <jsavard@ecn.ab.ca> writes:
>
>> On Friday, June 4, 2021 at 3:07:37 AM UTC-6, Anton Ertl wrote:
>>
>>> Apple uses OoO for both their big cores and their little cores.
>>
>> And, indeed, while Intel's original small Atom cores were in-order,
>> they eventually switched over to even giving those a simple
>> out-of-order capability, since transistor densities had increased,
>> and the original Atom cores were percieved as having very poor
>> performance.
>
> I'm actually retiring an old Atom system. D510 CPU, Bonnell uarch, 45
> nm, dual cores, 1.67 GHz. Early last decade these sold for $60 and that
> included a motherboard.
>
> It has served as a little file server and for that it's fine. But things
> like a web browser, even starting one let alone trying to render any
> pages is pretty frustrating. Any crypto likewise. Even a remote desktop
> thing like x2go is bogged down when starting up, that's apparently
> because some parts of it are shell scripts or Perl.
>
> I replaced it with the cheapest recent Intel CPU thing I could find, a
> Celeron G5900 (Comet Lake, 14 nm, dual cores, 3.4 GHz). It runs rings
> around the old Atom.
>
>> And yet people didn't complain about the performance of the
>> 486 DX. So I would be inclined to blame software bloat.
>
> I don't know, I seem to recall decoding and showing jpegs was pretty
> slow on a 486. MP3 audio decoding in software took a Pentium or at least
> a fairly fast 486 and highly optimized software. Crappy MPEG-1 video
> needed a hardware decoder card... Word for Windows 2.0 ran fine.
>
I helped optimize the MMX asm in Zoran's SoftDVD which was the first
pure software DVD player which could handle 30 fps interlaced with zero
drops on a Pentium MMX - 200 MHz. This cpu was effectively 6-12 X faster
than the classic 33 MHz 486, but something like Blink video could
probably run well even on a 486.

Decoding MPEG1 in software was far easier than the DVD MPEG2 formats.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Squeezing Those Bits: Concertina II

<49a5427d-1d8a-4c8a-8b1e-30796df3da76n@googlegroups.com>

Subject	Author
Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Stephen Fuld
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Ivan Godard
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	John Dallman
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Anssi Saari
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Ivan Godard
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Ivan Godard
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Stefan Monnier
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	George Neuner
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Stefan Monnier
Re: Squeezing Those Bits: Concertina II	Thomas Koenig
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	JimBrakefield
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Stephen Fuld
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc