Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Decaffeinated coffee? Just Say No.


computers / comp.arch / Re: Dense machine code from C++ code (compiler optimizations)

Re: Dense machine code from C++ code (compiler optimizations)

<snfohs$abd$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=22093&group=comp.arch#22093

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 03:40:07 -0600
Organization: A noiseless patient Spider
Lines: 273
Message-ID: <snfohs$abd$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 09:40:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="673fabfc6e16066c237e505b2b47291f";
logging-data="10605"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6ZclVW7bnngTVdDAhcGGw"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.1
Cancel-Lock: sha1:GYa5RPqTQ/DKfxqEwX2h1o/HJ/4=
In-Reply-To: <snffpt$o6p$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 22 Nov 2021 09:40 UTC

On 11/22/2021 1:10 AM, Marcus wrote:
> On 2021-11-21 kl. 23:14, BGB wrote:
>> On 11/21/2021 11:13 AM, Marcus wrote:
>>> Just wrote this short post. Maybe someone finds it interesting...
>>>
>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>
>>
>> I guess this points out one limitation of my compiler (relative to
>> GCC) is that for many cases it does a fairly direct translation from C
>> source to the machine code.
>>
>> It will not optimize any "high level" constructions, but instead sort
>> of depends on the programmer to write "reasonably efficient" C.
>
> I would suspect that. That was one of the points in my post: GCC and
> Clang have many, many, many man-years of work built in, and it's very
> hard to compete with them if you start fresh on a new compiler.
>
> I also have a feeling that the C++ language is at a level today that
> it's near impossible to write a new compiler from scratch. It's not only
> about the sheer amount of language features (classes, lambdas,
> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
> ...), but it's also about the expectations about how the code is
> optimized. C++ places a huge burden on the compiler to be able to
> resolve lots of things at compile time (e.g. constexpr essentially
> requires that the C++ code can be executed at compile time).
>

I have currently put full C++ support on the "not going to do it" shelf...

There is a subset along similar lines to EC++, but hardly any C++ code
(which actually uses the STL or C++ standard library) would actually
work with such a subset.

>> Such a case would not turn out nearly so nice in my compiler though
>> (if it supported C++), but alas.
>>
>> Trying to port GCC looks like a pain though, as its codebase is pretty
>> hairy and it takes a fairly long time to rebuild from source (compared
>> with my compiler; which rebuilds in a few seconds).
>
> Yes, it has taken me years, and the code base and the build system is
> not modern by a long shot. A complete rebuild of binutils +
> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
> incremental build of GCC when some part of the machine description has
> changed (e.g. an insn description was added) takes about a minute.
>

Rebuilding GCC for RISC-V took around 20 minutes on my 2700X, but
granted, this was on a platter drive which was running low on available
disk space; and on WSL.

Rebuilding BGBCC with MSVC takes around 3 seconds.
Rebuilding BGBCC with GCC takes around 56 seconds.

> OTOH it would probably have taken me even longer to create my own
> compiler (especially as I'm not very versed in compiler architecture),
> so for me it was the less evil of options (I still kind of regret that
> I didn't try harder with LLVM/Clang, though, but I have no evidence
> that the grass is greener over there).
>

I fiddled around with compilers and language design for a long time
before I got into ISA design or FPGAs (for me, I started fiddling with
writing language interpreters back in high-school; and BGBCC started out
as a fork off a project I started working on directly after high-school,
namely writing a custom JavaScript knock-off with the intention of using
it as a scripting language in my other projects).

Initially, BGBCC was also intended as a script-interpreter, just parsing
a C variant rather than a JS variant, but at the time I had found that C
was much worse suited to the task, and debugging a C compiler was
considerably harder than a JS-like compiler (despite their superficial
syntactic similarity).

But, then it beat around for some-odd years being used mostly as an "FFI
glue" tool.

I had BGBCC laying around from some past projects of mine. It was pretty
awful, and wasn't really used much for anything "non-trivial" but it
sorta had most of the basics in place.

For my newer ISA projects, I was like, "well, I will dust off what I
have and just use that".
Actual 2nd place option at the time was LCC, but it didn't look like LCC
would have offered that much over what I had already.

Though, BGBCC has expanded to around 5x as much code as it was when I
started this project. Most of this is in the backend, and a lot of new
special cases in the frontend, ...

It does have a little big of funkiness in that it works very differently
from GCC. It does not use object files, but instead uses a
stack-oriented bytecode as an IR stage (and internally generates the 3AC
IR code from the stack-machine code).

....

>>
>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
>> seemingly takes ~ 20 seconds to recompile Doom.
>>
>
> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
> Doom is 1.3 s (10 s without parallel compilation). The build time for
> Quake is 1.6 s (13 s without parallel compilation).
>

I was invoking it via a similar approach to how one uses MSVC, namely:
gcc -o whatever ...all the source files... (options)

Generally, the compiler speed ranking tends to be:
MSVC, fastest;
BGBCC, 2nd fastest;
GCC and Clang, considerably slower.

Though, GCC and Clang are able to generate code that is typically a fair
bit faster than what MSVC can pull off.

BGBCC basically runs as a single thread in a single process.
It does cache loaded header files and similar in RAM, mostly because
opening/closing/reading files is fairly slow (and otherwise can become a
bottleneck).

> But I agree that a fast compiler is worth a lot. I work with a DSP
> compiler that can take ~5 minutes to compile a single object file that
> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>

Hmm...

I get annoyed waiting much more than a few seconds.

>>
>
> [snip]
>
>> And, all this was while trying to hunt down another bug which seems to
>> result in Doom demos sometimes desyncing (in a way different from the
>> x86 builds); also a few other behavioral anomalies which have shown up
>> as bugs in Quake, ...
>>
>
> FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
> get the result off by a single LSB, the demos will start running off
> track very quickly ;-) I found this out back in the 1990s when I ported
> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>

I am handling these generally by using 64-bit integer math.

Previously, I had the Doom demos 1:1 between the x86 builds and the BJX2
builds (they would play out the same way), but recently something has
changed which is causing the 3rd demo in the demo-loop (for Ultimate
Doom, E3M5) to diverge near the end.

Namely, in the x86 builds (and previously on BJX2), the player would ram
into an imp and then get killed by a hell baron near the outside edge of
the central structure. Now, on BJX2, there is a divergence and the
player gets killed (by the baron) in a different location.

The main difference in this case seems to be that the imp behaves
differently and is in a different spot.

I had already been poking a bit at the division algo and similar, and
this seems to still be generating the correct results (things like
sign-extensions on shifts and similar are also tested via "sanity
checks", ...).

Behavior seems to still match up in Heretic and Hexen last I checked.

I have at least found and fixed some other bugs though.

The ROTT demos also desync pretty bad.

The behavior is also fairly stable in the face of switching around
codegen options, ..., implying it is probably not a low-level codegen issue.

In any case, this implies a behavioral / semantics issue somewhere.

Though, it does appear that this desync matches up with the behavior I
saw when I built this port of the Doom engine for the RasPi.

Or:
The first demo (E1M5) is desync'ed, but behavior is consistent between
all the targets;
The second demo (E2M2) remains in-sync and consistent on all targets;
The third demo (E3M5) diverges when the imp and baron are encountered,
with my BJX2 build switching from the x86 behavior to ARM/RasPi behavior
(for reasons I have not yet determined).

....

But, yeah, this stuff is super fiddly (and sensitive to very slight
disturbances).

For ROTT, there also seems to be a running RNG state, and differences in
enemy behavior will also cause the RNG to diverge, leading to a cascade
effect where everything goes to crap.

This is unlike Doom and friends where enemy behavior does not make use
of an RNG (ROTT deals with demos by initially reseeding the RNG with 0).

>>
>> Well, also the relatively naive register allocation strategy doesn't
>> help:
>> For each basic block, whenever a variable is referenced (that is not
>> part of the "statically reserved set"), it is loaded into a register
>> (and temporarily held there), and at the end of the basic-block,
>> everything is spilled back to the stack.
>>
>> There are, in theory, much better ways to do register allocation.
>>
>>
>> Though, one useful case is that the register space is large enough to
>> where a non-trivial number of functions can use a "statically reserve
>> everything" special case. This can completely avoid spills, but only
>> for functions within a strict limit for the number of variables
>> (limited by the number of callee save registers, or ~ 12 variables
>> with 32 GPRs).
>>
>> For most functions, this case still involves the creation of a stack
>> frame and similar though (mostly to save/restore registers).
>>
>> This case still excludes using a few features:
>>    Use of structs as value types (structs may only be used as pointers);
>>    Taking the address of any variable;
>>    Use of VLAs or alloca;
>>    ...
>>
>> But, this case does allow a few things:
>>    Calling functions;
>>    Operators which use scratch registers;
>>    Accessing global variables;
>>    ...
>>
>>
>> With the expanded GPR space, the scope of the "statically assign
>> everything" case could be be expanded, but still haven't gotten around
>> to making BGBCC able to use all 64 GPRs directly (and I still don't
>> consider them to be part of the baseline ISA, ...). This could (in
>> theory) expand the limit to ~ 28 or so.
>>
>>
>> If BGBCC supported C++, I don't have much confidence for how it would
>> deal with templates.
>>
>> But, otherwise, mostly more trying to find and fix bugs and similar at
>> the moment. But, often much of the effort is trying to actually find
>> the bugs (since "demo desyncs for some reason" isn't super obvious as
>> to why this is happening, or where in the codebase the bug is being
>> triggered, ...).
>>
>>
>>> /Marcus
>>
>

SubjectRepliesAuthor
o Dense machine code from C++ code (compiler optimizations)

By: Marcus on Sun, 21 Nov 2021

190Marcus
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor