Message-ID:

Somebody's terminal is dropping bits. I found a pile of them over in the corner.

On 10/22/2023 9:54 AM, Tim Rentsch wrote:
> BGB <cr88192@gmail.com> writes:
>
>> On 10/20/2023 9:14 AM, Tim Rentsch wrote:
>>
>>> Bart <bc@freeuk.com> writes:
>>>
>>>> On 16/10/2023 03:26, Tim Rentsch wrote:
>>>>
>>>>> Bart <bc@freeuk.com> writes:
>>>>
>>>>
>>>>>> My product compiles a C 'subset' but does not formally
>>>>>> define what it is.
>>>>>
>>>>> An informal definition is a lot better than no description
>>>>> at all, and at least is something about the C language.
>>>>> So figure out what the discrepancies are (and only you can
>>>>> do that), and tell us about it. That's why we're here!
>>>>
>>>> OK, an informal definition is the subset of C that I personally
>>>> use, and that I saw being commonly used on open source projects,
>>>> at the start of 2017.
>>>>
>>>> It would loosely be C99 but missing Complex, VLAs, designated
>>>> initialisers, compound literals, but with _Generic from C11 (that
>>>> one could be trivially implemented in about 40 lines of code).
>>>
>>> Given that, a good target would be to aim for conformance with
>>> C11, where Complex and VLAs are optional. Compound literals
>>> and designated initializers shouldn't be that hard to add, and are
>>> both very useful. If you can get your compiler up to the level
>>> of C11 conformance, or at least close enough, I would consider
>>> trying it with my current project (which is written entirely
>>> in C). The code in that project uses both compound literals and
>>> designated initializers. Do you handle bitfields? I use those
>>> also.
>>
>> FWIW, in my C compiler (BGBCC, as-is):
>> _Complex exists, but isn't well tested;
>> VLAs sort of exist, but:
>> They only work for 1D arrays of primitive types (no structs).
>> Basically, C89 with parts of the newer standards glued on ad-hoc.
>> Features were added more "as useful".
>> Supports some parts of C23 as well.
>> Supports _BitInt, partial 'auto', ...
>> Also supports the proposed (C++ style) lambda syntax.
>> Though, semantics don't exactly match C++ lambdas (1).
>> Also the new attribute syntax and similar.
>> Along with UTF-8 strings and similar as well.
>> Also the 'stdbit.h' stuff, "0b" literals, digit separators, ...
>> ...
>>
>> Does not support '_Generic()' though as of yet.
>
> For a compiler to be one I might be able to use, it must
>
> * conform to either the C99 or C11 standard (some
> documented shortcomings might be acceptable, depending
> on what they are)
>

It mostly supports roughly the common subset of C99 and C11.
Though, most of the code I have ported to my ISA has been C89, so newer
functionality isn't well tested.

> * produce .o files in at least one environment I can use
> (right now that is only GNU/Linux), including support
> for a -fPIC option
>

At present, generating code on any mainstream targets was not a
priority, as MSVC/GCC/Clang already address these cases fairly effectively.

My previous small-scale attempts at generating code for ARM targets had
given horribly bad performance relative to GCC, so it didn't really seem
worthwhile.

As-is, it produces "RIL3" as its "object file" output, which is a
stack-oriented bytecode along vaguely similar lines to .NET bytecode (it
does the preprocessing and parsing, then translates the resulting AST
into a stack-oriented bytecode; generally using implicit types which are
carried along the with stack, like .NET and unlike JVM, ...).

Though, there are larger scale architectural differences (metadata is
managed differently; and my design assumes a C-like or "bare metal"
environment, rather than a "Managed VM", ...).
Though, actual .NET bytecode would be a crappy way of expressing C and
similar (making it less desirable at the time), and is used mostly as a
stand-in for object files and static libraries (with the final output
being in the form of native-code binaries). Though, one could argue, at
least .NET's bytecode would be "less bad" at representing C than the
JVM's ".class" file-format (in that, at least, doing so is not "comical").

Though, things "could be better", but not come up with any "clearly
better" option (native-code objects and a three-address-code IR are all
tradeoffs); and other factors are mostly matters of data-packaging and
how to best represent the metadata, ...

I had on-off considered whether to consider using IR images and AOT
compiling them per target. Would preferably need a lighter weight
backend though. Current backend loads and unpacks all of the bytecode
and metadata; saving memory would require ability to translate functions
one-at-a-time, and determining things like what is reachable, preferably
without needing to first translate everything into 3-address form, ...

All this would require a fair bit of redesign though (and a different
packaging scheme for the IR bytecode and metadata; as the current
structure is not well suited to random access as it was designed around
the the assumption of a sequential loader). Things like RAM footprint of
reading in and accessing the bytecode images would also need to be
considered, etc. Well, along with the battle over the relative merits of
a stack-machine vs three-address-code representation for the IR, etc.

Note that many aspects of the target machine can be glossed over in the
IR, though things like "sizeof(void *)" and similar remain as thorny
issues (so, if relevant, one may still need to build versions of the
libraries for each potential pointer size).
As-is, it sort of works, but trying to mix/match here mostly tends to
result in a bunch of hairy bugs typically revolving around "#ifdef's"
and "typedef" and similar (which effect other code, even if the C
library itself can be written to be mostly pointer-size agnostic).

Well, and relatedly, there is another "__ifarch(feature)" extension that
mostly allows enabling/disabling functions of blocks of code depending
on target-specific options, however (unlike "#ifdef") requires the code
to be structurally and syntactically valid (and the function signatures
need to match across all ifarch paths, etc). This differing from
"#ifdef" mostly in that it is handled much later in the compiler pipeline.

An example of this would be to allow the same context-switch code to be
used on targets with 32 and 64 GPRs (where half of the registers don't
exist in the 32-register configuration). But, would be annoying to
compile separate versions of the runtime libraries based on things like
whether the configuration is using 32 or 64 GPRs, which variant of the C
ABI is being used, etc.

This feature is generally also available in ASM code and inline ASM and
similar as well (and also can enable or disable the use of ASM versions
of functions vs C counterparts, etc).

The backend then translates this to native code, at present emitting
binaries in a PE/COFF variant (loosely derived from the WinCE/SH
variant); but with some tweaks and typically LZ4 compressed (the LZ4
compression is integrated with the PE/COFF loaders, via the modified
format).

The LZ4 decompression is faster than reading more data from an SDcard
running at 13MHz in SPI mode. Where, I am using 13MHz mostly as I could
not get reliable operation much over this speed (and have not
implemented the logic to support UHS, and even if I did, I couldn't get
much more bandwidth over the existing MMIO bus).

When operating in the LZ4-compressed mode, the PE/COFF checksums also
use a different (slightly stronger) algorithm, as the original PE/COFF
checksum algorithm was insufficient to detect many of the problems a
misbehaving LZ4 decoder could introduce.

The code produced is "mostly" position independent, but does still rely
on "base relocations" for some things. The ABI and format was tweaked
slightly to allow running multiple logical instances of a given binary
within a single shared address space. Effectively, the
'.text'/'.rdata'/etc sections being accessed separately from '.data'/'.bss'.

Where, in this case, modifiable sections are separately allocated and
accessed relative to the "global pointer", with the global pointer
pointing to a lookup table which allows each PE image (EXE or DLL) to
locate its corresponding version of the section (updating the global
pointer to point to its own data section). This global pointer is
callee-save, so on return, the caller's global-pointer is restored.

Typically, the base-relocations were also split up, with one part being
applied to the executable sections when initially loaded; and the other
to the data sections when a new process instance is created.

Note that saving/restoring/updating the global-pointer is skipped for:
Functions which are not callable as either function pointers or DLL
exports ("__declspec(dllexport)" or similar);
Leaf functions which don't access any global variables.

This strategy seemed to be lower overhead than the mechanism used in ELF
FDPIC (where this hackery was mostly handled on the caller-side by using
multiple entries in the GOT and by having function pointers as pointers
into the GOT).

All of this running on a sort of quick/dirty makeshift OS (sort of
vaguely resembling a weird Linux/Windows/MS-DOS hybrid); which as-is
treats the use of virtual memory as an optional/experimental feature
(all this stuff was designed to not assume the use of virtual memory).

Had partly started work on another compiler built around the goal of
being lightweight (preferably less code and memory footprint; goal being
to try to keep it under 50k lines).

This compiler would be closer to a more conventional design, but was
using a non-standard "WOFF" object format, which was effectively sort of
like COFF but mode simplistic and effectively built on a variant of the
WAD format (sort of like what was used in in the Doom engine games):
Lumps that began with '.' being sections;
Such as ".text" / ".data"
Lumps starting with '$' being metadata;
Such as the symbol/reloc tables ("$symtab"/"$rlctab").
Typically, section lumps preceding metadata lumps.
Section lump index was used as section-number, just starting at 1.
Section 0 being a "NULL section" as far as relocs go.

Otherwise, the metadata was "similar" to what existed in COFF (just
using a single pair of symbol and reloc tables, rather than giving each
section its own tables). Likewise, using the same entry format for both
the symbol and relocation tables (just one entry defines where to find
the symbol, and the other what to fixup with the address of the symbol).

A variant of WOFF was also considered as a possible option for bytecode
objects/images as well, but in this case, its main "competitor" being
the use of RIFF with a file-structure vaguely similar to the RIFF AVI
format (and some past musings for possible WAD/RIFF hybrid formats, ...).

Though, for "whatever-ar", it would make sense to stick with the
traditional "!<arch>" format for static libraries and similar in any
case (regardless of the use of COFF/ELF/WOFF whatever for the object
files); though technically, "whatever-ar" could probably also use a WAD
variant or similar and nothing would likely notice (and sidesteps the
hair of the "arch" format using ASCII-based data fields for "who knows
what reason"); partly as each cross compiler provides its own versions
of all the binutils (but, either way).

Though, partly, this was because COFF seemed to have some needless
complexity vs "just throw the assembler's output in a WAD file".
Also skipped ELF mostly as it also seemed needlessly complex, and
doesn't really match as well to my ISA and ABI design.

This compiler would have also been intended to more closely mimic GCC's
interface (hopefully close enough that it would be less of a stretch to
try to use it as a cross-compiler in autotools).

This effort kinda stalled though by me going and working on other stuff.

And, I didn't really get the compiler anywhere near complete and it was
already at 30 kLOC, so it doesn't look like it would achieve the "less
than 50 kLOC" goal.

For contrast, my existing compiler weighs in at closer to 250 kLOC.

Though, relatedly, there was still the goal of "compile a program using
less than 4MB of RAM or so", where my existing compiler needs
significantly more than this.

Like, 180MB to recompile something like GLQuake from source effectively
precludes running it on the FPGA boards, at least not without using
virtual memory (this being a big drawback of the "do all the code
generation for everything all at once" strategy; but could be kept lower
with separate compilation and linking).

> * generate code of at least reasonable quality; not
> necessarily at the -O2 level of gcc or clang, but
> better than -O1
>

Code generation is still kinda hit/miss.

By default, it assumes some relatively conservative semantics:
Integer wrap on overflow (1);
Does not perform aliasing optimizations be default (2).

*1: Some old programs, particularly my ROTT port, seem rather sensitive
to integer overflow behavior (so, if build with GCC, typically need
"-fwrapv -fno-strict-aliasing" and similar).

*2: It may cache memory loads, but treats any explicit memory store that
can't be proven not to alias as potentially aliasing.

Had experimented with TBAA, but I don't really consider it "sufficiently
safe" to be treated as a default option.

Generally, it also assumes that most pointers may be unaligned, with a
partial exception for things like "__m128 *" and similar, which assume
aligned access (there is a separate "__m128u" type for unaligned access).

There are "__aligned" and "__unaligned" modifiers for this, with
"__packed" also assumed as implying "__unaligned" (and is also implied
as a default it "#pragma pack(1)" or similar is used, ...).

Though, some of this would matter more for SuperH (which by default
assumed aligned memory access), whereas in my ISA, most memory access
instructions are unaligned (with the main exception of the 128-bit
load/store instructions). Though, performance may still be better if
alignment is preserved (also packed structs have a penalty in that the
fixed-displacement load/store ops only encode aligned cases, so
accessing a misaligned struct member requires a multi-instruction sequence).

Register allocation tends to be more naive than GCC, as it generally
divides registers into one of two categories:
Statically assigned across the entirely function;
Dynamically assigned within each basic-block.
Any such variables are spilled to memory at the end of the block.

In many cases, this does increase the register pressure needed to get
"decent" performance, which is part of the incentive for my ISA design
having 64 GPRs (this makes it easier to often static-assign *everything*
into GPRs for many functions).

Though, have noted that there seems to be a relative performance
advantage for code using large numbers of local variables or large
amounts of state being updated inside of loops.

Relative to x86-64, "work per clock cycle" for my ISA seems to beat out
x86-64 for things like functions with 100+ local variables that update
"most of them" on each loop iteration (as a single giant loop body).
Granted, this is a bit niche...

But, if you write something like:
for(i=0; i<1000000; i++)
dst[i]=src[i];
The relative performance of my ISA is crap...

Also, it doesn't currently understand (or perform) any such
optimizations as loop unrolling or function inlining.

But, the code generation kinda falls on its face for targets with a
smaller number of registers (such as 16).

Comparably, GCC is able to handle individual values flowing from one
basic block to another, which holds up better on targets that can't
throw a large number of GPRs at the problem (eg: doesn't turn into
excessive amounts of spill-and-fill).

> If you get to something roughly or possibly approximating the
> list above, please keep the group posted!

Not sure...

Trying to compete with GCC or friends on their "home turf" wasn't really
something I had considered.

Subject	Replies	Author
MCC Compiler By: Bart on Thu, 28 Sep 2023	39	Bart

Somebody's terminal is dropping bits. I found a pile of them over in the corner.

devel / comp.lang.c / Re: MCC Compiler