Message-ID:

Felson's Law: To steal ideas from one person is plagiarism; to steal from many is research.

On 7/4/2021 4:56 PM, Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>
>> One doesn't just have literals, but may also have:
>> * any symbols (variable / function name) within the program;
>> * the names of any visible function prototypes, typedefs, ...;
>> * Every signature string for every variable or function;
>> * ...
>
> Any of those should be ASCII, right?
>

Most are ASCII...

Actually, given most symbols are one of:
A..Z
a..z
0..9
_

One could almost justify an "extra-compact" 6-bit encoding.

Say, for example, if one is using tagged pointers, one of the tags could
be interpreted as, say, containing 9x 6-bit chars. In this case, strings
like "strncmp"/"_Complex"/... could be stored without using any
additional memory.

For normal strings, non-ASCII cases are a small minority, but this is a
lot less true of binary data though.

>> So, having a space-efficient representation for string data can be
>> relevant, as can interning any short/recurring strings.
>
> Hmm... ok...
>

If one stores every copy of every string, this costs a lot of extra
memory. Merging them is generally effective.

The main exception is large one-off strings, which are better if not
interned. Similarly, the longer a string is, the less likely it is to
repeat.

>>
>>
>> As noted, in my compiler, inline ASM blobs, and also entire ASM modules
>> (post preprocessor) may also be passed through the IR stage via string
>> literals, as well as some binary data blobs, such as "Resource WAD"
>> lumps, ...
>
> I would think that there is a difference between user strings
> (which may be UTF-8 or whatever) and internal representation of
> variable names. Those are two rather different things, which I
> would probably put in separate code paths.
>

One could do so, but in this case things developed along a path where it
ended up being easier to treat all of these as essentially special cases
of strings at the IL layer.

There are different opcodes for different literal types, but they are
built on top of the same underlying serialization mechanism.

>> Essentially, pretty much the entire front-end of the compiler is fed
>> through something that is essentially comparable to a binary-coded
>> version of PostScript, and stored temporarily as ".ril" (RIL3) files.
>
> Hm, interesting. Is there a particular reason why you use files instead
> passing things through data structures in memory?
>

Memory buffers are used if one is compiling code to a binary directly.
This isn't always the case though, such as with static libraries and
similar.

At some point, there is a need to be able to store partially-compiled
code to files, so that it can be "linked" into a program at some other
point.

When compiling a program, the compiler will fetch the ".ril" files
containing things like the C library and similar, and combine it with
the RIL code from the program currently being compiled.

And, if one is compiling a library, the frontend does its thing and then
the blob of RIL code is stored to a file.

In some sense, RIL is derived from an earlier form of the BGBScript
bytecode (along with the AST format).

Brief history:

Very long ago, I wrote some script-language interpreters...

My first script language was based on Scheme (this was back when I was
in high-school), using a fairly naive interpreter design (recursive
expression rewriting), it kinda sucked... Did partly implement a 3D
engine based around it though, rendering stuff using OpenGL (with a lot
of code copy/pasted from the Quake engine and similar).

I started programming a bit earlier, but most of my code from middle
school and earlier was lost due to an HDD crash.

My second script language (BGBScript, VM-0) was essentially a crappy
JavaScript clone built on top of DOM and a hacked and extended version
of XML-RPC / Jabber-RPC, and its performance was abysmal even vs the
Scheme clone.

Not long after, I rewrote the BGBScript VM (VM-1, or 1.0), partly using
the core of the Scheme VM, but rewriting the front-end. This version
effectively used Scheme as its IR stage, with the parser effectively
rewriting the JS style code into Scheme. For some extra speed (by my
standards at the time), replaced the expression-rewriting backend with a
bytecode interpreter (stack machine based, inspired by PostScript).

Then, getting to BGBCC:
I copy/pasted much of the parser from VM-0, though replacing the
original DOM with my own re-implementation;
And, a lot of the rest of the bytecode compiler parts from VM-1;
I then glued these parts together, and modified the parser to deal with
C (sorta);
And, wrote a backend to translate the stack bytecode to x86 machine code.

I went with the older DOM based parser mostly because I soon noted that
a big drawback of using a Scheme-based AST format was that it was rather
inflexible if compared with XML.

At the time, BGBCC was basically unusable garbage, and I couldn't get
its reliability to anything near what could be considered "acceptable".

I then returned to working on BGBScript, back-porting some of the parts
I had written for BGBCC into BGBScript:
It gained a static typesystem (and a bunch of new syntax "borrowed" from
ActionScript3);
It also switched to a statically-typed VM core;
It also gained a JIT compiler.

I redubbed this new language "BGBScript-1.5", and around the same time I
also wrote a 3D engine I called "BGBTech", which initially started off
trying to be sort of like Quake with a Doom3 inspired renderer, but then
switched over to being a Minecraft clone (while still using a Doom3
inspired renderer).

Some videos for this engine are still on YouTube, eg:
https://www.youtube.com/watch?v=WTrfNAfWV8o

This 3D engine was basically a mix of C and BGBScript.
It suffered from crappy performance, excessive memory usage, and
frequent stalls due to the use of a conservative mark/sweep garbage
collector.

Not long after, I did a ground up rewrite of both the BGBScript VM and
BGBTech, creating BGBScript2 and BGBTech2.

The BGBScript2 VM (BS2) differed partly in that its compiler switched
over to essentially using JSON as its AST format, and aimed mostly to be
a cleaned up and simplified version of what the original BGBScript had
become.

The BGBTech2 engine was also a lot more focused on performance (simpler
renderer, and also got rid of the use of a garbage collector), ...
I also switched mostly from 3D models to sprite graphics.

Example here:
https://www.youtube.com/watch?v=LGB53L7f314

Most of the renderer and similar is written in C, whereas most of the
"game logic" and similar is written in BS2.

Not long after, my first ISA project (BJX1) got started (based on an
extended version of the SuperH / SH-4 ISA).

In this case, I sort of revived BGBCC, and in the process replaced the
DOM clone with an AST format partly derived from the BS2 compiler, but
with an API wrapper to mimic the prior DOM-style API. So, at this point,
it is essentially sort of like JSON pretending to be XML (and RIL3 is
still mostly derived from the original BGBScript-1.0 bytecode).

After not too long, issues with the BJX1 design became pretty obvious,
so I did a partial reboot of the ISA design, which I called BJX2, which
then gained a lot more features (and some amount of cruft).

Similarly, BGBCC also contains a partial re-implementation of
BGBScript2, just targeting BJX2 rather than a VM. This uses mostly the
same code as when compiling C (mostly enables/disables parser features).

Similarly, some amount of BS2 features are exposed in C (in BGBCC) as
language extensions.

Though, my BJX2 project did basically eat up all my time, so I ended up
not doing much more on the BGBTech2 engine after that. Kinda considered
resuming work on it on/off a few times, but never got back around to
working on it.

Well, and also I end up writing significantly more code in C for BJX2
than I do in BS2.

If anything, BS2 in this context can be compared to C++, which I also
tend not to use all that much. Though, I personally consider BS2's
design to be "a lot less evil" than C++.

Ironically, making a compromise between these languages would end up
looking (syntax wise) fairly similar to C#, so it would almost be sort
of like if one could have C# on top of C's memory model.

I made an effort in BS2 to try to limit the presence of "unreasonable"
runtime overheads if compared with C, but doing this (while also
allowing for "high-level" features), is difficult.

Yeah... This is basically my life history up to this point...

Never really did much "professionally" though, it is pretty rare that I
can "actually get a job", and then almost invariably stuff comes along
to causes it to end after not that long (most recently, covid), and then
back to no one wanting to hire me to do anything, so alas...

Pretty much never gotten any programming jobs though.

> (This is one of the things I really do not like about standard UNIX
> compilation, writing out the code in assembler and then reading
> it back and parsing it again seems like a waste.)
>
>> When the final binary is to be compiled, these RIL3 files are
>> decoded/interpreted, which essentially builds all of the structures used
>> by the backend (and is also translated into 3AC and basic-blocks via
>> this interpretation process).
>
> OK.
>
>> It is possible that RIL3 files could be considered also as input to an
>> AOT compiler, but the main debate is whether the memory overhead would
>> be low enough to be "acceptable" (its design would basically depend on
>> the ability to be able to keep everything in RAM during the AOT
>> compilation process, though possible workarounds may exist).
>
> I can understand that you want your compiler to be able to run on
> your own hardware, but...
>
>
>> Ideally, one would want the ability to have a single-pass AOT compiler
>> which can fit in a memory footprint of, say, under 4MB or so, and where
>> its memory footprint doesn't increase drastically when compiling a
>> larger or more complex program (ignoring the space needed to store the
>> intermediate program sections or "object code").
>
> ...that seems to be a rather harsh requirement these days. I am
> all for conserving resources, but I personally would probably
> rather go for a cross-compilation than trying to fit a compiler
> into 4MB. (Then again, I work in the gcc framework, where such
> restrictions do not exists).
>

I probably can't make the full compiler work absent virtual memory or
similar, but an AOT could be nice.

I have thus far mostly been running stuff via compiling to native binaries.

Something like an AOT could allow some amount of abstraction between the
program and the underlying hardware, but also needs to "not suck" and be
lightweight enough to not kill the machine it runs on.

I don't necessarily want Dalvik or CIL though, due to both options
failing to really do an adequate job at being a good target for C code.

>>
>>
>>> The best way is probably to translate to that type on input, do
>>> all stuff that needs to be done on that simple data type and then
>>> write it to the desired format on output.
>>>
>>
>> Could be.
>>
>> As noted, convert-on-output was how it worked previously, but figured it
>> needed to be changed given doing it this way led to some ambiguities
>> regarding the parsing and handling of C string literals.
>>
>> Luckily, the literals were already capable of handling arbitrary binary
>> data, so this part wasn't a huge leap.
>
> Including null bytes? That's good design, then.
>

NUL bytes, yes, these are escape-coded as part of the M-UTF-8 encoding.

M-UTF-8 escapes a NUL byte as a 2-byte sequence, which allows it to be
passed through without it truncating the string. The actual non-escaped
NUL can still serve as an EOB marker.

Not exactly the most efficient way to encode binary data, but it is
basically functional at this.

Subject	Replies	Author
Minor idea for indirect target predictor By: Paul A. Clayton on Sat, 26 Jun 2021	188	Paul A. Clayton

Felson's Law: To steal ideas from one person is plagiarism; to steal from many is research.

computers / comp.arch / Re: Minor idea for indirect target predictor