Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"Inquiry is fatal to certainty." -- Will Durant


devel / comp.arch / Dense machine code from C++ code (compiler optimizations)

SubjectAuthor
* Dense machine code from C++ code (compiler optimizations)Marcus
+* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
|`- Re: Dense machine code from C++ code (compiler optimizations)Marcus
+* Re: Dense machine code from C++ code (compiler optimizations)BGB
|+* Re: Dense machine code from C++ code (compiler optimizations)robf...@gmail.com
||`* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
|| `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||  +- Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
||  `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||   +* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
||   |`- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||   `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||    `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||     `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||      +* Re: Dense machine code from C++ code (compiler optimizations)robf...@gmail.com
||      |+- Re: Dense machine code from C++ code (compiler optimizations)BGB
||      |`* Re: Thor (was: Dense machine code...)Marcus
||      | `* Re: Thor (was: Dense machine code...)robf...@gmail.com
||      |  +- Re: ThorEricP
||      |  `* Re: Thor (was: Dense machine code...)Marcus
||      |   `- Re: Thor (was: Dense machine code...)robf...@gmail.com
||      `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||       `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||        `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||         `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||          `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||           `* Re: Testing with open source games (was Dense machine code ...)Marcus
||            +* Re: Testing with open source games (was Dense machine code ...)Terje Mathisen
||            |`* Re: Testing with open source games (was Dense machine code ...)Marcus
||            | +- Re: Testing with open source games (was Dense machine code ...)Terje Mathisen
||            | `* Re: Testing with open source games (was Dense machine code ...)James Van Buskirk
||            |  `- Re: Testing with open source games (was Dense machine code ...)Marcus
||            `- Re: Testing with open source games (was Dense machine code ...)BGB
|`* Re: Dense machine code from C++ code (compiler optimizations)Marcus
| +* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
| |+- Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
| |`* Re: Dense machine code from C++ code (compiler optimizations)BGB
| | `* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
| |  +- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
| |  `- Re: Dense machine code from C++ code (compiler optimizations)BGB
| +* Re: Dense machine code from C++ code (compiler optimizations)BGB
| |`- Re: Dense machine code from C++ code (compiler optimizations)Paul A. Clayton
| `* Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
|  `* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|   +* Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
|   |`* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|   | `- Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
|   `* Re: Dense machine code from C++ code (compiler optimizations)BGB
|    +* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|    |`- Re: Dense machine code from C++ code (compiler optimizations)George Neuner
|    `* Re: Dense machine code from C++ code (compiler optimizations)David Brown
|     `* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|      `* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
|       +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
|       |`- Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
|       +- Re: Dense machine code from C++ code (compiler optimizations)BGB
|       `- Re: Dense machine code from C++ code (compiler optimizations)Marcus
`* Re: Dense machine code from C++ code (compiler optimizations)Ir. Hj. Othman bin Hj. Ahmad
 +- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
 `* Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
  +* Re: Dense machine code from C++ code (compiler optimizations)chris
  |`* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  | `* Re: Dense machine code from C++ code (compiler optimizations)chris
  |  +* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |  |`- Re: Dense machine code from C++ code (compiler optimizations)chris
  |  `* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
  |   `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |    +- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |    `* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
  |     `- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  +* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |`* Re: Dense machine code from C++ code (compiler optimizations)BGB
  | `* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |  `* Re: Dense machine code from C++ code (compiler optimizations)BGB
  |   +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |   |`* Re: Dense machine code from C++ code (compiler optimizations)BGB
  |   | `- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |   `* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |    `- Re: Dense machine code from C++ code (compiler optimizations)BGB
  `* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   |`- Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   +* Re: Dense machine code from C++ code (compiler optimizations)James Van Buskirk
   |`* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   | +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   | |+- Re: Dense machine code from C++ code (compiler optimizations)Marcus
   | |`* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
   | | `* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   | |  `* Re: Dense machine code from C++ code (compiler optimizations)EricP
   | |   `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   | |    `* Re: Dense machine code from C++ code (compiler optimizations)EricP
   | |     `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   | |      `- Re: Dense machine code from C++ code (compiler optimizations)EricP
   | `* Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  +* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   |  |+* Re: Dense machine code from C++ code (compiler optimizations)Guillaume
   |  ||+* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   |  |||+- Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
   |  |||`* Re: Dense machine code from C++ code (compiler optimizations)Guillaume
   |  ||| +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   |  ||| |`* Re: Dense machine code from C++ code (compiler optimizations)Andreas Eder
   |  ||| `- Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  ||`- Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  |`* Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  `* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
   `- Re: Dense machine code from C++ code (compiler optimizations)Andreas Eder

Pages:12345678
Dense machine code from C++ code (compiler optimizations)

<sndun6$q07$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22073&group=comp.arch#22073

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Dense machine code from C++ code (compiler optimizations)
Date: Sun, 21 Nov 2021 18:13:10 +0100
Organization: A noiseless patient Spider
Lines: 5
Message-ID: <sndun6$q07$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Nov 2021 17:13:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b1e2d2587906191269d11f9e49e23e58";
logging-data="26631"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196zklRj9yEKoNUlD/YAty1O6476mWa1RI="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:Th4wRr9U6yxb99ZNyq1lcaQsadw=
Content-Language: en-US
X-Mozilla-News-Host: snews://news.eternal-september.org:563
 by: Marcus - Sun, 21 Nov 2021 17:13 UTC

Just wrote this short post. Maybe someone finds it interesting...

https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/

/Marcus

Re: Dense machine code from C++ code (compiler optimizations)

<sneejq$o25$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22075&group=comp.arch#22075

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sun, 21 Nov 2021 22:44:26 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sneejq$o25$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="24645"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sun, 21 Nov 2021 21:44 UTC

Marcus wrote:
> Just wrote this short post. Maybe someone finds it interesting...
>
> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>
> /Marcus
I didn't know that your target machine has bit field insert/extract
opcodes, otherwise the code generated was exactly as I expected. :-)

Terje
PS. Mill would get pretty much identical results, typically inlined so
no RET opcode.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<snegcq$n03$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22076&group=comp.arch#22076

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sun, 21 Nov 2021 16:14:45 -0600
Organization: A noiseless patient Spider
Lines: 171
Message-ID: <snegcq$n03$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 21 Nov 2021 22:14:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="77a9642a1b0c925e0dffddde14422082";
logging-data="23555"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ww4C0wZwK2fHp7l4joNhh"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.1
Cancel-Lock: sha1:byAjQXBO1sdrHQswMHoo6koljpw=
In-Reply-To: <sndun6$q07$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sun, 21 Nov 2021 22:14 UTC

On 11/21/2021 11:13 AM, Marcus wrote:
> Just wrote this short post. Maybe someone finds it interesting...
>
> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>

I guess this points out one limitation of my compiler (relative to GCC)
is that for many cases it does a fairly direct translation from C source
to the machine code.

It will not optimize any "high level" constructions, but instead sort of
depends on the programmer to write "reasonably efficient" C.

Such a case would not turn out nearly so nice in my compiler though (if
it supported C++), but alas.

Trying to port GCC looks like a pain though, as its codebase is pretty
hairy and it takes a fairly long time to rebuild from source (compared
with my compiler; which rebuilds in a few seconds).

Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
seemingly takes ~ 20 seconds to recompile Doom.

There are a few cases, like it will recognize a few special cases for
memcpy and convert them into inline sequences or register moves, but in
most areas it is still "pretty dumb".

Also doesn't perform function inlining or other high-level
transformations, and at its current defaults does not make use of strict
aliasing (default case is more conservative).

I have also noticed that its code-generation seems to be a little
overzealous with the application of "volatile", since it isn't always
necessarily obvious which exact aspect needs to be volatile, at present
it turns anything that touches a volatile variable into fairly obvious
"Ld, Op, St" or "Ld, Ld, Op, St" sequences (it effectively disables the
use of the register allocator).

Otherwise, one would have to possibly add some way to distinguish:
The value of this variable is volatile;
Vs, the memory at the address pointed to by this volatile pointer is
volatile.

It does optimize a few minor things though, like:
unsigned int ui, uj;
...
uj = ui % 16;
Will get transformed into:
uj = ui & 15; //applies to power-of-2 and unsigned
....

Divide by constant may also be replaced with multiply-by-reciprocal and
similar (normally divide and modulo are implemented via function calls
into the C runtime).

So, internally:
c = a / b;
Maps to, say:
c = __sdivsi3(a, b); //no hardware divide operation

Also recently added optimizations which transform, eg:
c = a + 0;
d = 0 | a;
if(a==a) { ... } /* if 'a' is not floating-point */
...
Into:
c = a;
d = a;
if(1) { ... }
...

Did also recently add an optimization to reduce the number of internal
type promotion-conversions used for operators like "==" and similar (in
cases where doing so was unnecessary).

....

There is an optimization which eliminates stack-frame creation for small
leaf functions:
Can't call anything (including implicit runtime calls);
Can't access global variables (ABI reasons);
Can't use any operations which allocate temporary registers;
Limited to 10 variables (arguments + locals + temporaries);
Can't take the address of any variable;
...

This basically means that these functions can use:
Basic pointers and simple arrays;
Pointers to structs (but not value-type structs);
A basic set of ALU operators and casts.
Safe: +, -, &, |, ^, <<, >>
Sometimes: * (small types only), % (unsigned with small power-of-2)
Many other cases: *, /, % may use temporaries and/or calls.
Basic integer and primitive pointer types.

But, then recently found/fixed a bug where it was sometimes still trying
to spill variables to the stack despite not having a stack-frame, which
was in turn corrupting the stack frame of the caller.

And, a bug where declaring an array like:
char *foo[] = { ... };
...
n=sizeof(foo);
Was giving the size as "sizeof(char *)", rather than the size the array
was initialized to (and currently still only works for global and static
arrays).

....

And, all this was while trying to hunt down another bug which seems to
result in Doom demos sometimes desyncing (in a way different from the
x86 builds); also a few other behavioral anomalies which have shown up
as bugs in Quake, ...

Well, also the relatively naive register allocation strategy doesn't help:
For each basic block, whenever a variable is referenced (that is not
part of the "statically reserved set"), it is loaded into a register
(and temporarily held there), and at the end of the basic-block,
everything is spilled back to the stack.

There are, in theory, much better ways to do register allocation.

Though, one useful case is that the register space is large enough to
where a non-trivial number of functions can use a "statically reserve
everything" special case. This can completely avoid spills, but only for
functions within a strict limit for the number of variables (limited by
the number of callee save registers, or ~ 12 variables with 32 GPRs).

For most functions, this case still involves the creation of a stack
frame and similar though (mostly to save/restore registers).

This case still excludes using a few features:
Use of structs as value types (structs may only be used as pointers);
Taking the address of any variable;
Use of VLAs or alloca;
...

But, this case does allow a few things:
Calling functions;
Operators which use scratch registers;
Accessing global variables;
...

With the expanded GPR space, the scope of the "statically assign
everything" case could be be expanded, but still haven't gotten around
to making BGBCC able to use all 64 GPRs directly (and I still don't
consider them to be part of the baseline ISA, ...). This could (in
theory) expand the limit to ~ 28 or so.

If BGBCC supported C++, I don't have much confidence for how it would
deal with templates.

But, otherwise, mostly more trying to find and fix bugs and similar at
the moment. But, often much of the effort is trying to actually find the
bugs (since "demo desyncs for some reason" isn't super obvious as to why
this is happening, or where in the codebase the bug is being triggered,
....).

> /Marcus

Re: Dense machine code from C++ code (compiler optimizations)

<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22078&group=comp.arch#22078

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:96e:: with SMTP id do14mr95275155qvb.39.1637535307256;
Sun, 21 Nov 2021 14:55:07 -0800 (PST)
X-Received: by 2002:a05:6808:19aa:: with SMTP id bj42mr16692578oib.37.1637535307037;
Sun, 21 Nov 2021 14:55:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 21 Nov 2021 14:55:06 -0800 (PST)
In-Reply-To: <snegcq$n03$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:f50d:d811:36a6:8f61;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:f50d:d811:36a6:8f61
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Sun, 21 Nov 2021 22:55:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: robf...@gmail.com - Sun, 21 Nov 2021 22:55 UTC

cc64 compiler cheats and has a bit-slice operator that allows the compiler
to see directly when bit-field operations are needed.
A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.

Re: Dense machine code from C++ code (compiler optimizations)

<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22079&group=comp.arch#22079

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4155:: with SMTP id e21mr25996536qtm.312.1637537256009;
Sun, 21 Nov 2021 15:27:36 -0800 (PST)
X-Received: by 2002:a05:6808:e8d:: with SMTP id k13mr16594039oil.84.1637537255826;
Sun, 21 Nov 2021 15:27:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 21 Nov 2021 15:27:35 -0800 (PST)
In-Reply-To: <06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me> <06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 21 Nov 2021 23:27:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 18
 by: MitchAlsup - Sun, 21 Nov 2021 23:27 UTC

On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com wrote:
> cc64 compiler cheats and has a bit-slice operator that allows the compiler
> to see directly when bit-field operations are needed.
> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
<
That is not cheating !
The semantic of the program has been obeyed.
>
{Although in that case, a single left shift would have sufficed.}
>
Although::
struct { uint64_t a: 20,
filler: 19,
c: 18; } t;
>
t.a=t.c //does need an extract and an insert
>

Re: Dense machine code from C++ code (compiler optimizations)

<snf8ui$p8n$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22087&group=comp.arch#22087

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sun, 21 Nov 2021 23:13:49 -0600
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <snf8ui$p8n$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 05:13:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="673fabfc6e16066c237e505b2b47291f";
logging-data="25879"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18V64HCSdKzVcCin2Nj7YlR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.1
Cancel-Lock: sha1:xya41TKc3KUh1fCEdteuD8m/7dw=
In-Reply-To: <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 22 Nov 2021 05:13 UTC

On 11/21/2021 5:27 PM, MitchAlsup wrote:
> On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com wrote:
>> cc64 compiler cheats and has a bit-slice operator that allows the compiler
>> to see directly when bit-field operations are needed.
>> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
> <
> That is not cheating !
> The semantic of the program has been obeyed.
>>
> {Although in that case, a single left shift would have sufficed.}
>>
> Although::
> struct { uint64_t a: 20,
> filler: 19,
> c: 18; } t;
>>
> t.a=t.c //does need an extract and an insert
>>

I bit-slice operator could be useful, although BJX2 lacks explicit
bit-extract or bit-insert operations (they need to be built via shift
and mask).

It seems like both extract and insert could be generalized into a
combined "shift-and-mask" operator. Though, this would require a
mechanism to either supply or create the bit-mask.

Eg:
Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
Or (4R):
Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).

So, if the Mask is all ones, it behaves as a normal shift, but otherwise
one masks off the bits they want to keep from the destination register.

Could synthesize a mask from a shift and count, harder part is coming up
with a way to do so cheaply (when the shift unit is already in use this
cycle).

Could be done more simply via multiple ops:
SHLD.Q (left as-is)
BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))

So, extract is, say:
SHLD.Q R4, -24, R3
AND 511, R3
And, insert is, say:
MOV 511, R7
SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
BITSEL R6, R7, R8

Still need to think about it.

Or such...

Re: Dense machine code from C++ code (compiler optimizations)

<snfc16$647$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22088&group=comp.arch#22088

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sun, 21 Nov 2021 22:06:31 -0800
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <snfc16$647$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 06:06:31 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6e5d1fc0ff598e6b5b076d2c3098adf2";
logging-data="6279"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18lKJourv2Xfgykke6Y6BSS"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:lVHiDI46ya9HSvZ6bD2KUO5z4Ng=
In-Reply-To: <snf8ui$p8n$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Mon, 22 Nov 2021 06:06 UTC

On 11/21/2021 9:13 PM, BGB wrote:
> On 11/21/2021 5:27 PM, MitchAlsup wrote:
>> On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com
>> wrote:
>>> cc64 compiler cheats and has a bit-slice operator that allows the
>>> compiler
>>> to see directly when bit-field operations are needed.
>>> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
>> <
>> That is not cheating !
>> The semantic of the program has been obeyed.
>>>
>> {Although in that case, a single left shift would have sufficed.}
>>>
>> Although::
>> struct { uint64_t     a: 20,
>>                              filler: 19,
>>                                   c: 18; } t;
>>>
>> t.a=t.c //does need an extract and an insert
>>>
>
> I bit-slice operator could be useful, although BJX2 lacks explicit
> bit-extract or bit-insert operations (they need to be built via shift
> and mask).
>
> It seems like both extract and insert could be generalized into a
> combined "shift-and-mask" operator. Though, this would require a
> mechanism to either supply or create the bit-mask.
>
> Eg:
>   Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
> Or (4R):
>   Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).
>
> So, if the Mask is all ones, it behaves as a normal shift, but otherwise
> one masks off the bits they want to keep from the destination register.
>
> Could synthesize a mask from a shift and count, harder part is coming up
> with a way to do so cheaply (when the shift unit is already in use this
> cycle).
>
>
> Could be done more simply via multiple ops:
>   SHLD.Q (left as-is)
>   BITSEL  Rm, Ro, Rn      // Rn=(Rm&Ro) | (Rn&(~Ro))
>   BITSEL  Rm, Ro, Rp, Rn  // Rn=(Rm&Ro) | (Rp&(~Ro))
>
>
> So, extract is, say:
>   SHLD.Q  R4, -24, R3
>   AND     511, R3
> And, insert is, say:
>   MOV     511, R7
>   SHLD.Q  R3, 52, R6 | SHLD.Q  R7, 52, R7
>   BITSEL  R6, R7, R8
>
>
> Still need to think about it.
>
>
> Or such...

Consider also signed extract, and overflow-checking insert.

Re: Dense machine code from C++ code (compiler optimizations)

<snffpt$o6p$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22089&group=comp.arch#22089

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 08:10:53 +0100
Organization: A noiseless patient Spider
Lines: 127
Message-ID: <snffpt$o6p$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 07:10:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="de81a7f78e7f32cf208824dd11c9f858";
logging-data="24793"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18UR7GV8PhdFHincNJYom0Y/T45sgbRbXo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:OA0F4Njfu819EtKyT+yId/TwtvE=
In-Reply-To: <snegcq$n03$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Mon, 22 Nov 2021 07:10 UTC

On 2021-11-21 kl. 23:14, BGB wrote:
> On 11/21/2021 11:13 AM, Marcus wrote:
>> Just wrote this short post. Maybe someone finds it interesting...
>>
>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>
>
> I guess this points out one limitation of my compiler (relative to GCC)
> is that for many cases it does a fairly direct translation from C source
> to the machine code.
>
> It will not optimize any "high level" constructions, but instead sort of
> depends on the programmer to write "reasonably efficient" C.

I would suspect that. That was one of the points in my post: GCC and
Clang have many, many, many man-years of work built in, and it's very
hard to compete with them if you start fresh on a new compiler.

I also have a feeling that the C++ language is at a level today that
it's near impossible to write a new compiler from scratch. It's not only
about the sheer amount of language features (classes, lambdas,
templates, auto, constexpr, ...) and std library (STL, thread, chrono,
....), but it's also about the expectations about how the code is
optimized. C++ places a huge burden on the compiler to be able to
resolve lots of things at compile time (e.g. constexpr essentially
requires that the C++ code can be executed at compile time).

> Such a case would not turn out nearly so nice in my compiler though (if
> it supported C++), but alas.
>
> Trying to port GCC looks like a pain though, as its codebase is pretty
> hairy and it takes a fairly long time to rebuild from source (compared
> with my compiler; which rebuilds in a few seconds).

Yes, it has taken me years, and the code base and the build system is
not modern by a long shot. A complete rebuild of binutils +
bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
incremental build of GCC when some part of the machine description has
changed (e.g. an insn description was added) takes about a minute.

OTOH it would probably have taken me even longer to create my own
compiler (especially as I'm not very versed in compiler architecture),
so for me it was the less evil of options (I still kind of regret that
I didn't try harder with LLVM/Clang, though, but I have no evidence
that the grass is greener over there).

>
> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
> seemingly takes ~ 20 seconds to recompile Doom.
>

Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
Doom is 1.3 s (10 s without parallel compilation). The build time for
Quake is 1.6 s (13 s without parallel compilation).

But I agree that a fast compiler is worth a lot. I work with a DSP
compiler that can take ~5 minutes to compile a single object file that
takes ~10 seconds to compile with GCC. It's a real productivity killer.

>

[snip]

> And, all this was while trying to hunt down another bug which seems to
> result in Doom demos sometimes desyncing (in a way different from the
> x86 builds); also a few other behavioral anomalies which have shown up
> as bugs in Quake, ...
>

FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
16.16-bit fixed point operation (say, with 32-bit floating-point), or
get the result off by a single LSB, the demos will start running off
track very quickly ;-) I found this out back in the 1990s when I ported
Doom to the Amiga and tried to pull some 68020/030 optimization tricks.

>
> Well, also the relatively naive register allocation strategy doesn't help:
> For each basic block, whenever a variable is referenced (that is not
> part of the "statically reserved set"), it is loaded into a register
> (and temporarily held there), and at the end of the basic-block,
> everything is spilled back to the stack.
>
> There are, in theory, much better ways to do register allocation.
>
>
> Though, one useful case is that the register space is large enough to
> where a non-trivial number of functions can use a "statically reserve
> everything" special case. This can completely avoid spills, but only for
> functions within a strict limit for the number of variables (limited by
> the number of callee save registers, or ~ 12 variables with 32 GPRs).
>
> For most functions, this case still involves the creation of a stack
> frame and similar though (mostly to save/restore registers).
>
> This case still excludes using a few features:
>   Use of structs as value types (structs may only be used as pointers);
>   Taking the address of any variable;
>   Use of VLAs or alloca;
>   ...
>
> But, this case does allow a few things:
>   Calling functions;
>   Operators which use scratch registers;
>   Accessing global variables;
>   ...
>
>
> With the expanded GPR space, the scope of the "statically assign
> everything" case could be be expanded, but still haven't gotten around
> to making BGBCC able to use all 64 GPRs directly (and I still don't
> consider them to be part of the baseline ISA, ...). This could (in
> theory) expand the limit to ~ 28 or so.
>
>
> If BGBCC supported C++, I don't have much confidence for how it would
> deal with templates.
>
> But, otherwise, mostly more trying to find and fix bugs and similar at
> the moment. But, often much of the effort is trying to actually find the
> bugs (since "demo desyncs for some reason" isn't super obvious as to why
> this is happening, or where in the codebase the bug is being triggered,
> ...).
>
>
>> /Marcus
>

Re: Dense machine code from C++ code (compiler optimizations)

<snfgtc$st5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22090&group=comp.arch#22090

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 08:29:48 +0100
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <snfgtc$st5$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <sneejq$o25$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Nov 2021 07:29:48 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="de81a7f78e7f32cf208824dd11c9f858";
logging-data="29605"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jpCkG8Bw8Cynql4H3GAYrnRA6A6Eki1s="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:Sg4m4qO0TZh6VxaRbhF6apVvvcI=
In-Reply-To: <sneejq$o25$1@gioia.aioe.org>
Content-Language: en-US
 by: Marcus - Mon, 22 Nov 2021 07:29 UTC

On 2021-11-21 22:44, Terje Mathisen wrote:
> Marcus wrote:
>> Just wrote this short post. Maybe someone finds it interesting...
>>
>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>
>> /Marcus
> I didn't know that your target machine has bit field insert/extract
> opcodes,

It originally didn't, but when I found the sweet mc88k bit field trick I
replaced the bit shift instructions with bit field instructions. :-)

The IBF (Insert Bit Field) instruction is the latest addition (got
implemented three weeks ago). My VHDL implementation is kind of a mess
though (it has evolved organically from plain shift operations), and
eats too much FPGA logic resources ATM. I'll have to revisit that some
time.

> otherwise the code generated was exactly as I expected. :-)
>
> Terje
> PS. Mill would get pretty much identical results, typically inlined so
> no RET opcode.

Well, for the article I decided to isolate the functions to make it
easier to follow, but otherwise you'd typically inline these kind of
functions.

/Marcus

Re: Dense machine code from C++ code (compiler optimizations)

<snfj23$7h1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22092&group=comp.arch#22092

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 00:06:28 -0800
Organization: A noiseless patient Spider
Lines: 135
Message-ID: <snfj23$7h1$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 08:06:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6e5d1fc0ff598e6b5b076d2c3098adf2";
logging-data="7713"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19yISOaH8uXlYe/8rT0suqe"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:4JOue18UAPuzTJ0B/UiPeHmBGiU=
In-Reply-To: <snffpt$o6p$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Mon, 22 Nov 2021 08:06 UTC

On 11/21/2021 11:10 PM, Marcus wrote:
> On 2021-11-21 kl. 23:14, BGB wrote:
>> On 11/21/2021 11:13 AM, Marcus wrote:
>>> Just wrote this short post. Maybe someone finds it interesting...
>>>
>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>
>>
>> I guess this points out one limitation of my compiler (relative to
>> GCC) is that for many cases it does a fairly direct translation from C
>> source to the machine code.
>>
>> It will not optimize any "high level" constructions, but instead sort
>> of depends on the programmer to write "reasonably efficient" C.
>
> I would suspect that. That was one of the points in my post: GCC and
> Clang have many, many, many man-years of work built in, and it's very
> hard to compete with them if you start fresh on a new compiler.
>
> I also have a feeling that the C++ language is at a level today that
> it's near impossible to write a new compiler from scratch. It's not only
> about the sheer amount of language features (classes, lambdas,
> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
> ...), but it's also about the expectations about how the code is
> optimized. C++ places a huge burden on the compiler to be able to
> resolve lots of things at compile time (e.g. constexpr essentially
> requires that the C++ code can be executed at compile time).
>
>> Such a case would not turn out nearly so nice in my compiler though
>> (if it supported C++), but alas.
>>
>> Trying to port GCC looks like a pain though, as its codebase is pretty
>> hairy and it takes a fairly long time to rebuild from source (compared
>> with my compiler; which rebuilds in a few seconds).
>
> Yes, it has taken me years, and the code base and the build system is
> not modern by a long shot. A complete rebuild of binutils +
> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
> incremental build of GCC when some part of the machine description has
> changed (e.g. an insn description was added) takes about a minute.
>
> OTOH it would probably have taken me even longer to create my own
> compiler (especially as I'm not very versed in compiler architecture),
> so for me it was the less evil of options (I still kind of regret that
> I didn't try harder with LLVM/Clang, though, but I have no evidence
> that the grass is greener over there).
>
>>
>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
>> seemingly takes ~ 20 seconds to recompile Doom.
>>
>
> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
> Doom is 1.3 s (10 s without parallel compilation). The build time for
> Quake is 1.6 s (13 s without parallel compilation).
>
> But I agree that a fast compiler is worth a lot. I work with a DSP
> compiler that can take ~5 minutes to compile a single object file that
> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>
>>
>
> [snip]
>
>> And, all this was while trying to hunt down another bug which seems to
>> result in Doom demos sometimes desyncing (in a way different from the
>> x86 builds); also a few other behavioral anomalies which have shown up
>> as bugs in Quake, ...
>>
>
> FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
> get the result off by a single LSB, the demos will start running off
> track very quickly ;-) I found this out back in the 1990s when I ported
> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>
>>
>> Well, also the relatively naive register allocation strategy doesn't
>> help:
>> For each basic block, whenever a variable is referenced (that is not
>> part of the "statically reserved set"), it is loaded into a register
>> (and temporarily held there), and at the end of the basic-block,
>> everything is spilled back to the stack.
>>
>> There are, in theory, much better ways to do register allocation.
>>
>>
>> Though, one useful case is that the register space is large enough to
>> where a non-trivial number of functions can use a "statically reserve
>> everything" special case. This can completely avoid spills, but only
>> for functions within a strict limit for the number of variables
>> (limited by the number of callee save registers, or ~ 12 variables
>> with 32 GPRs).
>>
>> For most functions, this case still involves the creation of a stack
>> frame and similar though (mostly to save/restore registers).
>>
>> This case still excludes using a few features:
>>    Use of structs as value types (structs may only be used as pointers);
>>    Taking the address of any variable;
>>    Use of VLAs or alloca;
>>    ...
>>
>> But, this case does allow a few things:
>>    Calling functions;
>>    Operators which use scratch registers;
>>    Accessing global variables;
>>    ...
>>
>>
>> With the expanded GPR space, the scope of the "statically assign
>> everything" case could be be expanded, but still haven't gotten around
>> to making BGBCC able to use all 64 GPRs directly (and I still don't
>> consider them to be part of the baseline ISA, ...). This could (in
>> theory) expand the limit to ~ 28 or so.
>>
>>
>> If BGBCC supported C++, I don't have much confidence for how it would
>> deal with templates.
>>
>> But, otherwise, mostly more trying to find and fix bugs and similar at
>> the moment. But, often much of the effort is trying to actually find
>> the bugs (since "demo desyncs for some reason" isn't super obvious as
>> to why this is happening, or where in the codebase the bug is being
>> triggered, ...).
>>
>>
>>> /Marcus
>>
>

The rule of thumb twenty years ago was that a new production-grade
compiler cost $100M$ and five years. I doubt the cost has gone down. The
Mill tool chain, even using clang for front and middle end and not
including linking, is ~30k lines of pretty tight C++. That ain't cheap.

Re: Dense machine code from C++ code (compiler optimizations)

<snfohs$abd$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22093&group=comp.arch#22093

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 03:40:07 -0600
Organization: A noiseless patient Spider
Lines: 273
Message-ID: <snfohs$abd$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 09:40:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="673fabfc6e16066c237e505b2b47291f";
logging-data="10605"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6ZclVW7bnngTVdDAhcGGw"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.1
Cancel-Lock: sha1:GYa5RPqTQ/DKfxqEwX2h1o/HJ/4=
In-Reply-To: <snffpt$o6p$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 22 Nov 2021 09:40 UTC

On 11/22/2021 1:10 AM, Marcus wrote:
> On 2021-11-21 kl. 23:14, BGB wrote:
>> On 11/21/2021 11:13 AM, Marcus wrote:
>>> Just wrote this short post. Maybe someone finds it interesting...
>>>
>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>
>>
>> I guess this points out one limitation of my compiler (relative to
>> GCC) is that for many cases it does a fairly direct translation from C
>> source to the machine code.
>>
>> It will not optimize any "high level" constructions, but instead sort
>> of depends on the programmer to write "reasonably efficient" C.
>
> I would suspect that. That was one of the points in my post: GCC and
> Clang have many, many, many man-years of work built in, and it's very
> hard to compete with them if you start fresh on a new compiler.
>
> I also have a feeling that the C++ language is at a level today that
> it's near impossible to write a new compiler from scratch. It's not only
> about the sheer amount of language features (classes, lambdas,
> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
> ...), but it's also about the expectations about how the code is
> optimized. C++ places a huge burden on the compiler to be able to
> resolve lots of things at compile time (e.g. constexpr essentially
> requires that the C++ code can be executed at compile time).
>

I have currently put full C++ support on the "not going to do it" shelf...

There is a subset along similar lines to EC++, but hardly any C++ code
(which actually uses the STL or C++ standard library) would actually
work with such a subset.

>> Such a case would not turn out nearly so nice in my compiler though
>> (if it supported C++), but alas.
>>
>> Trying to port GCC looks like a pain though, as its codebase is pretty
>> hairy and it takes a fairly long time to rebuild from source (compared
>> with my compiler; which rebuilds in a few seconds).
>
> Yes, it has taken me years, and the code base and the build system is
> not modern by a long shot. A complete rebuild of binutils +
> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
> incremental build of GCC when some part of the machine description has
> changed (e.g. an insn description was added) takes about a minute.
>

Rebuilding GCC for RISC-V took around 20 minutes on my 2700X, but
granted, this was on a platter drive which was running low on available
disk space; and on WSL.

Rebuilding BGBCC with MSVC takes around 3 seconds.
Rebuilding BGBCC with GCC takes around 56 seconds.

> OTOH it would probably have taken me even longer to create my own
> compiler (especially as I'm not very versed in compiler architecture),
> so for me it was the less evil of options (I still kind of regret that
> I didn't try harder with LLVM/Clang, though, but I have no evidence
> that the grass is greener over there).
>

I fiddled around with compilers and language design for a long time
before I got into ISA design or FPGAs (for me, I started fiddling with
writing language interpreters back in high-school; and BGBCC started out
as a fork off a project I started working on directly after high-school,
namely writing a custom JavaScript knock-off with the intention of using
it as a scripting language in my other projects).

Initially, BGBCC was also intended as a script-interpreter, just parsing
a C variant rather than a JS variant, but at the time I had found that C
was much worse suited to the task, and debugging a C compiler was
considerably harder than a JS-like compiler (despite their superficial
syntactic similarity).

But, then it beat around for some-odd years being used mostly as an "FFI
glue" tool.

I had BGBCC laying around from some past projects of mine. It was pretty
awful, and wasn't really used much for anything "non-trivial" but it
sorta had most of the basics in place.

For my newer ISA projects, I was like, "well, I will dust off what I
have and just use that".
Actual 2nd place option at the time was LCC, but it didn't look like LCC
would have offered that much over what I had already.

Though, BGBCC has expanded to around 5x as much code as it was when I
started this project. Most of this is in the backend, and a lot of new
special cases in the frontend, ...

It does have a little big of funkiness in that it works very differently
from GCC. It does not use object files, but instead uses a
stack-oriented bytecode as an IR stage (and internally generates the 3AC
IR code from the stack-machine code).

....

>>
>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
>> seemingly takes ~ 20 seconds to recompile Doom.
>>
>
> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
> Doom is 1.3 s (10 s without parallel compilation). The build time for
> Quake is 1.6 s (13 s without parallel compilation).
>

I was invoking it via a similar approach to how one uses MSVC, namely:
gcc -o whatever ...all the source files... (options)

Generally, the compiler speed ranking tends to be:
MSVC, fastest;
BGBCC, 2nd fastest;
GCC and Clang, considerably slower.

Though, GCC and Clang are able to generate code that is typically a fair
bit faster than what MSVC can pull off.

BGBCC basically runs as a single thread in a single process.
It does cache loaded header files and similar in RAM, mostly because
opening/closing/reading files is fairly slow (and otherwise can become a
bottleneck).

> But I agree that a fast compiler is worth a lot. I work with a DSP
> compiler that can take ~5 minutes to compile a single object file that
> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>

Hmm...

I get annoyed waiting much more than a few seconds.

>>
>
> [snip]
>
>> And, all this was while trying to hunt down another bug which seems to
>> result in Doom demos sometimes desyncing (in a way different from the
>> x86 builds); also a few other behavioral anomalies which have shown up
>> as bugs in Quake, ...
>>
>
> FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
> get the result off by a single LSB, the demos will start running off
> track very quickly ;-) I found this out back in the 1990s when I ported
> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>

I am handling these generally by using 64-bit integer math.

Previously, I had the Doom demos 1:1 between the x86 builds and the BJX2
builds (they would play out the same way), but recently something has
changed which is causing the 3rd demo in the demo-loop (for Ultimate
Doom, E3M5) to diverge near the end.

Namely, in the x86 builds (and previously on BJX2), the player would ram
into an imp and then get killed by a hell baron near the outside edge of
the central structure. Now, on BJX2, there is a divergence and the
player gets killed (by the baron) in a different location.

The main difference in this case seems to be that the imp behaves
differently and is in a different spot.

I had already been poking a bit at the division algo and similar, and
this seems to still be generating the correct results (things like
sign-extensions on shifts and similar are also tested via "sanity
checks", ...).

Behavior seems to still match up in Heretic and Hexen last I checked.

I have at least found and fixed some other bugs though.

The ROTT demos also desync pretty bad.

The behavior is also fairly stable in the face of switching around
codegen options, ..., implying it is probably not a low-level codegen issue.

In any case, this implies a behavioral / semantics issue somewhere.

Though, it does appear that this desync matches up with the behavior I
saw when I built this port of the Doom engine for the RasPi.

Or:
The first demo (E1M5) is desync'ed, but behavior is consistent between
all the targets;
The second demo (E2M2) remains in-sync and consistent on all targets;
The third demo (E3M5) diverges when the imp and baron are encountered,
with my BJX2 build switching from the x86 behavior to ARM/RasPi behavior
(for reasons I have not yet determined).

....

But, yeah, this stuff is super fiddly (and sensitive to very slight
disturbances).

For ROTT, there also seems to be a running RNG state, and differences in
enemy behavior will also cause the RNG to diverge, leading to a cascade
effect where everything goes to crap.

This is unlike Doom and friends where enemy behavior does not make use
of an RNG (ROTT deals with demos by initially reseeding the RNG with 0).

>>
>> Well, also the relatively naive register allocation strategy doesn't
>> help:
>> For each basic block, whenever a variable is referenced (that is not
>> part of the "statically reserved set"), it is loaded into a register
>> (and temporarily held there), and at the end of the basic-block,
>> everything is spilled back to the stack.
>>
>> There are, in theory, much better ways to do register allocation.
>>
>>
>> Though, one useful case is that the register space is large enough to
>> where a non-trivial number of functions can use a "statically reserve
>> everything" special case. This can completely avoid spills, but only
>> for functions within a strict limit for the number of variables
>> (limited by the number of callee save registers, or ~ 12 variables
>> with 32 GPRs).
>>
>> For most functions, this case still involves the creation of a stack
>> frame and similar though (mostly to save/restore registers).
>>
>> This case still excludes using a few features:
>>    Use of structs as value types (structs may only be used as pointers);
>>    Taking the address of any variable;
>>    Use of VLAs or alloca;
>>    ...
>>
>> But, this case does allow a few things:
>>    Calling functions;
>>    Operators which use scratch registers;
>>    Accessing global variables;
>>    ...
>>
>>
>> With the expanded GPR space, the scope of the "statically assign
>> everything" case could be be expanded, but still haven't gotten around
>> to making BGBCC able to use all 64 GPRs directly (and I still don't
>> consider them to be part of the baseline ISA, ...). This could (in
>> theory) expand the limit to ~ 28 or so.
>>
>>
>> If BGBCC supported C++, I don't have much confidence for how it would
>> deal with templates.
>>
>> But, otherwise, mostly more trying to find and fix bugs and similar at
>> the moment. But, often much of the effort is trying to actually find
>> the bugs (since "demo desyncs for some reason" isn't super obvious as
>> to why this is happening, or where in the codebase the bug is being
>> triggered, ...).
>>
>>
>>> /Marcus
>>
>


Click here to read the complete article
Re: Dense machine code from C++ code (compiler optimizations)

<eab2cd8b-a806-416c-a62f-247cb08825dbn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22097&group=comp.arch#22097

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:a8e:: with SMTP id 136mr50606343qkk.395.1637592625834;
Mon, 22 Nov 2021 06:50:25 -0800 (PST)
X-Received: by 2002:a05:6808:211c:: with SMTP id r28mr23104914oiw.155.1637592625578;
Mon, 22 Nov 2021 06:50:25 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Nov 2021 06:50:25 -0800 (PST)
In-Reply-To: <snfohs$abd$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=208.76.113.13; posting-account=6JNn0QoAAAD-Scrkl0ClrfutZTkrOS9S
NNTP-Posting-Host: 208.76.113.13
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <snfohs$abd$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <eab2cd8b-a806-416c-a62f-247cb08825dbn@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: paaroncl...@gmail.com (Paul A. Clayton)
Injection-Date: Mon, 22 Nov 2021 14:50:25 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 5
 by: Paul A. Clayton - Mon, 22 Nov 2021 14:50 UTC

On Monday, November 22, 2021 at 4:40:15 AM UTC-5, BGB wrote:
[snip]
> I get annoyed waiting much more than a few seconds.

No sword fighting?!
https://xkcd.com/303

Re: Dense machine code from C++ code (compiler optimizations)

<sngefb$j2l$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22098&group=comp.arch#22098

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2c0e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 15:54:19 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sngefb$j2l$1@newsreader4.netcologne.de>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me>
Injection-Date: Mon, 22 Nov 2021 15:54:19 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2c0e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2c0e:0:7285:c2ff:fe6c:992d";
logging-data="19541"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 22 Nov 2021 15:54 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:

> A complete rebuild of binutils +
> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.

That is blindingly fast, it usually takes about half to 3/4 on an
hour with recent versions. Which version is your gcc port
based on?

Re: Dense machine code from C++ code (compiler optimizations)

<sngl2l$nsc$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22099&group=comp.arch#22099

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-7bea-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 17:47:01 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sngl2l$nsc$1@newsreader4.netcologne.de>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <snfj23$7h1$1@dont-email.me>
Injection-Date: Mon, 22 Nov 2021 17:47:01 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-7bea-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:7bea:0:7285:c2ff:fe6c:992d";
logging-data="24460"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 22 Nov 2021 17:47 UTC

Ivan Godard <ivan@millcomputing.com> schrieb:

> The rule of thumb twenty years ago was that a new production-grade
> compiler cost $100M$ and five years. I doubt the cost has gone down.

That seems a lot - 20 years ago, you could probably calculate with
cost of less than 100 K$ per programmer and year. That would
mean a 200 + strong team working on this full time.

Re: Dense machine code from C++ code (compiler optimizations)

<sngl8p$laa$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22100&group=comp.arch#22100

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 18:50:17 +0100
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <sngl8p$laa$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Nov 2021 17:50:17 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="de81a7f78e7f32cf208824dd11c9f858";
logging-data="21834"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18tVWlFEJIOGTz3WHPtRhKOuPPAvzmQ7LY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:ldTWA9HLZnvHk9J4KhCOj0G7YuM=
In-Reply-To: <sngefb$j2l$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Marcus - Mon, 22 Nov 2021 17:50 UTC

On 2021-11-22 16:54, Thomas Koenig wrote:
> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>
>> A complete rebuild of binutils +
>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>
> That is blindingly fast, it usually takes about half to 3/4 on an
> hour with recent versions. Which version is your gcc port
> based on?
>

I'm on GCC trunk (12.0). Same with binutils and newlib.

I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
an NVMe drive (~3 GB/s read).

I build with "make -j28" (although the GNU toolchain build system is
poorly parallelizable, so most of the cores are idle most of the time).

BTW, my build script is here [1].

You may experiencing the "Windows tax" [2] - i.e. Windows is almost
always slower than Linux (esp. when it comes to build systems).

/Marcus

[1] https://github.com/mrisc32/mrisc32-gnu-toolchain
[2] https://www.bitsnbites.eu/benchmarking-os-primitives

Re: Dense machine code from C++ code (compiler optimizations)

<sngoqd$hgr$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22103&group=comp.arch#22103

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 12:50:49 -0600
Organization: A noiseless patient Spider
Lines: 397
Message-ID: <sngoqd$hgr$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <snfj23$7h1$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 18:50:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="673fabfc6e16066c237e505b2b47291f";
logging-data="17947"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KPe09IRnimMgojYJ6+/+1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:c40HizqbbXFEH2fgfVzL9i1Bz5g=
In-Reply-To: <snfj23$7h1$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 22 Nov 2021 18:50 UTC

On 11/22/2021 2:06 AM, Ivan Godard wrote:
> On 11/21/2021 11:10 PM, Marcus wrote:
>> On 2021-11-21 kl. 23:14, BGB wrote:
>>> On 11/21/2021 11:13 AM, Marcus wrote:
>>>> Just wrote this short post. Maybe someone finds it interesting...
>>>>
>>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>>
>>>
>>> I guess this points out one limitation of my compiler (relative to
>>> GCC) is that for many cases it does a fairly direct translation from
>>> C source to the machine code.
>>>
>>> It will not optimize any "high level" constructions, but instead sort
>>> of depends on the programmer to write "reasonably efficient" C.
>>
>> I would suspect that. That was one of the points in my post: GCC and
>> Clang have many, many, many man-years of work built in, and it's very
>> hard to compete with them if you start fresh on a new compiler.
>>
>> I also have a feeling that the C++ language is at a level today that
>> it's near impossible to write a new compiler from scratch. It's not only
>> about the sheer amount of language features (classes, lambdas,
>> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
>> ...), but it's also about the expectations about how the code is
>> optimized. C++ places a huge burden on the compiler to be able to
>> resolve lots of things at compile time (e.g. constexpr essentially
>> requires that the C++ code can be executed at compile time).
>>
>>> Such a case would not turn out nearly so nice in my compiler though
>>> (if it supported C++), but alas.
>>>
>>> Trying to port GCC looks like a pain though, as its codebase is
>>> pretty hairy and it takes a fairly long time to rebuild from source
>>> (compared with my compiler; which rebuilds in a few seconds).
>>
>> Yes, it has taken me years, and the code base and the build system is
>> not modern by a long shot. A complete rebuild of binutils +
>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
>> incremental build of GCC when some part of the machine description has
>> changed (e.g. an insn description was added) takes about a minute.
>>
>> OTOH it would probably have taken me even longer to create my own
>> compiler (especially as I'm not very versed in compiler architecture),
>> so for me it was the less evil of options (I still kind of regret that
>> I didn't try harder with LLVM/Clang, though, but I have no evidence
>> that the grass is greener over there).
>>
>>>
>>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
>>> seemingly takes ~ 20 seconds to recompile Doom.
>>>
>>
>> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
>> Doom is 1.3 s (10 s without parallel compilation). The build time for
>> Quake is 1.6 s (13 s without parallel compilation).
>>
>> But I agree that a fast compiler is worth a lot. I work with a DSP
>> compiler that can take ~5 minutes to compile a single object file that
>> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>>
>>>
>>
>> [snip]
>>
>>> And, all this was while trying to hunt down another bug which seems
>>> to result in Doom demos sometimes desyncing (in a way different from
>>> the x86 builds); also a few other behavioral anomalies which have
>>> shown up as bugs in Quake, ...
>>>
>>
>> FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
>> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
>> get the result off by a single LSB, the demos will start running off
>> track very quickly ;-) I found this out back in the 1990s when I ported
>> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>>
>>>
>>> Well, also the relatively naive register allocation strategy doesn't
>>> help:
>>> For each basic block, whenever a variable is referenced (that is not
>>> part of the "statically reserved set"), it is loaded into a register
>>> (and temporarily held there), and at the end of the basic-block,
>>> everything is spilled back to the stack.
>>>
>>> There are, in theory, much better ways to do register allocation.
>>>
>>>
>>> Though, one useful case is that the register space is large enough to
>>> where a non-trivial number of functions can use a "statically reserve
>>> everything" special case. This can completely avoid spills, but only
>>> for functions within a strict limit for the number of variables
>>> (limited by the number of callee save registers, or ~ 12 variables
>>> with 32 GPRs).
>>>
>>> For most functions, this case still involves the creation of a stack
>>> frame and similar though (mostly to save/restore registers).
>>>
>>> This case still excludes using a few features:
>>>    Use of structs as value types (structs may only be used as pointers);
>>>    Taking the address of any variable;
>>>    Use of VLAs or alloca;
>>>    ...
>>>
>>> But, this case does allow a few things:
>>>    Calling functions;
>>>    Operators which use scratch registers;
>>>    Accessing global variables;
>>>    ...
>>>
>>>
>>> With the expanded GPR space, the scope of the "statically assign
>>> everything" case could be be expanded, but still haven't gotten
>>> around to making BGBCC able to use all 64 GPRs directly (and I still
>>> don't consider them to be part of the baseline ISA, ...). This could
>>> (in theory) expand the limit to ~ 28 or so.
>>>
>>>
>>> If BGBCC supported C++, I don't have much confidence for how it would
>>> deal with templates.
>>>
>>> But, otherwise, mostly more trying to find and fix bugs and similar
>>> at the moment. But, often much of the effort is trying to actually
>>> find the bugs (since "demo desyncs for some reason" isn't super
>>> obvious as to why this is happening, or where in the codebase the bug
>>> is being triggered, ...).
>>>
>>>
>>>> /Marcus
>>>
>>
>
> The rule of thumb twenty years ago was that a new production-grade
> compiler cost $100M$ and five years. I doubt the cost has gone down. The
> Mill tool chain, even using clang for front and middle end and not
> including linking, is ~30k lines of pretty tight C++. That ain't cheap.

The current version of BGBCC is ~ 250k lines of C.
It was around 50k lines when I started.

Of this:
16k, C parser (includes preprocessor)
44k, middle stages (AST -> RIL, RIL -> 3AC, Typesystem, ...)
76k, BJX2 backend
20k, support code (memory manager, AST backend, ...)
40k, original SH4 / BJX1 backend;
30k, BSR1 backend
5k, Stuff for WAD2A and WAD4
...

ASTs (Parser, AST->RIL) use a system I had been calling BCCX, which is
organized in terms of nodes. In an abstract sense, a node contains:
A collection of key/value attributes;
A collection of zero or more child nodes.

BCCX-1 (Prior):
Supported 1-6 attributes directly;
Going beyond 6 attributes would use an array;
Nodes were organized via linked-lists.

BCCX-2 (Partial redesign):
Supports 1-8 attributes directly;
Going beyond 8 attributes causes the nodes to split B-Tree style;
Child nodes are organized via a radix-16 array.
Resembles a B-Tree being used as an array;
Children 0-3 may be folded into attributes.

Following the creation of BCCX-2, the nominal way to access child nodes
is to access them with an index (like an array). This interface was
back-ported to BCCX-1 (if needed).

BCCX-2 switches from the old clean up mechanism (manual destruction, and
clean-up via linked lists), to a mechanism inspired by Doom's Z_Malloc
system (sweeping away any short-lived nodes via Zone-tags, and
propagating tags along the tree if a node is moved into a longer-lived
zone).

Both versions of BCCX store attributes using 16-bit keys:
4 bits, identifies the type of the attribute
12 bits, gives a symbolic name for the attribute.

With most BCCX calls, once supplies both a name (as a string), and a
location to cache its index number (these need to match).

There are a few special node tags as well, namely:
$cdata / !CDATA, Represents a large blob of raw texts.
$text / !TEXT , Represents a blob of bare text.
$list / !LIST , Represents a freestanding list of nodes.


Click here to read the complete article
Re: Dense machine code from C++ code (compiler optimizations)

<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22105&group=comp.arch#22105

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b6c1:: with SMTP id g184mr51897181qkf.270.1637607249009;
Mon, 22 Nov 2021 10:54:09 -0800 (PST)
X-Received: by 2002:a9d:4f0b:: with SMTP id d11mr19822741otl.227.1637607248775;
Mon, 22 Nov 2021 10:54:08 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Nov 2021 10:54:08 -0800 (PST)
In-Reply-To: <snf8ui$p8n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 22 Nov 2021 18:54:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 84
 by: MitchAlsup - Mon, 22 Nov 2021 18:54 UTC

On Sunday, November 21, 2021 at 11:13:58 PM UTC-6, BGB wrote:
> On 11/21/2021 5:27 PM, MitchAlsup wrote:
> > On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com wrote:
> >> cc64 compiler cheats and has a bit-slice operator that allows the compiler
> >> to see directly when bit-field operations are needed.
> >> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
> > <
> > That is not cheating !
> > The semantic of the program has been obeyed.
> >>
> > {Although in that case, a single left shift would have sufficed.}
> >>
> > Although::
> > struct { uint64_t a: 20,
> > filler: 19,
> > c: 18; } t;
> >>
> > t.a=t.c //does need an extract and an insert
> >>
> I bit-slice operator could be useful, although BJX2 lacks explicit
> bit-extract or bit-insert operations (they need to be built via shift
> and mask).
>
> It seems like both extract and insert could be generalized into a
> combined "shift-and-mask" operator. Though, this would require a
> mechanism to either supply or create the bit-mask.
>
> Eg:
> Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
> Or (4R):
> Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).
>
> So, if the Mask is all ones, it behaves as a normal shift, but otherwise
> one masks off the bits they want to keep from the destination register.
<
In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
the other defines the width of the field (0->64). For immediate shift amounts
there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
(The empty spaces are checked for significance)
<
Then there is the operand-by-operand select (CMOV) and the bit-by bit
select (Multiplex)
CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
>
> Could synthesize a mask from a shift and count, harder part is coming up
> with a way to do so cheaply (when the shift unit is already in use this
> cycle).
<
It is a simple decoder........
>
>
> Could be done more simply via multiple ops:
> SHLD.Q (left as-is)
> BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
> BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))
>
>
> So, extract is, say:
> SHLD.Q R4, -24, R3
> AND 511, R3
> And, insert is, say:
> MOV 511, R7
> SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
> BITSEL R6, R7, R8
>
>
> Still need to think about it.
<
Don't forget both signed and unsigned versions.
>
>
> Or such...

Re: Dense machine code from C++ code (compiler optimizations)

<sngs88$tcd$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22107&group=comp.arch#22107

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-7bea-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 19:49:28 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sngs88$tcd$1@newsreader4.netcologne.de>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me>
Injection-Date: Mon, 22 Nov 2021 19:49:28 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-7bea-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:7bea:0:7285:c2ff:fe6c:992d";
logging-data="30093"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 22 Nov 2021 19:49 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:
> On 2021-11-22 16:54, Thomas Koenig wrote:
>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>
>>> A complete rebuild of binutils +
>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>
>> That is blindingly fast, it usually takes about half to 3/4 on an
>> hour with recent versions. Which version is your gcc port
>> based on?
>>
>
> I'm on GCC trunk (12.0). Same with binutils and newlib.
>
> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
> an NVMe drive (~3 GB/s read).
>
> I build with "make -j28" (although the GNU toolchain build system is
> poorly parallelizable, so most of the cores are idle most of the time).
>
> BTW, my build script is here [1].

On a POWER 9 with "make -j32", I get

real 38m30.253s
user 420m38.816s
sys 7m9.639s

with Fortran, C++ and C enabled (and checking).

Re: Dense machine code from C++ code (compiler optimizations)

<sngt7c$jbh$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22108&group=comp.arch#22108

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 14:05:58 -0600
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <sngt7c$jbh$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 20:06:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="673fabfc6e16066c237e505b2b47291f";
logging-data="19825"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/T47N0UZrRNiazbOYa5SaD"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:KqaskNwQCDgt9HKGro7sZKi/ujc=
In-Reply-To: <sngl8p$laa$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 22 Nov 2021 20:05 UTC

On 11/22/2021 11:50 AM, Marcus wrote:
> On 2021-11-22 16:54, Thomas Koenig wrote:
>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>
>>> A complete rebuild of binutils +
>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>
>> That is blindingly fast, it usually takes about half to 3/4 on an
>> hour with recent versions.  Which version is your gcc port
>> based on?
>>
>
> I'm on GCC trunk (12.0). Same with binutils and newlib.
>
> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
> an NVMe drive (~3 GB/s read).
>

In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
3.7 GHz).
OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD Red).

My MOBO does not have M.2 or similar, but does have SATA connectors, so
the SSD is plugged in via SATA.

RAM Stats:
48GB of RAM, 1467MHz (DDR4-2933)
192GB of Pagefile space.

HDD speed is pretty variable, but generally falls into the range of
20-80MB/s (except for lots of small files, where it might drop down to ~
2MB/s or so).

Also kinda funny is that "file and folder compression" tends to actually
make directories full of source code or other small files somewhat
faster (so I tend to leave it on by default for drives which I primarily
use for projects).

Otherwise, it is like the whole Windows Filesystem is built around the
assumption that one is primarily working with small numbers of large
files, rather than large numbers of small files.

> I build with "make -j28" (although the GNU toolchain build system is
> poorly parallelizable, so most of the cores are idle most of the time).
>

Trying to parallel make eats up all of the RAM and swap space, so I
don't generally do so. This issue is especially bad with LLVM though.

> BTW, my build script is here [1].
>
> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
> always slower than Linux (esp. when it comes to build systems).
>

I suspect because GCC and similar do excessive amounts of file opening
and spawning huge numbers of short-lived processes. These operations are
well into the millisecond range.

This was enough of an issue in BGBCC that I added a cache to keep any
previously loaded headers in-RAM, since when compiling a program it
tends to access the same headers multiple times.

Well, and also it does everything in a single monolithic process.

I had considered possibly splitting the compiler into separately loaded
components (DLLs or SOs), but didn't do so mostly as it would add more
complexity than I felt was worthwhile (vs just recompiling the compiler
all at once).

As noted, in BGBCC, the frontends and backends are managed using
FOURCC's and an interface lookup system loosely inspired by how A/V
codecs worked in Windows (so it will lookup a backend with support for a
given target architecture given as a pair of FOURCC's, a frontend to
parse a given source language, ...).

Though, the division points are a little wonky:
The Frontend/Middle interface uses BCCX ASTs;
The Middle/Backend interface uses the 3AC IR;
...

RIL isn't used at this level because, at the time I designed this part,
I had considered RIL to be vestigial (part of the current compiler
structure here was due to there being a point where I expected it to get
dissolved and go away). However, I was left with an issue of needing a
way to have static-linked libraries, and it turns out that trying to
save and reload the 3AC IR "kinda really sucked" (much more hairy and
complicated than using the stack-machine IR for static libraries).

Though, compiling stuff in BGBCC with the "debug dump" option slows down
the compiler somewhat, as it is fairly slow to dump the preprocessor
output and ASTs for each translation unit. Most of this time is
seemingly spent on IO (opening files and writing out the output data).

Though, I am not sure how much of this is due to the overhead of the
antivirus software (which auto-scans every file as it is opened or written).

> /Marcus
>
> [1] https://github.com/mrisc32/mrisc32-gnu-toolchain
> [2] https://www.bitsnbites.eu/benchmarking-os-primitives

Re: Dense machine code from C++ code (compiler optimizations)

<snh0sm$f4h$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22109&group=comp.arch#22109

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 13:08:37 -0800
Organization: A noiseless patient Spider
Lines: 79
Message-ID: <snh0sm$f4h$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 21:08:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6e5d1fc0ff598e6b5b076d2c3098adf2";
logging-data="15505"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+8SI7B+Q/R533tkOXf0+/l"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:YbSpZErac895Pj73Y03+GLoZhDA=
In-Reply-To: <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Mon, 22 Nov 2021 21:08 UTC

On 11/22/2021 10:54 AM, MitchAlsup wrote:
> On Sunday, November 21, 2021 at 11:13:58 PM UTC-6, BGB wrote:
>> On 11/21/2021 5:27 PM, MitchAlsup wrote:
>>> On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com wrote:
>>>> cc64 compiler cheats and has a bit-slice operator that allows the compiler
>>>> to see directly when bit-field operations are needed.
>>>> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
>>> <
>>> That is not cheating !
>>> The semantic of the program has been obeyed.
>>>>
>>> {Although in that case, a single left shift would have sufficed.}
>>>>
>>> Although::
>>> struct { uint64_t a: 20,
>>> filler: 19,
>>> c: 18; } t;
>>>>
>>> t.a=t.c //does need an extract and an insert
>>>>
>> I bit-slice operator could be useful, although BJX2 lacks explicit
>> bit-extract or bit-insert operations (they need to be built via shift
>> and mask).
>>
>> It seems like both extract and insert could be generalized into a
>> combined "shift-and-mask" operator. Though, this would require a
>> mechanism to either supply or create the bit-mask.
>>
>> Eg:
>> Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
>> Or (4R):
>> Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).
>>
>> So, if the Mask is all ones, it behaves as a normal shift, but otherwise
>> one masks off the bits they want to keep from the destination register.
> <
> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
> the other defines the width of the field (0->64). For immediate shift amounts
> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
> (The empty spaces are checked for significance)

The encoding doesn't leave any room for expansion to 128 bit data?

> <
> Then there is the operand-by-operand select (CMOV) and the bit-by bit
> select (Multiplex)
> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
>>
>> Could synthesize a mask from a shift and count, harder part is coming up
>> with a way to do so cheaply (when the shift unit is already in use this
>> cycle).
> <
> It is a simple decoder........
>>
>>
>> Could be done more simply via multiple ops:
>> SHLD.Q (left as-is)
>> BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
>> BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))
>>
>>
>> So, extract is, say:
>> SHLD.Q R4, -24, R3
>> AND 511, R3
>> And, insert is, say:
>> MOV 511, R7
>> SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
>> BITSEL R6, R7, R8
>>
>>
>> Still need to think about it.
> <
> Don't forget both signed and unsigned versions.
>>
>>
>> Or such...

Re: Dense machine code from C++ code (compiler optimizations)

<ccc4ef68-7899-46aa-b8e6-c567547c4b28n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22110&group=comp.arch#22110

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5794:: with SMTP id v20mr470654qta.60.1637616953047;
Mon, 22 Nov 2021 13:35:53 -0800 (PST)
X-Received: by 2002:a05:6830:1445:: with SMTP id w5mr399793otp.112.1637616952805;
Mon, 22 Nov 2021 13:35:52 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Nov 2021 13:35:52 -0800 (PST)
In-Reply-To: <snh0sm$f4h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh0sm$f4h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ccc4ef68-7899-46aa-b8e6-c567547c4b28n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 22 Nov 2021 21:35:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 60
 by: MitchAlsup - Mon, 22 Nov 2021 21:35 UTC

On Monday, November 22, 2021 at 3:08:40 PM UTC-6, Ivan Godard wrote:
> On 11/22/2021 10:54 AM, MitchAlsup wrote:
> > On Sunday, November 21, 2021 at 11:13:58 PM UTC-6, BGB wrote:
> >> On 11/21/2021 5:27 PM, MitchAlsup wrote:
> >>> On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com wrote:
> >>>> cc64 compiler cheats and has a bit-slice operator that allows the compiler
> >>>> to see directly when bit-field operations are needed.
> >>>> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
> >>> <
> >>> That is not cheating !
> >>> The semantic of the program has been obeyed.
> >>>>
> >>> {Although in that case, a single left shift would have sufficed.}
> >>>>
> >>> Although::
> >>> struct { uint64_t a: 20,
> >>> filler: 19,
> >>> c: 18; } t;
> >>>>
> >>> t.a=t.c //does need an extract and an insert
> >>>>
> >> I bit-slice operator could be useful, although BJX2 lacks explicit
> >> bit-extract or bit-insert operations (they need to be built via shift
> >> and mask).
> >>
> >> It seems like both extract and insert could be generalized into a
> >> combined "shift-and-mask" operator. Though, this would require a
> >> mechanism to either supply or create the bit-mask.
> >>
> >> Eg:
> >> Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
> >> Or (4R):
> >> Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).
> >>
> >> So, if the Mask is all ones, it behaves as a normal shift, but otherwise
> >> one masks off the bits they want to keep from the destination register..
> > <
> > In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
> > the other defines the width of the field (0->64). For immediate shift amounts
> > there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
> > shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
> > (The empty spaces are checked for significance)
<
> The encoding doesn't leave any room for expansion to 128 bit data?
<
The 12-bit immediate encoding does not, the register encoding does. The register
encoding can supply a 64-bit immediate with two 6-32 bit fields.
Since I have no registers larger than 64-bits, the issue is moot.
<

Re: Dense machine code from C++ code (compiler optimizations)

<snh2ih$qmu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22111&group=comp.arch#22111

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 13:37:20 -0800
Organization: A noiseless patient Spider
Lines: 429
Message-ID: <snh2ih$qmu$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <snfj23$7h1$1@dont-email.me>
<sngoqd$hgr$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 21:37:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6e5d1fc0ff598e6b5b076d2c3098adf2";
logging-data="27358"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+m10RRaggCm0KYDx6T8IL/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:j+Vx+9Xcnmdi8lbKfXfe1mUvkwE=
In-Reply-To: <sngoqd$hgr$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Mon, 22 Nov 2021 21:37 UTC

On 11/22/2021 10:50 AM, BGB wrote:
> On 11/22/2021 2:06 AM, Ivan Godard wrote:
>> On 11/21/2021 11:10 PM, Marcus wrote:
>>> On 2021-11-21 kl. 23:14, BGB wrote:
>>>> On 11/21/2021 11:13 AM, Marcus wrote:
>>>>> Just wrote this short post. Maybe someone finds it interesting...
>>>>>
>>>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>>>
>>>>
>>>> I guess this points out one limitation of my compiler (relative to
>>>> GCC) is that for many cases it does a fairly direct translation from
>>>> C source to the machine code.
>>>>
>>>> It will not optimize any "high level" constructions, but instead
>>>> sort of depends on the programmer to write "reasonably efficient" C.
>>>
>>> I would suspect that. That was one of the points in my post: GCC and
>>> Clang have many, many, many man-years of work built in, and it's very
>>> hard to compete with them if you start fresh on a new compiler.
>>>
>>> I also have a feeling that the C++ language is at a level today that
>>> it's near impossible to write a new compiler from scratch. It's not only
>>> about the sheer amount of language features (classes, lambdas,
>>> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
>>> ...), but it's also about the expectations about how the code is
>>> optimized. C++ places a huge burden on the compiler to be able to
>>> resolve lots of things at compile time (e.g. constexpr essentially
>>> requires that the C++ code can be executed at compile time).
>>>
>>>> Such a case would not turn out nearly so nice in my compiler though
>>>> (if it supported C++), but alas.
>>>>
>>>> Trying to port GCC looks like a pain though, as its codebase is
>>>> pretty hairy and it takes a fairly long time to rebuild from source
>>>> (compared with my compiler; which rebuilds in a few seconds).
>>>
>>> Yes, it has taken me years, and the code base and the build system is
>>> not modern by a long shot. A complete rebuild of binutils +
>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
>>> incremental build of GCC when some part of the machine description has
>>> changed (e.g. an insn description was added) takes about a minute.
>>>
>>> OTOH it would probably have taken me even longer to create my own
>>> compiler (especially as I'm not very versed in compiler architecture),
>>> so for me it was the less evil of options (I still kind of regret that
>>> I didn't try harder with LLVM/Clang, though, but I have no evidence
>>> that the grass is greener over there).
>>>
>>>>
>>>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas
>>>> GCC seemingly takes ~ 20 seconds to recompile Doom.
>>>>
>>>
>>> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
>>> Doom is 1.3 s (10 s without parallel compilation). The build time for
>>> Quake is 1.6 s (13 s without parallel compilation).
>>>
>>> But I agree that a fast compiler is worth a lot. I work with a DSP
>>> compiler that can take ~5 minutes to compile a single object file that
>>> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>>>
>>>>
>>>
>>> [snip]
>>>
>>>> And, all this was while trying to hunt down another bug which seems
>>>> to result in Doom demos sometimes desyncing (in a way different from
>>>> the x86 builds); also a few other behavioral anomalies which have
>>>> shown up as bugs in Quake, ...
>>>>
>>>
>>> FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
>>> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
>>> get the result off by a single LSB, the demos will start running off
>>> track very quickly ;-) I found this out back in the 1990s when I ported
>>> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>>>
>>>>
>>>> Well, also the relatively naive register allocation strategy doesn't
>>>> help:
>>>> For each basic block, whenever a variable is referenced (that is not
>>>> part of the "statically reserved set"), it is loaded into a register
>>>> (and temporarily held there), and at the end of the basic-block,
>>>> everything is spilled back to the stack.
>>>>
>>>> There are, in theory, much better ways to do register allocation.
>>>>
>>>>
>>>> Though, one useful case is that the register space is large enough
>>>> to where a non-trivial number of functions can use a "statically
>>>> reserve everything" special case. This can completely avoid spills,
>>>> but only for functions within a strict limit for the number of
>>>> variables (limited by the number of callee save registers, or ~ 12
>>>> variables with 32 GPRs).
>>>>
>>>> For most functions, this case still involves the creation of a stack
>>>> frame and similar though (mostly to save/restore registers).
>>>>
>>>> This case still excludes using a few features:
>>>>    Use of structs as value types (structs may only be used as
>>>> pointers);
>>>>    Taking the address of any variable;
>>>>    Use of VLAs or alloca;
>>>>    ...
>>>>
>>>> But, this case does allow a few things:
>>>>    Calling functions;
>>>>    Operators which use scratch registers;
>>>>    Accessing global variables;
>>>>    ...
>>>>
>>>>
>>>> With the expanded GPR space, the scope of the "statically assign
>>>> everything" case could be be expanded, but still haven't gotten
>>>> around to making BGBCC able to use all 64 GPRs directly (and I still
>>>> don't consider them to be part of the baseline ISA, ...). This could
>>>> (in theory) expand the limit to ~ 28 or so.
>>>>
>>>>
>>>> If BGBCC supported C++, I don't have much confidence for how it
>>>> would deal with templates.
>>>>
>>>> But, otherwise, mostly more trying to find and fix bugs and similar
>>>> at the moment. But, often much of the effort is trying to actually
>>>> find the bugs (since "demo desyncs for some reason" isn't super
>>>> obvious as to why this is happening, or where in the codebase the
>>>> bug is being triggered, ...).
>>>>
>>>>
>>>>> /Marcus
>>>>
>>>
>>
>> The rule of thumb twenty years ago was that a new production-grade
>> compiler cost $100M$ and five years. I doubt the cost has gone down.
>> The Mill tool chain, even using clang for front and middle end and not
>> including linking, is ~30k lines of pretty tight C++. That ain't cheap.
>
> The current version of BGBCC is ~ 250k lines of C.
> It was around 50k lines when I started.
>
> Of this:
>   16k, C parser (includes preprocessor)
>   44k, middle stages (AST -> RIL, RIL -> 3AC, Typesystem, ...)
>   76k, BJX2 backend
>   20k, support code (memory manager, AST backend, ...)
>   40k, original SH4 / BJX1 backend;
>   30k, BSR1 backend
>    5k, Stuff for WAD2A and WAD4
>   ...

That's large by my standard; I suspect that a lot of that comes from use
of C instead of C++ as an implementation language - a judicious use of
templates *really* shrinks the source in things as regular as compilers.
Of course, the regularity of the target makes a large difference too -
and then there's commenting style. It makes it real hard to meaningfully
compare code sizes.

The tightest compiler I ever wrote was for Mary2 (an Algol68 variant).
The compiler could compile itself on a DG Nova 1200 in 64k memory - that
you shared with the OS. That compiler is how Mitch and I first met.

> ASTs (Parser, AST->RIL) use a system I had been calling BCCX, which is
> organized in terms of nodes. In an abstract sense, a node contains:
>   A collection of key/value attributes;
>   A collection of zero or more child nodes.


Click here to read the complete article
Re: Dense machine code from C++ code (compiler optimizations)

<snh6f2$m5a$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22112&group=comp.arch#22112

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 16:43:40 -0600
Organization: A noiseless patient Spider
Lines: 136
Message-ID: <snh6f2$m5a$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Nov 2021 22:43:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="673fabfc6e16066c237e505b2b47291f";
logging-data="22698"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Bt1ozy4L0Ff5fd4VxDqKA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:0l8SrCl+yCiBE/xIrtjRHOuyajA=
In-Reply-To: <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 22 Nov 2021 22:43 UTC

On 11/22/2021 12:54 PM, MitchAlsup wrote:
> On Sunday, November 21, 2021 at 11:13:58 PM UTC-6, BGB wrote:
>> On 11/21/2021 5:27 PM, MitchAlsup wrote:
>>> On Sunday, November 21, 2021 at 4:55:08 PM UTC-6, robf...@gmail.com wrote:
>>>> cc64 compiler cheats and has a bit-slice operator that allows the compiler
>>>> to see directly when bit-field operations are needed.
>>>> A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.
>>> <
>>> That is not cheating !
>>> The semantic of the program has been obeyed.
>>>>
>>> {Although in that case, a single left shift would have sufficed.}
>>>>
>>> Although::
>>> struct { uint64_t a: 20,
>>> filler: 19,
>>> c: 18; } t;
>>>>
>>> t.a=t.c //does need an extract and an insert
>>>>
>> I bit-slice operator could be useful, although BJX2 lacks explicit
>> bit-extract or bit-insert operations (they need to be built via shift
>> and mask).
>>
>> It seems like both extract and insert could be generalized into a
>> combined "shift-and-mask" operator. Though, this would require a
>> mechanism to either supply or create the bit-mask.
>>
>> Eg:
>> Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
>> Or (4R):
>> Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).
>>
>> So, if the Mask is all ones, it behaves as a normal shift, but otherwise
>> one masks off the bits they want to keep from the destination register.
> <
> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
> the other defines the width of the field (0->64). For immediate shift amounts
> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
> (The empty spaces are checked for significance)
> <
> Then there is the operand-by-operand select (CMOV) and the bit-by bit
> select (Multiplex)
> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )

I did go and add a bit-select instruction (BITSEL / MUX).

Currently I have:
CSELT // Rn = SR.T ? Rm : Ro;
PCSELT.L // Packed select, 32-bit words (SR.ST)
PCSELT.W // Packed select, 16-bit words (SR.PQRO)
BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);

BITSEL didn't add much cost, but did initially result in the core
failing timing.

Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
work again.

The MOV.C instruction has been demoted some, as:
It isn't entirely free;
It's advantage in terms of performance is fairly small;
It doesn't seem to play well with TLB Miss interrupts.

Though, a cheaper option would be "MOV.C that only works with GBR and LR
and similar" (similar effect in practice, but avoids most of the cost of
the additional signal routing).

I am also on the fence and considering disallowing using bit-shift
operations in Lane 3, mostly as a possible way to reduce costs by not
having a (rarely used) Lane 3 shift unit.

Still on the fence though as it does appear that shifts operators in
Lane 3 aren't exactly unused either (so the change breaks compatibility
with my existing binaries).

>>
>> Could synthesize a mask from a shift and count, harder part is coming up
>> with a way to do so cheaply (when the shift unit is already in use this
>> cycle).
> <
> It is a simple decoder........

In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
table isn't super cheap.

One could have it as min/max values, allowing the problem to be reduced
to decisions in terms of the output bits.

Still would likely cost an annoying number of LUTs though (~ 200 LUTs by
current estimates). Though, still a lot cheaper than a dedicated shift unit.

I guess the open question is partly a choice between:
32-bit encoding which can specify a wide range of simple bit-masks;
Not do anything, just stick with existing encodings (using a 96-bit
encoding if need be).

>>
>>
>> Could be done more simply via multiple ops:
>> SHLD.Q (left as-is)
>> BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
>> BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))
>>
>>
>> So, extract is, say:
>> SHLD.Q R4, -24, R3
>> AND 511, R3
>> And, insert is, say:
>> MOV 511, R7
>> SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
>> BITSEL R6, R7, R8
>>
>>
>> Still need to think about it.
> <
> Don't forget both signed and unsigned versions.

Signed extract can be done with 2 shifts.
For insert, there shouldn't be any practical difference between signed
and unsigned.

>>
>>
>> Or such...

Re: Dense machine code from C++ code (compiler optimizations)

<f86a43b7-ef44-4ac4-a3a7-e9b627e0736an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22113&group=comp.arch#22113

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:ac2:: with SMTP id g2mr827548qvi.28.1637621681831;
Mon, 22 Nov 2021 14:54:41 -0800 (PST)
X-Received: by 2002:a05:6808:19aa:: with SMTP id bj42mr818397oib.37.1637621681604;
Mon, 22 Nov 2021 14:54:41 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Nov 2021 14:54:41 -0800 (PST)
In-Reply-To: <snh2ih$qmu$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <snfj23$7h1$1@dont-email.me> <sngoqd$hgr$1@dont-email.me>
<snh2ih$qmu$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f86a43b7-ef44-4ac4-a3a7-e9b627e0736an@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 22 Nov 2021 22:54:41 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 25
 by: MitchAlsup - Mon, 22 Nov 2021 22:54 UTC

On Monday, November 22, 2021 at 3:37:24 PM UTC-6, Ivan Godard wrote:
> On 11/22/2021 10:50 AM, BGB wrote:
> > On 11/22/2021 2:06 AM, Ivan Godard wrote:

> >
> > The logic for managing the stack frame (load/store stack variables,
> > prolog/epilog sequences, ...) has ended up being one of the largest and
> > most complicated parts of the backend, followed by things like the
> > register allocator and similar.
> When you co-develop the compiler and the architecture together then you
> can push a lot of this off onto the architecture. Our chain would be
> thrice the size it it were targeting a legacy architecture.
<
What to push into HW and what to push into SW is a careful balancing act.
<
One of the things I do in my simulators, is that all of the executable instructions
are in a single file, and if you comment one (or more) instruction out, then the
simulator converts the working instruction into a INVALID. This makes it easy to
test if the compiler has used/not-used any given instruction.
<
I did succumb to putting code-density improving instructions into My 66000 ISA
particularly in the epilogue and prologue sections. Basically, if all you need if
a few temporary registers, the arguments, and a small local stack frame, then
you do this is SW as instructions. However, if you need registers saved/restored,
FP setup (or not), dynamic arrays, list of structures constructed,... Then ENTER
and EXIT help out code density immensely.

Re: Dense machine code from C++ code (compiler optimizations)

<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22114&group=comp.arch#22114

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2407:: with SMTP id d7mr493506qkn.114.1637622252797;
Mon, 22 Nov 2021 15:04:12 -0800 (PST)
X-Received: by 2002:a9d:1b0f:: with SMTP id l15mr30271otl.38.1637622252553;
Mon, 22 Nov 2021 15:04:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Nov 2021 15:04:12 -0800 (PST)
In-Reply-To: <snh6f2$m5a$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 22 Nov 2021 23:04:12 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 73
 by: MitchAlsup - Mon, 22 Nov 2021 23:04 UTC

On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
> On 11/22/2021 12:54 PM, MitchAlsup wrote:

> > In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
> > the other defines the width of the field (0->64). For immediate shift amounts
> > there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
> > shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
> > (The empty spaces are checked for significance)
> > <
> > Then there is the operand-by-operand select (CMOV) and the bit-by bit
> > select (Multiplex)
> > CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
> > MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
> I did go and add a bit-select instruction (BITSEL / MUX).
>
> Currently I have:
> CSELT // Rn = SR.T ? Rm : Ro;
> PCSELT.L // Packed select, 32-bit words (SR.ST)
> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
>
> BITSEL didn't add much cost, but did initially result in the core
> failing timing.
>
> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
> work again.
>
>
> The MOV.C instruction has been demoted some, as:
> It isn't entirely free;
> It's advantage in terms of performance is fairly small;
> It doesn't seem to play well with TLB Miss interrupts.
>
> Though, a cheaper option would be "MOV.C that only works with GBR and LR
> and similar" (similar effect in practice, but avoids most of the cost of
> the additional signal routing).
>
>
> I am also on the fence and considering disallowing using bit-shift
> operations in Lane 3, mostly as a possible way to reduce costs by not
> having a (rarely used) Lane 3 shift unit.
<
In general purpose codes, shifts are "not all that present" 2%-5% range (source
code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
multiplier (slot[3]).
<
Shifters are on-the-order-of gate cont expensive as integer adders and less useful.
>
> Still on the fence though as it does appear that shifts operators in
> Lane 3 aren't exactly unused either (so the change breaks compatibility
> with my existing binaries).
<
You should see what the cost is if only lanes[0..1] can perform shifts.
> >>
> >> Could synthesize a mask from a shift and count, harder part is coming up
> >> with a way to do so cheaply (when the shift unit is already in use this
> >> cycle).
> > <
> > It is a simple decoder........
> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
> table isn't super cheap.
<
input.....|.............................output..................................
000000 | 0000000000000000000000000000000000000000000000000000000000000000
000001 | 0000000000000000000000000000000000000000000000000000000000000001
000010 | 0000000000000000000000000000000000000000000000000000000000000011
000011 | 0000000000000000000000000000000000000000000000000000000000000111
000100 | 0000000000000000000000000000000000000000000000000000000000001111
etc.
It is a straight "Greater Than" decoder.
>

Pages:12345678
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor