Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Packages should build-depend on what they should build-depend. -- Santiago Vila on debian-devel


devel / comp.arch / Re: Dense machine code from C++ code (compiler optimizations)

SubjectAuthor
* Dense machine code from C++ code (compiler optimizations)Marcus
+* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
|`- Re: Dense machine code from C++ code (compiler optimizations)Marcus
+* Re: Dense machine code from C++ code (compiler optimizations)BGB
|+* Re: Dense machine code from C++ code (compiler optimizations)robf...@gmail.com
||`* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
|| `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||  +- Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
||  `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||   +* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
||   |`- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||   `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||    `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||     `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||      +* Re: Dense machine code from C++ code (compiler optimizations)robf...@gmail.com
||      |+- Re: Dense machine code from C++ code (compiler optimizations)BGB
||      |`* Re: Thor (was: Dense machine code...)Marcus
||      | `* Re: Thor (was: Dense machine code...)robf...@gmail.com
||      |  +- Re: ThorEricP
||      |  `* Re: Thor (was: Dense machine code...)Marcus
||      |   `- Re: Thor (was: Dense machine code...)robf...@gmail.com
||      `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||       `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||        `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
||         `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||          `* Re: Dense machine code from C++ code (compiler optimizations)BGB
||           `* Re: Testing with open source games (was Dense machine code ...)Marcus
||            +* Re: Testing with open source games (was Dense machine code ...)Terje Mathisen
||            |`* Re: Testing with open source games (was Dense machine code ...)Marcus
||            | +- Re: Testing with open source games (was Dense machine code ...)Terje Mathisen
||            | `* Re: Testing with open source games (was Dense machine code ...)James Van Buskirk
||            |  `- Re: Testing with open source games (was Dense machine code ...)Marcus
||            `- Re: Testing with open source games (was Dense machine code ...)BGB
|`* Re: Dense machine code from C++ code (compiler optimizations)Marcus
| +* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
| |+- Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
| |`* Re: Dense machine code from C++ code (compiler optimizations)BGB
| | `* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
| |  +- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
| |  `- Re: Dense machine code from C++ code (compiler optimizations)BGB
| +* Re: Dense machine code from C++ code (compiler optimizations)BGB
| |`- Re: Dense machine code from C++ code (compiler optimizations)Paul A. Clayton
| `* Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
|  `* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|   +* Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
|   |`* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|   | `- Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
|   `* Re: Dense machine code from C++ code (compiler optimizations)BGB
|    +* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|    |`- Re: Dense machine code from C++ code (compiler optimizations)George Neuner
|    `* Re: Dense machine code from C++ code (compiler optimizations)David Brown
|     `* Re: Dense machine code from C++ code (compiler optimizations)Marcus
|      `* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
|       +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
|       |`- Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
|       +- Re: Dense machine code from C++ code (compiler optimizations)BGB
|       `- Re: Dense machine code from C++ code (compiler optimizations)Marcus
`* Re: Dense machine code from C++ code (compiler optimizations)Ir. Hj. Othman bin Hj. Ahmad
 +- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
 `* Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
  +* Re: Dense machine code from C++ code (compiler optimizations)chris
  |`* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  | `* Re: Dense machine code from C++ code (compiler optimizations)chris
  |  +* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |  |`- Re: Dense machine code from C++ code (compiler optimizations)chris
  |  `* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
  |   `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |    +- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |    `* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
  |     `- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  +* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |`* Re: Dense machine code from C++ code (compiler optimizations)BGB
  | `* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |  `* Re: Dense machine code from C++ code (compiler optimizations)BGB
  |   +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |   |`* Re: Dense machine code from C++ code (compiler optimizations)BGB
  |   | `- Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
  |   `* Re: Dense machine code from C++ code (compiler optimizations)David Brown
  |    `- Re: Dense machine code from C++ code (compiler optimizations)BGB
  `* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   |`- Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   +* Re: Dense machine code from C++ code (compiler optimizations)James Van Buskirk
   |`* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   | +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   | |+- Re: Dense machine code from C++ code (compiler optimizations)Marcus
   | |`* Re: Dense machine code from C++ code (compiler optimizations)Terje Mathisen
   | | `* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   | |  `* Re: Dense machine code from C++ code (compiler optimizations)EricP
   | |   `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   | |    `* Re: Dense machine code from C++ code (compiler optimizations)EricP
   | |     `* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   | |      `- Re: Dense machine code from C++ code (compiler optimizations)EricP
   | `* Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  +* Re: Dense machine code from C++ code (compiler optimizations)Stephen Fuld
   |  |+* Re: Dense machine code from C++ code (compiler optimizations)Guillaume
   |  ||+* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   |  |||+- Re: Dense machine code from C++ code (compiler optimizations)Thomas Koenig
   |  |||`* Re: Dense machine code from C++ code (compiler optimizations)Guillaume
   |  ||| +* Re: Dense machine code from C++ code (compiler optimizations)MitchAlsup
   |  ||| |`* Re: Dense machine code from C++ code (compiler optimizations)Andreas Eder
   |  ||| `- Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  ||`- Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  |`* Re: Dense machine code from C++ code (compiler optimizations)Tim Rentsch
   |  `* Re: Dense machine code from C++ code (compiler optimizations)Ivan Godard
   `- Re: Dense machine code from C++ code (compiler optimizations)Andreas Eder

Pages:12345678
Re: Dense machine code from C++ code (compiler optimizations)

<snhn49$rp4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22115&group=comp.arch#22115

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 21:28:03 -0600
Organization: A noiseless patient Spider
Lines: 692
Message-ID: <snhn49$rp4$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <snfj23$7h1$1@dont-email.me>
<sngoqd$hgr$1@dont-email.me> <snh2ih$qmu$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 03:28:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f58ae11bce70c03b51364af3b8944eed";
logging-data="28452"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189rkhmoVYaWsgSi8WU9Wo5"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:FBUI1/pRkcOq9yQGtYlcrdvUgjI=
In-Reply-To: <snh2ih$qmu$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 23 Nov 2021 03:28 UTC

On 11/22/2021 3:37 PM, Ivan Godard wrote:
> On 11/22/2021 10:50 AM, BGB wrote:
>> On 11/22/2021 2:06 AM, Ivan Godard wrote:
>>> On 11/21/2021 11:10 PM, Marcus wrote:
>>>> On 2021-11-21 kl. 23:14, BGB wrote:
>>>>> On 11/21/2021 11:13 AM, Marcus wrote:
>>>>>> Just wrote this short post. Maybe someone finds it interesting...
>>>>>>
>>>>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>>>>
>>>>>
>>>>> I guess this points out one limitation of my compiler (relative to
>>>>> GCC) is that for many cases it does a fairly direct translation
>>>>> from C source to the machine code.
>>>>>
>>>>> It will not optimize any "high level" constructions, but instead
>>>>> sort of depends on the programmer to write "reasonably efficient" C.
>>>>
>>>> I would suspect that. That was one of the points in my post: GCC and
>>>> Clang have many, many, many man-years of work built in, and it's very
>>>> hard to compete with them if you start fresh on a new compiler.
>>>>
>>>> I also have a feeling that the C++ language is at a level today that
>>>> it's near impossible to write a new compiler from scratch. It's not
>>>> only
>>>> about the sheer amount of language features (classes, lambdas,
>>>> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
>>>> ...), but it's also about the expectations about how the code is
>>>> optimized. C++ places a huge burden on the compiler to be able to
>>>> resolve lots of things at compile time (e.g. constexpr essentially
>>>> requires that the C++ code can be executed at compile time).
>>>>
>>>>> Such a case would not turn out nearly so nice in my compiler though
>>>>> (if it supported C++), but alas.
>>>>>
>>>>> Trying to port GCC looks like a pain though, as its codebase is
>>>>> pretty hairy and it takes a fairly long time to rebuild from source
>>>>> (compared with my compiler; which rebuilds in a few seconds).
>>>>
>>>> Yes, it has taken me years, and the code base and the build system is
>>>> not modern by a long shot. A complete rebuild of binutils +
>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
>>>> incremental build of GCC when some part of the machine description has
>>>> changed (e.g. an insn description was added) takes about a minute.
>>>>
>>>> OTOH it would probably have taken me even longer to create my own
>>>> compiler (especially as I'm not very versed in compiler architecture),
>>>> so for me it was the less evil of options (I still kind of regret that
>>>> I didn't try harder with LLVM/Clang, though, but I have no evidence
>>>> that the grass is greener over there).
>>>>
>>>>>
>>>>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas
>>>>> GCC seemingly takes ~ 20 seconds to recompile Doom.
>>>>>
>>>>
>>>> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time
>>>> for
>>>> Doom is 1.3 s (10 s without parallel compilation). The build time
>>>> for Quake is 1.6 s (13 s without parallel compilation).
>>>>
>>>> But I agree that a fast compiler is worth a lot. I work with a DSP
>>>> compiler that can take ~5 minutes to compile a single object file that
>>>> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>>>>
>>>>>
>>>>
>>>> [snip]
>>>>
>>>>> And, all this was while trying to hunt down another bug which seems
>>>>> to result in Doom demos sometimes desyncing (in a way different
>>>>> from the x86 builds); also a few other behavioral anomalies which
>>>>> have shown up as bugs in Quake, ...
>>>>>
>>>>
>>>> FixedDiv and FixedMul are very sensitive in Doom. If you approximate
>>>> the
>>>> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
>>>> get the result off by a single LSB, the demos will start running off
>>>> track very quickly ;-) I found this out back in the 1990s when I ported
>>>> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>>>>
>>>>>
>>>>> Well, also the relatively naive register allocation strategy
>>>>> doesn't help:
>>>>> For each basic block, whenever a variable is referenced (that is
>>>>> not part of the "statically reserved set"), it is loaded into a
>>>>> register (and temporarily held there), and at the end of the
>>>>> basic-block, everything is spilled back to the stack.
>>>>>
>>>>> There are, in theory, much better ways to do register allocation.
>>>>>
>>>>>
>>>>> Though, one useful case is that the register space is large enough
>>>>> to where a non-trivial number of functions can use a "statically
>>>>> reserve everything" special case. This can completely avoid spills,
>>>>> but only for functions within a strict limit for the number of
>>>>> variables (limited by the number of callee save registers, or ~ 12
>>>>> variables with 32 GPRs).
>>>>>
>>>>> For most functions, this case still involves the creation of a
>>>>> stack frame and similar though (mostly to save/restore registers).
>>>>>
>>>>> This case still excludes using a few features:
>>>>>    Use of structs as value types (structs may only be used as
>>>>> pointers);
>>>>>    Taking the address of any variable;
>>>>>    Use of VLAs or alloca;
>>>>>    ...
>>>>>
>>>>> But, this case does allow a few things:
>>>>>    Calling functions;
>>>>>    Operators which use scratch registers;
>>>>>    Accessing global variables;
>>>>>    ...
>>>>>
>>>>>
>>>>> With the expanded GPR space, the scope of the "statically assign
>>>>> everything" case could be be expanded, but still haven't gotten
>>>>> around to making BGBCC able to use all 64 GPRs directly (and I
>>>>> still don't consider them to be part of the baseline ISA, ...).
>>>>> This could (in theory) expand the limit to ~ 28 or so.
>>>>>
>>>>>
>>>>> If BGBCC supported C++, I don't have much confidence for how it
>>>>> would deal with templates.
>>>>>
>>>>> But, otherwise, mostly more trying to find and fix bugs and similar
>>>>> at the moment. But, often much of the effort is trying to actually
>>>>> find the bugs (since "demo desyncs for some reason" isn't super
>>>>> obvious as to why this is happening, or where in the codebase the
>>>>> bug is being triggered, ...).
>>>>>
>>>>>
>>>>>> /Marcus
>>>>>
>>>>
>>>
>>> The rule of thumb twenty years ago was that a new production-grade
>>> compiler cost $100M$ and five years. I doubt the cost has gone down.
>>> The Mill tool chain, even using clang for front and middle end and
>>> not including linking, is ~30k lines of pretty tight C++. That ain't
>>> cheap.
>>
>> The current version of BGBCC is ~ 250k lines of C.
>> It was around 50k lines when I started.
>>
>> Of this:
>>    16k, C parser (includes preprocessor)
>>    44k, middle stages (AST -> RIL, RIL -> 3AC, Typesystem, ...)
>>    76k, BJX2 backend
>>    20k, support code (memory manager, AST backend, ...)
>>    40k, original SH4 / BJX1 backend;
>>    30k, BSR1 backend
>>     5k, Stuff for WAD2A and WAD4
>>    ...
>
> That's large by my standard; I suspect that a lot of that comes from use
> of C instead of C++ as an implementation language - a judicious use of
> templates *really* shrinks the source in things as regular as compilers.
> Of course, the regularity of the target makes a large difference too -
> and then there's commenting style. It makes it real hard to meaningfully
> compare code sizes.
>
> The tightest compiler I ever wrote was for Mary2 (an Algol68 variant).
> The compiler could compile itself on a DG Nova 1200 in 64k memory - that
> you shared with the OS. That compiler is how Mitch and I first met.
>


Click here to read the complete article
Re: Dense machine code from C++ code (compiler optimizations)

<snhtp1$nkn$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22118&group=comp.arch#22118

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Mon, 22 Nov 2021 23:21:31 -0600
Organization: A noiseless patient Spider
Lines: 125
Message-ID: <snhtp1$nkn$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Nov 2021 05:21:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f58ae11bce70c03b51364af3b8944eed";
logging-data="24215"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ceNHcPyKecfCXZwZtqWpl"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:5boY5sEzlNYArq/nfXSkvXoqOsk=
In-Reply-To: <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 23 Nov 2021 05:21 UTC

On 11/22/2021 5:04 PM, MitchAlsup wrote:
> On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
>> On 11/22/2021 12:54 PM, MitchAlsup wrote:
>
>>> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
>>> the other defines the width of the field (0->64). For immediate shift amounts
>>> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
>>> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
>>> (The empty spaces are checked for significance)
>>> <
>>> Then there is the operand-by-operand select (CMOV) and the bit-by bit
>>> select (Multiplex)
>>> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
>>> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
>> I did go and add a bit-select instruction (BITSEL / MUX).
>>
>> Currently I have:
>> CSELT // Rn = SR.T ? Rm : Ro;
>> PCSELT.L // Packed select, 32-bit words (SR.ST)
>> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
>> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
>>
>> BITSEL didn't add much cost, but did initially result in the core
>> failing timing.
>>
>> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
>> work again.
>>
>>
>> The MOV.C instruction has been demoted some, as:
>> It isn't entirely free;
>> It's advantage in terms of performance is fairly small;
>> It doesn't seem to play well with TLB Miss interrupts.
>>
>> Though, a cheaper option would be "MOV.C that only works with GBR and LR
>> and similar" (similar effect in practice, but avoids most of the cost of
>> the additional signal routing).
>>
>>
>> I am also on the fence and considering disallowing using bit-shift
>> operations in Lane 3, mostly as a possible way to reduce costs by not
>> having a (rarely used) Lane 3 shift unit.
> <
> In general purpose codes, shifts are "not all that present" 2%-5% range (source
> code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
> but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
> could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
> multiplier (slot[3]).
> <
> Shifters are on-the-order-of gate cont expensive as integer adders and less useful.

All 3 lanes still have other ALU ops, including some type-conversion and
packed-integer SIMD ops.

>>
>> Still on the fence though as it does appear that shifts operators in
>> Lane 3 aren't exactly unused either (so the change breaks compatibility
>> with my existing binaries).
> <
> You should see what the cost is if only lanes[0..1] can perform shifts.

Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
It also makes timing a little better.

I guess the main factor is whether it justifies the break in binary
compatibility (since previously, the WEXifier assumed that running
shifts in Lane 3 was allowed; and this seems to have occurred a non-zero
number of times in pretty much every binary I have looked at).

>>>>
>>>> Could synthesize a mask from a shift and count, harder part is coming up
>>>> with a way to do so cheaply (when the shift unit is already in use this
>>>> cycle).
>>> <
>>> It is a simple decoder........
>> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
>> table isn't super cheap.
> <
> input.....|.............................output..................................
> 000000 | 0000000000000000000000000000000000000000000000000000000000000000
> 000001 | 0000000000000000000000000000000000000000000000000000000000000001
> 000010 | 0000000000000000000000000000000000000000000000000000000000000011
> 000011 | 0000000000000000000000000000000000000000000000000000000000000111
> 000100 | 0000000000000000000000000000000000000000000000000000000000001111
> etc.
> It is a straight "Greater Than" decoder.

One needs a greater-than comparison, a less-than comparison, and a way
to combine them with a bit-wise operator (AND or OR), for each bit.

Say, min, max:
min<=max: Combine with AND
min >max: Combine with OR

This would allow composing masks with a range of patterns, say:
00000000000003FF
7FE0000000000000
000003FFFFC00000
7FFE000000003FFF
...

Granted, this isn't all that steep (vs the core as a whole), but I guess
it is more a question of if it would be useful enough to make it worthwhile.

Though, would need to find an "odd corner" to shove it into.
Otherwise, BJX2 lacks any encodings for "OP Imm12, Rn".

Have spots for Imm10 (insufficient), and Imm16 (but no, not going to
spend an Imm16 spot on this; there are only a few of these left).

Have a few possible ideas (involving XGPR encodings, such as reclaiming
the encoding space for branches for Imm12 ops because XGPR+BRA does not
make sense), but they seem like ugly hacks.

Similarly, outside of being able to fit in a 32-bit encoding, it loses
any real advantage it might have had. Jumbo encodings can load pretty
much any possible constant.

>>
>

Re: Dense machine code from C++ code (compiler optimizations)

<25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22119&group=comp.arch#22119

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:411e:: with SMTP id kc30mr3255395qvb.38.1637650489628;
Mon, 22 Nov 2021 22:54:49 -0800 (PST)
X-Received: by 2002:a9d:764c:: with SMTP id o12mr2328540otl.129.1637650489332;
Mon, 22 Nov 2021 22:54:49 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Nov 2021 22:54:49 -0800 (PST)
In-Reply-To: <snhtp1$nkn$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:603e:8cd7:4ca4:21b8;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:603e:8cd7:4ca4:21b8
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 23 Nov 2021 06:54:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 46
 by: robf...@gmail.com - Tue, 23 Nov 2021 06:54 UTC

I have put a significant portion of a man-year into my own compiler over the last 25+ years. Not
really production quality code, hobby quality with lots of ‘makes it work’ code. I think it is about
80k lines of code. A combination of C and C++. It started out as pure C. I have not yet used it for
anything but simple demos. The compiler has limited support for some C++ isms like classes
with methods and method overloading. But does not support operator overloading, templates and
many other things associated with C++. The compiler makes use of two types of parse trees
which would be analogous to an AST. Expression nodes and statement nodes. As it generates code
it builds a linked list of code nodes. Optimizations are applied to all different types of nodes.
I got started working on it for a 68000 co-processor board for the XT and it has mutated through
several different processors. I started work before the internet and before I found out about other
projects like gcc. I decided to put more work into my own compiler that I already knew rather than
learn the workings of gcc. The most recent addition to the compiler is support for co-routines. I
think I can come up with a way to support some recursion in co-routines, by invoking a co-routine
again instead of just using ‘yield’ but I must think about it some more. I keep toying with the idea
of trying to make it more mainstream, perhaps by adding support for other mainstream processors.
The compiler has several nifty features like bit-slicing and enumerator generators.

For field extracts Thor supports using either a register or a seven-bit constant to specify the field
offset and width. Most Thor instructions are using a 48-bit encoding, allowing room for more options
when dealing with operands. One operand bit indicates a register or constant value. Virtually any
instruction can be a vector instruction as there is an opcode bit dedicated to indicating a vector
operation.
Thor should be extendable to 128-bits with a little bit of work. 128-bit constants can be formed
using the appropriate prefix instructions. The plan is to support 128-bit decimal floating-point.

Re: Dense machine code from C++ code (compiler optimizations)

<sni7hf$5o5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22120&group=comp.arch#22120

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 09:08:14 +0100
Organization: A noiseless patient Spider
Lines: 124
Message-ID: <sni7hf$5o5$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 08:08:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fa2a390485d48858ed6c0edccba8849b";
logging-data="5893"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LvLQMS9LiE7L739WEXFw7/+DhxqotL7w="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:lLbaCHqAsImlNonq+65dSa2RJdE=
In-Reply-To: <sngt7c$jbh$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Tue, 23 Nov 2021 08:08 UTC

On 2021-11-22 21:05, BGB wrote:
> On 11/22/2021 11:50 AM, Marcus wrote:
>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>
>>>> A complete rebuild of binutils +
>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>
>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>> hour with recent versions.  Which version is your gcc port
>>> based on?
>>>
>>
>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>
>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>> an NVMe drive (~3 GB/s read).
>>
>
> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
> 3.7 GHz).
>   OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>   Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD Red).
>
> My MOBO does not have M.2 or similar, but does have SATA connectors, so
> the SSD is plugged in via SATA.
>
> RAM Stats:
>   48GB of RAM, 1467MHz (DDR4-2933)
>   192GB of Pagefile space.
>
> HDD speed is pretty variable, but generally falls into the range of
> 20-80MB/s (except for lots of small files, where it might drop down to ~
> 2MB/s or so).

I believe that Windows is particularly sensitive to slow/spinning
drives. My impression is that Linux is generally better at caching disk
data in RAM, which hides some of the slowness (esp. when working with
small files).

If you have spare PCIe slots (preferably PCIe 3.0 x4), you could always
look into using a PCIe M.2 adapter. That together with a 1 TB 3000 MB/s
drive should set you back less than $100. It's a game changer for
productivity.

>
> Also kinda funny is that "file and folder compression" tends to actually
> make directories full of source code or other small files somewhat
> faster (so I tend to leave it on by default for drives which I primarily
> use for projects).
>
> Otherwise, it is like the whole Windows Filesystem is built around the
> assumption that one is primarily working with small numbers of large
> files, rather than large numbers of small files.
>

Yes. Actually, it seems to built around the assumption that you should
not access the file system at all if possible. Where you'd typically use
files for inter-process data communication in a Unix environment you are
often better off using direct IPC and in-memory data stores in Windows,
just to avoid the penalty of going via all of the file system layers.
You typically also want to use long-running persistent processes on
Windows, whereas in Linux or macOS it's fine to start thousands of
processes per second.

I think this is why tools that originate in Unix-like environments (e.g.
GCC, Git, CMake, bash, autotools, Apache, PostgreSQL, ...) always run
faster in Linux than in Windows - they were designed under the
assumption that file accesses and process creation are "free".

>
>> I build with "make -j28" (although the GNU toolchain build system is
>> poorly parallelizable, so most of the cores are idle most of the time).
>>
>
> Trying to parallel make eats up all of the RAM and swap space, so I
> don't generally do so. This issue is especially bad with LLVM though
Yeah, you have to have the RAM to support parallelism. You can always
try to use less processes though (e.g. -j2 or -j3). Also, avoid plain
"-j" as that just starts as many processes as possible without caring
about how many cores you have.

Unlike the GCC build system, the LLVM build system is near perfectly
parallelizable.

>
>
>> BTW, my build script is here [1].
>>
>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>> always slower than Linux (esp. when it comes to build systems).
>>
>
> I suspect because GCC and similar do excessive amounts of file opening
> and spawning huge numbers of short-lived processes. These operations are
> well into the millisecond range.
>

Yep.

> This was enough of an issue in BGBCC that I added a cache to keep any
> previously loaded headers in-RAM, since when compiling a program it
> tends to access the same headers multiple times.
>
> Well, and also it does everything in a single monolithic process.
>

That's the Windows-first approach: Keep everything in one large
monolithic executable & process. It works, but I happen to like the more
modular approach of Unix-like systems, which makes it easy to leverage
existing tools (e.g. BuildCache [1]).

/Marcus

[1] https://github.com/mbitsnbites/buildcache

>
>

[snip]

>
>

Re: Dense machine code from C++ code (compiler optimizations)

<snibhu$t8b$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22121&group=comp.arch#22121

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 10:16:45 +0100
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <snibhu$t8b$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngs88$tcd$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Nov 2021 09:16:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fa2a390485d48858ed6c0edccba8849b";
logging-data="29963"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+733cWhq1GeBCkTBlJdlM5aQyREGu+ULg="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:f/VRjkLfJqoc09maN67mtSa5P6Q=
In-Reply-To: <sngs88$tcd$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Marcus - Tue, 23 Nov 2021 09:16 UTC

On 2021-11-22 20:49, Thomas Koenig wrote:
> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>
>>>> A complete rebuild of binutils +
>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>
>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>> hour with recent versions. Which version is your gcc port
>>> based on?
>>>
>>
>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>
>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>> an NVMe drive (~3 GB/s read).
>>
>> I build with "make -j28" (although the GNU toolchain build system is
>> poorly parallelizable, so most of the cores are idle most of the time).
>>
>> BTW, my build script is here [1].
>
> On a POWER 9 with "make -j32", I get
>
>
> real 38m30.253s
> user 420m38.816s
> sys 7m9.639s
>
> with Fortran, C++ and C enabled (and checking).
>

I once had the opportunity to benchmark a relatively pricey POWER 9
system against an AMD 3950X.

IBM POWER 9:
- 2 sockets x 12 cores/socket x 4 threads/core = 96 HW threads
- 2.9 GHz under load (pretty high for a server CPU)

AMD 3950X:
- 1 socket x 16 cores/socket x 2 threads/core = 32 HW threads
- 3.9 GHz under load

I built LLVM using GCC, which saturated all cores, and to my surprise
the 3950X was faster than the POWER 9:

POWER 9: Build time 7m40s
3950X: Build time 6m45s

So I no longer consider POWER to be a real alternative, especially
given the price delta (the 3950X system cost less than $1500, the
POWER 9 system didn't).

/Marcus

Re: Dense machine code from C++ code (compiler optimizations)

<snicqq$581$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22122&group=comp.arch#22122

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 03:38:28 -0600
Organization: A noiseless patient Spider
Lines: 140
Message-ID: <snicqq$581$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 09:38:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f58ae11bce70c03b51364af3b8944eed";
logging-data="5377"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Bl8IYGYt8+3TBZKnSW71G"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:xtH4URFUNzdHkSoKyCSqJDaKzKw=
In-Reply-To: <25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 23 Nov 2021 09:38 UTC

On 11/23/2021 12:54 AM, robf...@gmail.com wrote:
> I have put a significant portion of a man-year into my own compiler over the last 25+ years. Not
> really production quality code, hobby quality with lots of ‘makes it work’ code. I think it is about
> 80k lines of code. A combination of C and C++. It started out as pure C. I have not yet used it for
> anything but simple demos. The compiler has limited support for some C++ isms like classes
> with methods and method overloading. But does not support operator overloading, templates and
> many other things associated with C++. The compiler makes use of two types of parse trees
> which would be analogous to an AST. Expression nodes and statement nodes. As it generates code
> it builds a linked list of code nodes. Optimizations are applied to all different types of nodes.
> I got started working on it for a 68000 co-processor board for the XT and it has mutated through
> several different processors. I started work before the internet and before I found out about other
> projects like gcc. I decided to put more work into my own compiler that I already knew rather than
> learn the workings of gcc. The most recent addition to the compiler is support for co-routines. I
> think I can come up with a way to support some recursion in co-routines, by invoking a co-routine
> again instead of just using ‘yield’ but I must think about it some more. I keep toying with the idea
> of trying to make it more mainstream, perhaps by adding support for other mainstream processors.
> The compiler has several nifty features like bit-slicing and enumerator generators.
>

Yeah, BGBCC isn't quite that old.

My first Scheme interpreter was ~ 2002 or so, when I was in high school.

The first BGBScript VM was written ~ 2004, shortly after I graduated HS.
I did a partial rewrite of the VM ~ 2006 (where it was recombined with
the Scheme interpreter).

The BGBCC fork initially happened the 2007/2008 timeframe.
I got it sorta able to compile C, but it was kind of a fail at its
original purpose, and was difficult to debug, ... My younger self had
sorta underestimated the jump in compiler complexity from a JavaScript
style language to something like C.

It does support a C++ subset:
classes (single inheritance);
operator overloading;
namespaces;
...
But, does not support templates or similar.

C is generally a higher priority in my case.

BGBCC was then mostly dormant until ~ 2017, when my BJX1 project got
started. One of the first non-trivial programs I tried compiling with it
was Quake.

I have some videos on YouTube from that era (several years ago), eg:
https://www.youtube.com/watch?v=xe5T5rhvAtU
(At this time, Quake was still pretty broken).

Note that the only reason Quake ran at playable speeds was because the
BJX1 emulator was not trying to be cycle accurate, and could emulate
processors much faster than what is viable on an FPGA. A faster BJX2
emulator could be made, but would need to abandon things like
cycle-counting and modeling cache misses and similar.

The second major program I faced off with was Doom, which had ended up
as the primary test program mostly because it was a lot easier to make
Doom playable well at 50MHz than Quake (single-digit framerates are "teh
suck").

> For field extracts Thor supports using either a register or a seven-bit constant to specify the field
> offset and width. Most Thor instructions are using a 48-bit encoding, allowing room for more options
> when dealing with operands. One operand bit indicates a register or constant value. Virtually any
> instruction can be a vector instruction as there is an opcode bit dedicated to indicating a vector
> operation.
> Thor should be extendable to 128-bits with a little bit of work. 128-bit constants can be formed
> using the appropriate prefix instructions. The plan is to support 128-bit decimal floating-point.
>

OK. I am dealing with 64-bit registers, but any 128-bit operations
mostly exist by operating on registers in pairs.

No plans for Decimal FP as of yet.
Had tried to support a tweaked S.E15.M80 format, but it was "kinda
expensive" compared with an FPU that only does Double.

ADDX/SUBX: Technically two ADD operations in parallel with a carry flag
routed between the ALUs.

Had figured out a trick for Carry-Select adders which I will call
"handle the carry propagation first". If the adder is organized so that
the carry bits are determined first and then used to select the results
after the fact, then large adders can get a nice speedup.

ANDX/ORX/XORX: Technically, just the same operation being run in
parallel on 2 lanes.

SHADX/SHLDX: This got a little interesting. I had replaced the original
barrel shift units with funnel-shift units. These units could emulate
the original barrel shift ops, perform rotates, and (via slight
trickery) be ganged up to perform a 128-bit shift (it looks like a
128-bit shift, but is actually two 64-bit shifts running in parallel).

Ironically, these funnel shifters were actually cheaper than the
original barrel shifters.

So, each shifter might select two inputs (as a "high" and "low") half
(forming a 128 bit value), then produce an output using a sliding 64-bit
window spanning these two halves.

So, for example, 64-bit ops:
A SHL B: High Bits = Input, Low bits = Zero , Offset = 64-B
A SHR B: High Bits = SExt , Low bits = Input, Offset = B
A ROL B: High Bits = Input, Low bits = Input, Offset = 64-B
A ROR B: High Bits = Input, Low bits = Input, Offset = B

For 128-bit shifts, each shifter also needs access to the other half of
the input value. The logic is a little more complicated (depends on both
whether we are Lane1 or Lane2, Bit 6 of the shift, ...).

A SHLX B:
B[6]==0:
Lane 1: High Bits = InputHi, Low bits = InputLo, Offset = 64-B[5:0]
Lane 2: High Bits = InputLo, Low bits = Zero , Offset = 64-B[5:0]
B[6]==1:
Lane 1: High Bits = InputLo, Low bits = Zero , Offset = 64-B[5:0]
Lane 2: High Bits = Zero , Low bits = Zero , Offset = 64-B[5:0]

A SHRX B:
B[6]==0:
Lane 1: High Bits = SExt , Low bits = InputHi, Offset = B[5:0]
Lane 2: High Bits = InputHi, Low bits = InputLo, Offset = B[5:0]
B[6]==1:
Lane 1: High Bits = SExt , Low bits = SExt , Offset = B[5:0]
Lane 2: High Bits = SExt , Low bits = InputHi, Offset = B[5:0]

ROLX / RORX can be left as an exercise for the reader.

The actual scenario in my case is a little more complicated, but it is a
similar idea.

....

Re: Thor (was: Dense machine code...)

<snigd0$uhs$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22123&group=comp.arch#22123

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Thor (was: Dense machine code...)
Date: Tue, 23 Nov 2021 11:39:27 +0100
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <snigd0$uhs$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 10:39:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fa2a390485d48858ed6c0edccba8849b";
logging-data="31292"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/XbQOQJwMoWfe3ohzgB7ofKcScetJOFSE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:ZoizDu4+JjmaSD/BrSUVECwCXsw=
In-Reply-To: <25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
Content-Language: en-US
 by: Marcus - Tue, 23 Nov 2021 10:39 UTC

On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
> I have put a significant portion of a man-year into my own compiler over the last 25+ years. Not
> really production quality code, hobby quality with lots of ‘makes it work’ code. I think it is about
> 80k lines of code. A combination of C and C++. It started out as pure C. I have not yet used it for
> anything but simple demos. The compiler has limited support for some C++ isms like classes
> with methods and method overloading. But does not support operator overloading, templates and
> many other things associated with C++. The compiler makes use of two types of parse trees
> which would be analogous to an AST. Expression nodes and statement nodes. As it generates code
> it builds a linked list of code nodes. Optimizations are applied to all different types of nodes.
> I got started working on it for a 68000 co-processor board for the XT and it has mutated through
> several different processors. I started work before the internet and before I found out about other
> projects like gcc. I decided to put more work into my own compiler that I already knew rather than
> learn the workings of gcc. The most recent addition to the compiler is support for co-routines. I
> think I can come up with a way to support some recursion in co-routines, by invoking a co-routine
> again instead of just using ‘yield’ but I must think about it some more. I keep toying with the idea
> of trying to make it more mainstream, perhaps by adding support for other mainstream processors.
> The compiler has several nifty features like bit-slicing and enumerator generators.
>
> For field extracts Thor supports using either a register or a seven-bit constant to specify the field
> offset and width. Most Thor instructions are using a 48-bit encoding, allowing room for more options
> when dealing with operands. One operand bit indicates a register or constant value. Virtually any
> instruction can be a vector instruction as there is an opcode bit dedicated to indicating a vector
> operation.
> Thor should be extendable to 128-bits with a little bit of work. 128-bit constants can be formed
> using the appropriate prefix instructions. The plan is to support 128-bit decimal floating-point.
>

I'm curious: Do you have any public documentation of the Thor project?

BTW, in the 1990s Swedish SAAB Ericsson Space (later RUAG Space) had a
stack oriented RISC microprocessor called Thor. IIRC the stack based
approach aimed to minimize the use of on-chip memory cells (e.g. as
would be required by a register file), in order to reduce the risk of
bit flips due to ion radiation etc, which is common in space
environments (and I think that radiation hardened external RAM was a
thing).

You can still find some public documents, e.g:

- https://www.cse.chalmers.se/~johan/publications/Folkesson_FTCS28.pdf
- https://www.cse.chalmers.se/~johan/publications/ewdc-8.frm.pdf
- http://www.artes.uu.se/project/9811-4/nlft-statusrapport.990817.pdf
- https://dl.acm.org/doi/10.1145/126551.126564

I assume that the name Thor was chosen to reflect the Nordic heritage of
the company.

It was probably superseded by ERC32/LEON (radiation hardened SPARC).

/Marcus

Re: Thor (was: Dense machine code...)

<48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22124&group=comp.arch#22124

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:29eb:: with SMTP id jv11mr6340226qvb.13.1637672826277;
Tue, 23 Nov 2021 05:07:06 -0800 (PST)
X-Received: by 2002:a05:6808:2396:: with SMTP id bp22mr2255727oib.78.1637672826022;
Tue, 23 Nov 2021 05:07:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 23 Nov 2021 05:07:05 -0800 (PST)
In-Reply-To: <snigd0$uhs$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:88c1:521f:f3f4:9506;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:88c1:521f:f3f4:9506
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me> <25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
<snigd0$uhs$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>
Subject: Re: Thor (was: Dense machine code...)
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 23 Nov 2021 13:07:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 8
 by: robf...@gmail.com - Tue, 23 Nov 2021 13:07 UTC

On Tuesday, November 23, 2021 at 5:39:30 AM UTC-5, Marcus wrote:
> On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
> I'm curious: Do you have any public documentation of the Thor project?
>
> /Marcus

The most recent version of the project (Thor2021) is publicly available at:
https://github.com/robfinch/Thor/tree/main/Thor2021

Re: Thor

<oD6nJ.14706$G996.845@fx31.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22125&group=comp.arch#22125

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx31.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Thor
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me> <06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com> <snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com> <snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com> <snhtp1$nkn$1@dont-email.me> <25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com> <snigd0$uhs$1@dont-email.me> <48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>
In-Reply-To: <48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 13
Message-ID: <oD6nJ.14706$G996.845@fx31.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 23 Nov 2021 14:10:28 UTC
Date: Tue, 23 Nov 2021 09:10:29 -0500
X-Received-Bytes: 1609
 by: EricP - Tue, 23 Nov 2021 14:10 UTC

robf...@gmail.com wrote:
> On Tuesday, November 23, 2021 at 5:39:30 AM UTC-5, Marcus wrote:
>> On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
>> I'm curious: Do you have any public documentation of the Thor project?
>>
>> /Marcus
>
> The most recent version of the project (Thor2021) is publicly available at:
> https://github.com/robfinch/Thor/tree/main/Thor2021
>

Which SystemVerilog compiler do you use?

Re: Dense machine code from C++ code (compiler optimizations)

<snitic$rbq$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22126&group=comp.arch#22126

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 15:24:11 +0100
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <snitic$rbq$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 14:24:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b5415687160ff85422e6b952e3ce5f77";
logging-data="28026"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/q2hRKrASA/FmIuw3d8ReyiHjCukMM2ac="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:iWcJDsQ7kGIMXuBwWhOJar6rI08=
In-Reply-To: <sngt7c$jbh$1@dont-email.me>
Content-Language: en-GB
 by: David Brown - Tue, 23 Nov 2021 14:24 UTC

On 22/11/2021 21:05, BGB wrote:
> On 11/22/2021 11:50 AM, Marcus wrote:
>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>
>>>> A complete rebuild of binutils +
>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>
>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>> hour with recent versions.  Which version is your gcc port
>>> based on?
>>>
>>
>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>
>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>> an NVMe drive (~3 GB/s read).
>>
>
> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
> 3.7 GHz).
>   OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>   Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD Red).
>
> My MOBO does not have M.2 or similar, but does have SATA connectors, so
> the SSD is plugged in via SATA.
>
> RAM Stats:
>   48GB of RAM, 1467MHz (DDR4-2933)
>   192GB of Pagefile space.
>
> HDD speed is pretty variable, but generally falls into the range of
> 20-80MB/s (except for lots of small files, where it might drop down to ~
> 2MB/s or so).
>
>
> Also kinda funny is that "file and folder compression" tends to actually
> make directories full of source code or other small files somewhat
> faster (so I tend to leave it on by default for drives which I primarily
> use for projects).
>
> Otherwise, it is like the whole Windows Filesystem is built around the
> assumption that one is primarily working with small numbers of large
> files, rather than large numbers of small files.
>
>
>> I build with "make -j28" (although the GNU toolchain build system is
>> poorly parallelizable, so most of the cores are idle most of the time).
>>
>
> Trying to parallel make eats up all of the RAM and swap space, so I
> don't generally do so. This issue is especially bad with LLVM though.
>
>
>> BTW, my build script is here [1].
>>
>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>> always slower than Linux (esp. when it comes to build systems).
>>
>
> I suspect because GCC and similar do excessive amounts of file opening
> and spawning huge numbers of short-lived processes. These operations are
> well into the millisecond range.
>

Large gcc builds can easily be two or three times faster on Linux than
on Windows, in my experience. gcc comes from a *nix heritage, where
spawning new processes is cheap and fast so a single run of "gcc" may
involve starting dozens of processes connected via pipes or temporary
files. On Windows, starting a new process is a slow matter, and using
temporary files often means the file has to actually be written out to
disk - you don't have the same tricks for telling the OS that the file
does not have to be created on disk unless you run out of spare memory.
So a Windows-friendly compiler will be a monolithic program that does
everything, while gcc is much more efficient on a *nix system. In
addition, handling lots of small files is vastly faster on *nix than
Windows. (Though I believe Win10 was an improvement here compared to Win7.)

Re: Thor (was: Dense machine code...)

<snitkl$q5j$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22127&group=comp.arch#22127

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Thor (was: Dense machine code...)
Date: Tue, 23 Nov 2021 15:25:25 +0100
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <snitkl$q5j$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
<snigd0$uhs$1@dont-email.me>
<48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Nov 2021 14:25:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fa2a390485d48858ed6c0edccba8849b";
logging-data="26803"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185vlyNeXaHojhSn+xW1hoTurzHqBvOwTU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:I6bnbxsyJGsDg6l36DGbLL65R0w=
In-Reply-To: <48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>
Content-Language: en-US
 by: Marcus - Tue, 23 Nov 2021 14:25 UTC

On 2021-11-23 14:07, robf...@gmail.com wrote:
> On Tuesday, November 23, 2021 at 5:39:30 AM UTC-5, Marcus wrote:
>> On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
>> I'm curious: Do you have any public documentation of the Thor project?
>>
>> /Marcus
>
> The most recent version of the project (Thor2021) is publicly available at:
> https://github.com/robfinch/Thor/tree/main/Thor2021
>

Thanks!

That's impressive work. Great!

/Marcus

Re: Thor (was: Dense machine code...)

<641c6829-83c7-4e3e-9290-461c3b301fffn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22128&group=comp.arch#22128

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:8d4:: with SMTP id z20mr5279719qkz.526.1637678913180;
Tue, 23 Nov 2021 06:48:33 -0800 (PST)
X-Received: by 2002:a9d:1b0f:: with SMTP id l15mr4828618otl.38.1637678913045;
Tue, 23 Nov 2021 06:48:33 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 23 Nov 2021 06:48:32 -0800 (PST)
In-Reply-To: <snitkl$q5j$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:88c1:521f:f3f4:9506;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:88c1:521f:f3f4:9506
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me> <25358d47-5b27-4dc5-bca7-09c987aace54n@googlegroups.com>
<snigd0$uhs$1@dont-email.me> <48e3349f-f0b9-467a-8173-0e5961844bbbn@googlegroups.com>
<snitkl$q5j$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <641c6829-83c7-4e3e-9290-461c3b301fffn@googlegroups.com>
Subject: Re: Thor (was: Dense machine code...)
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 23 Nov 2021 14:48:33 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 17
 by: robf...@gmail.com - Tue, 23 Nov 2021 14:48 UTC

robf...@gmail.com wrote:
>> On Tuesday, November 23, 2021 at 5:39:30 AM UTC-5, Marcus wrote:
>>> On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
>>> I'm curious: Do you have any public documentation of the Thor project?
>>>
>>> /Marcus
>>
>> The most recent version of the project (Thor2021) is publicly available at:
>> https://github.com/robfinch/Thor/tree/main/Thor2021
>>

>Which SystemVerilog compiler do you use?

The one provided in the Xilinx Vivado toolset.

Note to build the system there are a number of IP Catalog cores that need to be
setup. These would be provided in .XCI files but I've not transferred them to the
repository (yet).

Re: Dense machine code from C++ code (compiler optimizations)

<snj1kq$qql$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22129&group=comp.arch#22129

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 16:33:46 +0100
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <snj1kq$qql$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
<snitic$rbq$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 15:33:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fa2a390485d48858ed6c0edccba8849b";
logging-data="27477"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/o5jIc/HG80ZW70TjwRt6sGLvf2P4uAsc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:YcQzrSI17xsHErB+Aim/l++bYnw=
In-Reply-To: <snitic$rbq$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Tue, 23 Nov 2021 15:33 UTC

On 2021-11-23 15:24, David Brown wrote:
> On 22/11/2021 21:05, BGB wrote:
>> On 11/22/2021 11:50 AM, Marcus wrote:
>>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>>
>>>>> A complete rebuild of binutils +
>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>>
>>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>>> hour with recent versions.  Which version is your gcc port
>>>> based on?
>>>>
>>>
>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>>
>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>>> an NVMe drive (~3 GB/s read).
>>>
>>
>> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
>> 3.7 GHz).
>>   OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>>   Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD Red).
>>
>> My MOBO does not have M.2 or similar, but does have SATA connectors, so
>> the SSD is plugged in via SATA.
>>
>> RAM Stats:
>>   48GB of RAM, 1467MHz (DDR4-2933)
>>   192GB of Pagefile space.
>>
>> HDD speed is pretty variable, but generally falls into the range of
>> 20-80MB/s (except for lots of small files, where it might drop down to ~
>> 2MB/s or so).
>>
>>
>> Also kinda funny is that "file and folder compression" tends to actually
>> make directories full of source code or other small files somewhat
>> faster (so I tend to leave it on by default for drives which I primarily
>> use for projects).
>>
>> Otherwise, it is like the whole Windows Filesystem is built around the
>> assumption that one is primarily working with small numbers of large
>> files, rather than large numbers of small files.
>>
>>
>>> I build with "make -j28" (although the GNU toolchain build system is
>>> poorly parallelizable, so most of the cores are idle most of the time).
>>>
>>
>> Trying to parallel make eats up all of the RAM and swap space, so I
>> don't generally do so. This issue is especially bad with LLVM though.
>>
>>
>>> BTW, my build script is here [1].
>>>
>>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>>> always slower than Linux (esp. when it comes to build systems).
>>>
>>
>> I suspect because GCC and similar do excessive amounts of file opening
>> and spawning huge numbers of short-lived processes. These operations are
>> well into the millisecond range.
>>
>
> Large gcc builds can easily be two or three times faster on Linux than
> on Windows, in my experience. gcc comes from a *nix heritage, where
> spawning new processes is cheap and fast so a single run of "gcc" may
> involve starting dozens of processes connected via pipes or temporary
> files. On Windows, starting a new process is a slow matter, and using
> temporary files often means the file has to actually be written out to
> disk - you don't have the same tricks for telling the OS that the file
> does not have to be created on disk unless you run out of spare memory.
> So a Windows-friendly compiler will be a monolithic program that does
> everything, while gcc is much more efficient on a *nix system. In
> addition, handling lots of small files is vastly faster on *nix than
> Windows. (Though I believe Win10 was an improvement here compared to Win7.)
>

Possibly (Win10 vs Win7), but it's still not even in the same ballpark
as Linux:

https://www.bitsnbites.eu/benchmarking-os-primitives/

File creation is ~60x slower on Windows 10 compared to Linux (highly
dependent on things like Defender and indexing services though), and
launching processes is ~20x slower (in the tests in that article).

Re: Dense machine code from C++ code (compiler optimizations)

<snj7od$f3i$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22133&group=comp.arch#22133

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-19ac-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 17:18:05 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <snj7od$f3i$1@newsreader4.netcologne.de>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngs88$tcd$1@newsreader4.netcologne.de>
<snibhu$t8b$1@dont-email.me>
Injection-Date: Tue, 23 Nov 2021 17:18:05 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-19ac-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:19ac:0:7285:c2ff:fe6c:992d";
logging-data="15474"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 23 Nov 2021 17:18 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:
> On 2021-11-22 20:49, Thomas Koenig wrote:
>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>>
>>>>> A complete rebuild of binutils +
>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>>
>>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>>> hour with recent versions. Which version is your gcc port
>>>> based on?
>>>>
>>>
>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>>
>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>>> an NVMe drive (~3 GB/s read).
>>>
>>> I build with "make -j28" (although the GNU toolchain build system is
>>> poorly parallelizable, so most of the cores are idle most of the time).
>>>
>>> BTW, my build script is here [1].
>>
>> On a POWER 9 with "make -j32", I get
>>
>>
>> real 38m30.253s
>> user 420m38.816s
>> sys 7m9.639s
>>
>> with Fortran, C++ and C enabled (and checking).
>>
>
> I once had the opportunity to benchmark a relatively pricey POWER 9
> system against an AMD 3950X.
>
> IBM POWER 9:
> - 2 sockets x 12 cores/socket x 4 threads/core = 96 HW threads
> - 2.9 GHz under load (pretty high for a server CPU)
>
> AMD 3950X:
> - 1 socket x 16 cores/socket x 2 threads/core = 32 HW threads
> - 3.9 GHz under load
>
> I built LLVM using GCC, which saturated all cores, and to my surprise
> the 3950X was faster than the POWER 9:
>
> POWER 9: Build time 7m40s
> 3950X: Build time 6m45s

An AMD 3950 is one generation ahead of a POWER9, so it stands to
reason that it is faster.

For CFD, there was a time window a couple of years ago when the
Talos II machines actually beat the performance of x86 chips,
so a colleague actually bought a POWER9 system for that application.
The program in question was OpenFOAM.

Then AMD added memory channels, and the performance advantage relative
to the newest generation was gone :-)

(POWER seems to be very good ad attaching GPUs, but OpenFOAM doesn't use
that, and anyway GPUs are of little help for CFD if the memory and
memory bandwith requirements are high).

Re: Dense machine code from C++ code (compiler optimizations)

<351qpgpisnmpd7b0fov09d8rhqa86vacom@4ax.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22134&group=comp.arch#22134

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 12:38:46 -0500
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <351qpgpisnmpd7b0fov09d8rhqa86vacom@4ax.com>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me> <snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de> <sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me> <sni7hf$5o5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="26eba9b208e03dd446f2def9453424f1";
logging-data="20288"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18T3+D+l27YggaDZDf/p9lpf7fVQ3d2z2k="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:XoclZV3ABcQSBM8uQZTzdZpjEmQ=
 by: George Neuner - Tue, 23 Nov 2021 17:38 UTC

On Tue, 23 Nov 2021 09:08:14 +0100, Marcus
<m.delete@this.bitsnbites.eu> wrote:

>I believe that Windows is particularly sensitive to slow/spinning
>drives. My impression is that Linux is generally better at caching disk
>data in RAM, which hides some of the slowness (esp. when working with
>small files).

The server edition of Windows behaves more like Linux.

The workstation edition by default has a /per-application/ limit on
the amount of data that can be cached. Even if there is more space
available, once an application reaches the limit it will start to
cache thrash. This is independent of the program file handle limit.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session
Manager\Memory Management]
"LargeSystemCache"=dword:00000000

Set to 1 to enable, 0 to disable. Note that LargeSystemCache disables
per-app data limits, so an application that uses huge amounts of file
data potentially can take up to the whole cache.

This little app can set/limit the total size of the file cache (not
the per-app limits). By default Windows will use up to ~90% of free
memory.
https://www.microimages.com/downloads/SetFileCacheSize.htm

Note: this need Admin privileges, and although you can change the
cache size on the fly, the settings will NOT survive OS restart. If
you find it useful, you should schedule it as a task to set your
chosen cache size at boot.

I'm sure there is a way to change the per-app data limits, but
unfortunately I don't know the magic incantations to search for.

HP used to maintain a comprehensive listing of Windows registry
settings, but unfortunately I lost the link and Google is not being
cooperative just now. Sigh!

George

Re: Dense machine code from C++ code (compiler optimizations)

<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22136&group=comp.arch#22136

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b6c1:: with SMTP id g184mr7135251qkf.270.1637693072206;
Tue, 23 Nov 2021 10:44:32 -0800 (PST)
X-Received: by 2002:a05:6830:2b0f:: with SMTP id l15mr6658476otv.333.1637693071910;
Tue, 23 Nov 2021 10:44:31 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 23 Nov 2021 10:44:31 -0800 (PST)
In-Reply-To: <snhtp1$nkn$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 23 Nov 2021 18:44:32 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 123
 by: MitchAlsup - Tue, 23 Nov 2021 18:44 UTC

On Monday, November 22, 2021 at 11:21:41 PM UTC-6, BGB wrote:
> On 11/22/2021 5:04 PM, MitchAlsup wrote:
> > On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
> >> On 11/22/2021 12:54 PM, MitchAlsup wrote:
> >
> >>> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
> >>> the other defines the width of the field (0->64). For immediate shift amounts
> >>> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
> >>> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
> >>> (The empty spaces are checked for significance)
> >>> <
> >>> Then there is the operand-by-operand select (CMOV) and the bit-by bit
> >>> select (Multiplex)
> >>> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
> >>> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
> >> I did go and add a bit-select instruction (BITSEL / MUX).
> >>
> >> Currently I have:
> >> CSELT // Rn = SR.T ? Rm : Ro;
> >> PCSELT.L // Packed select, 32-bit words (SR.ST)
> >> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
> >> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
> >>
> >> BITSEL didn't add much cost, but did initially result in the core
> >> failing timing.
> >>
> >> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
> >> work again.
> >>
> >>
> >> The MOV.C instruction has been demoted some, as:
> >> It isn't entirely free;
> >> It's advantage in terms of performance is fairly small;
> >> It doesn't seem to play well with TLB Miss interrupts.
> >>
> >> Though, a cheaper option would be "MOV.C that only works with GBR and LR
> >> and similar" (similar effect in practice, but avoids most of the cost of
> >> the additional signal routing).
> >>
> >>
> >> I am also on the fence and considering disallowing using bit-shift
> >> operations in Lane 3, mostly as a possible way to reduce costs by not
> >> having a (rarely used) Lane 3 shift unit.
> > <
> > In general purpose codes, shifts are "not all that present" 2%-5% range (source
> > code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
> > but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
> > could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
> > multiplier (slot[3]).
> > <
> > Shifters are on-the-order-of gate cont expensive as integer adders and less useful.
> All 3 lanes still have other ALU ops, including some type-conversion and
> packed-integer SIMD ops.
> >>
> >> Still on the fence though as it does appear that shifts operators in
> >> Lane 3 aren't exactly unused either (so the change breaks compatibility
> >> with my existing binaries).
> > <
> > You should see what the cost is if only lanes[0..1] can perform shifts.
> Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
> It also makes timing a little better.
>
> I guess the main factor is whether it justifies the break in binary
> compatibility (since previously, the WEXifier assumed that running
> shifts in Lane 3 was allowed; and this seems to have occurred a non-zero
> number of times in pretty much every binary I have looked at).
> >>>>
> >>>> Could synthesize a mask from a shift and count, harder part is coming up
> >>>> with a way to do so cheaply (when the shift unit is already in use this
> >>>> cycle).
> >>> <
> >>> It is a simple decoder........
> >> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
> >> table isn't super cheap.
> > <
> > input.....|.............................output..................................
> > 000000 | 0000000000000000000000000000000000000000000000000000000000000000
> > 000001 | 0000000000000000000000000000000000000000000000000000000000000001
> > 000010 | 0000000000000000000000000000000000000000000000000000000000000011
> > 000011 | 0000000000000000000000000000000000000000000000000000000000000111
> > 000100 | 0000000000000000000000000000000000000000000000000000000000001111
> > etc.
> > It is a straight "Greater Than" decoder.
> One needs a greater-than comparison, a less-than comparison, and a way
> to combine them with a bit-wise operator (AND or OR), for each bit.
>
> Say, min, max:
> min<=max: Combine with AND
> min >max: Combine with OR
>
> This would allow composing masks with a range of patterns, say:
> 00000000000003FF
> 7FE0000000000000
> 000003FFFFC00000
<
Up to this point you need a Greater than decoder and a less than decoder
and an AND gate.
<
> 7FFE000000003FFF
> ...
Why split the field around the container boundary ??
>
>
> Granted, this isn't all that steep (vs the core as a whole), but I guess
> it is more a question of if it would be useful enough to make it worthwhile.
>
>
> Though, would need to find an "odd corner" to shove it into.
> Otherwise, BJX2 lacks any encodings for "OP Imm12, Rn".
>
> Have spots for Imm10 (insufficient), and Imm16 (but no, not going to
> spend an Imm16 spot on this; there are only a few of these left).
>
> Have a few possible ideas (involving XGPR encodings, such as reclaiming
> the encoding space for branches for Imm12 ops because XGPR+BRA does not
> make sense), but they seem like ugly hacks.
>
> Similarly, outside of being able to fit in a 32-bit encoding, it loses
> any real advantage it might have had. Jumbo encodings can load pretty
> much any possible constant.
>
>
> >>
> >

Re: Dense machine code from C++ code (compiler optimizations)

<snjigc$skp$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22139&group=comp.arch#22139

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 21:21:32 +0100
Organization: Aioe.org NNTP Server
Message-ID: <snjigc$skp$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
<snitic$rbq$1@dont-email.me> <snj1kq$qql$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="29337"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Tue, 23 Nov 2021 20:21 UTC

Marcus wrote:
> On 2021-11-23 15:24, David Brown wrote:
>> On 22/11/2021 21:05, BGB wrote:
>>> On 11/22/2021 11:50 AM, Marcus wrote:
>>>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>>>
>>>>>> A complete rebuild of binutils +
>>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>>>
>>>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>>>> hour with recent versions.  Which version is your gcc port
>>>>> based on?
>>>>>
>>>>
>>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>>>
>>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>>>> an NVMe drive (~3 GB/s read).
>>>>
>>>
>>> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
>>> 3.7 GHz).
>>>  Â  OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>>>  Â  Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD
>>> Red).
>>>
>>> My MOBO does not have M.2 or similar, but does have SATA connectors, so
>>> the SSD is plugged in via SATA.
>>>
>>> RAM Stats:
>>>  Â  48GB of RAM, 1467MHz (DDR4-2933)
>>>  Â  192GB of Pagefile space.
>>>
>>> HDD speed is pretty variable, but generally falls into the range of
>>> 20-80MB/s (except for lots of small files, where it might drop down to ~
>>> 2MB/s or so).
>>>
>>>
>>> Also kinda funny is that "file and folder compression" tends to actually
>>> make directories full of source code or other small files somewhat
>>> faster (so I tend to leave it on by default for drives which I primarily
>>> use for projects).
>>>
>>> Otherwise, it is like the whole Windows Filesystem is built around the
>>> assumption that one is primarily working with small numbers of large
>>> files, rather than large numbers of small files.
>>>
>>>
>>>> I build with "make -j28" (although the GNU toolchain build system is
>>>> poorly parallelizable, so most of the cores are idle most of the time).
>>>>
>>>
>>> Trying to parallel make eats up all of the RAM and swap space, so I
>>> don't generally do so. This issue is especially bad with LLVM though.
>>>
>>>
>>>> BTW, my build script is here [1].
>>>>
>>>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>>>> always slower than Linux (esp. when it comes to build systems).
>>>>
>>>
>>> I suspect because GCC and similar do excessive amounts of file opening
>>> and spawning huge numbers of short-lived processes. These operations are
>>> well into the millisecond range.
>>>
>>
>> Large gcc builds can easily be two or three times faster on Linux than
>> on Windows, in my experience.  gcc comes from a *nix heritage, where
>> spawning new processes is cheap and fast so a single run of "gcc" may
>> involve starting dozens of processes connected via pipes or temporary
>> files.  On Windows, starting a new process is a slow matter, and using
>> temporary files often means the file has to actually be written out to
>> disk - you don't have the same tricks for telling the OS that the file
>> does not have to be created on disk unless you run out of spare memory.
>>   So a Windows-friendly compiler will be a monolithic program that does
>> everything, while gcc is much more efficient on a *nix system.  In
>> addition, handling lots of small files is vastly faster on *nix than
>> Windows.  (Though I believe Win10 was an improvement here compared to
>> Win7.)
>>
>
> Possibly (Win10 vs Win7), but it's still not even in the same ballpark
> as Linux:
>
> https://www.bitsnbites.eu/benchmarking-os-primitives/
>
> File creation is ~60x slower on Windows 10 compared to Linux (highly
> dependent on things like Defender and indexing services though), and
> launching processes is ~20x slower (in the tests in that article).

It turns out that a group of people which include Robert Collins, a
co-worker (at Cognite), have done a lot of work on speeding up Rust
installations on both unix/linux and Microsoft platforms, they found
that Microsoft can in fact reach parity with Linux, but with serious effort:

https://www.youtube.com/watch?v=qbKGw8MQ0i8

It turns out that the actual throughput is approximately the same for
Windows/NTFS and Linux, but as you noted Defender & co makes the latency
orders of magnitude worse. By overlapping pretty much all the file
metadata work (mainly file creations) those latency bubbles can in fact
be filled in. Afair they reduced the installation time from several
minutes to ~10 seconds.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<f331900a-630f-49a2-ab48-89441ed984d1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22141&group=comp.arch#22141

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7f86:: with SMTP id z6mr1457630qtj.162.1637709027937;
Tue, 23 Nov 2021 15:10:27 -0800 (PST)
X-Received: by 2002:a9d:206a:: with SMTP id n97mr8325777ota.142.1637709025787;
Tue, 23 Nov 2021 15:10:25 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 23 Nov 2021 15:10:25 -0800 (PST)
In-Reply-To: <snjigc$skp$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me> <snitic$rbq$1@dont-email.me>
<snj1kq$qql$1@dont-email.me> <snjigc$skp$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f331900a-630f-49a2-ab48-89441ed984d1n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 23 Nov 2021 23:10:27 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 147
 by: MitchAlsup - Tue, 23 Nov 2021 23:10 UTC

On Tuesday, November 23, 2021 at 2:21:37 PM UTC-6, Terje Mathisen wrote:
> Marcus wrote:
> > On 2021-11-23 15:24, David Brown wrote:
> >> On 22/11/2021 21:05, BGB wrote:
> >>> On 11/22/2021 11:50 AM, Marcus wrote:
> >>>> On 2021-11-22 16:54, Thomas Koenig wrote:
> >>>>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
> >>>>>
> >>>>>> A complete rebuild of binutils +
> >>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
> >>>>>
> >>>>> That is blindingly fast, it usually takes about half to 3/4 on an
> >>>>> hour with recent versions. Which version is your gcc port
> >>>>> based on?
> >>>>>
> >>>>
> >>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
> >>>>
> >>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
> >>>> an NVMe drive (~3 GB/s read).
> >>>>
> >>>
> >>> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
> >>> 3.7 GHz).
> >>> Â OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
> >>> Â Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD
> >>> Red).
> >>>
> >>> My MOBO does not have M.2 or similar, but does have SATA connectors, so
> >>> the SSD is plugged in via SATA.
> >>>
> >>> RAM Stats:
> >>> Â 48GB of RAM, 1467MHz (DDR4-2933)
> >>> Â 192GB of Pagefile space.
> >>>
> >>> HDD speed is pretty variable, but generally falls into the range of
> >>> 20-80MB/s (except for lots of small files, where it might drop down to ~
> >>> 2MB/s or so).
> >>>
> >>>
> >>> Also kinda funny is that "file and folder compression" tends to actually
> >>> make directories full of source code or other small files somewhat
> >>> faster (so I tend to leave it on by default for drives which I primarily
> >>> use for projects).
> >>>
> >>> Otherwise, it is like the whole Windows Filesystem is built around the
> >>> assumption that one is primarily working with small numbers of large
> >>> files, rather than large numbers of small files.
> >>>
> >>>
> >>>> I build with "make -j28" (although the GNU toolchain build system is
> >>>> poorly parallelizable, so most of the cores are idle most of the time).
> >>>>
> >>>
> >>> Trying to parallel make eats up all of the RAM and swap space, so I
> >>> don't generally do so. This issue is especially bad with LLVM though.
> >>>
> >>>
> >>>> BTW, my build script is here [1].
> >>>>
> >>>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
> >>>> always slower than Linux (esp. when it comes to build systems).
> >>>>
> >>>
> >>> I suspect because GCC and similar do excessive amounts of file opening
> >>> and spawning huge numbers of short-lived processes. These operations are
> >>> well into the millisecond range.
> >>>
> >>
> >> Large gcc builds can easily be two or three times faster on Linux than
> >> on Windows, in my experience. gcc comes from a *nix heritage, where
> >> spawning new processes is cheap and fast so a single run of "gcc" may
> >> involve starting dozens of processes connected via pipes or temporary
> >> files. On Windows, starting a new process is a slow matter, and using
> >> temporary files often means the file has to actually be written out to
> >> disk - you don't have the same tricks for telling the OS that the file
> >> does not have to be created on disk unless you run out of spare memory..
> >> So a Windows-friendly compiler will be a monolithic program that does
> >> everything, while gcc is much more efficient on a *nix system. In
> >> addition, handling lots of small files is vastly faster on *nix than
> >> Windows. (Though I believe Win10 was an improvement here compared to
> >> Win7.)
> >>
> >
> > Possibly (Win10 vs Win7), but it's still not even in the same ballpark
> > as Linux:
> >
> > https://www.bitsnbites.eu/benchmarking-os-primitives/
> >
> > File creation is ~60x slower on Windows 10 compared to Linux (highly
> > dependent on things like Defender and indexing services though), and
> > launching processes is ~20x slower (in the tests in that article).
> It turns out that a group of people which include Robert Collins, a
> co-worker (at Cognite), have done a lot of work on speeding up Rust
> installations on both unix/linux and Microsoft platforms, they found
> that Microsoft can in fact reach parity with Linux, but with serious effort:
>
> https://www.youtube.com/watch?v=qbKGw8MQ0i8
>
> It turns out that the actual throughput is approximately the same for
> Windows/NTFS and Linux, but as you noted Defender & co makes the latency
> orders of magnitude worse. By overlapping pretty much all the file
> metadata work (mainly file creations) those latency bubbles can in fact
> be filled in. Afair they reduced the installation time from several
> minutes to ~10 seconds.
<
Now if MS would simply allow one to DD <current image> -> <new image>
so that recovery was 6 minutes instead of 6 hours we might be getting
somewhere...
<
>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<snjsvc$48l$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22142&group=comp.arch#22142

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 17:20:05 -0600
Organization: A noiseless patient Spider
Lines: 172
Message-ID: <snjsvc$48l$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Nov 2021 23:20:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6d41b7aef672a37ebb8e3597c0f9a646";
logging-data="4373"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18I0R3ZXoqc2ikyRafPUSp/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:5FSJxREUCGq0j5DUwLWeBAShRHg=
In-Reply-To: <9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 23 Nov 2021 23:20 UTC

On 11/23/2021 12:44 PM, MitchAlsup wrote:
> On Monday, November 22, 2021 at 11:21:41 PM UTC-6, BGB wrote:
>> On 11/22/2021 5:04 PM, MitchAlsup wrote:
>>> On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
>>>> On 11/22/2021 12:54 PM, MitchAlsup wrote:
>>>
>>>>> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
>>>>> the other defines the width of the field (0->64). For immediate shift amounts
>>>>> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
>>>>> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
>>>>> (The empty spaces are checked for significance)
>>>>> <
>>>>> Then there is the operand-by-operand select (CMOV) and the bit-by bit
>>>>> select (Multiplex)
>>>>> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
>>>>> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
>>>> I did go and add a bit-select instruction (BITSEL / MUX).
>>>>
>>>> Currently I have:
>>>> CSELT // Rn = SR.T ? Rm : Ro;
>>>> PCSELT.L // Packed select, 32-bit words (SR.ST)
>>>> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
>>>> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
>>>>
>>>> BITSEL didn't add much cost, but did initially result in the core
>>>> failing timing.
>>>>
>>>> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
>>>> work again.
>>>>
>>>>
>>>> The MOV.C instruction has been demoted some, as:
>>>> It isn't entirely free;
>>>> It's advantage in terms of performance is fairly small;
>>>> It doesn't seem to play well with TLB Miss interrupts.
>>>>
>>>> Though, a cheaper option would be "MOV.C that only works with GBR and LR
>>>> and similar" (similar effect in practice, but avoids most of the cost of
>>>> the additional signal routing).
>>>>
>>>>
>>>> I am also on the fence and considering disallowing using bit-shift
>>>> operations in Lane 3, mostly as a possible way to reduce costs by not
>>>> having a (rarely used) Lane 3 shift unit.
>>> <
>>> In general purpose codes, shifts are "not all that present" 2%-5% range (source
>>> code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
>>> but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
>>> could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
>>> multiplier (slot[3]).
>>> <
>>> Shifters are on-the-order-of gate cont expensive as integer adders and less useful.
>> All 3 lanes still have other ALU ops, including some type-conversion and
>> packed-integer SIMD ops.
>>>>
>>>> Still on the fence though as it does appear that shifts operators in
>>>> Lane 3 aren't exactly unused either (so the change breaks compatibility
>>>> with my existing binaries).
>>> <
>>> You should see what the cost is if only lanes[0..1] can perform shifts.
>> Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
>> It also makes timing a little better.
>>
>> I guess the main factor is whether it justifies the break in binary
>> compatibility (since previously, the WEXifier assumed that running
>> shifts in Lane 3 was allowed; and this seems to have occurred a non-zero
>> number of times in pretty much every binary I have looked at).
>>>>>>
>>>>>> Could synthesize a mask from a shift and count, harder part is coming up
>>>>>> with a way to do so cheaply (when the shift unit is already in use this
>>>>>> cycle).
>>>>> <
>>>>> It is a simple decoder........
>>>> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
>>>> table isn't super cheap.
>>> <
>>> input.....|.............................output..................................
>>> 000000 | 0000000000000000000000000000000000000000000000000000000000000000
>>> 000001 | 0000000000000000000000000000000000000000000000000000000000000001
>>> 000010 | 0000000000000000000000000000000000000000000000000000000000000011
>>> 000011 | 0000000000000000000000000000000000000000000000000000000000000111
>>> 000100 | 0000000000000000000000000000000000000000000000000000000000001111
>>> etc.
>>> It is a straight "Greater Than" decoder.
>> One needs a greater-than comparison, a less-than comparison, and a way
>> to combine them with a bit-wise operator (AND or OR), for each bit.
>>
>> Say, min, max:
>> min<=max: Combine with AND
>> min >max: Combine with OR
>>
>> This would allow composing masks with a range of patterns, say:
>> 00000000000003FF
>> 7FE0000000000000
>> 000003FFFFC00000
> <
> Up to this point you need a Greater than decoder and a less than decoder
> and an AND gate.
> <
>> 7FFE000000003FFF
>> ...
> Why split the field around the container boundary ??

Masks like this happen sometimes, and supporting both an AND gate and an
OR gate would not likely cost that much more than the AND gate by itself.

Goes and runs some stats:
Constants that fit these patterns: ~ 30% of the total constants;
Percentage of masks which end up using jumbo 64 or 96: ~ 6%
Percentage of total constant loads, masks as jumbo: ~ 2%
Percentage of total instructions, masks as jumbo: ~ 0.15%

For something the size of Doom, it would save ~ 670 bytes.

Of the total constants (for Doom):
~ 75% LDI Imm16, Rn
~ 15% LDI Imm8, Rn (16-bit encoding)
~ 2% LDIHI Imm10, Rn (Load Imm10 into high-order bits)
~ 0.5% FLDCH Imm16, Rn (Load Half-Float as Double, *1)
~ 6% Jumbo64 (Imm33s)
~ 0.6% Jumbo96 (Imm64)

*1: This stat is somewhat bigger in Quake, but there are very few FP
constants in Doom.

In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
encodings (seems to be mostly Binary64 constants which can't be
expressed exactly as either Binary16 or Binary32).

The ratio of bit-mask patterns to other constants is also a little lower
in Quake vs Doom.

It looks like, for FP constants (~ 14% of the total):
~ 60% can be expressed exactly as Binary16
~ 20% can be expressed exactly as Binary32
~ 20% fall through to Binary64.

May need to look at it, a larger percentage should theoretically fit the
Binary32 pattern, given Quake is primarily using "float" operations.

Goes and looks, "Float+ImmDouble" will coerce the ImmDouble to ImmFloat,
which will forcibly round the result. The number of seemingly
double-precision constants seems mysterious.

....

>>
>>
>> Granted, this isn't all that steep (vs the core as a whole), but I guess
>> it is more a question of if it would be useful enough to make it worthwhile.
>>
>>
>> Though, would need to find an "odd corner" to shove it into.
>> Otherwise, BJX2 lacks any encodings for "OP Imm12, Rn".
>>
>> Have spots for Imm10 (insufficient), and Imm16 (but no, not going to
>> spend an Imm16 spot on this; there are only a few of these left).
>>
>> Have a few possible ideas (involving XGPR encodings, such as reclaiming
>> the encoding space for branches for Imm12 ops because XGPR+BRA does not
>> make sense), but they seem like ugly hacks.
>>
>> Similarly, outside of being able to fit in a 32-bit encoding, it loses
>> any real advantage it might have had. Jumbo encodings can load pretty
>> much any possible constant.
>>
>>
>>>>
>>>

Re: Dense machine code from C++ code (compiler optimizations)

<snjur8$enb$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22143&group=comp.arch#22143

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 17:52:02 -0600
Organization: A noiseless patient Spider
Lines: 133
Message-ID: <snjur8$enb$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
<snitic$rbq$1@dont-email.me> <snj1kq$qql$1@dont-email.me>
<snjigc$skp$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Nov 2021 23:52:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6d41b7aef672a37ebb8e3597c0f9a646";
logging-data="15083"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iBSDstat7q/yO5ePIqFZ8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:AqRFMkIMt7E2pvR5CMFm+oFq0nQ=
In-Reply-To: <snjigc$skp$1@gioia.aioe.org>
Content-Language: en-US
 by: BGB - Tue, 23 Nov 2021 23:52 UTC

On 11/23/2021 2:21 PM, Terje Mathisen wrote:
> Marcus wrote:
>> On 2021-11-23 15:24, David Brown wrote:
>>> On 22/11/2021 21:05, BGB wrote:
>>>> On 11/22/2021 11:50 AM, Marcus wrote:
>>>>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>>>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>>>>
>>>>>>> A complete rebuild of binutils +
>>>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>>>>
>>>>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>>>>> hour with recent versions.  Which version is your gcc port
>>>>>> based on?
>>>>>>
>>>>>
>>>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>>>>
>>>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU,
>>>>> with
>>>>> an NVMe drive (~3 GB/s read).
>>>>>
>>>>
>>>> In my case, I am running Windows 10 on a Ryzen 2700X (8-core,
>>>> 16-thread,
>>>> 3.7 GHz).
>>>>  Â  OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>>>>  Â  Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD
>>>> Red).
>>>>
>>>> My MOBO does not have M.2 or similar, but does have SATA connectors, so
>>>> the SSD is plugged in via SATA.
>>>>
>>>> RAM Stats:
>>>>  Â  48GB of RAM, 1467MHz (DDR4-2933)
>>>>  Â  192GB of Pagefile space.
>>>>
>>>> HDD speed is pretty variable, but generally falls into the range of
>>>> 20-80MB/s (except for lots of small files, where it might drop down
>>>> to ~
>>>> 2MB/s or so).
>>>>
>>>>
>>>> Also kinda funny is that "file and folder compression" tends to
>>>> actually
>>>> make directories full of source code or other small files somewhat
>>>> faster (so I tend to leave it on by default for drives which I
>>>> primarily
>>>> use for projects).
>>>>
>>>> Otherwise, it is like the whole Windows Filesystem is built around the
>>>> assumption that one is primarily working with small numbers of large
>>>> files, rather than large numbers of small files.
>>>>
>>>>
>>>>> I build with "make -j28" (although the GNU toolchain build system is
>>>>> poorly parallelizable, so most of the cores are idle most of the
>>>>> time).
>>>>>
>>>>
>>>> Trying to parallel make eats up all of the RAM and swap space, so I
>>>> don't generally do so. This issue is especially bad with LLVM though.
>>>>
>>>>
>>>>> BTW, my build script is here [1].
>>>>>
>>>>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>>>>> always slower than Linux (esp. when it comes to build systems).
>>>>>
>>>>
>>>> I suspect because GCC and similar do excessive amounts of file opening
>>>> and spawning huge numbers of short-lived processes. These operations
>>>> are
>>>> well into the millisecond range.
>>>>
>>>
>>> Large gcc builds can easily be two or three times faster on Linux than
>>> on Windows, in my experience.  gcc comes from a *nix heritage, where
>>> spawning new processes is cheap and fast so a single run of "gcc" may
>>> involve starting dozens of processes connected via pipes or temporary
>>> files.  On Windows, starting a new process is a slow matter, and using
>>> temporary files often means the file has to actually be written out to
>>> disk - you don't have the same tricks for telling the OS that the file
>>> does not have to be created on disk unless you run out of spare memory.
>>>   So a Windows-friendly compiler will be a monolithic program that does
>>> everything, while gcc is much more efficient on a *nix system.  In
>>> addition, handling lots of small files is vastly faster on *nix than
>>> Windows.  (Though I believe Win10 was an improvement here compared to
>>> Win7.)
>>>
>>
>> Possibly (Win10 vs Win7), but it's still not even in the same ballpark
>> as Linux:
>>
>> https://www.bitsnbites.eu/benchmarking-os-primitives/
>>
>> File creation is ~60x slower on Windows 10 compared to Linux (highly
>> dependent on things like Defender and indexing services though), and
>> launching processes is ~20x slower (in the tests in that article).
>
> It turns out that a group of people which include Robert Collins, a
> co-worker (at Cognite), have done a lot of work on speeding up Rust
> installations on both unix/linux and Microsoft platforms, they found
> that Microsoft can in fact reach parity with Linux, but with serious
> effort:
>
> https://www.youtube.com/watch?v=qbKGw8MQ0i8
>
> It turns out that the actual throughput is approximately the same for
> Windows/NTFS and Linux, but as you noted Defender & co makes the latency
> orders of magnitude worse. By overlapping pretty much all the file
> metadata work (mainly file creations) those latency bubbles can in fact
> be filled in. Afair they reduced the installation time from several
> minutes to ~10 seconds.
>

One thing I had considered in the past was potentially moving a lot of
things like C library headers into a VFS (so the compiler would try to
fetch header files from the VFS rather than the OS filesystem).

In such a system, every major installed library would likely be an image
file which is then mounted into the compiler's VFS (likely read-only).

I didn't do so for BGBCC, mostly because this seemed like it would be a
bit of a hassle to work with, and my existing strategies (such as
keeping previously loaded header files cached in RAM between translation
units) had mostly already gotten these issues under control.

> Terje
>

Re: Dense machine code from C++ code (compiler optimizations)

<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22144&group=comp.arch#22144

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1828:: with SMTP id t40mr2513847qtc.0.1637717532261;
Tue, 23 Nov 2021 17:32:12 -0800 (PST)
X-Received: by 2002:a9d:749a:: with SMTP id t26mr9436004otk.96.1637717532040;
Tue, 23 Nov 2021 17:32:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 23 Nov 2021 17:32:11 -0800 (PST)
In-Reply-To: <snjsvc$48l$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me> <9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 24 Nov 2021 01:32:12 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 188
 by: MitchAlsup - Wed, 24 Nov 2021 01:32 UTC

On Tuesday, November 23, 2021 at 5:20:15 PM UTC-6, BGB wrote:
> On 11/23/2021 12:44 PM, MitchAlsup wrote:
> > On Monday, November 22, 2021 at 11:21:41 PM UTC-6, BGB wrote:
> >> On 11/22/2021 5:04 PM, MitchAlsup wrote:
> >>> On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
> >>>> On 11/22/2021 12:54 PM, MitchAlsup wrote:
> >>>
> >>>>> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
> >>>>> the other defines the width of the field (0->64). For immediate shift amounts
> >>>>> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
> >>>>> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
> >>>>> (The empty spaces are checked for significance)
> >>>>> <
> >>>>> Then there is the operand-by-operand select (CMOV) and the bit-by bit
> >>>>> select (Multiplex)
> >>>>> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
> >>>>> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
> >>>> I did go and add a bit-select instruction (BITSEL / MUX).
> >>>>
> >>>> Currently I have:
> >>>> CSELT // Rn = SR.T ? Rm : Ro;
> >>>> PCSELT.L // Packed select, 32-bit words (SR.ST)
> >>>> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
> >>>> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
> >>>>
> >>>> BITSEL didn't add much cost, but did initially result in the core
> >>>> failing timing.
> >>>>
> >>>> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
> >>>> work again.
> >>>>
> >>>>
> >>>> The MOV.C instruction has been demoted some, as:
> >>>> It isn't entirely free;
> >>>> It's advantage in terms of performance is fairly small;
> >>>> It doesn't seem to play well with TLB Miss interrupts.
> >>>>
> >>>> Though, a cheaper option would be "MOV.C that only works with GBR and LR
> >>>> and similar" (similar effect in practice, but avoids most of the cost of
> >>>> the additional signal routing).
> >>>>
> >>>>
> >>>> I am also on the fence and considering disallowing using bit-shift
> >>>> operations in Lane 3, mostly as a possible way to reduce costs by not
> >>>> having a (rarely used) Lane 3 shift unit.
> >>> <
> >>> In general purpose codes, shifts are "not all that present" 2%-5% range (source
> >>> code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
> >>> but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
> >>> could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
> >>> multiplier (slot[3]).
> >>> <
> >>> Shifters are on-the-order-of gate cont expensive as integer adders and less useful.
> >> All 3 lanes still have other ALU ops, including some type-conversion and
> >> packed-integer SIMD ops.
> >>>>
> >>>> Still on the fence though as it does appear that shifts operators in
> >>>> Lane 3 aren't exactly unused either (so the change breaks compatibility
> >>>> with my existing binaries).
> >>> <
> >>> You should see what the cost is if only lanes[0..1] can perform shifts.
> >> Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
> >> It also makes timing a little better.
> >>
> >> I guess the main factor is whether it justifies the break in binary
> >> compatibility (since previously, the WEXifier assumed that running
> >> shifts in Lane 3 was allowed; and this seems to have occurred a non-zero
> >> number of times in pretty much every binary I have looked at).
> >>>>>>
> >>>>>> Could synthesize a mask from a shift and count, harder part is coming up
> >>>>>> with a way to do so cheaply (when the shift unit is already in use this
> >>>>>> cycle).
> >>>>> <
> >>>>> It is a simple decoder........
> >>>> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
> >>>> table isn't super cheap.
> >>> <
> >>> input.....|.............................output..................................
> >>> 000000 | 0000000000000000000000000000000000000000000000000000000000000000
> >>> 000001 | 0000000000000000000000000000000000000000000000000000000000000001
> >>> 000010 | 0000000000000000000000000000000000000000000000000000000000000011
> >>> 000011 | 0000000000000000000000000000000000000000000000000000000000000111
> >>> 000100 | 0000000000000000000000000000000000000000000000000000000000001111
> >>> etc.
> >>> It is a straight "Greater Than" decoder.
> >> One needs a greater-than comparison, a less-than comparison, and a way
> >> to combine them with a bit-wise operator (AND or OR), for each bit.
> >>
> >> Say, min, max:
> >> min<=max: Combine with AND
> >> min >max: Combine with OR
> >>
> >> This would allow composing masks with a range of patterns, say:
> >> 00000000000003FF
> >> 7FE0000000000000
> >> 000003FFFFC00000
> > <
> > Up to this point you need a Greater than decoder and a less than decoder
> > and an AND gate.
> > <
> >> 7FFE000000003FFF
> >> ...
> > Why split the field around the container boundary ??
> Masks like this happen sometimes, and supporting both an AND gate and an
> OR gate would not likely cost that much more than the AND gate by itself.
>
>
> Goes and runs some stats:
> Constants that fit these patterns: ~ 30% of the total constants;
> Percentage of masks which end up using jumbo 64 or 96: ~ 6%
> Percentage of total constant loads, masks as jumbo: ~ 2%
> Percentage of total instructions, masks as jumbo: ~ 0.15%
>
> For something the size of Doom, it would save ~ 670 bytes.
>
> Of the total constants (for Doom):
> ~ 75% LDI Imm16, Rn
> ~ 15% LDI Imm8, Rn (16-bit encoding)
> ~ 2% LDIHI Imm10, Rn (Load Imm10 into high-order bits)
> ~ 0.5% FLDCH Imm16, Rn (Load Half-Float as Double, *1)
> ~ 6% Jumbo64 (Imm33s)
> ~ 0.6% Jumbo96 (Imm64)
<
So, translating the above into My 66000
<
90% IMM16
either
8% IMM32 and 2% IMM64
or
6% IMM32 and 4% IMM64
depending
on whether the 2% LDHI Imm10 goes to the HoBs of 32-bits or of 64-bits.
<
Both are within spitting distance of what I see in ASM code.
<
Now, I do have IMM12 which is used strictly for shifts.
>
> *1: This stat is somewhat bigger in Quake, but there are very few FP
> constants in Doom.
>
>
> In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
> encodings (seems to be mostly Binary64 constants which can't be
> expressed exactly as either Binary16 or Binary32).
>
> The ratio of bit-mask patterns to other constants is also a little lower
> in Quake vs Doom.
>
> It looks like, for FP constants (~ 14% of the total):
> ~ 60% can be expressed exactly as Binary16
> ~ 20% can be expressed exactly as Binary32
> ~ 20% fall through to Binary64.
<
This corroborates the VAX-11/780 FP constants.
>
> May need to look at it, a larger percentage should theoretically fit the
> Binary32 pattern, given Quake is primarily using "float" operations.
>
> Goes and looks, "Float+ImmDouble" will coerce the ImmDouble to ImmFloat,
> which will forcibly round the result. The number of seemingly
> double-precision constants seems mysterious.
<
Have you "too easily" converted FP constants to 64-bits when they could have
been smaller ?
>
> ...
> >>
> >>
> >> Granted, this isn't all that steep (vs the core as a whole), but I guess
> >> it is more a question of if it would be useful enough to make it worthwhile.
> >>
> >>
> >> Though, would need to find an "odd corner" to shove it into.
> >> Otherwise, BJX2 lacks any encodings for "OP Imm12, Rn".
> >>
> >> Have spots for Imm10 (insufficient), and Imm16 (but no, not going to
> >> spend an Imm16 spot on this; there are only a few of these left).
> >>
> >> Have a few possible ideas (involving XGPR encodings, such as reclaiming
> >> the encoding space for branches for Imm12 ops because XGPR+BRA does not
> >> make sense), but they seem like ugly hacks.
> >>
> >> Similarly, outside of being able to fit in a 32-bit encoding, it loses
> >> any real advantage it might have had. Jumbo encodings can load pretty
> >> much any possible constant.
> >>
> >>
> >>>>
> >>>


Click here to read the complete article
Re: Dense machine code from C++ code (compiler optimizations)

<snkcp3$lmc$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22145&group=comp.arch#22145

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Tue, 23 Nov 2021 21:49:49 -0600
Organization: A noiseless patient Spider
Lines: 286
Message-ID: <snkcp3$lmc$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Nov 2021 03:49:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6d41b7aef672a37ebb8e3597c0f9a646";
logging-data="22220"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/VW6AbRHk0FwCOLNSFgbyF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:gnAWlDeHTcd6eGuZym3gHJTX+xs=
In-Reply-To: <f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 24 Nov 2021 03:49 UTC

On 11/23/2021 7:32 PM, MitchAlsup wrote:
> On Tuesday, November 23, 2021 at 5:20:15 PM UTC-6, BGB wrote:
>> On 11/23/2021 12:44 PM, MitchAlsup wrote:
>>> On Monday, November 22, 2021 at 11:21:41 PM UTC-6, BGB wrote:
>>>> On 11/22/2021 5:04 PM, MitchAlsup wrote:
>>>>> On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
>>>>>> On 11/22/2021 12:54 PM, MitchAlsup wrote:
>>>>>
>>>>>>> In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
>>>>>>> the other defines the width of the field (0->64). For immediate shift amounts
>>>>>>> there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
>>>>>>> shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
>>>>>>> (The empty spaces are checked for significance)
>>>>>>> <
>>>>>>> Then there is the operand-by-operand select (CMOV) and the bit-by bit
>>>>>>> select (Multiplex)
>>>>>>> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
>>>>>>> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
>>>>>> I did go and add a bit-select instruction (BITSEL / MUX).
>>>>>>
>>>>>> Currently I have:
>>>>>> CSELT // Rn = SR.T ? Rm : Ro;
>>>>>> PCSELT.L // Packed select, 32-bit words (SR.ST)
>>>>>> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
>>>>>> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
>>>>>>
>>>>>> BITSEL didn't add much cost, but did initially result in the core
>>>>>> failing timing.
>>>>>>
>>>>>> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
>>>>>> work again.
>>>>>>
>>>>>>
>>>>>> The MOV.C instruction has been demoted some, as:
>>>>>> It isn't entirely free;
>>>>>> It's advantage in terms of performance is fairly small;
>>>>>> It doesn't seem to play well with TLB Miss interrupts.
>>>>>>
>>>>>> Though, a cheaper option would be "MOV.C that only works with GBR and LR
>>>>>> and similar" (similar effect in practice, but avoids most of the cost of
>>>>>> the additional signal routing).
>>>>>>
>>>>>>
>>>>>> I am also on the fence and considering disallowing using bit-shift
>>>>>> operations in Lane 3, mostly as a possible way to reduce costs by not
>>>>>> having a (rarely used) Lane 3 shift unit.
>>>>> <
>>>>> In general purpose codes, shifts are "not all that present" 2%-5% range (source
>>>>> code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
>>>>> but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
>>>>> could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
>>>>> multiplier (slot[3]).
>>>>> <
>>>>> Shifters are on-the-order-of gate cont expensive as integer adders and less useful.
>>>> All 3 lanes still have other ALU ops, including some type-conversion and
>>>> packed-integer SIMD ops.
>>>>>>
>>>>>> Still on the fence though as it does appear that shifts operators in
>>>>>> Lane 3 aren't exactly unused either (so the change breaks compatibility
>>>>>> with my existing binaries).
>>>>> <
>>>>> You should see what the cost is if only lanes[0..1] can perform shifts.
>>>> Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
>>>> It also makes timing a little better.
>>>>
>>>> I guess the main factor is whether it justifies the break in binary
>>>> compatibility (since previously, the WEXifier assumed that running
>>>> shifts in Lane 3 was allowed; and this seems to have occurred a non-zero
>>>> number of times in pretty much every binary I have looked at).
>>>>>>>>
>>>>>>>> Could synthesize a mask from a shift and count, harder part is coming up
>>>>>>>> with a way to do so cheaply (when the shift unit is already in use this
>>>>>>>> cycle).
>>>>>>> <
>>>>>>> It is a simple decoder........
>>>>>> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
>>>>>> table isn't super cheap.
>>>>> <
>>>>> input.....|.............................output..................................
>>>>> 000000 | 0000000000000000000000000000000000000000000000000000000000000000
>>>>> 000001 | 0000000000000000000000000000000000000000000000000000000000000001
>>>>> 000010 | 0000000000000000000000000000000000000000000000000000000000000011
>>>>> 000011 | 0000000000000000000000000000000000000000000000000000000000000111
>>>>> 000100 | 0000000000000000000000000000000000000000000000000000000000001111
>>>>> etc.
>>>>> It is a straight "Greater Than" decoder.
>>>> One needs a greater-than comparison, a less-than comparison, and a way
>>>> to combine them with a bit-wise operator (AND or OR), for each bit.
>>>>
>>>> Say, min, max:
>>>> min<=max: Combine with AND
>>>> min >max: Combine with OR
>>>>
>>>> This would allow composing masks with a range of patterns, say:
>>>> 00000000000003FF
>>>> 7FE0000000000000
>>>> 000003FFFFC00000
>>> <
>>> Up to this point you need a Greater than decoder and a less than decoder
>>> and an AND gate.
>>> <
>>>> 7FFE000000003FFF
>>>> ...
>>> Why split the field around the container boundary ??
>> Masks like this happen sometimes, and supporting both an AND gate and an
>> OR gate would not likely cost that much more than the AND gate by itself.
>>
>>
>> Goes and runs some stats:
>> Constants that fit these patterns: ~ 30% of the total constants;
>> Percentage of masks which end up using jumbo 64 or 96: ~ 6%
>> Percentage of total constant loads, masks as jumbo: ~ 2%
>> Percentage of total instructions, masks as jumbo: ~ 0.15%
>>
>> For something the size of Doom, it would save ~ 670 bytes.
>>
>> Of the total constants (for Doom):
>> ~ 75% LDI Imm16, Rn
>> ~ 15% LDI Imm8, Rn (16-bit encoding)
>> ~ 2% LDIHI Imm10, Rn (Load Imm10 into high-order bits)
>> ~ 0.5% FLDCH Imm16, Rn (Load Half-Float as Double, *1)
>> ~ 6% Jumbo64 (Imm33s)
>> ~ 0.6% Jumbo96 (Imm64)
> <
> So, translating the above into My 66000
> <
> 90% IMM16
> either
> 8% IMM32 and 2% IMM64
> or
> 6% IMM32 and 4% IMM64
> depending
> on whether the 2% LDHI Imm10 goes to the HoBs of 32-bits or of 64-bits.

The LDHI instruction can encode either case (63:54) or (31:22) depending
on the E.Q bit, but my stat counts don't distinguish between these cases.

The ratios were fairly similar between the programs I looked at.

Main difference being that Doom and ROTT have significantly fewer
floating point constants than Quake, but this shouldn't really come as
any big surprise.

> <
> Both are within spitting distance of what I see in ASM code.
> <
> Now, I do have IMM12 which is used strictly for shifts.

I don't have an Imm12 case, and my shift instructions generally use
Imm8. This is treated as an 8-bit sign-extended quantity (positive
encodes left shift, negative encodes right shift).

The exact interpretation depends on the type size:
32 bits(SHAD /SHLD ): Treated as Mod-32
64 bits(SHADQ/SHLDQ): Treated as Mod-64
128 bits(SHADX/SHLDX): Valid range is ± 127.

There are also some SHLR/SHAR/... variants which encode a right-shift by
internally flipping the shift direction (register only). This was mostly
to avoid needing to negate the input to encode variable right shifts.

>>
>> *1: This stat is somewhat bigger in Quake, but there are very few FP
>> constants in Doom.
>>
>>
>> In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
>> encodings (seems to be mostly Binary64 constants which can't be
>> expressed exactly as either Binary16 or Binary32).
>>
>> The ratio of bit-mask patterns to other constants is also a little lower
>> in Quake vs Doom.
>>
>> It looks like, for FP constants (~ 14% of the total):
>> ~ 60% can be expressed exactly as Binary16
>> ~ 20% can be expressed exactly as Binary32
>> ~ 20% fall through to Binary64.
> <
> This corroborates the VAX-11/780 FP constants.


Click here to read the complete article
Re: Dense machine code from C++ code (compiler optimizations)

<snko83$1g3u$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22146&group=comp.arch#22146

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Wed, 24 Nov 2021 08:05:38 +0100
Organization: Aioe.org NNTP Server
Message-ID: <snko83$1g3u$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
<snitic$rbq$1@dont-email.me> <snj1kq$qql$1@dont-email.me>
<snjigc$skp$1@gioia.aioe.org>
<f331900a-630f-49a2-ab48-89441ed984d1n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="49278"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Wed, 24 Nov 2021 07:05 UTC

MitchAlsup wrote:
> On Tuesday, November 23, 2021 at 2:21:37 PM UTC-6, Terje Mathisen wrote:
>> It turns out that a group of people which include Robert Collins, a
>> co-worker (at Cognite), have done a lot of work on speeding up Rust
>> installations on both unix/linux and Microsoft platforms, they found
>> that Microsoft can in fact reach parity with Linux, but with serious effort:
>>
>> https://www.youtube.com/watch?v=qbKGw8MQ0i8
>>
>> It turns out that the actual throughput is approximately the same for
>> Windows/NTFS and Linux, but as you noted Defender & co makes the latency
>> orders of magnitude worse. By overlapping pretty much all the file
>> metadata work (mainly file creations) those latency bubbles can in fact

Also file close, i.e. anything that can trigger the virus scanner(s) to
take a look before releasing/completing the IO.

>> be filled in. Afair they reduced the installation time from several
>> minutes to ~10 seconds.
> <
> Now if MS would simply allow one to DD <current image> -> <new image>
> so that recovery was 6 minutes instead of 6 hours we might be getting
> somewhere...

That _has_ to be possible, since hibernation needs exactly that kind of
infrastructure, right?

For some probably good reasons, they don't want to expose the kernel
interfaces which allows this to work?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<snlevu$sm9$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22148&group=comp.arch#22148

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Wed, 24 Nov 2021 14:33:50 +0100
Organization: A noiseless patient Spider
Lines: 121
Message-ID: <snlevu$sm9$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<snffpt$o6p$1@dont-email.me> <sngefb$j2l$1@newsreader4.netcologne.de>
<sngl8p$laa$1@dont-email.me> <sngt7c$jbh$1@dont-email.me>
<snitic$rbq$1@dont-email.me> <snj1kq$qql$1@dont-email.me>
<snjigc$skp$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Nov 2021 13:33:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b1c263ea9bc0536542cee8b8b11646f5";
logging-data="29385"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19DKR9hc8g/hxYiYL4I37LrUnhANTXF2WU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:eyMosFewiFAWZP5PW5T+rTXQr50=
In-Reply-To: <snjigc$skp$1@gioia.aioe.org>
Content-Language: en-US
 by: Marcus - Wed, 24 Nov 2021 13:33 UTC

On 2021-11-23 21:21, Terje Mathisen wrote:
> Marcus wrote:
>> On 2021-11-23 15:24, David Brown wrote:
>>> On 22/11/2021 21:05, BGB wrote:
>>>> On 11/22/2021 11:50 AM, Marcus wrote:
>>>>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>>>>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>>>>>
>>>>>>> A complete rebuild of binutils +
>>>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>>>>
>>>>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>>>>> hour with recent versions.  Which version is your gcc port
>>>>>> based on?
>>>>>>
>>>>>
>>>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>>>>
>>>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU,
>>>>> with
>>>>> an NVMe drive (~3 GB/s read).
>>>>>
>>>>
>>>> In my case, I am running Windows 10 on a Ryzen 2700X (8-core,
>>>> 16-thread,
>>>> 3.7 GHz).
>>>>  Â  OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>>>>  Â  Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD
>>>> Red).
>>>>
>>>> My MOBO does not have M.2 or similar, but does have SATA connectors, so
>>>> the SSD is plugged in via SATA.
>>>>
>>>> RAM Stats:
>>>>  Â  48GB of RAM, 1467MHz (DDR4-2933)
>>>>  Â  192GB of Pagefile space.
>>>>
>>>> HDD speed is pretty variable, but generally falls into the range of
>>>> 20-80MB/s (except for lots of small files, where it might drop down
>>>> to ~
>>>> 2MB/s or so).
>>>>
>>>>
>>>> Also kinda funny is that "file and folder compression" tends to
>>>> actually
>>>> make directories full of source code or other small files somewhat
>>>> faster (so I tend to leave it on by default for drives which I
>>>> primarily
>>>> use for projects).
>>>>
>>>> Otherwise, it is like the whole Windows Filesystem is built around the
>>>> assumption that one is primarily working with small numbers of large
>>>> files, rather than large numbers of small files.
>>>>
>>>>
>>>>> I build with "make -j28" (although the GNU toolchain build system is
>>>>> poorly parallelizable, so most of the cores are idle most of the
>>>>> time).
>>>>>
>>>>
>>>> Trying to parallel make eats up all of the RAM and swap space, so I
>>>> don't generally do so. This issue is especially bad with LLVM though.
>>>>
>>>>
>>>>> BTW, my build script is here [1].
>>>>>
>>>>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>>>>> always slower than Linux (esp. when it comes to build systems).
>>>>>
>>>>
>>>> I suspect because GCC and similar do excessive amounts of file opening
>>>> and spawning huge numbers of short-lived processes. These operations
>>>> are
>>>> well into the millisecond range.
>>>>
>>>
>>> Large gcc builds can easily be two or three times faster on Linux than
>>> on Windows, in my experience.  gcc comes from a *nix heritage, where
>>> spawning new processes is cheap and fast so a single run of "gcc" may
>>> involve starting dozens of processes connected via pipes or temporary
>>> files.  On Windows, starting a new process is a slow matter, and using
>>> temporary files often means the file has to actually be written out to
>>> disk - you don't have the same tricks for telling the OS that the file
>>> does not have to be created on disk unless you run out of spare memory.
>>>   So a Windows-friendly compiler will be a monolithic program that does
>>> everything, while gcc is much more efficient on a *nix system.  In
>>> addition, handling lots of small files is vastly faster on *nix than
>>> Windows.  (Though I believe Win10 was an improvement here compared to
>>> Win7.)
>>>
>>
>> Possibly (Win10 vs Win7), but it's still not even in the same ballpark
>> as Linux:
>>
>> https://www.bitsnbites.eu/benchmarking-os-primitives/
>>
>> File creation is ~60x slower on Windows 10 compared to Linux (highly
>> dependent on things like Defender and indexing services though), and
>> launching processes is ~20x slower (in the tests in that article).
>
> It turns out that a group of people which include Robert Collins, a
> co-worker (at Cognite), have done a lot of work on speeding up Rust
> installations on both unix/linux and Microsoft platforms, they found
> that Microsoft can in fact reach parity with Linux, but with serious
> effort:
>
> https://www.youtube.com/watch?v=qbKGw8MQ0i8
>
> It turns out that the actual throughput is approximately the same for
> Windows/NTFS and Linux, but as you noted Defender & co makes the latency
> orders of magnitude worse. By overlapping pretty much all the file
> metadata work (mainly file creations) those latency bubbles can in fact
> be filled in. Afair they reduced the installation time from several
> minutes to ~10 seconds.
>

Thanks! That's a very interesting talk (I'm halfway through and got to
the non-blocking CloseHandle() part). I will most likely investigate
some of those tricks for BuildCache on Windows.

/Marcus

Re: Dense machine code from C++ code (compiler optimizations)

<snnfqj$vvu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22149&group=comp.arch#22149

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Thu, 25 Nov 2021 02:00:12 -0600
Organization: A noiseless patient Spider
Lines: 356
Message-ID: <snnfqj$vvu$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 25 Nov 2021 08:00:19 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2515515d6a32f212d6a4bf62972a0c6d";
logging-data="32766"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX187wcl0+WKi8lyiIeu703mU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:an+7Wg9GkkjTjwBZFAzco2AYdkA=
In-Reply-To: <snkcp3$lmc$1@dont-email.me>
Content-Language: en-US
 by: BGB - Thu, 25 Nov 2021 08:00 UTC

On 11/23/2021 9:49 PM, BGB wrote:
> On 11/23/2021 7:32 PM, MitchAlsup wrote:
>> On Tuesday, November 23, 2021 at 5:20:15 PM UTC-6, BGB wrote:
>>> On 11/23/2021 12:44 PM, MitchAlsup wrote:
>>>> On Monday, November 22, 2021 at 11:21:41 PM UTC-6, BGB wrote:
>>>>> On 11/22/2021 5:04 PM, MitchAlsup wrote:
>>>>>> On Monday, November 22, 2021 at 4:43:48 PM UTC-6, BGB wrote:
>>>>>>> On 11/22/2021 12:54 PM, MitchAlsup wrote:
>>>>>>
>>>>>>>> In My 66000 ISA there are two 6-bit field codes, one defines the
>>>>>>>> shift amount
>>>>>>>> the other defines the width of the field (0->64). For immediate
>>>>>>>> shift amounts
>>>>>>>> there is a 12-bit immediate that supplies the pair of 6-bit
>>>>>>>> specifiers; for register
>>>>>>>> shift amounts R<5:0> is the shift amount while R<37:32> is the
>>>>>>>> field width.
>>>>>>>> (The empty spaces are checked for significance)
>>>>>>>> <
>>>>>>>> Then there is the operand-by-operand select (CMOV) and the
>>>>>>>> bit-by bit
>>>>>>>> select (Multiplex)
>>>>>>>> CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
>>>>>>>> MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )
>>>>>>> I did go and add a bit-select instruction (BITSEL / MUX).
>>>>>>>
>>>>>>> Currently I have:
>>>>>>> CSELT // Rn = SR.T ? Rm : Ro;
>>>>>>> PCSELT.L // Packed select, 32-bit words (SR.ST)
>>>>>>> PCSELT.W // Packed select, 16-bit words (SR.PQRO)
>>>>>>> BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);
>>>>>>>
>>>>>>> BITSEL didn't add much cost, but did initially result in the core
>>>>>>> failing timing.
>>>>>>>
>>>>>>> Though disabling the "MOV.C" instruction saves ~ 2k LUT and made
>>>>>>> timing
>>>>>>> work again.
>>>>>>>
>>>>>>>
>>>>>>> The MOV.C instruction has been demoted some, as:
>>>>>>> It isn't entirely free;
>>>>>>> It's advantage in terms of performance is fairly small;
>>>>>>> It doesn't seem to play well with TLB Miss interrupts.
>>>>>>>
>>>>>>> Though, a cheaper option would be "MOV.C that only works with GBR
>>>>>>> and LR
>>>>>>> and similar" (similar effect in practice, but avoids most of the
>>>>>>> cost of
>>>>>>> the additional signal routing).
>>>>>>>
>>>>>>>
>>>>>>> I am also on the fence and considering disallowing using bit-shift
>>>>>>> operations in Lane 3, mostly as a possible way to reduce costs by
>>>>>>> not
>>>>>>> having a (rarely used) Lane 3 shift unit.
>>>>>> <
>>>>>> In general purpose codes, shifts are "not all that present" 2%-5%
>>>>>> range (source
>>>>>> code). In one 6-wide machine, we stuck an integer unit in each of
>>>>>> the 6-slots
>>>>>> but we borrowed the shifters in the LD-Align stage for shifts--so
>>>>>> only 3 slots
>>>>>> could perform shifts, while all 6 could do +-&|^ . MUL and DIV
>>>>>> were done in the
>>>>>> multiplier (slot[3]).
>>>>>> <
>>>>>> Shifters are on-the-order-of gate cont expensive as integer adders
>>>>>> and less useful.
>>>>> All 3 lanes still have other ALU ops, including some
>>>>> type-conversion and
>>>>> packed-integer SIMD ops.
>>>>>>>
>>>>>>> Still on the fence though as it does appear that shifts operators in
>>>>>>> Lane 3 aren't exactly unused either (so the change breaks
>>>>>>> compatibility
>>>>>>> with my existing binaries).
>>>>>> <
>>>>>> You should see what the cost is if only lanes[0..1] can perform
>>>>>> shifts.
>>>>> Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
>>>>> It also makes timing a little better.
>>>>>
>>>>> I guess the main factor is whether it justifies the break in binary
>>>>> compatibility (since previously, the WEXifier assumed that running
>>>>> shifts in Lane 3 was allowed; and this seems to have occurred a
>>>>> non-zero
>>>>> number of times in pretty much every binary I have looked at).
>>>>>>>>>
>>>>>>>>> Could synthesize a mask from a shift and count, harder part is
>>>>>>>>> coming up
>>>>>>>>> with a way to do so cheaply (when the shift unit is already in
>>>>>>>>> use this
>>>>>>>>> cycle).
>>>>>>>> <
>>>>>>>> It is a simple decoder........
>>>>>>> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit
>>>>>>> lookup
>>>>>>> table isn't super cheap.
>>>>>> <
>>>>>> input.....|.............................output..................................
>>>>>>
>>>>>> 000000 |
>>>>>> 0000000000000000000000000000000000000000000000000000000000000000
>>>>>> 000001 |
>>>>>> 0000000000000000000000000000000000000000000000000000000000000001
>>>>>> 000010 |
>>>>>> 0000000000000000000000000000000000000000000000000000000000000011
>>>>>> 000011 |
>>>>>> 0000000000000000000000000000000000000000000000000000000000000111
>>>>>> 000100 |
>>>>>> 0000000000000000000000000000000000000000000000000000000000001111
>>>>>> etc.
>>>>>> It is a straight "Greater Than" decoder.
>>>>> One needs a greater-than comparison, a less-than comparison, and a way
>>>>> to combine them with a bit-wise operator (AND or OR), for each bit.
>>>>>
>>>>> Say, min, max:
>>>>> min<=max: Combine with AND
>>>>> min >max: Combine with OR
>>>>>
>>>>> This would allow composing masks with a range of patterns, say:
>>>>> 00000000000003FF
>>>>> 7FE0000000000000
>>>>> 000003FFFFC00000
>>>> <
>>>> Up to this point you need a Greater than decoder and a less than
>>>> decoder
>>>> and an AND gate.
>>>> <
>>>>> 7FFE000000003FFF
>>>>> ...
>>>> Why split the field around the container boundary ??
>>> Masks like this happen sometimes, and supporting both an AND gate and an
>>> OR gate would not likely cost that much more than the AND gate by
>>> itself.
>>>
>>>
>>> Goes and runs some stats:
>>> Constants that fit these patterns: ~ 30% of the total constants;
>>> Percentage of masks which end up using jumbo 64 or 96: ~ 6%
>>> Percentage of total constant loads, masks as jumbo: ~ 2%
>>> Percentage of total instructions, masks as jumbo: ~ 0.15%
>>>
>>> For something the size of Doom, it would save ~ 670 bytes.
>>>
>>> Of the total constants (for Doom):
>>> ~ 75% LDI Imm16, Rn
>>> ~ 15% LDI Imm8, Rn (16-bit encoding)
>>> ~ 2% LDIHI Imm10, Rn (Load Imm10 into high-order bits)
>>> ~ 0.5% FLDCH Imm16, Rn (Load Half-Float as Double, *1)
>>> ~ 6% Jumbo64 (Imm33s)
>>> ~ 0.6% Jumbo96 (Imm64)
>> <
>> So, translating the above into My 66000
>> <
>> 90% IMM16
>> either
>> 8% IMM32 and 2% IMM64
>> or
>> 6% IMM32 and 4% IMM64
>> depending
>> on whether the 2% LDHI Imm10 goes to the HoBs of 32-bits or of 64-bits.
>
> The LDHI instruction can encode either case (63:54) or (31:22) depending
> on the E.Q bit, but my stat counts don't distinguish between these cases.
>
> The ratios were fairly similar between the programs I looked at.
>
> Main difference being that Doom and ROTT have significantly fewer
> floating point constants than Quake, but this shouldn't really come as
> any big surprise.
>
>
>> <
>> Both are within spitting distance of what I see in ASM code.
>> <
>> Now, I do have IMM12 which is used strictly for shifts.
>
> I don't have an Imm12 case, and my shift instructions generally use
> Imm8. This is treated as an 8-bit sign-extended quantity (positive
> encodes left shift, negative encodes right shift).
>
> The exact interpretation depends on the type size:
>    32 bits(SHAD /SHLD ): Treated as Mod-32
>    64 bits(SHADQ/SHLDQ): Treated as Mod-64
>   128 bits(SHADX/SHLDX): Valid range is ± 127.
>
> There are also some SHLR/SHAR/... variants which encode a right-shift by
> internally flipping the shift direction (register only). This was mostly
> to avoid needing to negate the input to encode variable right shifts.
>
>
>>>
>>> *1: This stat is somewhat bigger in Quake, but there are very few FP
>>> constants in Doom.
>>>
>>>
>>> In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
>>> encodings (seems to be mostly Binary64 constants which can't be
>>> expressed exactly as either Binary16 or Binary32).
>>>
>>> The ratio of bit-mask patterns to other constants is also a little lower
>>> in Quake vs Doom.
>>>
>>> It looks like, for FP constants (~ 14% of the total):
>>> ~ 60% can be expressed exactly as Binary16
>>> ~ 20% can be expressed exactly as Binary32
>>> ~ 20% fall through to Binary64.
>> <
>> This corroborates the VAX-11/780 FP constants.
>
> Could be. I had noted previously that a fair number of "typical"
> constants in C code could be expressed directly as Binary16.
>
> My ISA already had a Binary16 to Binary64 converter instruction, so I
> added an encoding to feed it a constant.
>
>
>>>
>>> May need to look at it, a larger percentage should theoretically fit the
>>> Binary32 pattern, given Quake is primarily using "float" operations.
>>>
>>> Goes and looks, "Float+ImmDouble" will coerce the ImmDouble to ImmFloat,
>>> which will forcibly round the result. The number of seemingly
>>> double-precision constants seems mysterious.
>> <
>> Have you "too easily" converted FP constants to 64-bits when they
>> could have
>> been smaller ?
>
> Within BGBCC, there are a few ways to store floating-point literals:
> Float Literal, which stores it inline as a Binary32 value;
> Double Literal, which stores it as an index into a "literal value table"
> (same table is also used for Long constants, and pairs for Int128
> constants).
>
> When an expression is type Float, it coerces the immediate to use the
> Float Literal format. Expressions in C tend to not use any suffixes on
> floating-point constants, so the idea is the compiler will instead uses
> the type from the non-constant operand for the type of the expression in
> this case.
>
>
> Within registers, scalar values are always operated on as Binary64
> (regardless of the type of the variable).
>
> The constant-load is thus given a value in Binary64 format, but the
> compiler may use a more compact format for storage (generally by trying
> to convert the value to the smaller format and then convert it back and
> seeing if the values match).
>
> If the value had been converted to Binary32 previously, then this
> round-trip check should pass.
>
>
> As is, there are several ways to encode a floating-point literal (scalar):
>   Binary16, "FLDCH Imm16, Rn" (Convert Binary16 Immed to Binary64)
>   Binary32, "FLDCF Imm32, Rn" (Convert Binary32 Immed to Binary64)
>   Truncated Binary64, "LDIQHI Imm32, Rn"
>   Raw Binary64, "LDI Imm64, Rn"
>
> The issue is mostly that it appears to be using a full 64-bit load more
> often than would be expected here.
>


Click here to read the complete article
Pages:12345678
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor