Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

UNIX is many things to many people, but it's never been everything to anybody.

Re: Dense machine code from C++ code (compiler optimizations)

Subject	Author
Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Terje Mathisen
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	robf...@gmail.com
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Ivan Godard
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Ivan Godard
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	robf...@gmail.com
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Thor (was: Dense machine code...)	Marcus
Re: Thor (was: Dense machine code...)	robf...@gmail.com
Re: Thor	EricP
Re: Thor (was: Dense machine code...)	Marcus
Re: Thor (was: Dense machine code...)	robf...@gmail.com
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Testing with open source games (was Dense machine code ...)	Marcus
Re: Testing with open source games (was Dense machine code ...)	Terje Mathisen
Re: Testing with open source games (was Dense machine code ...)	Marcus
Re: Testing with open source games (was Dense machine code ...)	Terje Mathisen
Re: Testing with open source games (was Dense machine code ...)	James Van Buskirk
Re: Testing with open source games (was Dense machine code ...)	Marcus
Re: Testing with open source games (was Dense machine code ...)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Ivan Godard
Re: Dense machine code from C++ code (compiler optimizations)	Thomas Koenig
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Ivan Godard
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Paul A. Clayton
Re: Dense machine code from C++ code (compiler optimizations)	Thomas Koenig
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Thomas Koenig
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Thomas Koenig
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	George Neuner
Re: Dense machine code from C++ code (compiler optimizations)	David Brown
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Terje Mathisen
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Terje Mathisen
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Ir. Hj. Othman bin Hj. Ahmad
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Thomas Koenig
Re: Dense machine code from C++ code (compiler optimizations)	chris
Re: Dense machine code from C++ code (compiler optimizations)	David Brown
Re: Dense machine code from C++ code (compiler optimizations)	chris
Re: Dense machine code from C++ code (compiler optimizations)	David Brown
Re: Dense machine code from C++ code (compiler optimizations)	chris
Re: Dense machine code from C++ code (compiler optimizations)	Terje Mathisen
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Terje Mathisen
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	David Brown
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	David Brown
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	David Brown
Re: Dense machine code from C++ code (compiler optimizations)	BGB
Re: Dense machine code from C++ code (compiler optimizations)	Stephen Fuld
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Stephen Fuld
Re: Dense machine code from C++ code (compiler optimizations)	James Van Buskirk
Re: Dense machine code from C++ code (compiler optimizations)	Stephen Fuld
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Marcus
Re: Dense machine code from C++ code (compiler optimizations)	Terje Mathisen
Re: Dense machine code from C++ code (compiler optimizations)	Stephen Fuld
Re: Dense machine code from C++ code (compiler optimizations)	EricP
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	EricP
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	EricP
Re: Dense machine code from C++ code (compiler optimizations)	Tim Rentsch
Re: Dense machine code from C++ code (compiler optimizations)	Stephen Fuld
Re: Dense machine code from C++ code (compiler optimizations)	Guillaume
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Thomas Koenig
Re: Dense machine code from C++ code (compiler optimizations)	Guillaume
Re: Dense machine code from C++ code (compiler optimizations)	MitchAlsup
Re: Dense machine code from C++ code (compiler optimizations)	Andreas Eder
Re: Dense machine code from C++ code (compiler optimizations)	Tim Rentsch
Re: Dense machine code from C++ code (compiler optimizations)	Tim Rentsch
Re: Dense machine code from C++ code (compiler optimizations)	Tim Rentsch
Re: Dense machine code from C++ code (compiler optimizations)	Ivan Godard
Re: Dense machine code from C++ code (compiler optimizations)	Andreas Eder

Pages:1 234 5 6 7 8

Re: Testing with open source games (was Dense machine code ...)

<snni99$f2n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22150&group=comp.arch#22150

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Thu, 25 Nov 2021 09:42:16 +0100
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <snni99$f2n$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 25 Nov 2021 08:42:17 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7cb3e46b8e9272dff824e577fed6af2b";
logging-data="15447"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19KOZb/1uxj5TVd+WCQ0SDBTTQLb33rpmE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:jVewOQOYjf5WU/c5/mx19pViJlk=
In-Reply-To: <snnfqj$vvu$1@dont-email.me>
Content-Language: en-US

by: Marcus - Thu, 25 Nov 2021 08:42 UTC

On 2021-11-25 09:00, BGB wrote:

[snip]

>
> Status update:
> Fixing this issue, and also discovering and fixing another issue which
> was resulting in incorrect type conversions (it was coercing constant
> values to a smaller type in some cases where it should have
> type-promoted instead), fixed some of the bugs I was encountering in Quake.
>
> There are still a lot of cases where expressions are promoting to
> 'double' only to then be forced back to 'float'. Not really a quick/easy
> fix though (and in these cases, the promotions are following C's
> type-promotion rules; it is more a case of it being slightly less
> efficient than it could be).
>
>
> Still no effect on the results of the E3M5 desync though.
>
> Did start wondering though if there is a way to make the other demos not
> desync, but looking into it, it appears every version of the Doom engine
> had slightly different behavior, and it appears some of the
> newer/fancier multi-game engines try to infer the Doom engine version
> from the IWAD and change around a bunch of settings to try to match the
> behavior of the particular engine for the particular IWAD (eg, minor
> changes from one version of the Doom engine breaking demos originally
> recorded on another version of the Doom engine).
>
> Or, otherwise, it is likely getting "correct" demo playback from an
> ad-hoc port of the Linuxdoom source might be asking a bit much...
>
>
> Heretic and Hexen still behave as before.
>
> Though, as noted:
> Heretic seems to remain in-sync for the Shareware IWAD;
> It desyncs for two of the demos in the release IWAD.
>
> This is the same between builds, though there does appear to be a
> difference in that in one of the demos, in some cases an enemy drops a
> "morph ovum" and in other cases they do not (implying a there is some
> non-deterministic property somewhere).
>
>
> Previously, in a case where demo playback was diverging between targets
> in Doom, it was due to an out-of-bounds memory access and an apparent
> divergence in the contents of the out-of-bound memory contents between
> targets (possibly a pointer within the Z_Malloc headers or similar).
>
> It is possible I could be looking at something similar in this case,
> rather than necessarily a difference in compiler behavior.
>
>
> No visible change in ROTT behavior, though it is nowhere near close to
> matching the behavior of the x86 build (in that an x86 build with MSVC
> does actually manage to play the demos correctly, but builds for other
> targets are prone to desync).
>
> Though, if anything, a lot of this is probably evidence for why it is
> probably better, if implementing a game engine, to implement demos in
> terms of recording game events and similar rather than by recording user
> keypresses.
>
> ...

You seem to be using several of the old C code classics (Doom, Quake,
ROTT, Heretic, ...?). Have you seen
https://github.com/videogamepreservation ? A gold mine of old games (it
even has both C and Fortran versions of Zork!).

I wonder, which code bases have been most useful to work with?

I have ported Doom and Quake (obvious candidates, and I have ported them
to other architectures before so it was a fairly low barrier). I
personally use Quake a lot since it is a fairly large C code base with
near zero OS/platform dependencies, and it also has floating-point.
It's far from perfect, though (it's showing its age).

/Marcus

Re: Testing with open source games (was Dense machine code ...)

<snnlgb$15m1$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22151&group=comp.arch#22151

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!T3F9KNSTSM9ffyC31YXeHw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Thu, 25 Nov 2021 10:37:15 +0100
Organization: Aioe.org NNTP Server
Message-ID: <snnlgb$15m1$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me>
<snni99$f2n$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="38593"; posting-host="T3F9KNSTSM9ffyC31YXeHw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 25 Nov 2021 09:37 UTC

Marcus wrote:
> I have ported Doom and Quake (obvious candidates, and I have ported them
> to other architectures before so it was a fairly low barrier). I
> personally use Quake a lot since it is a fairly large C code base with
> near zero OS/platform dependencies, and it also has floating-point.
> It's far from perfect, though (it's showing its age).

Quake showing its age? Please tell me it ain't so!

It was only 25+ years ago that I tried to grok the original C and Mike
Abrash' asm version, so that I could look for possible speedups. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Testing with open source games (was Dense machine code ...)

<snnrk8$dgl$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22152&group=comp.arch#22152

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Thu, 25 Nov 2021 12:21:44 +0100
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <snnrk8$dgl$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me>
<snni99$f2n$1@dont-email.me> <snnlgb$15m1$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 25 Nov 2021 11:21:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7cb3e46b8e9272dff824e577fed6af2b";
logging-data="13845"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/p0z52T1SCq2NN67fQ6OEgZGk5NnUkP9Y="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:9d93imITD2m8zcbWc8O94XMqF5E=
In-Reply-To: <snnlgb$15m1$1@gioia.aioe.org>
Content-Language: en-US

by: Marcus - Thu, 25 Nov 2021 11:21 UTC

On 2021-11-25 10:37, Terje Mathisen wrote:
> Marcus wrote:
>> I have ported Doom and Quake (obvious candidates, and I have ported them
>> to other architectures before so it was a fairly low barrier). I
>> personally use Quake a lot since it is a fairly large C code base with
>> near zero OS/platform dependencies, and it also has floating-point.
>> It's far from perfect, though (it's showing its age).
>
> Quake showing its age? Please tell me it ain't so!

Yeah :-) The things that I wouldn't expect to see in a more modern code
base is its excessive use of global variables (both for passing
information between functions, and just because they didn't bother with
using the static keyword), and it's excessive use of shorts and bytes
(some of it is motivated, e.g. in packed data structures, but some of it
just seems to be because "shorts are faster & better than ints!").

These things may have been good/acceptable on x86-32 where you didn't
have any GPR:s anyway (and the ones you had could be used as 8-bit or
16-bit registers), but it's really not a good design for modern
load/store machines (nor a good SW architecture).

>
> It was only 25+ years ago that I tried to grok the original C and Mike
> Abrash' asm version, so that I could look for possible speedups. :-)
>

I think I dabbled with it just before the source was made public (found
it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
64-bit arch was "fun" back then).

/Marcus

Re: Testing with open source games (was Dense machine code ...)

<sno68u$1dn2$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22153&group=comp.arch#22153

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Thu, 25 Nov 2021 15:23:25 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sno68u$1dn2$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me>
<snni99$f2n$1@dont-email.me> <snnlgb$15m1$1@gioia.aioe.org>
<snnrk8$dgl$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="46818"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 25 Nov 2021 14:23 UTC

Marcus wrote:
> On 2021-11-25 10:37, Terje Mathisen wrote:
>> Marcus wrote:
>>> I have ported Doom and Quake (obvious candidates, and I have ported them
>>> to other architectures before so it was a fairly low barrier). I
>>> personally use Quake a lot since it is a fairly large C code base with
>>> near zero OS/platform dependencies, and it also has floating-point.
>>> It's far from perfect, though (it's showing its age).
>>
>> Quake showing its age? Please tell me it ain't so!
>
> Yeah :-) The things that I wouldn't expect to see in a more modern code
> base is its excessive use of global variables (both for passing
> information between functions, and just because they didn't bother with
> using the static keyword), and it's excessive use of shorts and bytes
> (some of it is motivated, e.g. in packed data structures, but some of it
> just seems to be because "shorts are faster & better than ints!").

I didn't look at those parts, had enough to do with helping Mike with
the asm code which was already 2.5x faster than the original C.
>
> These things may have been good/acceptable on x86-32 where you didn't
> have any GPR:s anyway (and the ones you had could be used as 8-bit or
> 16-bit registers), but it's really not a good design for modern
> load/store machines (nor a good SW architecture).

You've never really seen register pressure if you haven't tried to
optimize something like affine texture mapping on a 486 (or earlier)
running in 16-bit mode. I wrote lots of code in those days that
absolutely had to use both the low and high half of AX/BX/CX or DX as
separate 8-bit registers.

Said affine mapping could take advantage of 32-bit regs with 16:16
fixed-point math, that was the fastest way to do it until I discovered
how to cheat. :-)

>
>>
>> It was only 25+ years ago that I tried to grok the original C and Mike
>> Abrash' asm version, so that I could look for possible speedups. :-)
>>
>
> I think I dabbled with it just before the source was made public (found
> it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
> 64-bit arch was "fun" back then).

I believe you!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Testing with open source games (was Dense machine code ...)

<snolsl$ld2$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22155&group=comp.arch#22155

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Thu, 25 Nov 2021 12:49:49 -0600
Organization: A noiseless patient Spider
Lines: 305
Message-ID: <snolsl$ld2$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me>
<snni99$f2n$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 25 Nov 2021 18:49:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2515515d6a32f212d6a4bf62972a0c6d";
logging-data="21922"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19YUCSV6wByq0ivROIOR/Fq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Cancel-Lock: sha1:WbsP60t7pf3VEKaQzfs/neSowrU=
In-Reply-To: <snni99$f2n$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 25 Nov 2021 18:49 UTC

On 11/25/2021 2:42 AM, Marcus wrote:
> On 2021-11-25 09:00, BGB wrote:
>
> [snip]
>
>>
>> Status update:
>> Fixing this issue, and also discovering and fixing another issue which
>> was resulting in incorrect type conversions (it was coercing constant
>> values to a smaller type in some cases where it should have
>> type-promoted instead), fixed some of the bugs I was encountering in
>> Quake.
>>
>> There are still a lot of cases where expressions are promoting to
>> 'double' only to then be forced back to 'float'. Not really a
>> quick/easy fix though (and in these cases, the promotions are
>> following C's type-promotion rules; it is more a case of it being
>> slightly less efficient than it could be).
>>
>>
>> Still no effect on the results of the E3M5 desync though.
>>
>> Did start wondering though if there is a way to make the other demos
>> not desync, but looking into it, it appears every version of the Doom
>> engine had slightly different behavior, and it appears some of the
>> newer/fancier multi-game engines try to infer the Doom engine version
>> from the IWAD and change around a bunch of settings to try to match
>> the behavior of the particular engine for the particular IWAD (eg,
>> minor changes from one version of the Doom engine breaking demos
>> originally recorded on another version of the Doom engine).
>>
>> Or, otherwise, it is likely getting "correct" demo playback from an
>> ad-hoc port of the Linuxdoom source might be asking a bit much...
>>
>>
>> Heretic and Hexen still behave as before.
>>
>> Though, as noted:
>> Heretic seems to remain in-sync for the Shareware IWAD;
>> It desyncs for two of the demos in the release IWAD.
>>
>> This is the same between builds, though there does appear to be a
>> difference in that in one of the demos, in some cases an enemy drops a
>> "morph ovum" and in other cases they do not (implying a there is some
>> non-deterministic property somewhere).
>>
>>
>> Previously, in a case where demo playback was diverging between
>> targets in Doom, it was due to an out-of-bounds memory access and an
>> apparent divergence in the contents of the out-of-bound memory
>> contents between targets (possibly a pointer within the Z_Malloc
>> headers or similar).
>>
>> It is possible I could be looking at something similar in this case,
>> rather than necessarily a difference in compiler behavior.
>>
>>
>> No visible change in ROTT behavior, though it is nowhere near close to
>> matching the behavior of the x86 build (in that an x86 build with MSVC
>> does actually manage to play the demos correctly, but builds for other
>> targets are prone to desync).
>>
>> Though, if anything, a lot of this is probably evidence for why it is
>> probably better, if implementing a game engine, to implement demos in
>> terms of recording game events and similar rather than by recording
>> user keypresses.
>>
>> ...
>
>
> You seem to be using several of the old C code classics (Doom, Quake,
> ROTT, Heretic, ...?).

Yeah.

Ports, actively used (mostly based on the original source releases):
Quake (both software and GLQuake)
Doom
Heretic
Hexen
ROTT

Partial (not uploaded as of yet):
Quake 3 Arena (incomplete, OS support, *1)
Wolfenstein 3D (incomplete, license issues, ...)

*1: To work effectively, Quake 3 will require both virtual memory and
DLL loading. Neither of these really "have the kinks worked out". It is
also very likely to have considerably worse performance than Quake 1 (so
most likely, basically a slide-show).

A lot of work also went into trying to reduce Quake 3's memory footprint
to something more reasonable for running on an FPGA, and partly allowing
the ZIP-based PK3 format to be replaced with my own 'WAD4' format. Which
needs both less RAM, and is faster; though RP2 gets worse compression
than Deflate. Another change was allowing the JPG and TGA textures to be
replaced with DDS textures (DXT1 or DXT5).

In vertex-lighting mode, Quake 3 seems to draw less geometry than Quake
1 or 2, but one concern is that it has a somewhat larger and more
complicated BSP's, which was already a big source of slowdown in Quake 1
(GLQuake was still limited to single-digit frame-rates even if one
doesn't draw any geometry).

Though, it does appear if one could redo the Quake 1 maps with Quake 3's
BSP format and tools (say, qbsp3 modified to accept Quake 1 map files),
it is possible it could be faster than the original Quake 1 BSPs.

Some small experiments:
Wrote a video player (several codecs, AVI format, *2);
Wrote a miniature raycast miniature Minecraft style engine;
Wrote a Mod player (still unreleased IIRC, *3)

*2:
Video:
RPZA: "QuickTime Video"
CRAM: "MS Video 1"
BT4B: one of my own codecs
BT5A: Another of my own codecs, simple indexed-color color-cell.
It is similar in concept to CRAM, but gets a lower bitrate.
Faster / simpler to decode than BT4B.
Audio:
IMA ADPCM
BTAC1C (extended form of IMA ADPCM)
Decoder is backwards compatible with IMA ADPCM.

No MPEG style codecs; I doubt BJX2 (at 50MHz) has the go power to make
MPEG style codecs "not suck".

While RPZA and CRAM are simple to decode, they ran into the problem that
their bitrates are bad enough to run into the limits of the IO bandwidth
from the SDcard (if running it at 5MHz, or ~ 600kB/s). One either has to
use very low resolutions of very poor quality settings, which did not
look good (I wanted, say, 320x200 video, not 160x100).

My BT4B codec did OK; was able to push a color-cell based codec into
being bitrate competitive with MPEG-1, and it decoded very fast by PC
standards, but it was entropy coded and still fairly computationally
demanding to decode on BJX2 (as well as being a fairly complicated
format in general).

So, I came up with BT5A as a compromise, intended to reduce the bitrate
while keeping decoding complexity more in line with something like RPZA
or CRAM, and using a dynamically modified index-color palette rather
than RGB555, ...

Structurally, it resembled a hybrid of CRAM, RPZA, and RP2. Like these
formats, it was also built around a raw byte stream (no entropy coding);
with a baseline color-cell format of 4x4x1 pixels (16 bits) and two
8-bit index-color endpoints.

*3: My initial attempt at MOD playback had performance issues with the
sound mixing. In this case, it is doing the mixing in software (no
dedicated hardware mixer).

Some of the MODs tested had sample data too large to fit effectively
into the L2 cache; and they were using mostly premixed PCM audio and
effectively using the MOD format like a chunked WAV file, which seems to
sorta defeat the point of the MOD format.

A lot of the other examples I could find were either "kinda awful", or
were built using more complicated formats (S3M/XM/IT/...).

> Have you seen
> https://github.com/videogamepreservation ? A gold mine of old games (it
> even has both C and Fortran versions of Zork!).
>

Not really dug around on there.

I mostly have games which I was aware of the source for, and which the
porting effort didn't look too absurd.

The games mostly tend to provide quick visual feedback if something has
broken somewhere (and tend to have better coverage than what I have been
able to pull off with "sanity tests").

> I wonder, which code bases have been most useful to work with?
>

In terms of testing C compiler features, Doom and Quake work well.

Hexen and Heretic are basically Doom variants, and my ports of these
engines mostly reused code which was copy/pasted from my Doom port. The
main difference (at the low level) was that Hexen's sound system was
somewhat redesigned from Doom, so required a little more work.

The ROTT port involved a whole lot of stripping out and rewriting a
bunch of x86 specific stuff (into C), and sorta awkwardly emulating the
VGA hardware via function calls because otherwise it would have required
a more significant rewrite of the renderer.

The engine was also very prone to out-of-bounds memory accesses.

Codebase is basically sorta like Wold3D + parts of Doom (but, many of
these parts used poorly, like they seemingly didn't quite get the point
of the WAD format and implemented a whole bunch of stuff with computed
lump indices).

Some parts of my porting effort also involved rewriting parts of the
engine to access lumps by name; allowing the engine to be modded Doom
style by loading PWADs. For sake of my "Wolf3D in the ROTT engine"
effort, had also allowed the maps to be put into the WAD file (like in
Doom), and allowed the RLE compression to be replaced with LZ
compression, ...

Click here to read the complete article

Re: Testing with open source games (was Dense machine code ...)

<soham2$5uf$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22194&group=comp.arch#22194

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: not_va...@comcast.net (James Van Buskirk)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Sat, 4 Dec 2021 20:11:18 -0700
Organization: A noiseless patient Spider
Lines: 2
Message-ID: <soham2$5uf$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me> <06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com> <a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com> <snf8ui$p8n$1@dont-email.me> <bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com> <snh6f2$m5a$1@dont-email.me> <6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com> <snhtp1$nkn$1@dont-email.me> <9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com> <snjsvc$48l$1@dont-email.me> <f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com> <snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me> <snni99$f2n$1@dont-email.me> <snnlgb$15m1$1@gioia.aioe.org> <snnrk8$dgl$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain;
format=flowed;
charset="Windows-1252";
reply-type=response
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 5 Dec 2021 03:12:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6486696df367ec61a385901928210d33";
logging-data="6095"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/aBOMJcKZ5Fh+1Ein9bCfUAf3gGgAgLwo="
Cancel-Lock: sha1:d1uAuIZ1IW1LdflP4VtiXeTiBJs=
X-MimeOLE: Produced By Microsoft MimeOLE V16.4.3528.331
In-Reply-To: <snnrk8$dgl$1@dont-email.me>
X-Newsreader: Microsoft Windows Live Mail 16.4.3528.331
Importance: Normal
X-Priority: 3
X-MSMail-Priority: Normal

by: James Van Buskirk - Sun, 5 Dec 2021 03:11 UTC

"Marcus" wrote in message news:snnrk8$dgl$1@dont-email.me...

> On 2021-11-25 10:37, Terje Mathisen wrote:

> > Quake showing its age? Please tell me it ain't so!

> I think I dabbled with it just before the source was made public (found
> it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
> 64-bit arch was "fun" back then).

Are you the one responsible for that port? It always crashed when you
fired the BFG9000 :(

Re: Testing with open source games (was Dense machine code ...)

<sor83p$tld$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22239&group=comp.arch#22239

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Testing with open source games (was Dense machine code ...)
Date: Wed, 8 Dec 2021 22:29:29 +0100
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <sor83p$tld$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me> <snegcq$n03$1@dont-email.me>
<06b3cd2c-b51b-44f8-a050-b441a67458abn@googlegroups.com>
<a40c3b5f-8118-46a5-9072-c8725156ef6dn@googlegroups.com>
<snf8ui$p8n$1@dont-email.me>
<bebbe060-cfc2-4be9-b36d-450c9017f2cdn@googlegroups.com>
<snh6f2$m5a$1@dont-email.me>
<6eee4227-ec6e-40a1-831f-08dd2e3fc240n@googlegroups.com>
<snhtp1$nkn$1@dont-email.me>
<9cbe608f-f62a-4f2b-86da-6364e0760d45n@googlegroups.com>
<snjsvc$48l$1@dont-email.me>
<f6e33b74-7e0e-4a06-9282-6aba9bb2d308n@googlegroups.com>
<snkcp3$lmc$1@dont-email.me> <snnfqj$vvu$1@dont-email.me>
<snni99$f2n$1@dont-email.me> <snnlgb$15m1$1@gioia.aioe.org>
<snnrk8$dgl$1@dont-email.me> <soham2$5uf$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 8 Dec 2021 21:29:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="754a3ed847220e81a87c526ee94ddf39";
logging-data="30381"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18W5UGZRTQZVyAc5tCBZGNhWrRJ0CVOooc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:aBZDDkNJR6B4SMH+eJr+PpBqFGE=
In-Reply-To: <soham2$5uf$1@dont-email.me>
Content-Language: en-US

by: Marcus - Wed, 8 Dec 2021 21:29 UTC

On 2021-12-05 04:11, James Van Buskirk wrote:
> "Marcus" wrote in message news:snnrk8$dgl$1@dont-email.me...
>> On 2021-11-25 10:37, Terje Mathisen wrote:
>
>> > Quake showing its age? Please tell me it ain't so!
>
>> I think I dabbled with it just before the source was made public (found
>> it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
>> 64-bit arch was "fun" back then).
>
> Are you the one responsible for that port? It always crashed when you
> fired the BFG9000 :(
>

I doubt that the port ever left the Swedish university that I studied at
back than, or even the few accounts that had the copy (but what do I
know?).

/Marcus

Re: Dense machine code from C++ code (compiler optimizations)

<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22306&group=comp.arch#22306

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5aa4:: with SMTP id u4mr17729487qvg.7.1639691711591;
Thu, 16 Dec 2021 13:55:11 -0800 (PST)
X-Received: by 2002:a9d:4c90:: with SMTP id m16mr89508otf.129.1639691710902;
Thu, 16 Dec 2021 13:55:10 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 16 Dec 2021 13:55:10 -0800 (PST)
In-Reply-To: <sndun6$q07$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=60.48.14.105; posting-account=JwRkAAkAAADB2qESRxaIoL0MVeLh5Uyo
NNTP-Posting-Host: 60.48.14.105
References: <sndun6$q07$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: othm...@gmail.com (Ir. Hj. Othman bin Hj. Ahmad)
Injection-Date: Thu, 16 Dec 2021 21:55:11 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 13

by: Ir. Hj. Othman bin H - Thu, 16 Dec 2021 21:55 UTC

On Monday, 22 November 2021 at 01:13:13 UTC+8, Marcus wrote:
> Just wrote this short post. Maybe someone finds it interesting...
>
> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>
> /Marcus

I do not regularly see assembly output of compilers but I suspect something like this should happen.
Thank you for sharing. This is an extreme example.

HLL is very poor in bit manipulations. Better use a library for these bit manipulations, so less need for optimization.

I used to hauant comp.arch 30 years ago, in the 1990s. Now, I do nnot know that I need to subscribe to this group.
I did use a differnt email. But more than 10 yeaars ago, I used this email to post to comp.arch also. I was able to see my post 30 years ago. Not sure now.

Re: Dense machine code from C++ code (compiler optimizations)

<3a8bc1be-b972-482d-9d83-c4512e92fd01n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22308&group=comp.arch#22308

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:180c:: with SMTP id t12mr242463qtc.507.1639696950300;
Thu, 16 Dec 2021 15:22:30 -0800 (PST)
X-Received: by 2002:a05:6830:1445:: with SMTP id w5mr337560otp.112.1639696949964;
Thu, 16 Dec 2021 15:22:29 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 16 Dec 2021 15:22:29 -0800 (PST)
In-Reply-To: <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d1ad:2cc5:c286:f93a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d1ad:2cc5:c286:f93a
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3a8bc1be-b972-482d-9d83-c4512e92fd01n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 16 Dec 2021 23:22:30 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 31

by: MitchAlsup - Thu, 16 Dec 2021 23:22 UTC

On Thursday, December 16, 2021 at 3:55:12 PM UTC-6, Ir. Hj. Othman bin Hj. Ahmad wrote:
> On Monday, 22 November 2021 at 01:13:13 UTC+8, Marcus wrote:
> > Just wrote this short post. Maybe someone finds it interesting...
> >
> > https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
> >
> > /Marcus
>
> I do not regularly see assembly output of compilers but I suspect something like this should happen.
> Thank you for sharing. This is an extreme example.
>
> HLL is very poor in bit manipulations.
<
HDLs are very good in bit manipulation::
Verilog: variable[first..last] just wire the bits as is.
Verilog: variable[last..first] wire the bits up backwards
<
> Better use a library for these bit manipulations, so less need for optimization.
<
Bit manipulation (for fields less than 64-bits in size) are available natively in the My 66000 ISA by
<
# define fieldspec(a,b) (((a)<<32)|(b))
<
extracted = container >> fieldspec(width,offset);
//
// The HW checks :: 1 <= bits[37..32] <= 64 and 0 <= bits[5..0] <= 63
// The HW allows bits[37..32]==0 as a representation of 64-bits.
//
No need for any kind of library; << and >> can directly access bit fields.
>
> I used to hauant comp.arch 30 years ago, in the 1990s. Now, I do nnot know that I need to subscribe to this group.
> I did use a differnt email. But more than 10 yeaars ago, I used this email to post to comp.arch also. I was able to see my post 30 years ago. Not sure now.

Re: Dense machine code from C++ code (compiler optimizations)

<sphe6s$5a3$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22310&group=comp.arch#22310

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-6c47-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 07:28:28 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sphe6s$5a3$1@newsreader4.netcologne.de>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
Injection-Date: Fri, 17 Dec 2021 07:28:28 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-6c47-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:6c47:0:7285:c2ff:fe6c:992d";
logging-data="5443"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Fri, 17 Dec 2021 07:28 UTC

Ir. Hj. Othman bin Hj. Ahmad <othmana@gmail.com> schrieb:

> HLL is very poor in bit manipulations. Better use a library for
> these bit manipulations, so less need for optimization.

Not really - calling a library function has a _lot_ of overhead,
and what goes on in the library is hidden from the compiler. It is
usually preferred for the compiler to emit highly optimized code.

These days, compilers will might even recognize some standard
idioms for certain bit manipulations and use a corresponding
machine instruction, if available.

Re: Dense machine code from C++ code (compiler optimizations)

<sphs92$8lg$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22311&group=comp.arch#22311

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!jazQyxryRFiI4FEZ51SAvA.user.46.165.242.75.POSTED!not-for-mail
From: chris-no...@tridac.net (chris)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 11:28:34 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sphs92$8lg$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com> <sphe6s$5a3$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="8880"; posting-host="jazQyxryRFiI4FEZ51SAvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; SunOS sun4u; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
X-Notice: Filtered by postfilter v. 0.9.2

by: chris - Fri, 17 Dec 2021 11:28 UTC

On 12/17/21 07:28, Thomas Koenig wrote:
> Ir. Hj. Othman bin Hj. Ahmad<othmana@gmail.com> schrieb:
>
>> HLL is very poor in bit manipulations. Better use a library for
>> these bit manipulations, so less need for optimization.
>
> Not really - calling a library function has a _lot_ of overhead,
> and what goes on in the library is hidden from the compiler. It is
> usually preferred for the compiler to emit highly optimized code.
>
> These days, compilers will might even recognize some standard
> idioms for certain bit manipulations and use a corresponding
> machine instruction, if available.

Working primarily in embedded space, would never use bitfields in
C, as not guaranteed to be portable. Tend to use inline mask tables
for that that functionality and stick to safe subset of C.

Chris

Re: Dense machine code from C++ code (compiler optimizations)

<spi43n$h7h$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22312&group=comp.arch#22312

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 14:42:15 +0100
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <spi43n$h7h$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 17 Dec 2021 13:42:16 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2dff273a327e2cb34cf73fe517a8293e";
logging-data="17649"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+USER4bx23qnV2auta2ggDAhkOZIkc8Cs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:1/hmN8Ec1pLBW5H21zZVNejn6Hk=
In-Reply-To: <sphs92$8lg$1@gioia.aioe.org>
Content-Language: en-GB

by: David Brown - Fri, 17 Dec 2021 13:42 UTC

On 17/12/2021 12:28, chris wrote:
> On 12/17/21 07:28, Thomas Koenig wrote:
>> Ir. Hj. Othman bin Hj. Ahmad<othmana@gmail.com> schrieb:
>>
>>> HLL is very poor in bit manipulations. Better use a library for
>>> these bit manipulations, so less need for optimization.
>>
>> Not really - calling a library function has a _lot_ of overhead,
>> and what goes on in the library is hidden from the compiler. It is
>> usually preferred for the compiler to emit highly optimized code.
>>
>> These days, compilers will might even recognize some standard
>> idioms for certain bit manipulations and use a corresponding
>> machine instruction, if available.
>
> Working primarily in embedded space, would never use bitfields in
> C, as not guaranteed to be portable. Tend to use inline mask tables
> for that that functionality and stick to safe subset of C.
>

Working primarily in embedded development, I quite happily use bitfields
because you generally know exactly what hardware you are working with,
and bitfields can significantly reduce some kinds of error.

Not being guaranteed to be portable is rarely as big an issue as people
often think. If you are writing a driver for a peripheral in a
microcontroller, portability is already limited by the specifics of the
hardware.

There are a few things about bitfields that are not tightly specified.
Ordering of bits, padding and alignment, and the types of bitfields
supported are the usual ones. But a lot of that is either easily
checked (such as a static assertion on the size of the struct),
consistent for the toolchain (gcc supports the same types across all
targets), or consistent for the architecture (most processor's have an
ABI which specifies bitfield details).

The key place to avoid (or be very careful with) bitfields due to
portability issues is when you have data exchange back and forth from
the system you are using - network protocols, file formats, and the like.

Re: Dense machine code from C++ code (compiler optimizations)

<spi4l7$sdj$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22313&group=comp.arch#22313

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 14:51:35 +0100
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <spi4l7$sdj$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 17 Dec 2021 13:51:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2dff273a327e2cb34cf73fe517a8293e";
logging-data="29107"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX182aJT3U0TEt0Y8Te+MQOTN1Z5ywVisuRo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:uhiz/3ZYh6LKy3cqCXoq29clRUI=
In-Reply-To: <sphe6s$5a3$1@newsreader4.netcologne.de>
Content-Language: en-GB

by: David Brown - Fri, 17 Dec 2021 13:51 UTC

On 17/12/2021 08:28, Thomas Koenig wrote:
> Ir. Hj. Othman bin Hj. Ahmad <othmana@gmail.com> schrieb:
>
>> HLL is very poor in bit manipulations. Better use a library for
>> these bit manipulations, so less need for optimization.
>
> Not really - calling a library function has a _lot_ of overhead,
> and what goes on in the library is hidden from the compiler. It is
> usually preferred for the compiler to emit highly optimized code.
>

The best choice here is a header library (or equivalent for languages
other than C, C++ and friends). You can write your static inline
functions or templates in whatever you feel is the nicest way for your
needs, and re-use these functions with no overhead.

> These days, compilers will might even recognize some standard
> idioms for certain bit manipulations and use a corresponding
> machine instruction, if available.
>

Compilers have been doing that for decades. Details vary, of course,
and sometimes they are not quite optimal depending on exactly how you
write the bit manipulation. But they are usually pretty solid.

Re: Dense machine code from C++ code (compiler optimizations)

<spif98$1lkm$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22314&group=comp.arch#22314

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!jazQyxryRFiI4FEZ51SAvA.user.46.165.242.75.POSTED!not-for-mail
From: chris-no...@tridac.net (chris)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 16:52:55 +0000
Organization: Aioe.org NNTP Server
Message-ID: <spif98$1lkm$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com> <sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org> <spi43n$h7h$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="54934"; posting-host="jazQyxryRFiI4FEZ51SAvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; SunOS sun4u; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
X-Notice: Filtered by postfilter v. 0.9.2

by: chris - Fri, 17 Dec 2021 16:52 UTC

On 12/17/21 13:42, David Brown wrote:
> On 17/12/2021 12:28, chris wrote:
>> On 12/17/21 07:28, Thomas Koenig wrote:
>>> Ir. Hj. Othman bin Hj. Ahmad<othmana@gmail.com> schrieb:
>>>
>>>> HLL is very poor in bit manipulations. Better use a library for
>>>> these bit manipulations, so less need for optimization.
>>>
>>> Not really - calling a library function has a _lot_ of overhead,
>>> and what goes on in the library is hidden from the compiler. It is
>>> usually preferred for the compiler to emit highly optimized code.
>>>
>>> These days, compilers will might even recognize some standard
>>> idioms for certain bit manipulations and use a corresponding
>>> machine instruction, if available.
>>
>> Working primarily in embedded space, would never use bitfields in
>> C, as not guaranteed to be portable. Tend to use inline mask tables
>> for that that functionality and stick to safe subset of C.
>>
>
> Working primarily in embedded development, I quite happily use bitfields
> because you generally know exactly what hardware you are working with,
> and bitfields can significantly reduce some kinds of error.
>
> Not being guaranteed to be portable is rarely as big an issue as people
> often think. If you are writing a driver for a peripheral in a
> microcontroller, portability is already limited by the specifics of the
> hardware.
>
> There are a few things about bitfields that are not tightly specified.
> Ordering of bits, padding and alignment, and the types of bitfields
> supported are the usual ones. But a lot of that is either easily
> checked (such as a static assertion on the size of the struct),
> consistent for the toolchain (gcc supports the same types across all
> targets), or consistent for the architecture (most processor's have an
> ABI which specifies bitfield details).
>
> The key place to avoid (or be very careful with) bitfields due to
> portability issues is when you have data exchange back and forth from
> the system you are using - network protocols, file formats, and the like.
>

o each his own, but take a slightly different approach. Have a
standard header file included at the top of all modules, with
all the bit definitions. Within the modules, short
const tables to use as masks. The whole idea is to abstract away
anything that could be affected by the compiler. Done right, the
code can be expected to compile with any modern compiler. All
code here is expected to be reused, so some effort is made to
help that along. If that means using a subset of C, then
fair enough.

Bitfields may have been a novel idea with code storage limitations
of the past but just one of those things in C that seem just a
bit dodgy. So many horror stories about bitfields with early
compilers, so best avoided. The early MISRA standard thought so
as well. Program defensively seems like common sense to me
and far more productive in the long run...

Chris

Re: Dense machine code from C++ code (compiler optimizations)

<spih96$mk4$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22315&group=comp.arch#22315

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 18:27:01 +0100
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <spih96$mk4$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org>
<spi43n$h7h$1@dont-email.me> <spif98$1lkm$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 17 Dec 2021 17:27:02 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2dff273a327e2cb34cf73fe517a8293e";
logging-data="23172"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18kvICQujSKp83FLDOPyqeTkcYmuwFQRWM="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:z5Ie6n9HE3DSsdW8pixsyepk+i0=
In-Reply-To: <spif98$1lkm$1@gioia.aioe.org>
Content-Language: en-GB

by: David Brown - Fri, 17 Dec 2021 17:27 UTC

On 17/12/2021 17:52, chris wrote:
> On 12/17/21 13:42, David Brown wrote:
>> On 17/12/2021 12:28, chris wrote:
>>> On 12/17/21 07:28, Thomas Koenig wrote:
>>>> Ir. Hj. Othman bin Hj. Ahmad<othmana@gmail.com> schrieb:
>>>>
>>>>> HLL is very poor in bit manipulations. Better use a library for
>>>>> these bit manipulations, so less need for optimization.
>>>>
>>>> Not really - calling a library function has a _lot_ of overhead,
>>>> and what goes on in the library is hidden from the compiler. It is
>>>> usually preferred for the compiler to emit highly optimized code.
>>>>
>>>> These days, compilers will might even recognize some standard
>>>> idioms for certain bit manipulations and use a corresponding
>>>> machine instruction, if available.
>>>
>>> Working primarily in embedded space, would never use bitfields in
>>> C, as not guaranteed to be portable. Tend to use inline mask tables
>>> for that that functionality and stick to safe subset of C.
>>>
>>
>> Working primarily in embedded development, I quite happily use bitfields
>> because you generally know exactly what hardware you are working with,
>> and bitfields can significantly reduce some kinds of error.
>>
>> Not being guaranteed to be portable is rarely as big an issue as people
>> often think. If you are writing a driver for a peripheral in a
>> microcontroller, portability is already limited by the specifics of the
>> hardware.
>>
>> There are a few things about bitfields that are not tightly specified.
>> Ordering of bits, padding and alignment, and the types of bitfields
>> supported are the usual ones. But a lot of that is either easily
>> checked (such as a static assertion on the size of the struct),
>> consistent for the toolchain (gcc supports the same types across all
>> targets), or consistent for the architecture (most processor's have an
>> ABI which specifies bitfield details).
>>
>> The key place to avoid (or be very careful with) bitfields due to
>> portability issues is when you have data exchange back and forth from
>> the system you are using - network protocols, file formats, and the like.
>>
>
> o each his own, but take a slightly different approach.

Of course - there is no "right" answer here!

> Have a
> standard header file included at the top of all modules, with
> all the bit definitions. Within the modules, short
> const tables to use as masks. The whole idea is to abstract away
> anything that could be affected by the compiler. Done right, the
> code can be expected to compile with any modern compiler.

Done right and used appropriately, bitfields are perfectly suitable for
use in portable code and any modern compiler. That doesn't mean they
are necessarily /better/ for any particular use, just that they are not
something that you have to pass on if you want safe, reliable, efficient
and portable code.

> All
> code here is expected to be reused, so some effort is made to
> help that along. If that means using a subset of C, then
> fair enough.

Everyone writes in a subset of C - we just have different choices as to
what those subsets should be :-)

>
> Bitfields may have been a novel idea with code storage limitations
> of the past but just one of those things in C that seem just a
> bit dodgy. So many horror stories about bitfields with early
> compilers, so best avoided.

No - the answer is to avoid horrible early compilers. It is silly to
restrict how you can work purely because of outdated problems.

> The early MISRA standard thought so
> as well. Program defensively seems like common sense to me
> and far more productive in the long run...
>

MISRA has always seemed to me to be a mixture of a few good rules, a
number of directly bad and counter-productive rules, and a whole lot of
things that could be summed up by "Learn the C language and use it
correctly" and "Don't be stupid". There is a difference between
programming defensively and limiting good programming techniques because
of bad tools. It is far more productive to use good tools.

Re: Dense machine code from C++ code (compiler optimizations)

<spiis0$1i0g$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22318&group=comp.arch#22318

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!jazQyxryRFiI4FEZ51SAvA.user.46.165.242.75.POSTED!not-for-mail
From: chris-no...@tridac.net (chris)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 17:54:08 +0000
Organization: Aioe.org NNTP Server
Message-ID: <spiis0$1i0g$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com> <sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org> <spi43n$h7h$1@dont-email.me> <spif98$1lkm$1@gioia.aioe.org> <spih96$mk4$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="51216"; posting-host="jazQyxryRFiI4FEZ51SAvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; SunOS sun4u; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
X-Notice: Filtered by postfilter v. 0.9.2

by: chris - Fri, 17 Dec 2021 17:54 UTC

On 12/17/21 17:27, David Brown wrote:
> On 17/12/2021 17:52, chris wrote:
>> On 12/17/21 13:42, David Brown wrote:
>>> On 17/12/2021 12:28, chris wrote:
>>>> On 12/17/21 07:28, Thomas Koenig wrote:
>>>>> Ir. Hj. Othman bin Hj. Ahmad<othmana@gmail.com> schrieb:
>>>>>
>>>>>> HLL is very poor in bit manipulations. Better use a library for
>>>>>> these bit manipulations, so less need for optimization.
>>>>>
>>>>> Not really - calling a library function has a _lot_ of overhead,
>>>>> and what goes on in the library is hidden from the compiler. It is
>>>>> usually preferred for the compiler to emit highly optimized code.
>>>>>
>>>>> These days, compilers will might even recognize some standard
>>>>> idioms for certain bit manipulations and use a corresponding
>>>>> machine instruction, if available.
>>>>
>>>> Working primarily in embedded space, would never use bitfields in
>>>> C, as not guaranteed to be portable. Tend to use inline mask tables
>>>> for that that functionality and stick to safe subset of C.
>>>>
>>>
>>> Working primarily in embedded development, I quite happily use bitfields
>>> because you generally know exactly what hardware you are working with,
>>> and bitfields can significantly reduce some kinds of error.
>>>
>>> Not being guaranteed to be portable is rarely as big an issue as people
>>> often think. If you are writing a driver for a peripheral in a
>>> microcontroller, portability is already limited by the specifics of the
>>> hardware.
>>>
>>> There are a few things about bitfields that are not tightly specified.
>>> Ordering of bits, padding and alignment, and the types of bitfields
>>> supported are the usual ones. But a lot of that is either easily
>>> checked (such as a static assertion on the size of the struct),
>>> consistent for the toolchain (gcc supports the same types across all
>>> targets), or consistent for the architecture (most processor's have an
>>> ABI which specifies bitfield details).
>>>
>>> The key place to avoid (or be very careful with) bitfields due to
>>> portability issues is when you have data exchange back and forth from
>>> the system you are using - network protocols, file formats, and the like.
>>>
>>
>> o each his own, but take a slightly different approach.
>
> Of course - there is no "right" answer here!
>
>> Have a
>> standard header file included at the top of all modules, with
>> all the bit definitions. Within the modules, short
>> const tables to use as masks. The whole idea is to abstract away
>> anything that could be affected by the compiler. Done right, the
>> code can be expected to compile with any modern compiler.
>
> Done right and used appropriately, bitfields are perfectly suitable for
> use in portable code and any modern compiler. That doesn't mean they
> are necessarily /better/ for any particular use, just that they are not
> something that you have to pass on if you want safe, reliable, efficient
> and portable code.
>
>> All
>> code here is expected to be reused, so some effort is made to
>> help that along. If that means using a subset of C, then
>> fair enough.
>
> Everyone writes in a subset of C - we just have different choices as to
> what those subsets should be :-)
>
>>
>> Bitfields may have been a novel idea with code storage limitations
>> of the past but just one of those things in C that seem just a
>> bit dodgy. So many horror stories about bitfields with early
>> compilers, so best avoided.
>
> No - the answer is to avoid horrible early compilers. It is silly to
> restrict how you can work purely because of outdated problems.
>
>> The early MISRA standard thought so
>> as well. Program defensively seems like common sense to me
>> and far more productive in the long run...
>>
>
> MISRA has always seemed to me to be a mixture of a few good rules, a
> number of directly bad and counter-productive rules, and a whole lot of
> things that could be summed up by "Learn the C language and use it
> correctly" and "Don't be stupid". There is a difference between
> programming defensively and limiting good programming techniques because
> of bad tools. It is far more productive to use good tools.

Oh, agree with much of that, but we are all conditioned by
experience and I think the bitfield thing was so ingrained and
have used other methods for so long, unlikely to change now.

Right about some of the early compilers as well. We have a lot to be
thankful for with the gnu tools, which put to rest for good, expensive
tool chains and license manager dongles...

Chris

Re: Dense machine code from C++ code (compiler optimizations)

<spisnd$dq2$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22322&group=comp.arch#22322

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 21:42:21 +0100
Organization: Aioe.org NNTP Server
Message-ID: <spisnd$dq2$1@gioia.aioe.org>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org>
<spi43n$h7h$1@dont-email.me> <spif98$1lkm$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="14146"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 17 Dec 2021 20:42 UTC

chris wrote:
> On 12/17/21 13:42, David Brown wrote:
>> The key place to avoid (or be very careful with) bitfields due to
>> portability issues is when you have data exchange back and forth from
>> the system you are using - network protocols, file formats, and the like.
>>
>
> o each his own, but take a slightly different approach. Have a
> standard header file included at the top of all modules, with
> all the bit definitions. Within the modules, short
> const tables to use as masks. The whole idea is to abstract away
> anything that could be affected by the compiler. Done right, the
> code can be expected to compile with any modern compiler. All
> code here is expected to be reused, so some effort is made to
> help that along. If that means using a subset of C, then
> fair enough.

I don't use that header setup, I'm closer to David here:

I'm guessing that I have written more low-level bit-twiddling code than
most here (graphics, io drivers, codecs, crypto, network protocols, fp
emulation), and I have _never_ used bitfields to implement any of this.

I'll typically load 32 or 64 bits surrounding the target are, swap the
bytes if there's an endian change, then use regular shift & mask ops to
get at what I need.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<555424d7-7e86-4d10-ad66-398c2035797dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22323&group=comp.arch#22323

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7dd1:: with SMTP id c17mr4492234qte.546.1639781319299;
Fri, 17 Dec 2021 14:48:39 -0800 (PST)
X-Received: by 2002:a05:6830:1445:: with SMTP id w5mr3772029otp.112.1639781319043;
Fri, 17 Dec 2021 14:48:39 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 17 Dec 2021 14:48:38 -0800 (PST)
In-Reply-To: <spisnd$dq2$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4528:60b:ee17:ff38;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4528:60b:ee17:ff38
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org>
<spi43n$h7h$1@dont-email.me> <spif98$1lkm$1@gioia.aioe.org> <spisnd$dq2$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <555424d7-7e86-4d10-ad66-398c2035797dn@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 17 Dec 2021 22:48:39 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 36

by: MitchAlsup - Fri, 17 Dec 2021 22:48 UTC

On Friday, December 17, 2021 at 2:42:25 PM UTC-6, Terje Mathisen wrote:
> chris wrote:
> > On 12/17/21 13:42, David Brown wrote:
> >> The key place to avoid (or be very careful with) bitfields due to
> >> portability issues is when you have data exchange back and forth from
> >> the system you are using - network protocols, file formats, and the like.
> >>
> >
> > o each his own, but take a slightly different approach. Have a
> > standard header file included at the top of all modules, with
> > all the bit definitions. Within the modules, short
> > const tables to use as masks. The whole idea is to abstract away
> > anything that could be affected by the compiler. Done right, the
> > code can be expected to compile with any modern compiler. All
> > code here is expected to be reused, so some effort is made to
> > help that along. If that means using a subset of C, then
> > fair enough.
> I don't use that header setup, I'm closer to David here:
>
> I'm guessing that I have written more low-level bit-twiddling code than
> most here (graphics, io drivers, codecs, crypto, network protocols, fp
> emulation), and I have _never_ used bitfields to implement any of this.
>
> I'll typically load 32 or 64 bits surrounding the target are, swap the
> bytes if there's an endian change, then use regular shift & mask ops to
> get at what I need.
<
We teach compilers to recognize << followed by >> and convert it into
EXT instructions.
We teach compilers to recognize & followed by << followed by &~
followed by | and convert it into INS instructions (when appropriate).
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<8acc16a1-246d-4c11-8cea-28da7438a28an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22324&group=comp.arch#22324

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:2aa3:: with SMTP id js3mr54115qvb.69.1639796082845;
Fri, 17 Dec 2021 18:54:42 -0800 (PST)
X-Received: by 2002:a05:6830:1445:: with SMTP id w5mr4270939otp.112.1639796082597;
Fri, 17 Dec 2021 18:54:42 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 17 Dec 2021 18:54:42 -0800 (PST)
In-Reply-To: <555424d7-7e86-4d10-ad66-398c2035797dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4528:60b:ee17:ff38;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4528:60b:ee17:ff38
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <sphs92$8lg$1@gioia.aioe.org>
<spi43n$h7h$1@dont-email.me> <spif98$1lkm$1@gioia.aioe.org>
<spisnd$dq2$1@gioia.aioe.org> <555424d7-7e86-4d10-ad66-398c2035797dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8acc16a1-246d-4c11-8cea-28da7438a28an@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 18 Dec 2021 02:54:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 50

by: MitchAlsup - Sat, 18 Dec 2021 02:54 UTC

On Friday, December 17, 2021 at 4:48:40 PM UTC-6, MitchAlsup wrote:
> On Friday, December 17, 2021 at 2:42:25 PM UTC-6, Terje Mathisen wrote:
> > chris wrote:
> > > On 12/17/21 13:42, David Brown wrote:
> > >> The key place to avoid (or be very careful with) bitfields due to
> > >> portability issues is when you have data exchange back and forth from
> > >> the system you are using - network protocols, file formats, and the like.
> > >>
> > >
> > > o each his own, but take a slightly different approach. Have a
> > > standard header file included at the top of all modules, with
> > > all the bit definitions. Within the modules, short
> > > const tables to use as masks. The whole idea is to abstract away
> > > anything that could be affected by the compiler. Done right, the
> > > code can be expected to compile with any modern compiler. All
> > > code here is expected to be reused, so some effort is made to
> > > help that along. If that means using a subset of C, then
> > > fair enough.
> > I don't use that header setup, I'm closer to David here:
> >
> > I'm guessing that I have written more low-level bit-twiddling code than
> > most here (graphics, io drivers, codecs, crypto, network protocols, fp
> > emulation), and I have _never_ used bitfields to implement any of this.
> >
> > I'll typically load 32 or 64 bits surrounding the target are, swap the
> > bytes if there's an endian change, then use regular shift & mask ops to
> > get at what I need.
> <
> We teach compilers to recognize << followed by >> and convert it into
> EXT instructions.
> We teach compilers to recognize & followed by << followed by &~
> followed by | and convert it into INS instructions (when appropriate).
<
Then there was other bit twiddling codes that generally read something
like:
<
if( p->state & SOMESTATE )
{
p->state ^= SOMESTATE | SOMEOTHERSTATE;
.....
}
<
Using XORs to get rid of the current state and change it to some new state
while leaving the rest of the container alone. This is often faster than actually
using bit fields.
> <
> > Terje
> >
> > --
> > - <Terje.Mathisen at tmsw.no>
> > "almost all programming can be viewed as an exercise in caching"

Re: Dense machine code from C++ code (compiler optimizations)

<spjmcf$hhs$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22325&group=comp.arch#22325

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Fri, 17 Dec 2021 22:00:14 -0600
Organization: A noiseless patient Spider
Lines: 158
Message-ID: <spjmcf$hhs$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <spi4l7$sdj$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 18 Dec 2021 04:00:16 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fcf15497748725c5a6b14761bb7bf743";
logging-data="17980"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+fGFa/BuP78WbnVsQgZvfM"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:EuwdLRxM8N0z4yQ9Q60Ewzx4Z5w=
In-Reply-To: <spi4l7$sdj$1@dont-email.me>
Content-Language: en-US

by: BGB - Sat, 18 Dec 2021 04:00 UTC

On 12/17/2021 7:51 AM, David Brown wrote:
> On 17/12/2021 08:28, Thomas Koenig wrote:
>> Ir. Hj. Othman bin Hj. Ahmad <othmana@gmail.com> schrieb:
>>
>>> HLL is very poor in bit manipulations. Better use a library for
>>> these bit manipulations, so less need for optimization.
>>
>> Not really - calling a library function has a _lot_ of overhead,
>> and what goes on in the library is hidden from the compiler. It is
>> usually preferred for the compiler to emit highly optimized code.
>>
>
> The best choice here is a header library (or equivalent for languages
> other than C, C++ and friends). You can write your static inline
> functions or templates in whatever you feel is the nicest way for your
> needs, and re-use these functions with no overhead.
>

It depends some.

One might end up also having a header which serves mostly to detect some
compiler and machine specific defines and tune the logic for performance.

Say, for GCC, one might want to use inline functions, but for MSVC it
might be better to do everything with preprocessor macros, ...

>> These days, compilers will might even recognize some standard
>> idioms for certain bit manipulations and use a corresponding
>> machine instruction, if available.
>>
>
> Compilers have been doing that for decades. Details vary, of course,
> and sometimes they are not quite optimal depending on exactly how you
> write the bit manipulation. But they are usually pretty solid.

Recently was messing with an LZ decoder, and comparing performance
results between several ways of moving data/values around:
using volatile and pointer casts;
using may_alias and pointer casts;
using memcpy;
using byte-oriented access.

This was via a wrapper library for various "fundamental" operators, such
as getting/setting values, or copying 8 or 16 bytes from a source
location to a destination (may overlap).

On GCC and Clang:
Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
Could not get may_alias to work correctly on GCC;
On clang, may_alias gave the same behavior as volatile;
Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).

On MSVC:
Volatile was fastest (~ 2.2 GB/s)
No obvious difference between volatile and a bare pointer.
Memcpy was again, slightly slower ( ~ 2.0 GB/s);
Byte-oriented access was somewhat worse ( ~ 800 MB/s ).

Bare pointer casts, with no 'volatile' or similar, in both GCC and
Clang, caused corrupted output data.

Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
visible effect in this case, however optimization level did (-O1 and -Os
worked, -O2 and -O3 failed). It appears as if GCC was changing the
relative order of the memory loads and stores unless 'volatile' were used.

This was with my RP2 format, which (on x86) seems to be a little slower
than LZ4. The LZ4 command-line tool (written by Yann Collet) gave ~ 2.7
GB/s for the file I was testing (an arbitrary EXE file); though, for my
own LZ4 decoders, it tends to be a little closer.

I was also messing around with trying to get more speed from a 'ULZ'
codec, which was sort of like a hybrid of LZ4 and Deflate.

It uses Huffman coding with a similar structure for the encoded stream
to that of LZ4 (alternating runs of literal bytes and LZ matches).
Things like match lengths and distances use a similar encoding to the
distance encoding used in Deflate.

Unlike Deflate, Huffman symbols were limited to 12 bits (*).

*: With a 12-bit lookup, the lookup table is ~ 8K, and allows for the
fastest-case Huffman decoder to be a single table lookup. A 15-bit limit
(as in Deflate) gives slightly better compression, but makes the lookup
table considerably slower (the faster approach then being to use a
2-stage lookup). However, making the symbol length shorter than 12-bits
has a significant adverse effect on its general effectiveness.

With ULZ, was getting a decode speed of around 460 MB/s for the file in
the previous tests (though, working on making it faster did make the
code somewhat larger; a minimal decoder being around 300 lines, but the
"fast" decoder was more around 800 lines; most of this related to
bulkier code for the bitstream handling and symbol decoding).

In my tests, ULZ compression was in a similar area to Deflate.

The current ULZ design uses 3 Huffman tables, but I could consider
experimenting with a variant which uses 2 Huffman tables (could fit in
L1 a little easier).

....

These values were for my native PC, decoding speeds on BJX2 are
considerably slower (~ 7 MB/s at @ 50 MHz). None the less, it is still
fast enough to be usable for IO acceleration from the SDcard, though
doesn't usually offer a significant compression advantage over RP2
(which is somewhat faster).

Not much new ISA-wise; I suspect BJX2 is starting to reach a certain
degree of stability. Most recent change was adding a TEAH register,
which is mostly useful for dealing with XTLB misses (allowing the TLB to
send over the entire 96-bit address).

It could potentially be useful for extending the capabilities of
inter-processor interrupts (including significantly expanding the
encoding space for the CoreID for inter-core interrupts).

I am also having idle thoughts as to whether or not to resume work on an
x86 emulator for BJX2 (running x86 on BJX2 via a threaded-code VM / JIT).

Other thoughts are for whether or not to continue working on rewriting
the C library. Moving forward (eg: giving TestKern capabilities more
like a real OS) is likely to require some level of structural redesign
both to the C library and to TestKern as a whole. This would involve
some amount of "driving wedges", namely separating "front-end" and
"back-end" parts of the C library to be able to play nicer with things
like dynamic linking, as well as dividing the C library and TestKern
kernel into two independent entities (eg: not linking a copy of the
kernel with every binary).

Though, loading up binaries bare-metal is nice for debugging, I am
likely to need to move away from this practice.

In the re-imagined C library, a lot of the library will remain
statically linked with the binaries, however the library would be
divided into two major halves connected via internal VTables, with
another VTable providing OS level interfaces. When a DLL is loaded, it
will inherit its back-end VTables from the main binaries instance of the
C library. The OS VTable would mostly consist of syscall wrappers (would
partly replace the current use of wrapper functions and function pointers).

The change would be drastic enough that they couldn't be easily glued
onto the existing C library (a modified version of PDPCLIB), and for a
while I had been tempted to rewrite the C library anyways mostly to
reduce the amount of "cruft".

....

Re: Dense machine code from C++ code (compiler optimizations)

<spl5kc$mja$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22330&group=comp.arch#22330

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sat, 18 Dec 2021 18:26:35 +0100
Organization: A noiseless patient Spider
Lines: 92
Message-ID: <spl5kc$mja$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <spi4l7$sdj$1@dont-email.me>
<spjmcf$hhs$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 18 Dec 2021 17:26:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="cda1128b60b8075ff749aba4d5123070";
logging-data="23146"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FTcYN9GRVnlUEC0k367oyzf4T3UVGIec="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:UNznBYBBl9nERrIftZ1hfFkGVy4=
In-Reply-To: <spjmcf$hhs$1@dont-email.me>
Content-Language: en-GB

by: David Brown - Sat, 18 Dec 2021 17:26 UTC

On 18/12/2021 05:00, BGB wrote:
> On 12/17/2021 7:51 AM, David Brown wrote:
>> On 17/12/2021 08:28, Thomas Koenig wrote:
>>> Ir. Hj. Othman bin Hj. Ahmad <othmana@gmail.com> schrieb:
>>>
>>>> HLL is very poor in bit manipulations. Better use a library for
>>>> these bit manipulations, so less need for optimization.
>>>
>>> Not really - calling a library function has a _lot_ of overhead,
>>> and what goes on in the library is hidden from the compiler. It is
>>> usually preferred for the compiler to emit highly optimized code.
>>>
>>
>> The best choice here is a header library (or equivalent for languages
>> other than C, C++ and friends). You can write your static inline
>> functions or templates in whatever you feel is the nicest way for your
>> needs, and re-use these functions with no overhead.
>>
>
> It depends some.
>
> One might end up also having a header which serves mostly to detect some
> compiler and machine specific defines and tune the logic for performance.
>
> Say, for GCC, one might want to use inline functions, but for MSVC it
> might be better to do everything with preprocessor macros, ...

Of course. If you expect to be using the header for multiple different
compilers, architectures, standards, optimisations, etc., then it is not
at all unreasonable to have some compiler detection and give specialised
versions for certain combinations and generic fall-backs for others.

It would be unexpected for a compiler to support inline functions but
not be able to handle them as efficiently as a macro. But some
compilers will handle particular ways of expressing the code more
efficiently, and in some cases you might have processor-specific
intrinsics, compiler extensions, or even inline assembly to get the
optimal code.

>
>
>>> These days, compilers will might even recognize some standard
>>> idioms for certain bit manipulations and use a corresponding
>>> machine instruction, if available.
>>>
>>
>> Compilers have been doing that for decades. Details vary, of course,
>> and sometimes they are not quite optimal depending on exactly how you
>> write the bit manipulation. But they are usually pretty solid.
>
> Recently was messing with an LZ decoder, and comparing performance
> results between several ways of moving data/values around:
> using volatile and pointer casts;
> using may_alias and pointer casts;
> using memcpy;
> using byte-oriented access.
>
> This was via a wrapper library for various "fundamental" operators, such
> as getting/setting values, or copying 8 or 16 bytes from a source
> location to a destination (may overlap).
>
>
> On GCC and Clang:
> Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
> Could not get may_alias to work correctly on GCC;
> On clang, may_alias gave the same behavior as volatile;
> Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).
>
> On MSVC:
> Volatile was fastest (~ 2.2 GB/s)
> No obvious difference between volatile and a bare pointer.
> Memcpy was again, slightly slower ( ~ 2.0 GB/s);
> Byte-oriented access was somewhat worse ( ~ 800 MB/s ).
>
> Bare pointer casts, with no 'volatile' or similar, in both GCC and
> Clang, caused corrupted output data.
>
> Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
> visible effect in this case, however optimization level did (-O1 and -Os
> worked, -O2 and -O3 failed). It appears as if GCC was changing the
> relative order of the memory loads and stores unless 'volatile' were used.
>

Compilers are free to change load and store orders on the assumption
that the current thread of execution is the only thing that ever sees
the accesses, unless you specifically say otherwise (with "volatile",
atomics, or other methods). More extreme movements are usually only
done at higher levels of optimisation. But if the code does not work
correctly at -O2, you can be pretty confident that your code is wrong.
Compiler bugs are not unheard of, of course, but user-code bugs are a
/lot/ more common!

Re: Dense machine code from C++ code (compiler optimizations)

<splek0$j57$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22335&group=comp.arch#22335

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sat, 18 Dec 2021 13:59:58 -0600
Organization: A noiseless patient Spider
Lines: 249
Message-ID: <splek0$j57$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <spi4l7$sdj$1@dont-email.me>
<spjmcf$hhs$1@dont-email.me> <spl5kc$mja$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 18 Dec 2021 20:00:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fcf15497748725c5a6b14761bb7bf743";
logging-data="19623"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19k5tOUK8I6H1NbZsZVqVL8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:ScLs8ka8jlBqAaQofB8f2BXo++4=
In-Reply-To: <spl5kc$mja$1@dont-email.me>
Content-Language: en-US

by: BGB - Sat, 18 Dec 2021 19:59 UTC

On 12/18/2021 11:26 AM, David Brown wrote:
> On 18/12/2021 05:00, BGB wrote:
>> On 12/17/2021 7:51 AM, David Brown wrote:
>>> On 17/12/2021 08:28, Thomas Koenig wrote:
>>>> Ir. Hj. Othman bin Hj. Ahmad <othmana@gmail.com> schrieb:
>>>>
>>>>> HLL is very poor in bit manipulations. Better use a library for
>>>>> these bit manipulations, so less need for optimization.
>>>>
>>>> Not really - calling a library function has a _lot_ of overhead,
>>>> and what goes on in the library is hidden from the compiler. It is
>>>> usually preferred for the compiler to emit highly optimized code.
>>>>
>>>
>>> The best choice here is a header library (or equivalent for languages
>>> other than C, C++ and friends). You can write your static inline
>>> functions or templates in whatever you feel is the nicest way for your
>>> needs, and re-use these functions with no overhead.
>>>
>>
>> It depends some.
>>
>> One might end up also having a header which serves mostly to detect some
>> compiler and machine specific defines and tune the logic for performance.
>>
>> Say, for GCC, one might want to use inline functions, but for MSVC it
>> might be better to do everything with preprocessor macros, ...
>
> Of course. If you expect to be using the header for multiple different
> compilers, architectures, standards, optimisations, etc., then it is not
> at all unreasonable to have some compiler detection and give specialised
> versions for certain combinations and generic fall-backs for others.
>

Yeah. It is possible to have the code run on generic / unknown
architectures, if albeit at a speed penalty.

There is a tradeoff though in that this can add a lot of cruft.

> It would be unexpected for a compiler to support inline functions but
> not be able to handle them as efficiently as a macro. But some
> compilers will handle particular ways of expressing the code more
> efficiently, and in some cases you might have processor-specific
> intrinsics, compiler extensions, or even inline assembly to get the
> optimal code.
>

In some cases, one may need local variables or to not have an argument
be evaluated multiple times, ... These cases favor inline functions.

But, if it is a wrapper over a pointer de-reference or similar, then a
macro typically works better.

>>
>>
>>>> These days, compilers will might even recognize some standard
>>>> idioms for certain bit manipulations and use a corresponding
>>>> machine instruction, if available.
>>>>
>>>
>>> Compilers have been doing that for decades. Details vary, of course,
>>> and sometimes they are not quite optimal depending on exactly how you
>>> write the bit manipulation. But they are usually pretty solid.
>>
>> Recently was messing with an LZ decoder, and comparing performance
>> results between several ways of moving data/values around:
>> using volatile and pointer casts;
>> using may_alias and pointer casts;
>> using memcpy;
>> using byte-oriented access.
>>
>> This was via a wrapper library for various "fundamental" operators, such
>> as getting/setting values, or copying 8 or 16 bytes from a source
>> location to a destination (may overlap).
>>
>>
>> On GCC and Clang:
>> Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
>> Could not get may_alias to work correctly on GCC;
>> On clang, may_alias gave the same behavior as volatile;
>> Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).
>>
>> On MSVC:
>> Volatile was fastest (~ 2.2 GB/s)
>> No obvious difference between volatile and a bare pointer.
>> Memcpy was again, slightly slower ( ~ 2.0 GB/s);
>> Byte-oriented access was somewhat worse ( ~ 800 MB/s ).
>>
>> Bare pointer casts, with no 'volatile' or similar, in both GCC and
>> Clang, caused corrupted output data.
>>
>> Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
>> visible effect in this case, however optimization level did (-O1 and -Os
>> worked, -O2 and -O3 failed). It appears as if GCC was changing the
>> relative order of the memory loads and stores unless 'volatile' were used.
>>
>
> Compilers are free to change load and store orders on the assumption
> that the current thread of execution is the only thing that ever sees
> the accesses, unless you specifically say otherwise (with "volatile",
> atomics, or other methods). More extreme movements are usually only
> done at higher levels of optimisation. But if the code does not work
> correctly at -O2, you can be pretty confident that your code is wrong.
> Compiler bugs are not unheard of, of course, but user-code bugs are a
> /lot/ more common!
>

This code is single threaded.

The code in question depends on the use of misaligned and overlapping
loads and stores. As a result, changing the relative load/store order
will change the visible results.

I was trying to use pointers declared as:
__attribute__((__may_alias__)) __attribute__((aligned(1)))

But, this did not work in GCC.

However, 'volatile' did give the expected behavior.

The above did work in Clang, however as far as I can tell it is behaving
the same in this case as it did for volatile.

One can also use a bunch of 8 and 16 byte memcpy operations, but as
noted, this is slower than using 'volatile'.

One of the major operations is basically to mimic the behavior of:
unsigned char *cs, *ct, *cte;
ct=dst; cs=dst-dist; cte=dst+len;
while(ct<cte)
{ *ct++=*cs++; }

However, doing it this way is slow enough to have a noticeable adverse
effect on the performance of an LZ decoder. Hence the use of a bunch of
misaligned load/store hackery to try to work around this.

So, the match copying tends to get a little more convoluted, pseudocode:
if(dist>=8)
{ copy data 16B at a time via 8B elements. }
else
{
if(!(dist&(dist-1)))
{
if(dist==1)
{ fill 8B with a single byte; }
else if(dist==2)
{ fill 8B with a 2B pattern; }
else if(dist==4)
{ fill 8B with a 4B pattern; }
Flood-fill target with pattern (power-of-2)
}else
{
if(dist==3)
{ fill 6B with a 3B pattern; step=6; }
else if(dist==5)
{ fill 6B with a 3B pattern; step=5; }
else ...
Flood-fill target with pattern (non-power-of-2)
}
}

And, typically for speed reasons, the match copying is allowed to stomp
memory slightly past the end of the output.

But, with raw pointer dereferencing (and no "volatile") something was
messing up here (namely, the expected LZ output would differ from what
was the expected output).

Without either "volatile" or "__may_alias__", both GCC and Clang were
messing up on this stuff, but this was to be expected.

On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
attempt to vectorize the copy loop.

Both "volatile" and "__unaligned" seem able to ward off the auto
vectorization, but using both does seem to make it "valid" according to
MSDN rules. Either keyword by itself seems able to give the intended
effect, and there is little obvious difference between using one of them
vs using both of them.

For bitstream manipulation, it is similar:
b=(ctx->win>>ctx->pos)&((1<<bits)-1);
ctx->pos+=bits;
while(ctx->pos>=8)
{
ctx->win=(ctx->win>>8)|((*ctx->cs++)<<24);
ctx->pos-=8;
}
Is a fair bit slower than, say:
cs=ctx->cs;
p=ctx->pos;
w=get_u32le(cs);
b=(w>>p)&((1<<bits)-1);
k=p+bits;
ctx->pos=k&7;
ctx->cs=cs+(k>>3);

Or, relatedly, for decoding 4 Huffman symbols:
cs=ctx->cs;
p=ctx->pos;
w=get_u64le(cs);
b=w>>p; c0=htab[b&4095]; p+=c0>>8; c0=c0&255;
b=w>>p; c1=htab[b&4095]; p+=c1>>8; c1=c1&255;
b=w>>p; c2=htab[b&4095]; p+=c2>>8; c2=c2&255;
b=w>>p; c3=htab[b&4095]; p+=c3>>8; c3=c3&255;
c0=c0|(c1<<8); c2=c2|(c3<<8);
ctx->pos=p&7; ctx->cs=cs+(p>>3);
c=c0|(c2<<16);

Where, getu32_le(ptr) might be, MSVC:
(*(volatile __unaligned uint32_t *)(ptr))

Or, what seemed to work on Clang (but not GCC):
(*(uint32_t __attribute__((__may_alias__))
__attribute__((aligned(1))) *)(ptr))

Click here to read the complete article

Re: Dense machine code from C++ code (compiler optimizations)

<309f4a07-109e-4589-b2f1-fe2d906c5787n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22337&group=comp.arch#22337

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5712:: with SMTP id 18mr7507454qtw.584.1639863225980;
Sat, 18 Dec 2021 13:33:45 -0800 (PST)
X-Received: by 2002:a05:6808:2396:: with SMTP id bp22mr7047514oib.78.1639863225715;
Sat, 18 Dec 2021 13:33:45 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 18 Dec 2021 13:33:45 -0800 (PST)
In-Reply-To: <splek0$j57$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b8ce:84cf:2c07:ff3d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b8ce:84cf:2c07:ff3d
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <spi4l7$sdj$1@dont-email.me>
<spjmcf$hhs$1@dont-email.me> <spl5kc$mja$1@dont-email.me> <splek0$j57$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <309f4a07-109e-4589-b2f1-fe2d906c5787n@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 18 Dec 2021 21:33:45 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 253

by: MitchAlsup - Sat, 18 Dec 2021 21:33 UTC

On Saturday, December 18, 2021 at 2:00:03 PM UTC-6, BGB wrote:
> On 12/18/2021 11:26 AM, David Brown wrote:
> > On 18/12/2021 05:00, BGB wrote:
> >> On 12/17/2021 7:51 AM, David Brown wrote:
> >>> On 17/12/2021 08:28, Thomas Koenig wrote:
> >>>> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
> >>>>
> >>>>> HLL is very poor in bit manipulations. Better use a library for
> >>>>> these bit manipulations, so less need for optimization.
> >>>>
> >>>> Not really - calling a library function has a _lot_ of overhead,
> >>>> and what goes on in the library is hidden from the compiler. It is
> >>>> usually preferred for the compiler to emit highly optimized code.
> >>>>
> >>>
> >>> The best choice here is a header library (or equivalent for languages
> >>> other than C, C++ and friends). You can write your static inline
> >>> functions or templates in whatever you feel is the nicest way for your
> >>> needs, and re-use these functions with no overhead.
> >>>
> >>
> >> It depends some.
> >>
> >> One might end up also having a header which serves mostly to detect some
> >> compiler and machine specific defines and tune the logic for performance.
> >>
> >> Say, for GCC, one might want to use inline functions, but for MSVC it
> >> might be better to do everything with preprocessor macros, ...
> >
> > Of course. If you expect to be using the header for multiple different
> > compilers, architectures, standards, optimisations, etc., then it is not
> > at all unreasonable to have some compiler detection and give specialised
> > versions for certain combinations and generic fall-backs for others.
> >
> Yeah. It is possible to have the code run on generic / unknown
> architectures, if albeit at a speed penalty.
>
> There is a tradeoff though in that this can add a lot of cruft.
> > It would be unexpected for a compiler to support inline functions but
> > not be able to handle them as efficiently as a macro. But some
> > compilers will handle particular ways of expressing the code more
> > efficiently, and in some cases you might have processor-specific
> > intrinsics, compiler extensions, or even inline assembly to get the
> > optimal code.
> >
> In some cases, one may need local variables or to not have an argument
> be evaluated multiple times, ... These cases favor inline functions.
>
> But, if it is a wrapper over a pointer de-reference or similar, then a
> macro typically works better.
> >>
> >>
> >>>> These days, compilers will might even recognize some standard
> >>>> idioms for certain bit manipulations and use a corresponding
> >>>> machine instruction, if available.
> >>>>
> >>>
> >>> Compilers have been doing that for decades. Details vary, of course,
> >>> and sometimes they are not quite optimal depending on exactly how you
> >>> write the bit manipulation. But they are usually pretty solid.
> >>
> >> Recently was messing with an LZ decoder, and comparing performance
> >> results between several ways of moving data/values around:
> >> using volatile and pointer casts;
> >> using may_alias and pointer casts;
> >> using memcpy;
> >> using byte-oriented access.
> >>
> >> This was via a wrapper library for various "fundamental" operators, such
> >> as getting/setting values, or copying 8 or 16 bytes from a source
> >> location to a destination (may overlap).
> >>
> >>
> >> On GCC and Clang:
> >> Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
> >> Could not get may_alias to work correctly on GCC;
> >> On clang, may_alias gave the same behavior as volatile;
> >> Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).
> >>
> >> On MSVC:
> >> Volatile was fastest (~ 2.2 GB/s)
> >> No obvious difference between volatile and a bare pointer.
> >> Memcpy was again, slightly slower ( ~ 2.0 GB/s);
> >> Byte-oriented access was somewhat worse ( ~ 800 MB/s ).
> >>
> >> Bare pointer casts, with no 'volatile' or similar, in both GCC and
> >> Clang, caused corrupted output data.
> >>
> >> Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
> >> visible effect in this case, however optimization level did (-O1 and -Os
> >> worked, -O2 and -O3 failed). It appears as if GCC was changing the
> >> relative order of the memory loads and stores unless 'volatile' were used.
> >>
> >
> > Compilers are free to change load and store orders on the assumption
> > that the current thread of execution is the only thing that ever sees
> > the accesses, unless you specifically say otherwise (with "volatile",
> > atomics, or other methods). More extreme movements are usually only
> > done at higher levels of optimisation. But if the code does not work
> > correctly at -O2, you can be pretty confident that your code is wrong.
> > Compiler bugs are not unheard of, of course, but user-code bugs are a
> > /lot/ more common!
> >
> This code is single threaded.
>
> The code in question depends on the use of misaligned and overlapping
> loads and stores. As a result, changing the relative load/store order
> will change the visible results.
>
>
> I was trying to use pointers declared as:
> __attribute__((__may_alias__)) __attribute__((aligned(1)))
>
> But, this did not work in GCC.
>
> However, 'volatile' did give the expected behavior.
>
> The above did work in Clang, however as far as I can tell it is behaving
> the same in this case as it did for volatile.
>
>
> One can also use a bunch of 8 and 16 byte memcpy operations, but as
> noted, this is slower than using 'volatile'.
>
>
>
> One of the major operations is basically to mimic the behavior of:
> unsigned char *cs, *ct, *cte;
> ct=dst; cs=dst-dist; cte=dst+len;
> while(ct<cte)
> { *ct++=*cs++; }
>
> However, doing it this way is slow enough to have a noticeable adverse
> effect on the performance of an LZ decoder. Hence the use of a bunch of
> misaligned load/store hackery to try to work around this.
<
Perhaps you should use the Virtual Vector Method where your copy loop
can execute 8-to16 iterations per clock. This should get rid of the lazy
performance problem.
>
> So, the match copying tends to get a little more convoluted, pseudocode:
> if(dist>=8)
> { copy data 16B at a time via 8B elements. }
> else
> {
> if(!(dist&(dist-1)))
> {
> if(dist==1)
> { fill 8B with a single byte; }
> else if(dist==2)
> { fill 8B with a 2B pattern; }
> else if(dist==4)
> { fill 8B with a 4B pattern; }
> Flood-fill target with pattern (power-of-2)
> }else
> {
> if(dist==3)
> { fill 6B with a 3B pattern; step=6; }
> else if(dist==5)
> { fill 6B with a 3B pattern; step=5; }
> else ...
> Flood-fill target with pattern (non-power-of-2)
> }
> }
<
And obviating the programmer from having to do time-wasting stuff like
the above.
>
> And, typically for speed reasons, the match copying is allowed to stomp
> memory slightly past the end of the output.
>
> But, with raw pointer dereferencing (and no "volatile") something was
> messing up here (namely, the expected LZ output would differ from what
> was the expected output).
>
>
> Without either "volatile" or "__may_alias__", both GCC and Clang were
> messing up on this stuff, but this was to be expected.
>
> On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
> attempt to vectorize the copy loop.
<
You can (CAN) vectorize the loop using VVM without having semantic errors
during execution.
>
> Both "volatile" and "__unaligned" seem able to ward off the auto
> vectorization, but using both does seem to make it "valid" according to
> MSDN rules. Either keyword by itself seems able to give the intended
> effect, and there is little obvious difference between using one of them
> vs using both of them.
>
What you want is vectorization that retains semantic content. CRAY-style
vectors fail at this, VVM succeeds.
>
>
> For bitstream manipulation, it is similar:
> b=(ctx->win>>ctx->pos)&((1<<bits)-1);
> ctx->pos+=bits;
> while(ctx->pos>=8)
> {
> ctx->win=(ctx->win>>8)|((*ctx->cs++)<<24);
> ctx->pos-=8;
> }
> Is a fair bit slower than, say:
> cs=ctx->cs;
> p=ctx->pos;
> w=get_u32le(cs);
> b=(w>>p)&((1<<bits)-1);
> k=p+bits;
> ctx->pos=k&7;
> ctx->cs=cs+(k>>3);
>
> Or, relatedly, for decoding 4 Huffman symbols:
> cs=ctx->cs;
> p=ctx->pos;
> w=get_u64le(cs);
> b=w>>p; c0=htab[b&4095]; p+=c0>>8; c0=c0&255;
> b=w>>p; c1=htab[b&4095]; p+=c1>>8; c1=c1&255;
> b=w>>p; c2=htab[b&4095]; p+=c2>>8; c2=c2&255;
> b=w>>p; c3=htab[b&4095]; p+=c3>>8; c3=c3&255;
> c0=c0|(c1<<8); c2=c2|(c3<<8);
> ctx->pos=p&7; ctx->cs=cs+(p>>3);
> c=c0|(c2<<16);
>
>
> Where, getu32_le(ptr) might be, MSVC:
> (*(volatile __unaligned uint32_t *)(ptr))
>
> Or, what seemed to work on Clang (but not GCC):
> (*(uint32_t __attribute__((__may_alias__))
> __attribute__((aligned(1))) *)(ptr))
>
> Or, seemed to work on both Clang and GCC:
> (*(volatile uint32_t __attribute__((aligned(1))) *)(ptr))
>
>
> Where, if one puts the attributes before the type-name, Clang gives a
> warning that the attributes are ignored; but not if one puts them after
> the type name.
>
>
> For the "generic" fallback, "getu32_le()" and friends can fall back to
> something like:
> ((ptr)[0]|((ptr)[1]<<8)|((ptr)[2]<<16)|((ptr)[3]<<24))
>
> But, this results in a fairly massive drop in performance with MSVC.
>
> However, interestingly, GCC sees much less of a performance impact with
> this case.
>
>
>
> This sort of hackery may make a pretty big difference in the speed of an
> LZ decoder...

Click here to read the complete article

Re: Dense machine code from C++ code (compiler optimizations)

<splpnn$rqd$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22339&group=comp.arch#22339

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Dense machine code from C++ code (compiler optimizations)
Date: Sat, 18 Dec 2021 17:09:41 -0600
Organization: A noiseless patient Spider
Lines: 316
Message-ID: <splpnn$rqd$1@dont-email.me>
References: <sndun6$q07$1@dont-email.me>
<dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <spi4l7$sdj$1@dont-email.me>
<spjmcf$hhs$1@dont-email.me> <spl5kc$mja$1@dont-email.me>
<splek0$j57$1@dont-email.me>
<309f4a07-109e-4589-b2f1-fe2d906c5787n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 18 Dec 2021 23:09:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="75b42e13be0f9ec744f65a9d2140a83f";
logging-data="28493"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX195i91ukPFFOf0v72aaC/EY"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:0iFjfZRsOa3nrdR9IwzEbyOmD3E=
In-Reply-To: <309f4a07-109e-4589-b2f1-fe2d906c5787n@googlegroups.com>
Content-Language: en-US

by: BGB - Sat, 18 Dec 2021 23:09 UTC

On 12/18/2021 3:33 PM, MitchAlsup wrote:
> On Saturday, December 18, 2021 at 2:00:03 PM UTC-6, BGB wrote:
>> On 12/18/2021 11:26 AM, David Brown wrote:
>>> On 18/12/2021 05:00, BGB wrote:
>>>> On 12/17/2021 7:51 AM, David Brown wrote:
>>>>> On 17/12/2021 08:28, Thomas Koenig wrote:
>>>>>> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
>>>>>>
>>>>>>> HLL is very poor in bit manipulations. Better use a library for
>>>>>>> these bit manipulations, so less need for optimization.
>>>>>>
>>>>>> Not really - calling a library function has a _lot_ of overhead,
>>>>>> and what goes on in the library is hidden from the compiler. It is
>>>>>> usually preferred for the compiler to emit highly optimized code.
>>>>>>
>>>>>
>>>>> The best choice here is a header library (or equivalent for languages
>>>>> other than C, C++ and friends). You can write your static inline
>>>>> functions or templates in whatever you feel is the nicest way for your
>>>>> needs, and re-use these functions with no overhead.
>>>>>
>>>>
>>>> It depends some.
>>>>
>>>> One might end up also having a header which serves mostly to detect some
>>>> compiler and machine specific defines and tune the logic for performance.
>>>>
>>>> Say, for GCC, one might want to use inline functions, but for MSVC it
>>>> might be better to do everything with preprocessor macros, ...
>>>
>>> Of course. If you expect to be using the header for multiple different
>>> compilers, architectures, standards, optimisations, etc., then it is not
>>> at all unreasonable to have some compiler detection and give specialised
>>> versions for certain combinations and generic fall-backs for others.
>>>
>> Yeah. It is possible to have the code run on generic / unknown
>> architectures, if albeit at a speed penalty.
>>
>> There is a tradeoff though in that this can add a lot of cruft.
>>> It would be unexpected for a compiler to support inline functions but
>>> not be able to handle them as efficiently as a macro. But some
>>> compilers will handle particular ways of expressing the code more
>>> efficiently, and in some cases you might have processor-specific
>>> intrinsics, compiler extensions, or even inline assembly to get the
>>> optimal code.
>>>
>> In some cases, one may need local variables or to not have an argument
>> be evaluated multiple times, ... These cases favor inline functions.
>>
>> But, if it is a wrapper over a pointer de-reference or similar, then a
>> macro typically works better.
>>>>
>>>>
>>>>>> These days, compilers will might even recognize some standard
>>>>>> idioms for certain bit manipulations and use a corresponding
>>>>>> machine instruction, if available.
>>>>>>
>>>>>
>>>>> Compilers have been doing that for decades. Details vary, of course,
>>>>> and sometimes they are not quite optimal depending on exactly how you
>>>>> write the bit manipulation. But they are usually pretty solid.
>>>>
>>>> Recently was messing with an LZ decoder, and comparing performance
>>>> results between several ways of moving data/values around:
>>>> using volatile and pointer casts;
>>>> using may_alias and pointer casts;
>>>> using memcpy;
>>>> using byte-oriented access.
>>>>
>>>> This was via a wrapper library for various "fundamental" operators, such
>>>> as getting/setting values, or copying 8 or 16 bytes from a source
>>>> location to a destination (may overlap).
>>>>
>>>>
>>>> On GCC and Clang:
>>>> Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
>>>> Could not get may_alias to work correctly on GCC;
>>>> On clang, may_alias gave the same behavior as volatile;
>>>> Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).
>>>>
>>>> On MSVC:
>>>> Volatile was fastest (~ 2.2 GB/s)
>>>> No obvious difference between volatile and a bare pointer.
>>>> Memcpy was again, slightly slower ( ~ 2.0 GB/s);
>>>> Byte-oriented access was somewhat worse ( ~ 800 MB/s ).
>>>>
>>>> Bare pointer casts, with no 'volatile' or similar, in both GCC and
>>>> Clang, caused corrupted output data.
>>>>
>>>> Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
>>>> visible effect in this case, however optimization level did (-O1 and -Os
>>>> worked, -O2 and -O3 failed). It appears as if GCC was changing the
>>>> relative order of the memory loads and stores unless 'volatile' were used.
>>>>
>>>
>>> Compilers are free to change load and store orders on the assumption
>>> that the current thread of execution is the only thing that ever sees
>>> the accesses, unless you specifically say otherwise (with "volatile",
>>> atomics, or other methods). More extreme movements are usually only
>>> done at higher levels of optimisation. But if the code does not work
>>> correctly at -O2, you can be pretty confident that your code is wrong.
>>> Compiler bugs are not unheard of, of course, but user-code bugs are a
>>> /lot/ more common!
>>>
>> This code is single threaded.
>>
>> The code in question depends on the use of misaligned and overlapping
>> loads and stores. As a result, changing the relative load/store order
>> will change the visible results.
>>
>>
>> I was trying to use pointers declared as:
>> __attribute__((__may_alias__)) __attribute__((aligned(1)))
>>
>> But, this did not work in GCC.
>>
>> However, 'volatile' did give the expected behavior.
>>
>> The above did work in Clang, however as far as I can tell it is behaving
>> the same in this case as it did for volatile.
>>
>>
>> One can also use a bunch of 8 and 16 byte memcpy operations, but as
>> noted, this is slower than using 'volatile'.
>>
>>
>>
>> One of the major operations is basically to mimic the behavior of:
>> unsigned char *cs, *ct, *cte;
>> ct=dst; cs=dst-dist; cte=dst+len;
>> while(ct<cte)
>> { *ct++=*cs++; }
>>
>> However, doing it this way is slow enough to have a noticeable adverse
>> effect on the performance of an LZ decoder. Hence the use of a bunch of
>> misaligned load/store hackery to try to work around this.
> <
> Perhaps you should use the Virtual Vector Method where your copy loop
> can execute 8-to16 iterations per clock. This should get rid of the lazy
> performance problem.

Possible, but N/A for x86-64...

On BJX2, BGBCC basically behaves similar to MSVC in these areas.

Assuming an unrolled byte-at-a-time copy loop, the fastest possible on
BJX2 would still only be ~ 25MB/s.

Copying a byte at a time via a "while()" loop is closer to 5MB/s.

It is considerably faster on a desktop PC running x86-64, but still a
lot slower than options which copy 8 or 16 bytes at a time.

>>
>> So, the match copying tends to get a little more convoluted, pseudocode:
>> if(dist>=8)
>> { copy data 16B at a time via 8B elements. }
>> else
>> {
>> if(!(dist&(dist-1)))
>> {
>> if(dist==1)
>> { fill 8B with a single byte; }
>> else if(dist==2)
>> { fill 8B with a 2B pattern; }
>> else if(dist==4)
>> { fill 8B with a 4B pattern; }
>> Flood-fill target with pattern (power-of-2)
>> }else
>> {
>> if(dist==3)
>> { fill 6B with a 3B pattern; step=6; }
>> else if(dist==5)
>> { fill 6B with a 3B pattern; step=5; }
>> else ...
>> Flood-fill target with pattern (non-power-of-2)
>> }
>> }
> <
> And obviating the programmer from having to do time-wasting stuff like
> the above.

It would be nicer if the underlying operation were supported by the C
library. As noted, wither "memcpy()" nor "memmove()" implement this
behavior.

I did before add a "_memlzcpy()" function which had an interface like a
normal "memcpy()" but which had behavior more like what was needed for a
typical LZ77 match copy. I am not aware of any other C libraries doing
this though.

Click here to read the complete article

Re: Dense machine code from C++ code (compiler optimizations)

<caf13a6e-813c-43ab-9eb6-451aec304e9an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22340&group=comp.arch#22340

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:e0c:: with SMTP id y12mr5862686qkm.109.1639871454423;
Sat, 18 Dec 2021 15:50:54 -0800 (PST)
X-Received: by 2002:a05:6830:2019:: with SMTP id e25mr7059783otp.96.1639871454175;
Sat, 18 Dec 2021 15:50:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 18 Dec 2021 15:50:53 -0800 (PST)
In-Reply-To: <splpnn$rqd$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b8ce:84cf:2c07:ff3d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b8ce:84cf:2c07:ff3d
References: <sndun6$q07$1@dont-email.me> <dc10dfff-b48e-40ec-8c24-0a6c2e1790d2n@googlegroups.com>
<sphe6s$5a3$1@newsreader4.netcologne.de> <spi4l7$sdj$1@dont-email.me>
<spjmcf$hhs$1@dont-email.me> <spl5kc$mja$1@dont-email.me> <splek0$j57$1@dont-email.me>
<309f4a07-109e-4589-b2f1-fe2d906c5787n@googlegroups.com> <splpnn$rqd$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <caf13a6e-813c-43ab-9eb6-451aec304e9an@googlegroups.com>
Subject: Re: Dense machine code from C++ code (compiler optimizations)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 18 Dec 2021 23:50:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 329

by: MitchAlsup - Sat, 18 Dec 2021 23:50 UTC

On Saturday, December 18, 2021 at 5:09:46 PM UTC-6, BGB wrote:
> On 12/18/2021 3:33 PM, MitchAlsup wrote:
> > On Saturday, December 18, 2021 at 2:00:03 PM UTC-6, BGB wrote:
> >> On 12/18/2021 11:26 AM, David Brown wrote:
> >>> On 18/12/2021 05:00, BGB wrote:
> >>>> On 12/17/2021 7:51 AM, David Brown wrote:
> >>>>> On 17/12/2021 08:28, Thomas Koenig wrote:
> >>>>>> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
> >>>>>>
> >>>>>>> HLL is very poor in bit manipulations. Better use a library for
> >>>>>>> these bit manipulations, so less need for optimization.
> >>>>>>
> >>>>>> Not really - calling a library function has a _lot_ of overhead,
> >>>>>> and what goes on in the library is hidden from the compiler. It is
> >>>>>> usually preferred for the compiler to emit highly optimized code.
> >>>>>>
> >>>>>
> >>>>> The best choice here is a header library (or equivalent for languages
> >>>>> other than C, C++ and friends). You can write your static inline
> >>>>> functions or templates in whatever you feel is the nicest way for your
> >>>>> needs, and re-use these functions with no overhead.
> >>>>>
> >>>>
> >>>> It depends some.
> >>>>
> >>>> One might end up also having a header which serves mostly to detect some
> >>>> compiler and machine specific defines and tune the logic for performance.
> >>>>
> >>>> Say, for GCC, one might want to use inline functions, but for MSVC it
> >>>> might be better to do everything with preprocessor macros, ...
> >>>
> >>> Of course. If you expect to be using the header for multiple different
> >>> compilers, architectures, standards, optimisations, etc., then it is not
> >>> at all unreasonable to have some compiler detection and give specialised
> >>> versions for certain combinations and generic fall-backs for others.
> >>>
> >> Yeah. It is possible to have the code run on generic / unknown
> >> architectures, if albeit at a speed penalty.
> >>
> >> There is a tradeoff though in that this can add a lot of cruft.
> >>> It would be unexpected for a compiler to support inline functions but
> >>> not be able to handle them as efficiently as a macro. But some
> >>> compilers will handle particular ways of expressing the code more
> >>> efficiently, and in some cases you might have processor-specific
> >>> intrinsics, compiler extensions, or even inline assembly to get the
> >>> optimal code.
> >>>
> >> In some cases, one may need local variables or to not have an argument
> >> be evaluated multiple times, ... These cases favor inline functions.
> >>
> >> But, if it is a wrapper over a pointer de-reference or similar, then a
> >> macro typically works better.
> >>>>
> >>>>
> >>>>>> These days, compilers will might even recognize some standard
> >>>>>> idioms for certain bit manipulations and use a corresponding
> >>>>>> machine instruction, if available.
> >>>>>>
> >>>>>
> >>>>> Compilers have been doing that for decades. Details vary, of course,
> >>>>> and sometimes they are not quite optimal depending on exactly how you
> >>>>> write the bit manipulation. But they are usually pretty solid.
> >>>>
> >>>> Recently was messing with an LZ decoder, and comparing performance
> >>>> results between several ways of moving data/values around:
> >>>> using volatile and pointer casts;
> >>>> using may_alias and pointer casts;
> >>>> using memcpy;
> >>>> using byte-oriented access.
> >>>>
> >>>> This was via a wrapper library for various "fundamental" operators, such
> >>>> as getting/setting values, or copying 8 or 16 bytes from a source
> >>>> location to a destination (may overlap).
> >>>>
> >>>>
> >>>> On GCC and Clang:
> >>>> Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
> >>>> Could not get may_alias to work correctly on GCC;
> >>>> On clang, may_alias gave the same behavior as volatile;
> >>>> Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).
> >>>>
> >>>> On MSVC:
> >>>> Volatile was fastest (~ 2.2 GB/s)
> >>>> No obvious difference between volatile and a bare pointer.
> >>>> Memcpy was again, slightly slower ( ~ 2.0 GB/s);
> >>>> Byte-oriented access was somewhat worse ( ~ 800 MB/s ).
> >>>>
> >>>> Bare pointer casts, with no 'volatile' or similar, in both GCC and
> >>>> Clang, caused corrupted output data.
> >>>>
> >>>> Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
> >>>> visible effect in this case, however optimization level did (-O1 and -Os
> >>>> worked, -O2 and -O3 failed). It appears as if GCC was changing the
> >>>> relative order of the memory loads and stores unless 'volatile' were used.
> >>>>
> >>>
> >>> Compilers are free to change load and store orders on the assumption
> >>> that the current thread of execution is the only thing that ever sees
> >>> the accesses, unless you specifically say otherwise (with "volatile",
> >>> atomics, or other methods). More extreme movements are usually only
> >>> done at higher levels of optimisation. But if the code does not work
> >>> correctly at -O2, you can be pretty confident that your code is wrong..
> >>> Compiler bugs are not unheard of, of course, but user-code bugs are a
> >>> /lot/ more common!
> >>>
> >> This code is single threaded.
> >>
> >> The code in question depends on the use of misaligned and overlapping
> >> loads and stores. As a result, changing the relative load/store order
> >> will change the visible results.
> >>
> >>
> >> I was trying to use pointers declared as:
> >> __attribute__((__may_alias__)) __attribute__((aligned(1)))
> >>
> >> But, this did not work in GCC.
> >>
> >> However, 'volatile' did give the expected behavior.
> >>
> >> The above did work in Clang, however as far as I can tell it is behaving
> >> the same in this case as it did for volatile.
> >>
> >>
> >> One can also use a bunch of 8 and 16 byte memcpy operations, but as
> >> noted, this is slower than using 'volatile'.
> >>
> >>
> >>
> >> One of the major operations is basically to mimic the behavior of:
> >> unsigned char *cs, *ct, *cte;
> >> ct=dst; cs=dst-dist; cte=dst+len;
> >> while(ct<cte)
> >> { *ct++=*cs++; }
> >>
> >> However, doing it this way is slow enough to have a noticeable adverse
> >> effect on the performance of an LZ decoder. Hence the use of a bunch of
> >> misaligned load/store hackery to try to work around this.
> > <
> > Perhaps you should use the Virtual Vector Method where your copy loop
> > can execute 8-to16 iterations per clock. This should get rid of the lazy
> > performance problem.
> Possible, but N/A for x86-64...
>
>
> On BJX2, BGBCC basically behaves similar to MSVC in these areas.
>
>
> Assuming an unrolled byte-at-a-time copy loop, the fastest possible on
> BJX2 would still only be ~ 25MB/s.
>
> Copying a byte at a time via a "while()" loop is closer to 5MB/s.
<
Using VVM the copy loop should be able to perform at cache-line-access-width
per cycle even when the loop is constructed with LDB and STB (and without
even having to do alias analysis between to and from.)
<
>
> It is considerably faster on a desktop PC running x86-64, but still a
> lot slower than options which copy 8 or 16 bytes at a time.
<
Why not copy 64-bytes per cycle ?
Answer: I don't have a 64-byte wide register!
Retort: You don/'t need a 64-byte wide register to copy 64-bytes per cycle
using VVM.
<
> >>
> >> So, the match copying tends to get a little more convoluted, pseudocode:
> >> if(dist>=8)
> >> { copy data 16B at a time via 8B elements. }
> >> else
> >> {
> >> if(!(dist&(dist-1)))
> >> {
> >> if(dist==1)
> >> { fill 8B with a single byte; }
> >> else if(dist==2)
> >> { fill 8B with a 2B pattern; }
> >> else if(dist==4)
> >> { fill 8B with a 4B pattern; }
> >> Flood-fill target with pattern (power-of-2)
> >> }else
> >> {
> >> if(dist==3)
> >> { fill 6B with a 3B pattern; step=6; }
> >> else if(dist==5)
> >> { fill 6B with a 3B pattern; step=5; }
> >> else ...
> >> Flood-fill target with pattern (non-power-of-2)
> >> }
> >> }
> > <
> > And obviating the programmer from having to do time-wasting stuff like
> > the above.
<
> It would be nicer if the underlying operation were supported by the C
> library. As noted, wither "memcpy()" nor "memmove()" implement this
> behavior.
<
MY 66000 compiler (Brian's) directly converts byte-wide copies into VVM code
and gets as much perf as the HW can muster (typically ½ cache line width
per cycle for low end machines, and at least cache line width on large machines.
<
Look, you have your own architecture and are not restricted to x86. Do better
than you are doing, copy VVM if you like, but do something better--something
that scales from the smallest implementation you can imagine to the largest
and be source code compatible throughout the range.
>
> I did before add a "_memlzcpy()" function which had an interface like a
> normal "memcpy()" but which had behavior more like what was needed for a
> typical LZ77 match copy. I am not aware of any other C libraries doing
> this though.
<
My guess is that if you did VVM, it would give you the vast majority of what
you want.
> >>
> >> And, typically for speed reasons, the match copying is allowed to stomp
> >> memory slightly past the end of the output.
> >>
> >> But, with raw pointer dereferencing (and no "volatile") something was
> >> messing up here (namely, the expected LZ output would differ from what
> >> was the expected output).
> >>
> >>
> >> Without either "volatile" or "__may_alias__", both GCC and Clang were
> >> messing up on this stuff, but this was to be expected.
> >>
> >> On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
> >> attempt to vectorize the copy loop.
> > <
> > You can (CAN) vectorize the loop using VVM without having semantic errors
> > during execution.
> >>
> >> Both "volatile" and "__unaligned" seem able to ward off the auto
> >> vectorization, but using both does seem to make it "valid" according to
> >> MSDN rules. Either keyword by itself seems able to give the intended
> >> effect, and there is little obvious difference between using one of them
> >> vs using both of them.
> >>
> > What you want is vectorization that retains semantic content. CRAY-style
> > vectors fail at this, VVM succeeds.
<
> MSVC seemingly does surprisingly OK at transforming arbitrary code into
> a horrible mess of SSE instructions without breaking semantics (it seems
> to often put in guard-checks when it does stuff like this).
<
Let's see MSVC vectorize this::
<
for( i = 0; i < MAX; i++ )
to[i] = from[MAX-i-1];
<
!
Click here to read the complete article

Pages:1 234 5 6 7 8

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor