novaBBS - comp.arch - Re: Misc: Relative benchmark, closer than expected...

Misc: Relative benchmark, closer than expected...

<tasuaa$36141$1@dont-email.me>

https://www.novabbs.com/devel/article-flat.php?id=26650&group=comp.arch#26650

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc: Relative benchmark, closer than expected...
Date: Fri, 15 Jul 2022 18:47:13 -0500
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <tasuaa$36141$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 15 Jul 2022 23:47:22 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="ad2cb6903a7006d76647dd238ce11b18";
logging-data="3343489"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19oIhbeyscbPVNzz6J8Ilcg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:DxjdPL1BBJT7NUH5XRKQsDxFH2Q=
Content-Language: en-US

by: BGB - Fri, 15 Jul 2022 23:47 UTC

I decided to set up another benchmark between my PC and BJX2 core,
mostly as Dhrystone is a little suspect at this point.

So, basically grabbed an old JPEG decoder of mine, and basically set up
it up as a simple benchmark:
Decode the JPEG in a loop;
Count how many times JPEG can be decoded in 30 seconds;
Figure out the megapixels...

The JPEG decoder is plain C in this case (no SIMD or ASM or similar on
either target).

So, my PC (3.7 GHz Ryzen 2700X):
44 megapixels/second (MSVC X64, "/Os").

BJX2 (50MHz, testing via emulator):
0.4 megapixels/second
(Roughly 6 fps with a 320x200 JPEG)

So, around a 110x ratio delta, vs a 74x clock-speed delta.
Combined ratio: 1.49x (normalized for baseline clock speed).
Drops to 1.375x if normalized for "turboed" clock in Task Manager.

This is ironically, a lot closer to parity than I would have expected...

Using "/Os" doesn't inline stuff or use auto-vectorization, so is "more
fair" (granted, it is closer to around 100 megapixels/second with "/O2",
but this is with the compiler inlining and vectorizing a bunch of stuff).

BGBCC doesn't do any of this. It does have the WEXifier, but the Ryzen
basically does this job itself in this case.

Likewise, the normalized ratio is around 4x (121 Mpix/sec) if comparing
against "gcc -O3" (also seems to inline and vectorize stuff and similar).

GCC gives around 27 Mpix/sec if I build with "gcc -g":
Normalized ratio = 0.91 (0.84 accounting for turbo).
So, BJX2 gets more pixels-per-clock vs gcc with a debug build.

But, not entirely sure what I should expect to be "reasonable" in this
case...

The JPEG in this case is 320x200; final output would be 32-bit RGBA, but
I am mostly ignoring the decoded output image in this case.

The decoder appears to fit reasonably well in both the L1 and L2 caches
with this image (if the image were bigger, it would likely be having a
larger amount of cache misses).

Ironically, this seems like a good enough result that something like an
MPEG style decoder would not seem to be entirely out of the question.

Top activity areas in the profile:
The block-transfer function;
Transfers decoded blocks to output image buffer;
IDCT function;
Huffman Symbol and Block decoding.
( Nothing looks terribly surprising here )

....

Considering the limitations, things mostly seem to be "within reason".

I am not sure, does this seem like OK performance, or terrible
performance?...

Well, along with uncertainty as to whether all this is "potentially
useful" or "completely pointless", but alas...

Any thoughts?...

Re: Misc: Relative benchmark, closer than expected...

<tav30i$3eqj0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26671&group=comp.arch#26671

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Relative benchmark, closer than expected...
Date: Sat, 16 Jul 2022 14:19:44 -0500
Organization: A noiseless patient Spider
Lines: 132
Message-ID: <tav30i$3eqj0$1@dont-email.me>
References: <tasuaa$36141$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 16 Jul 2022 19:19:46 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="ad2cb6903a7006d76647dd238ce11b18";
logging-data="3631712"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ZQC1788xKKgrZtoHektTR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:TdTvOe6VCF8u+Nb1ppDj4Un8Zyg=
In-Reply-To: <tasuaa$36141$1@dont-email.me>
Content-Language: en-US

by: BGB - Sat, 16 Jul 2022 19:19 UTC

On 7/15/2022 6:47 PM, BGB wrote:
> I decided to set up another benchmark between my PC and BJX2 core,
> mostly as Dhrystone is a little suspect at this point.
>
>
>
> So, basically grabbed an old JPEG decoder of mine, and basically set up
> it up as a simple benchmark:
> Decode the JPEG in a loop;
> Count how many times JPEG can be decoded in 30 seconds;
> Figure out the megapixels...
>
> The JPEG decoder is plain C in this case (no SIMD or ASM or similar on
> either target).
>
>
> So, my PC (3.7 GHz Ryzen 2700X):
> 44 megapixels/second (MSVC X64, "/Os").
>
> BJX2 (50MHz, testing via emulator):
> 0.4 megapixels/second
> (Roughly 6 fps with a 320x200 JPEG)
>
>
> So, around a 110x ratio delta, vs a 74x clock-speed delta.
> Combined ratio: 1.49x (normalized for baseline clock speed).
> Drops to 1.375x if normalized for "turboed" clock in Task Manager.
>
> This is ironically, a lot closer to parity than I would have expected...
>

Also didn't notice until after the prior post:
For the JPEG decoder, the top spots for clock-cycle usage had shifted
over to MULS and ADDS.L rather than Load/Store ops, so in effect the
JPEG decoder was (possibly unsurprisingly) being limited mostly by how
quickly it can multiply and add integer values.

I didn't test this with the DMAC instruction (combined multiply and
ADD), but this looks like a scenario where DMAC could be useful.

Ideally, could use something like:
DMACS.L Rs, Rt, Imm, Rn //Rn=Rs+(Rt*Imm) (or similar)

But, don't currently have anything like this...

Have otherwise started work on support for Dual-Lane Loads:
Load | Load
Load | Store

Currently, it is likely that using a Load in Lane 2 will also require a
memory operation in Lane 1, and the Lane 2 operation may only be a Load.

This should hopefully be able to help in common case "wall of memory
ops" use-cases; but cost and memory consistency remain as concerns.
Still working on it.

The Lane 2 port effectively has a much smaller L1 cache, which functions
as a read-only mirror of the contents of the main L1 cache. Main
"complex case" at the moment is dealing with cache misses in this
smaller cache (without causing issues with memory consistency).

Currently, the small cache "hijacks" the miss-handling for the main
cache, but need to deal with making sure dirty cache lines get written
back correctly.

A direct 1:1 mirror of the cache arrays would have been simpler, but
would effectively double the Block-RAM requirements of the L1 D$ (vs my
current approach of effectively trying to throw ~ 2K of LUTRAM at the
problem).

....

>
> Using "/Os" doesn't inline stuff or use auto-vectorization, so is "more
> fair" (granted, it is closer to around 100 megapixels/second with "/O2",
> but this is with the compiler inlining and vectorizing a bunch of stuff).
>
> BGBCC doesn't do any of this. It does have the WEXifier, but the Ryzen
> basically does this job itself in this case.
>
>
> Likewise, the normalized ratio is around 4x (121 Mpix/sec) if comparing
> against "gcc -O3" (also seems to inline and vectorize stuff and similar).
>
> GCC gives around 27 Mpix/sec if I build with "gcc -g":
> Normalized ratio = 0.91 (0.84 accounting for turbo).
> So, BJX2 gets more pixels-per-clock vs gcc with a debug build.
>
> But, not entirely sure what I should expect to be "reasonable" in this
> case...
>
>
>
> The JPEG in this case is 320x200; final output would be 32-bit RGBA, but
> I am mostly ignoring the decoded output image in this case.
>
> The decoder appears to fit reasonably well in both the L1 and L2 caches
> with this image (if the image were bigger, it would likely be having a
> larger amount of cache misses).
>
> Ironically, this seems like a good enough result that something like an
> MPEG style decoder would not seem to be entirely out of the question.
>
>
> Top activity areas in the profile:
> The block-transfer function;
> Transfers decoded blocks to output image buffer;
> IDCT function;
> Huffman Symbol and Block decoding.
> ( Nothing looks terribly surprising here )
>
> ...
>
>
> Considering the limitations, things mostly seem to be "within reason".
>
>
> I am not sure, does this seem like OK performance, or terrible
> performance?...
>
> Well, along with uncertainty as to whether all this is "potentially
> useful" or "completely pointless", but alas...
>
>
> Any thoughts?...

!07/11 PDP a ni deppart m'I !pleH

devel / comp.arch / Re: Misc: Relative benchmark, closer than expected...

Subject	Author
Misc: Relative benchmark, closer than expected...	BGB
Re: Misc: Relative benchmark, closer than expected...	BGB