Message-ID:

Hackers are just a migratory lifeform with a tropism for computers.

computers / comp.arch / Re: My experience with Apple M1 chip

Re: My experience with Apple M1 chip

<6_IFI.1132$tL2.607@fx43.iad>

https://www.novabbs.com/computers/article-flat.php?id=18525&group=comp.arch#18525

Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
Newsgroups: comp.arch
From: branimir...@gmail.com (Branimir Maksimovic)
Subject: Re: My experience with Apple M1 chip
References: <T6VEI.159$VU3.17@fx46.iad>
<187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com>
<c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com>
<teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com>
<pcadnRiE6PbwtHr9nZ2dnUU7-R3NnZ2d@giganews.com>
<8109bcc6-706f-4370-a4c2-e5119481f19cn@googlegroups.com>
<U1HFI.336$nj3.297@fx15.iad>
<3fb4829b-cbb3-434b-a26f-d94fd5662713n@googlegroups.com>
User-Agent: slrn/1.0.3 (Darwin)
Lines: 191
Message-ID: <6_IFI.1132$tL2.607@fx43.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Thu, 08 Jul 2021 20:11:46 UTC
Organization: usenet-news.net
Date: Thu, 08 Jul 2021 20:11:46 GMT
X-Received-Bytes: 7696

by: Branimir Maksimovic - Thu, 8 Jul 2021 20:11 UTC

On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> On Thursday, July 8, 2021 at 8:59:20 PM UTC+3, Branimir Maksimovic wrote:
>> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> > On Thursday, July 8, 2021 at 7:36:04 PM UTC+3, Kent Dickey wrote:
>> >> In article <teidncg9kKcBvnr9...@giganews.com>,
>> >> Kent Dickey <ke...@provalid.com> wrote:
>> >> >In article <c5a7429b-b7e6-4fbc...@googlegroups.com>,
>> >> >Michael S <already...@yahoo.com> wrote:
>> >> >>In the mean time I simplified and improved this program.
>> >> >>With new variant nDigits=11 is no longer interesting as a benchmark (too
>> >> >>fast), but nDigits=12 and nDigits=13 are now well suited.
>> >> >>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
>> >> >>Code:
>> >> >>
>> >> >>//-- beg
>> >> >>#include <stdint.h>
>> >> >>#include <stdio.h>
>> >> >>#include <stdlib.h>
>> >> >>#include <string.h>
>> >> >>
>> >> >>static unsigned long long oneChildsInRange(int nDigits);
>> >> >>int main(int argz, char** argv)
>> >> >>{
>> >> >> if (argz < 2) {
>> >> >> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
>> >> >> return 1;
>> >> >> }
>> >> >>
>> >> >> char* endp;
>> >> >> int nDigits = strtol(argv[1], &endp, 0);
>> >> >> if (endp == argv[1]) {
>> >> >> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
>> >> >> return 1;
>> >> >> }
>> >> >>
>> >> >> if (nDigits < 11 || nDigits > 19) {
>> >> >> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
>> >> >> return 1;
>> >> >> }
>> >> >>
>> >> >> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
>> >> >> return 0;
>> >> >>}
>> >> >>
>> >> >>typedef struct {
>> >> >> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
>> >> >> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
>> >> >>} tabs_t;
>> >> >>
>> >> >>static unsigned long long countChildsRecursive(
>> >> >> int prefix_nChilds, // 0 or 1
>> >> >> const uint8_t prefixRem[],
>> >> >> int prefixlen,
>> >> >> const tabs_t* tabs)
>> >> >>{
>> >> >> unsigned long long cnt = 0;
>> >> >> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
>> >> >> int nChilds = suffix ? prefix_nChilds : 1;
>> >> >> const uint8_t *isChild = &tabs->isChildTab[suffix];
>> >> >> for (int i = 0; i < prefixlen; ++i)
>> >> >> nChilds += isChild[prefixRem[i]];
>> >> >>
>> >> >> if (nChilds < 2) {
>> >> >> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
>> >> >> cnt += nChilds;
>> >> >> } else {
>> >> >> // extend prefix
>> >> >> uint8_t prefixRemEx[20];
>> >> >> for (int i = 0; i < prefixlen; ++i)
>> >> >> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
>> >> >> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
>> >> >> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
>> >> >> }
>> >> >> }
>> >> >> }
>> >> >> return cnt;
>> >> >>}
>> >> >>
>> >> >>static unsigned long long oneChildsInRange(int nDigits)
>> >> >>{
>> >> >> // initialize look-up tables
>> >> >> tabs_t tabs;
>> >> >> for (int i = 0; i < 19+10; ++i) {
>> >> >> tabs.isChildTab[i] = i % nDigits == 0;
>> >> >> tabs.x10remTab [i] = (i*10) % nDigits;
>> >> >> }
>> >> >>
>> >> >> unsigned long long cnt = 0;
>> >> >> for (int pref = 1; pref < 10; ++pref) {
>> >> >> uint8_t prefixRem[1];
>> >> >> prefixRem[0] = tabs.x10remTab[pref];
>> >> >> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
>> >> >> }
>> >> >>
>> >> >> return cnt;
>> >> >>}
>> >> >>
>> >> >>//-- end
>> >> >>
>> >> >>Unlike the previous code, this variant on x86-64 is faster when compile
>> >> >>with 'clang -march=native -O2'.
>> >> >>gcc is significantly slower.
>> >> >
>> >> >OK, on my Mac Mini M1 (all Apple compilers are clang):
>> >> >
>> >> >---
>> >> >m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
>> >> >m1-mini-bash$ time ./michaels2 13
>> >> >13 1057516028
>> >> >
>> >> >real 1m32.648s
>> >> >user 1m32.311s
>> >> >sys 0m0.098s
>> >> >m1-mini-bash$ time ./michaels2 12
>> >> >12 55121700430
>> >> >
>> >> >real 7m30.212s
>> >> >user 7m29.783s
>> >> >sys 0m0.432s
>> >> >---
>> >> And I copied over the x86 executable (I do this since then I'm sure Apple can
>> >> in no way "cheat") and ran it:
>> >>
>> >> --
>> >> m1-mini-bash$ time ./michaels2.x86 13
>> >> 13 1057516028
>> >>
>> >> real 1m26.825s
>> >> user 1m26.541s
>> >> sys 0m0.108s
>> >> m1-mini-bash$ time ./michaels2.x86 12
>> >> 12 55121700430
>> >>
>> >> real 8m5.908s
>> >> user 8m5.452s
>> >> sys 0m0.465s
>> >> --
>> >>
>> >> Yes, that's right, for the argument "13", the JIT x86 code on Apple M1 ran
>> >> FASTER than the native compiled code. This did not happen for the "12"
>> >> version (which runs much longer).
>> >> >
>> >> >And for comparison, my iMac Pro:
>> >> >
>> >> >---
>> >> >imacpro-bash$ cc -O2 -o michaels2 michaels2.c
>> >> >imacpro-bash$ time michaels2 13
>> >> >13 1057516028
>> >> >
>> >> >real 1m50.732s
>> >> >user 1m50.517s
>> >> >sys 0m0.055s
>> >> >imacpro-bash$ time michaels2 12
>> >> >12 55121700430
>> >> >
>> >> >real 9m45.505s
>> >> >user 9m45.260s
>> >> >sys 0m0.181s
>> >> >---
>> >>
>> >> Kent
>> >
>> > Diz JIT boy is a BAD mazafaka!
>> >
>> It's not JIT it's M1, native code is faster then x86 :P
>> maxa@Branimirs-Air euler % gcc -O3 euler413i.c -o euler413i -target x86_64-apple-darwin
>
> In order to see the effect, observed by Kent, you should compile x86-64 binary with "clang -O2 -march=skylake".
> As I said above, in this particular program clang-x64 is measurably faster than gcc-x64.
> That's assuming that your gcc is a real gcc rather than alias to clang, as typical on MacOS.

I have real gcc for ARM, gcc-11, that is, yes in this particular case it is slower,
but produces significantly faster code then Apples clang most of the time...
Yes gcc is alias for clang on macOS, so when I want real gcc I use gcc-11 command.
Here is result with your parameters to clang
maxa@Branimirs-Air euler % gcc -O2 euler413i.c -o euler413i -march=skylake -target x86_64-apple-darwin
bmaxa@Branimirs-Air euler % time ./euler413i 13
13 1057516028
../euler413i 13 96.14s user 0.44s system 99% cpu 1:36.94 total
bmaxa@Branimirs-Air euler % time ./euler413i 12
12 55121700430
../euler413i 12 487.02s user 2.13s system 99% cpu 8:09.21 total

maxa@Branimirs-Air euler % file euler413
euler413: Mach-O 64-bit executable x86_64
It is still faster. Only difference is that I use clang that came
on M1, probably latest.

>
--
something dumb

Subject	Replies	Author
My experience with Apple M1 chip By: Branimir Maksimovic on Sun, 4 Jul 2021	73	Branimir Maksimovic