novaBBS - comp.arch - Re: My experience with Apple M1 chip

Re: My experience with Apple M1 chip

<2021Jul8.164728@mips.complang.tuwien.ac.at>

https://www.novabbs.com/devel/article-flat.php?id=18502&group=comp.arch#18502

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: My experience with Apple M1 chip
Date: Thu, 08 Jul 2021 14:47:28 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 17
Message-ID: <2021Jul8.164728@mips.complang.tuwien.ac.at>
References: <T6VEI.159$VU3.17@fx46.iad> <19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com> <2l%EI.8$gE.2@fx21.iad> <80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com> <SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com> <zQ5FI.779$VU3.610@fx46.iad> <2021Jul7.122809@mips.complang.tuwien.ac.at> <YigFI.3408$Nq7.1210@fx33.iad> <2021Jul8.084725@mips.complang.tuwien.ac.at> <_9CFI.2808$Nz.1791@fx22.iad>
Injection-Info: reader02.eternal-september.org; posting-host="4046eb382dfb208213332e8b4156df24";
logging-data="19782"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+4XDmTosWXRtdt9qmnsw4S"
Cancel-Lock: sha1:gIbupF9duVsacfWuJz+wzt/MJ0M=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Thu, 8 Jul 2021 14:47 UTC

Branimir Maksimovic <branimir.maksimovic@gmail.com> writes:
>> The M1 is pretty good. Too bad it's only available in Apple products.
>
>Worse is that I can't install Linux at this time

Yes, because it's only available in Apple products.

I admit to getting an iBook G4 in 2004, but at that time you could
install Linux on it. Mine never had MacOS installed.

But these days, Apple can afford not to get business from people like
me, or at least they think so.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
>> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
>> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
>> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
>> >> > Michael S <already...@yahoo.com> wrote:
>> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
>> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
>> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
>> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
>> >> >>either in single thread or in multithread throughput. Unless, of course,
>> >> >>absolute majority of run time spent in native libraries.
>> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
>> >> >>
>> >> >>I took it into account.
>> >> >>
>> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
>> >> >>for *any* x86.
>> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
>> >> >>than top Comet Lake.
>> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
>> >> >>rather close.
>> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
>> >> >>Rocket Lake closely.
>> >> >
>> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
>> >> > predict branches,
>> >> Yes, my assembler doesn't work as well as on x86 ;p
>> >> likes loop unrolling very much :p
>> >> Got 10% just coknverting suboutine in macro and calling several times :P
>> >> or working set in the 100-200KB range). It is not the
>> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
>> >> Sorry can't gree. Blows away x86 alright.
>> >> At < 10W
>> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
>> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
>> >> > command line, doesn't require me to install anything else, should run in less
>> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
>> >> > those results to any system you like. I don't think comparing optimized AVX
>> >> > is going to be useful, but simple integer or floating point algorithms would
>> >> > be best.
>> >> It can compare with optimised AVX alright :P
>> >>
>> >
>> > Do you want to test it?
>> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
>> > plain C++ that can be, in theory, vectorized by good compiler.
>> Of course, would be interresting how M1 does that.
>
> It's in my public github repo already5chosen/others under directory cholesky_solver.
> You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
> for the same algorithm in plain c++.
> For speed measurement, I was mostly concerned with N=85.
>
> Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
> The repo was intended for myself so didn't contain comprehensive readme.
> Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(

_stricmp not present on macOS, can't compile.

--
something dumb

Re: My experience with Apple M1 chip

<bef8975c-2c59-4530-bc98-653a77464ce5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18504&group=comp.arch#18504

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:ee2a:: with SMTP id l10mr30320355qvs.22.1625758822470; Thu, 08 Jul 2021 08:40:22 -0700 (PDT)
X-Received: by 2002:a9d:5603:: with SMTP id e3mr17937933oti.178.1625758822243; Thu, 08 Jul 2021 08:40:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 8 Jul 2021 08:40:22 -0700 (PDT)
In-Reply-To: <JQEFI.964$6j.113@fx04.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <T6VEI.159$VU3.17@fx46.iad> <19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com> <2l%EI.8$gE.2@fx21.iad> <80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com> <SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <A74FI.172$Yv3.30@fx41.iad> <d5a8db38-b2b8-403d-9692-b4bdbd4388f0n@googlegroups.com> <xmDFI.1310$6U5.249@fx02.iad> <84dc1c7c-4db9-4699-9fb0-a97a6cc3cf27n@googlegroups.com> <JQEFI.964$6j.113@fx04.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bef8975c-2c59-4530-bc98-653a77464ce5n@googlegroups.com>
Subject: Re: My experience with Apple M1 chip
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 08 Jul 2021 15:40:22 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 66

by: Michael S - Thu, 8 Jul 2021 15:40 UTC

On Thursday, July 8, 2021 at 6:28:46 PM UTC+3, Branimir Maksimovic wrote:
> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> > On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
> >> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> >> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
> >> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
> >> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
> >> >> > Michael S <already...@yahoo.com> wrote:
> >> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
> >> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
> >> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
> >> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
> >> >> >>either in single thread or in multithread throughput. Unless, of course,
> >> >> >>absolute majority of run time spent in native libraries.
> >> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
> >> >> >>
> >> >> >>I took it into account.
> >> >> >>
> >> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
> >> >> >>for *any* x86.
> >> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
> >> >> >>than top Comet Lake.
> >> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
> >> >> >>rather close.
> >> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
> >> >> >>Rocket Lake closely.
> >> >> >
> >> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
> >> >> > predict branches,
> >> >> Yes, my assembler doesn't work as well as on x86 ;p
> >> >> likes loop unrolling very much :p
> >> >> Got 10% just coknverting suboutine in macro and calling several times :P
> >> >> or working set in the 100-200KB range). It is not the
> >> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
> >> >> Sorry can't gree. Blows away x86 alright.
> >> >> At < 10W
> >> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
> >> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
> >> >> > command line, doesn't require me to install anything else, should run in less
> >> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
> >> >> > those results to any system you like. I don't think comparing optimized AVX
> >> >> > is going to be useful, but simple integer or floating point algorithms would
> >> >> > be best.
> >> >> It can compare with optimised AVX alright :P
> >> >>
> >> >
> >> > Do you want to test it?
> >> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
> >> > plain C++ that can be, in theory, vectorized by good compiler.
> >> Of course, would be interresting how M1 does that.
> >
> > It's in my public github repo already5chosen/others under directory cholesky_solver.
> > You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
> > for the same algorithm in plain c++.
> > For speed measurement, I was mostly concerned with N=85.
> >
> > Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
> > The repo was intended for myself so didn't contain comprehensive readme.
> > Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(
> _stricmp not present on macOS, can't compile.
>
>
> --
> something dumb

You can replace it with strcasecmp(). If it also does not work then strcmp().
In later case command line arguments will become case-sensitive.

Re: My experience with Apple M1 chip

<teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18508&group=comp.arch#18508

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!buffer2.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Thu, 08 Jul 2021 11:11:08 -0500
Newsgroups: comp.arch
Subject: Re: My experience with Apple M1 chip
References: <T6VEI.159$VU3.17@fx46.iad> <SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com> <c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com>
Organization: provalid.com
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
From: keg...@provalid.com (Kent Dickey)
Originator: kegs@provalid.com (Kent Dickey)
Message-ID: <teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com>
Date: Thu, 08 Jul 2021 11:11:08 -0500
Lines: 134
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-WmfTthHp/BMCB204yVgqt9aPKuAqjKNzACEJe8dWYsBTBLfIHbpRgJ34tO4Xz9EsdGL526UV3jKhsER!dYmP3UyIE8DNTQby6VvkZTAEzXnSiuSaKdW7F8Q+MFLvvem6FwNymZX+W2S8l9Q2PIDPm7ZlrIM=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 4675

by: Kent Dickey - Thu, 8 Jul 2021 16:11 UTC

In article <c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com>,
Michael S <already5chosen@yahoo.com> wrote:
>In the mean time I simplified and improved this program.
>With new variant nDigits=11 is no longer interesting as a benchmark (too
>fast), but nDigits=12 and nDigits=13 are now well suited.
>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
>Code:
>
>//-- beg
>#include <stdint.h>
>#include <stdio.h>
>#include <stdlib.h>
>#include <string.h>
>
>static unsigned long long oneChildsInRange(int nDigits);
>int main(int argz, char** argv)
>{
> if (argz < 2) {
> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
> return 1;
> }
>
> char* endp;
> int nDigits = strtol(argv[1], &endp, 0);
> if (endp == argv[1]) {
> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
> return 1;
> }
>
> if (nDigits < 11 || nDigits > 19) {
> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
> return 1;
> }
>
> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
> return 0;
>}
>
>typedef struct {
> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
>} tabs_t;
>
>static unsigned long long countChildsRecursive(
> int prefix_nChilds, // 0 or 1
> const uint8_t prefixRem[],
> int prefixlen,
> const tabs_t* tabs)
>{
> unsigned long long cnt = 0;
> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
> int nChilds = suffix ? prefix_nChilds : 1;
> const uint8_t *isChild = &tabs->isChildTab[suffix];
> for (int i = 0; i < prefixlen; ++i)
> nChilds += isChild[prefixRem[i]];
>
> if (nChilds < 2) {
> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
> cnt += nChilds;
> } else {
> // extend prefix
> uint8_t prefixRemEx[20];
> for (int i = 0; i < prefixlen; ++i)
> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
> }
> }
> }
> return cnt;
>}
>
>static unsigned long long oneChildsInRange(int nDigits)
>{
> // initialize look-up tables
> tabs_t tabs;
> for (int i = 0; i < 19+10; ++i) {
> tabs.isChildTab[i] = i % nDigits == 0;
> tabs.x10remTab [i] = (i*10) % nDigits;
> }
>
> unsigned long long cnt = 0;
> for (int pref = 1; pref < 10; ++pref) {
> uint8_t prefixRem[1];
> prefixRem[0] = tabs.x10remTab[pref];
> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
> }
>
> return cnt;
>}
>
>//-- end
>
>Unlike the previous code, this variant on x86-64 is faster when compile
>with 'clang -march=native -O2'.
>gcc is significantly slower.

OK, on my Mac Mini M1 (all Apple compilers are clang):

---
m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
m1-mini-bash$ time ./michaels2 13
13 1057516028

real 1m32.648s
user 1m32.311s
sys 0m0.098s
m1-mini-bash$ time ./michaels2 12
12 55121700430

real 7m30.212s
user 7m29.783s
sys 0m0.432s
---

And for comparison, my iMac Pro:

---
imacpro-bash$ cc -O2 -o michaels2 michaels2.c
imacpro-bash$ time michaels2 13
13 1057516028

real 1m50.732s
user 1m50.517s
sys 0m0.055s
imacpro-bash$ time michaels2 12
12 55121700430

real 9m45.505s
user 9m45.260s
sys 0m0.181s
---

Kent

gcc is much faster then Apples clang so it seems in this case...

--
something dumb

Re: My experience with Apple M1 chip

<fwFFI.332$nj3.163@fx15.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18510&group=comp.arch#18510

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
Newsgroups: comp.arch
From: branimir...@gmail.com (Branimir Maksimovic)
Subject: Re: My experience with Apple M1 chip
References: <T6VEI.159$VU3.17@fx46.iad>
<19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com>
<2l%EI.8$gE.2@fx21.iad>
<80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com>
<SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <A74FI.172$Yv3.30@fx41.iad>
<d5a8db38-b2b8-403d-9692-b4bdbd4388f0n@googlegroups.com>
<xmDFI.1310$6U5.249@fx02.iad>
<84dc1c7c-4db9-4699-9fb0-a97a6cc3cf27n@googlegroups.com>
<JQEFI.964$6j.113@fx04.iad>
<bef8975c-2c59-4530-bc98-653a77464ce5n@googlegroups.com>
User-Agent: slrn/1.0.3 (Darwin)
Lines: 72
Message-ID: <fwFFI.332$nj3.163@fx15.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Thu, 08 Jul 2021 16:15:07 UTC
Organization: usenet-news.net
Date: Thu, 08 Jul 2021 16:15:07 GMT
X-Received-Bytes: 5132

by: Branimir Maksimovic - Thu, 8 Jul 2021 16:15 UTC

On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> On Thursday, July 8, 2021 at 6:28:46 PM UTC+3, Branimir Maksimovic wrote:
>> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> > On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
>> >> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> >> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
>> >> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
>> >> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
>> >> >> > Michael S <already...@yahoo.com> wrote:
>> >> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
>> >> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
>> >> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
>> >> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
>> >> >> >>either in single thread or in multithread throughput. Unless, of course,
>> >> >> >>absolute majority of run time spent in native libraries.
>> >> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
>> >> >> >>
>> >> >> >>I took it into account.
>> >> >> >>
>> >> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
>> >> >> >>for *any* x86.
>> >> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
>> >> >> >>than top Comet Lake.
>> >> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
>> >> >> >>rather close.
>> >> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
>> >> >> >>Rocket Lake closely.
>> >> >> >
>> >> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
>> >> >> > predict branches,
>> >> >> Yes, my assembler doesn't work as well as on x86 ;p
>> >> >> likes loop unrolling very much :p
>> >> >> Got 10% just coknverting suboutine in macro and calling several times :P
>> >> >> or working set in the 100-200KB range). It is not the
>> >> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
>> >> >> Sorry can't gree. Blows away x86 alright.
>> >> >> At < 10W
>> >> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
>> >> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
>> >> >> > command line, doesn't require me to install anything else, should run in less
>> >> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
>> >> >> > those results to any system you like. I don't think comparing optimized AVX
>> >> >> > is going to be useful, but simple integer or floating point algorithms would
>> >> >> > be best.
>> >> >> It can compare with optimised AVX alright :P
>> >> >>
>> >> >
>> >> > Do you want to test it?
>> >> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
>> >> > plain C++ that can be, in theory, vectorized by good compiler.
>> >> Of course, would be interresting how M1 does that.
>> >
>> > It's in my public github repo already5chosen/others under directory cholesky_solver.
>> > You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
>> > for the same algorithm in plain c++.
>> > For speed measurement, I was mostly concerned with N=85.
>> >
>> > Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
>> > The repo was intended for myself so didn't contain comprehensive readme.
>> > Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(
>> _stricmp not present on macOS, can't compile.
>>
>>
>> --
>> something dumb
>
> You can replace it with strcasecmp(). If it also does not work then strcmp().
> In later case command line arguments will become case-sensitive.
I have waited for you, that's one small change :P

--
something dumb

On 2021-07-08, Branimir Maksimovic <branimir.maksimovic@gmail.com> wrote:
> On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
>> On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
>>> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>>> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
>>> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
>>> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
>>> >> > Michael S <already...@yahoo.com> wrote:
>>> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
>>> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
>>> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
>>> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
>>> >> >>either in single thread or in multithread throughput. Unless, of course,
>>> >> >>absolute majority of run time spent in native libraries.
>>> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
>>> >> >>
>>> >> >>I took it into account.
>>> >> >>
>>> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
>>> >> >>for *any* x86.
>>> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
>>> >> >>than top Comet Lake.
>>> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
>>> >> >>rather close.
>>> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
>>> >> >>Rocket Lake closely.
>>> >> >
>>> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
>>> >> > predict branches,
>>> >> Yes, my assembler doesn't work as well as on x86 ;p
>>> >> likes loop unrolling very much :p
>>> >> Got 10% just coknverting suboutine in macro and calling several times :P
>>> >> or working set in the 100-200KB range). It is not the
>>> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
>>> >> Sorry can't gree. Blows away x86 alright.
>>> >> At < 10W
>>> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
>>> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
>>> >> > command line, doesn't require me to install anything else, should run in less
>>> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
>>> >> > those results to any system you like. I don't think comparing optimized AVX
>>> >> > is going to be useful, but simple integer or floating point algorithms would
>>> >> > be best.
>>> >> It can compare with optimised AVX alright :P
>>> >>
>>> >
>>> > Do you want to test it?
>>> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
>>> > plain C++ that can be, in theory, vectorized by good compiler.
>>> Of course, would be interresting how M1 does that.
>>
>> It's in my public github repo already5chosen/others under directory cholesky_solver.
>> You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
>> for the same algorithm in plain c++.
>> For speed measurement, I was mostly concerned with N=85.
>>
>> Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
>> The repo was intended for myself so didn't contain comprehensive readme.
>> Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(
>>
>>> >
>>> > The most interesting would be compiling for Intel Mac and then running binary through Rosetta, as Kent did for my first test.
>>> > I don't know if you can do it without Intel Mac.
>>> Apples gcc support cross compilation. I build x86 binaries no problem. Only thing is if they contain AVX code won't work
>>> as Rosetta does not supports AVX...
>>
>> That a pity.
>> I heard about it a year ago, but completely forgot.
>>
>>> >
>>> >
>>> >
>>> >> >
>>> >> > Kent
>>> >>
>>> >>
>>> >> --
>>> >> something dumb
>>>
>>>
>>> --
>>> something dumb
> bmaxa@Branimirs-Air cholesky_solver % g++-11 -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 9.845e-12. N= 128. 84.79 usec. 17.657 GMADD/s
> ./chol 128 1.48s user 0.02s system 85% cpu 1.752 total
> bmaxa@Branimirs-Air cholesky_solver % g++ -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 1.001e-11. N= 128. 258.35 usec. 5.795 GMADD/s
> ./chol 128 4.16s user 0.02s system 95% cpu 4.358 total
>
> gcc is much faster then Apples clang so it seems in this case...
>
>
And through rosetta:
maxa@Branimirs-Air cholesky_solver % g++ -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11 -target x86_64-apple-darwin
bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
Layout='R'. N= 128. max. err: decomposition 5.684e-13, solver 1.001e-11. N= 128. 274.58 usec. 5.452 GMADD/s
../chol 128 4.73s user 0.03s system 92% cpu 5.160 total
bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
Layout='R'. N= 128. max. err: decomposition 5.684e-13, solver 1.001e-11. N= 128. 274.22 usec. 5.460 GMADD/s
../chol 128 4.74s user 0.03s system 99% cpu 4.801 total
gcc I don't have cross compiler...

--
something dumb

Re: My experience with Apple M1 chip

<pcadnRiE6PbwtHr9nZ2dnUU7-R3NnZ2d@giganews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18512&group=comp.arch#18512

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!buffer2.nntp.dca1.giganews.com!buffer1.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Thu, 08 Jul 2021 11:35:57 -0500
Newsgroups: comp.arch
Subject: Re: My experience with Apple M1 chip
References: <T6VEI.159$VU3.17@fx46.iad> <187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com> <c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com> <teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com>
Organization: provalid.com
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
From: keg...@provalid.com (Kent Dickey)
Originator: kegs@provalid.com (Kent Dickey)
Message-ID: <pcadnRiE6PbwtHr9nZ2dnUU7-R3NnZ2d@giganews.com>
Date: Thu, 08 Jul 2021 11:35:57 -0500
Lines: 159
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-ALT2f4HqOCMrtPDN6RDbV6luzn43znir8K4uzrKYK4nBKNQt90X6fLZGflzZCskWP28xQTZ8FW1TJcT!KgXbxPEnOMN1oQtHZx5TUk/0jlJ0FVKX8wbGFYitusdNI/bp4Eyvo0O6N5xQJyts9Tm2UAHFkMY=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 5441

by: Kent Dickey - Thu, 8 Jul 2021 16:35 UTC

In article <teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com>,
Kent Dickey <kegs@provalid.com> wrote:
>In article <c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com>,
>Michael S <already5chosen@yahoo.com> wrote:
>>In the mean time I simplified and improved this program.
>>With new variant nDigits=11 is no longer interesting as a benchmark (too
>>fast), but nDigits=12 and nDigits=13 are now well suited.
>>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
>>Code:
>>
>>//-- beg
>>#include <stdint.h>
>>#include <stdio.h>
>>#include <stdlib.h>
>>#include <string.h>
>>
>>static unsigned long long oneChildsInRange(int nDigits);
>>int main(int argz, char** argv)
>>{
>> if (argz < 2) {
>> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
>> return 1;
>> }
>>
>> char* endp;
>> int nDigits = strtol(argv[1], &endp, 0);
>> if (endp == argv[1]) {
>> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
>> return 1;
>> }
>>
>> if (nDigits < 11 || nDigits > 19) {
>> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
>> return 1;
>> }
>>
>> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
>> return 0;
>>}
>>
>>typedef struct {
>> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
>> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
>>} tabs_t;
>>
>>static unsigned long long countChildsRecursive(
>> int prefix_nChilds, // 0 or 1
>> const uint8_t prefixRem[],
>> int prefixlen,
>> const tabs_t* tabs)
>>{
>> unsigned long long cnt = 0;
>> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
>> int nChilds = suffix ? prefix_nChilds : 1;
>> const uint8_t *isChild = &tabs->isChildTab[suffix];
>> for (int i = 0; i < prefixlen; ++i)
>> nChilds += isChild[prefixRem[i]];
>>
>> if (nChilds < 2) {
>> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
>> cnt += nChilds;
>> } else {
>> // extend prefix
>> uint8_t prefixRemEx[20];
>> for (int i = 0; i < prefixlen; ++i)
>> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
>> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
>> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
>> }
>> }
>> }
>> return cnt;
>>}
>>
>>static unsigned long long oneChildsInRange(int nDigits)
>>{
>> // initialize look-up tables
>> tabs_t tabs;
>> for (int i = 0; i < 19+10; ++i) {
>> tabs.isChildTab[i] = i % nDigits == 0;
>> tabs.x10remTab [i] = (i*10) % nDigits;
>> }
>>
>> unsigned long long cnt = 0;
>> for (int pref = 1; pref < 10; ++pref) {
>> uint8_t prefixRem[1];
>> prefixRem[0] = tabs.x10remTab[pref];
>> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
>> }
>>
>> return cnt;
>>}
>>
>>//-- end
>>
>>Unlike the previous code, this variant on x86-64 is faster when compile
>>with 'clang -march=native -O2'.
>>gcc is significantly slower.
>
>OK, on my Mac Mini M1 (all Apple compilers are clang):
>
>---
>m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
>m1-mini-bash$ time ./michaels2 13
>13 1057516028
>
>real 1m32.648s
>user 1m32.311s
>sys 0m0.098s
>m1-mini-bash$ time ./michaels2 12
>12 55121700430
>
>real 7m30.212s
>user 7m29.783s
>sys 0m0.432s
>---

And I copied over the x86 executable (I do this since then I'm sure Apple can
in no way "cheat") and ran it:

--
m1-mini-bash$ time ./michaels2.x86 13
13 1057516028

real 1m26.825s
user 1m26.541s
sys 0m0.108s
m1-mini-bash$ time ./michaels2.x86 12
12 55121700430

real 8m5.908s
user 8m5.452s
sys 0m0.465s
--

Yes, that's right, for the argument "13", the JIT x86 code on Apple M1 ran
FASTER than the native compiled code. This did not happen for the "12"
version (which runs much longer).

>
>And for comparison, my iMac Pro:
>
>---
>imacpro-bash$ cc -O2 -o michaels2 michaels2.c
>imacpro-bash$ time michaels2 13
>13 1057516028
>
>real 1m50.732s
>user 1m50.517s
>sys 0m0.055s
>imacpro-bash$ time michaels2 12
>12 55121700430
>
>real 9m45.505s
>user 9m45.260s
>sys 0m0.181s
>---

Kent

Re: My experience with Apple M1 chip

<8109bcc6-706f-4370-a4c2-e5119481f19cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18513&group=comp.arch#18513

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:e54e:: with SMTP id n14mr1459889qvm.41.1625764824463; Thu, 08 Jul 2021 10:20:24 -0700 (PDT)
X-Received: by 2002:a05:6830:2470:: with SMTP id x48mr6826220otr.81.1625764824277; Thu, 08 Jul 2021 10:20:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 8 Jul 2021 10:20:24 -0700 (PDT)
In-Reply-To: <pcadnRiE6PbwtHr9nZ2dnUU7-R3NnZ2d@giganews.com>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <T6VEI.159$VU3.17@fx46.iad> <187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com> <c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com> <teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com> <pcadnRiE6PbwtHr9nZ2dnUU7-R3NnZ2d@giganews.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8109bcc6-706f-4370-a4c2-e5119481f19cn@googlegroups.com>
Subject: Re: My experience with Apple M1 chip
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 08 Jul 2021 17:20:24 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 160

by: Michael S - Thu, 8 Jul 2021 17:20 UTC

On Thursday, July 8, 2021 at 7:36:04 PM UTC+3, Kent Dickey wrote:
> In article <teidncg9kKcBvnr9...@giganews.com>,
> Kent Dickey <ke...@provalid.com> wrote:
> >In article <c5a7429b-b7e6-4fbc...@googlegroups.com>,
> >Michael S <already...@yahoo.com> wrote:
> >>In the mean time I simplified and improved this program.
> >>With new variant nDigits=11 is no longer interesting as a benchmark (too
> >>fast), but nDigits=12 and nDigits=13 are now well suited.
> >>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
> >>Code:
> >>
> >>//-- beg
> >>#include <stdint.h>
> >>#include <stdio.h>
> >>#include <stdlib.h>
> >>#include <string.h>
> >>
> >>static unsigned long long oneChildsInRange(int nDigits);
> >>int main(int argz, char** argv)
> >>{
> >> if (argz < 2) {
> >> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
> >> return 1;
> >> }
> >>
> >> char* endp;
> >> int nDigits = strtol(argv[1], &endp, 0);
> >> if (endp == argv[1]) {
> >> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
> >> return 1;
> >> }
> >>
> >> if (nDigits < 11 || nDigits > 19) {
> >> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
> >> return 1;
> >> }
> >>
> >> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
> >> return 0;
> >>}
> >>
> >>typedef struct {
> >> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
> >> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
> >>} tabs_t;
> >>
> >>static unsigned long long countChildsRecursive(
> >> int prefix_nChilds, // 0 or 1
> >> const uint8_t prefixRem[],
> >> int prefixlen,
> >> const tabs_t* tabs)
> >>{
> >> unsigned long long cnt = 0;
> >> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
> >> int nChilds = suffix ? prefix_nChilds : 1;
> >> const uint8_t *isChild = &tabs->isChildTab[suffix];
> >> for (int i = 0; i < prefixlen; ++i)
> >> nChilds += isChild[prefixRem[i]];
> >>
> >> if (nChilds < 2) {
> >> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
> >> cnt += nChilds;
> >> } else {
> >> // extend prefix
> >> uint8_t prefixRemEx[20];
> >> for (int i = 0; i < prefixlen; ++i)
> >> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
> >> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
> >> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
> >> }
> >> }
> >> }
> >> return cnt;
> >>}
> >>
> >>static unsigned long long oneChildsInRange(int nDigits)
> >>{
> >> // initialize look-up tables
> >> tabs_t tabs;
> >> for (int i = 0; i < 19+10; ++i) {
> >> tabs.isChildTab[i] = i % nDigits == 0;
> >> tabs.x10remTab [i] = (i*10) % nDigits;
> >> }
> >>
> >> unsigned long long cnt = 0;
> >> for (int pref = 1; pref < 10; ++pref) {
> >> uint8_t prefixRem[1];
> >> prefixRem[0] = tabs.x10remTab[pref];
> >> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
> >> }
> >>
> >> return cnt;
> >>}
> >>
> >>//-- end
> >>
> >>Unlike the previous code, this variant on x86-64 is faster when compile
> >>with 'clang -march=native -O2'.
> >>gcc is significantly slower.
> >
> >OK, on my Mac Mini M1 (all Apple compilers are clang):
> >
> >---
> >m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
> >m1-mini-bash$ time ./michaels2 13
> >13 1057516028
> >
> >real 1m32.648s
> >user 1m32.311s
> >sys 0m0.098s
> >m1-mini-bash$ time ./michaels2 12
> >12 55121700430
> >
> >real 7m30.212s
> >user 7m29.783s
> >sys 0m0.432s
> >---
> And I copied over the x86 executable (I do this since then I'm sure Apple can
> in no way "cheat") and ran it:
>
> --
> m1-mini-bash$ time ./michaels2.x86 13
> 13 1057516028
>
> real 1m26.825s
> user 1m26.541s
> sys 0m0.108s
> m1-mini-bash$ time ./michaels2.x86 12
> 12 55121700430
>
> real 8m5.908s
> user 8m5.452s
> sys 0m0.465s
> --
>
> Yes, that's right, for the argument "13", the JIT x86 code on Apple M1 ran
> FASTER than the native compiled code. This did not happen for the "12"
> version (which runs much longer).
> >
> >And for comparison, my iMac Pro:
> >
> >---
> >imacpro-bash$ cc -O2 -o michaels2 michaels2.c
> >imacpro-bash$ time michaels2 13
> >13 1057516028
> >
> >real 1m50.732s
> >user 1m50.517s
> >sys 0m0.055s
> >imacpro-bash$ time michaels2 12
> >12 55121700430
> >
> >real 9m45.505s
> >user 9m45.260s
> >sys 0m0.181s
> >---
>
> Kent

Diz JIT boy is a BAD mazafaka!

In article <2021Jul8.164728@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Branimir Maksimovic <branimir.maksimovic@gmail.com> writes:
> >Worse is that I can't install Linux at this time
>
> Yes, because it's only available in Apple products.
>
> I admit to getting an iBook G4 in 2004, but at that time you could
> install Linux on it. Mine never had MacOS installed.

There is a crowdfunded Linux project underway for the M1, and it seems to
be making reasonable progress. While some of Apple's decisions have not
been especially helpful for the Linux work, they have not locked down the
bootloader, so there's no need for jailbreaking.

https://www.patreon.com/marcan

John

On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> On Thursday, July 8, 2021 at 7:36:04 PM UTC+3, Kent Dickey wrote:
>> In article <teidncg9kKcBvnr9...@giganews.com>,
>> Kent Dickey <ke...@provalid.com> wrote:
>> >In article <c5a7429b-b7e6-4fbc...@googlegroups.com>,
>> >Michael S <already...@yahoo.com> wrote:
>> >>In the mean time I simplified and improved this program.
>> >>With new variant nDigits=11 is no longer interesting as a benchmark (too
>> >>fast), but nDigits=12 and nDigits=13 are now well suited.
>> >>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
>> >>Code:
>> >>
>> >>//-- beg
>> >>#include <stdint.h>
>> >>#include <stdio.h>
>> >>#include <stdlib.h>
>> >>#include <string.h>
>> >>
>> >>static unsigned long long oneChildsInRange(int nDigits);
>> >>int main(int argz, char** argv)
>> >>{
>> >> if (argz < 2) {
>> >> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
>> >> return 1;
>> >> }
>> >>
>> >> char* endp;
>> >> int nDigits = strtol(argv[1], &endp, 0);
>> >> if (endp == argv[1]) {
>> >> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
>> >> return 1;
>> >> }
>> >>
>> >> if (nDigits < 11 || nDigits > 19) {
>> >> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
>> >> return 1;
>> >> }
>> >>
>> >> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
>> >> return 0;
>> >>}
>> >>
>> >>typedef struct {
>> >> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
>> >> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
>> >>} tabs_t;
>> >>
>> >>static unsigned long long countChildsRecursive(
>> >> int prefix_nChilds, // 0 or 1
>> >> const uint8_t prefixRem[],
>> >> int prefixlen,
>> >> const tabs_t* tabs)
>> >>{
>> >> unsigned long long cnt = 0;
>> >> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
>> >> int nChilds = suffix ? prefix_nChilds : 1;
>> >> const uint8_t *isChild = &tabs->isChildTab[suffix];
>> >> for (int i = 0; i < prefixlen; ++i)
>> >> nChilds += isChild[prefixRem[i]];
>> >>
>> >> if (nChilds < 2) {
>> >> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
>> >> cnt += nChilds;
>> >> } else {
>> >> // extend prefix
>> >> uint8_t prefixRemEx[20];
>> >> for (int i = 0; i < prefixlen; ++i)
>> >> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
>> >> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
>> >> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
>> >> }
>> >> }
>> >> }
>> >> return cnt;
>> >>}
>> >>
>> >>static unsigned long long oneChildsInRange(int nDigits)
>> >>{
>> >> // initialize look-up tables
>> >> tabs_t tabs;
>> >> for (int i = 0; i < 19+10; ++i) {
>> >> tabs.isChildTab[i] = i % nDigits == 0;
>> >> tabs.x10remTab [i] = (i*10) % nDigits;
>> >> }
>> >>
>> >> unsigned long long cnt = 0;
>> >> for (int pref = 1; pref < 10; ++pref) {
>> >> uint8_t prefixRem[1];
>> >> prefixRem[0] = tabs.x10remTab[pref];
>> >> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
>> >> }
>> >>
>> >> return cnt;
>> >>}
>> >>
>> >>//-- end
>> >>
>> >>Unlike the previous code, this variant on x86-64 is faster when compile
>> >>with 'clang -march=native -O2'.
>> >>gcc is significantly slower.
>> >
>> >OK, on my Mac Mini M1 (all Apple compilers are clang):
>> >
>> >---
>> >m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
>> >m1-mini-bash$ time ./michaels2 13
>> >13 1057516028
>> >
>> >real 1m32.648s
>> >user 1m32.311s
>> >sys 0m0.098s
>> >m1-mini-bash$ time ./michaels2 12
>> >12 55121700430
>> >
>> >real 7m30.212s
>> >user 7m29.783s
>> >sys 0m0.432s
>> >---
>> And I copied over the x86 executable (I do this since then I'm sure Apple can
>> in no way "cheat") and ran it:
>>
>> --
>> m1-mini-bash$ time ./michaels2.x86 13
>> 13 1057516028
>>
>> real 1m26.825s
>> user 1m26.541s
>> sys 0m0.108s
>> m1-mini-bash$ time ./michaels2.x86 12
>> 12 55121700430
>>
>> real 8m5.908s
>> user 8m5.452s
>> sys 0m0.465s
>> --
>>
>> Yes, that's right, for the argument "13", the JIT x86 code on Apple M1 ran
>> FASTER than the native compiled code. This did not happen for the "12"
>> version (which runs much longer).
>> >
>> >And for comparison, my iMac Pro:
>> >
>> >---
>> >imacpro-bash$ cc -O2 -o michaels2 michaels2.c
>> >imacpro-bash$ time michaels2 13
>> >13 1057516028
>> >
>> >real 1m50.732s
>> >user 1m50.517s
>> >sys 0m0.055s
>> >imacpro-bash$ time michaels2 12
>> >12 55121700430
>> >
>> >real 9m45.505s
>> >user 9m45.260s
>> >sys 0m0.181s
>> >---
>>
>> Kent
>
> Diz JIT boy is a BAD mazafaka!
>
It's not JIT it's M1, native code is faster then x86 :P
maxa@Branimirs-Air euler % gcc -O3 euler413i.c -o euler413i -target x86_64-apple-darwin
bmaxa@Branimirs-Air euler % time ./euler413i 13
13 1057516028
../euler413i 13 105.50s user 0.35s system 99% cpu 1:46.18 total
bmaxa@Branimirs-Air euler % time ./euler413i 12
12 55121700430
../euler413i 12 567.55s user 1.88s system 99% cpu 9:29.62 total
bmaxa@Branimirs-Air euler % gcc -O3 euler413i.c -o euler413i
bmaxa@Branimirs-Air euler % time ./euler413i 13
13 1057516028
../euler413i 13 93.72s user 0.31s system 99% cpu 1:34.76 total
bmaxa@Branimirs-Air euler % time ./euler413i 12
12 55121700430
../euler413i 12 458.93s user 1.39s system 99% cpu 7:40.35 total

--
something dumb

On 2021-07-08, John Dallman <jgd@cix.co.uk> wrote:
> In article <2021Jul8.164728@mips.complang.tuwien.ac.at>,
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> Branimir Maksimovic <branimir.maksimovic@gmail.com> writes:
>> >Worse is that I can't install Linux at this time
>>
>> Yes, because it's only available in Apple products.
>>
>> I admit to getting an iBook G4 in 2004, but at that time you could
>> install Linux on it. Mine never had MacOS installed.
>
> There is a crowdfunded Linux project underway for the M1, and it seems to
> be making reasonable progress. While some of Apple's decisions have not
> been especially helpful for the Linux work, they have not locked down the
> bootloader, so there's no need for jailbreaking.
+1

--
something dumb

Re: My experience with Apple M1 chip

<jwva6mwprzp.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18520&group=comp.arch#18520

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: My experience with Apple M1 chip
Date: Thu, 08 Jul 2021 14:45:25 -0400
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <jwva6mwprzp.fsf-monnier+comp.arch@gnu.org>
References: <T6VEI.159$VU3.17@fx46.iad>
<19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com>
<2l%EI.8$gE.2@fx21.iad>
<80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com>
<SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com>
<187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com>
<zQ5FI.779$VU3.610@fx46.iad>
<2021Jul7.122809@mips.complang.tuwien.ac.at>
<YigFI.3408$Nq7.1210@fx33.iad>
<2021Jul8.084725@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="b761c3bf9721215b7e47f34b4681af9a";
logging-data="9041"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19n8Jao4P1aC9FntjgtO8xB"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:15z2pDAwOCyClNVfVsIHu6hxKdg=
sha1:vJ2Hkfadznv9ZfWpwhwopaxdCoA=

by: Stefan Monnier - Thu, 8 Jul 2021 18:45 UTC

> The M1 is pretty good. Too bad it's only available in Apple products.

Yes, this raises some philosophical/political issues for me, but the
more real problems are technical: AFAIU you can't buy a machine with
such a CPU where the RAM and SSD can be later upgraded.
That's a kind of planned obsolescence that I find very problematic, also
from an environmental point of view.

Stefan

Re: My experience with Apple M1 chip

<3fb4829b-cbb3-434b-a26f-d94fd5662713n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18523&group=comp.arch#18523

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:20d1:: with SMTP id f17mr33480134qka.370.1625773824935; Thu, 08 Jul 2021 12:50:24 -0700 (PDT)
X-Received: by 2002:aca:b18a:: with SMTP id a132mr4710436oif.30.1625773824667; Thu, 08 Jul 2021 12:50:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.swapon.de!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 8 Jul 2021 12:50:24 -0700 (PDT)
In-Reply-To: <U1HFI.336$nj3.297@fx15.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.191; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.191
References: <T6VEI.159$VU3.17@fx46.iad> <187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com> <c5a7429b-b7e6-4fbc-ac98-bf5160c3a87dn@googlegroups.com> <teidncg9kKcBvnr9nZ2dnUU7-Y_NnZ2d@giganews.com> <pcadnRiE6PbwtHr9nZ2dnUU7-R3NnZ2d@giganews.com> <8109bcc6-706f-4370-a4c2-e5119481f19cn@googlegroups.com> <U1HFI.336$nj3.297@fx15.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3fb4829b-cbb3-434b-a26f-d94fd5662713n@googlegroups.com>
Subject: Re: My experience with Apple M1 chip
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 08 Jul 2021 19:50:24 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 186

by: Michael S - Thu, 8 Jul 2021 19:50 UTC

On Thursday, July 8, 2021 at 8:59:20 PM UTC+3, Branimir Maksimovic wrote:
> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> > On Thursday, July 8, 2021 at 7:36:04 PM UTC+3, Kent Dickey wrote:
> >> In article <teidncg9kKcBvnr9...@giganews.com>,
> >> Kent Dickey <ke...@provalid.com> wrote:
> >> >In article <c5a7429b-b7e6-4fbc...@googlegroups.com>,
> >> >Michael S <already...@yahoo.com> wrote:
> >> >>In the mean time I simplified and improved this program.
> >> >>With new variant nDigits=11 is no longer interesting as a benchmark (too
> >> >>fast), but nDigits=12 and nDigits=13 are now well suited.
> >> >>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
> >> >>Code:
> >> >>
> >> >>//-- beg
> >> >>#include <stdint.h>
> >> >>#include <stdio.h>
> >> >>#include <stdlib.h>
> >> >>#include <string.h>
> >> >>
> >> >>static unsigned long long oneChildsInRange(int nDigits);
> >> >>int main(int argz, char** argv)
> >> >>{
> >> >> if (argz < 2) {
> >> >> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
> >> >> return 1;
> >> >> }
> >> >>
> >> >> char* endp;
> >> >> int nDigits = strtol(argv[1], &endp, 0);
> >> >> if (endp == argv[1]) {
> >> >> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
> >> >> return 1;
> >> >> }
> >> >>
> >> >> if (nDigits < 11 || nDigits > 19) {
> >> >> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
> >> >> return 1;
> >> >> }
> >> >>
> >> >> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
> >> >> return 0;
> >> >>}
> >> >>
> >> >>typedef struct {
> >> >> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
> >> >> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
> >> >>} tabs_t;
> >> >>
> >> >>static unsigned long long countChildsRecursive(
> >> >> int prefix_nChilds, // 0 or 1
> >> >> const uint8_t prefixRem[],
> >> >> int prefixlen,
> >> >> const tabs_t* tabs)
> >> >>{
> >> >> unsigned long long cnt = 0;
> >> >> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
> >> >> int nChilds = suffix ? prefix_nChilds : 1;
> >> >> const uint8_t *isChild = &tabs->isChildTab[suffix];
> >> >> for (int i = 0; i < prefixlen; ++i)
> >> >> nChilds += isChild[prefixRem[i]];
> >> >>
> >> >> if (nChilds < 2) {
> >> >> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
> >> >> cnt += nChilds;
> >> >> } else {
> >> >> // extend prefix
> >> >> uint8_t prefixRemEx[20];
> >> >> for (int i = 0; i < prefixlen; ++i)
> >> >> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
> >> >> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
> >> >> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
> >> >> }
> >> >> }
> >> >> }
> >> >> return cnt;
> >> >>}
> >> >>
> >> >>static unsigned long long oneChildsInRange(int nDigits)
> >> >>{
> >> >> // initialize look-up tables
> >> >> tabs_t tabs;
> >> >> for (int i = 0; i < 19+10; ++i) {
> >> >> tabs.isChildTab[i] = i % nDigits == 0;
> >> >> tabs.x10remTab [i] = (i*10) % nDigits;
> >> >> }
> >> >>
> >> >> unsigned long long cnt = 0;
> >> >> for (int pref = 1; pref < 10; ++pref) {
> >> >> uint8_t prefixRem[1];
> >> >> prefixRem[0] = tabs.x10remTab[pref];
> >> >> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
> >> >> }
> >> >>
> >> >> return cnt;
> >> >>}
> >> >>
> >> >>//-- end
> >> >>
> >> >>Unlike the previous code, this variant on x86-64 is faster when compile
> >> >>with 'clang -march=native -O2'.
> >> >>gcc is significantly slower.
> >> >
> >> >OK, on my Mac Mini M1 (all Apple compilers are clang):
> >> >
> >> >---
> >> >m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
> >> >m1-mini-bash$ time ./michaels2 13
> >> >13 1057516028
> >> >
> >> >real 1m32.648s
> >> >user 1m32.311s
> >> >sys 0m0.098s
> >> >m1-mini-bash$ time ./michaels2 12
> >> >12 55121700430
> >> >
> >> >real 7m30.212s
> >> >user 7m29.783s
> >> >sys 0m0.432s
> >> >---
> >> And I copied over the x86 executable (I do this since then I'm sure Apple can
> >> in no way "cheat") and ran it:
> >>
> >> --
> >> m1-mini-bash$ time ./michaels2.x86 13
> >> 13 1057516028
> >>
> >> real 1m26.825s
> >> user 1m26.541s
> >> sys 0m0.108s
> >> m1-mini-bash$ time ./michaels2.x86 12
> >> 12 55121700430
> >>
> >> real 8m5.908s
> >> user 8m5.452s
> >> sys 0m0.465s
> >> --
> >>
> >> Yes, that's right, for the argument "13", the JIT x86 code on Apple M1 ran
> >> FASTER than the native compiled code. This did not happen for the "12"
> >> version (which runs much longer).
> >> >
> >> >And for comparison, my iMac Pro:
> >> >
> >> >---
> >> >imacpro-bash$ cc -O2 -o michaels2 michaels2.c
> >> >imacpro-bash$ time michaels2 13
> >> >13 1057516028
> >> >
> >> >real 1m50.732s
> >> >user 1m50.517s
> >> >sys 0m0.055s
> >> >imacpro-bash$ time michaels2 12
> >> >12 55121700430
> >> >
> >> >real 9m45.505s
> >> >user 9m45.260s
> >> >sys 0m0.181s
> >> >---
> >>
> >> Kent
> >
> > Diz JIT boy is a BAD mazafaka!
> >
> It's not JIT it's M1, native code is faster then x86 :P
> maxa@Branimirs-Air euler % gcc -O3 euler413i.c -o euler413i -target x86_64-apple-darwin

In order to see the effect, observed by Kent, you should compile x86-64 binary with "clang -O2 -march=skylake".
As I said above, in this particular program clang-x64 is measurably faster than gcc-x64.
That's assuming that your gcc is a real gcc rather than alias to clang, as typical on MacOS.

> bmaxa@Branimirs-Air euler % time ./euler413i 13
> 13 1057516028
> ./euler413i 13 105.50s user 0.35s system 99% cpu 1:46.18 total
> bmaxa@Branimirs-Air euler % time ./euler413i 12
> 12 55121700430
> ./euler413i 12 567.55s user 1.88s system 99% cpu 9:29.62 total
> bmaxa@Branimirs-Air euler % gcc -O3 euler413i.c -o euler413i
> bmaxa@Branimirs-Air euler % time ./euler413i 13
> 13 1057516028
> ./euler413i 13 93.72s user 0.31s system 99% cpu 1:34.76 total
> bmaxa@Branimirs-Air euler % time ./euler413i 12
> 12 55121700430
> ./euler413i 12 458.93s user 1.39s system 99% cpu 7:40.35 total
>
>
> --
> something dumb

On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> On Thursday, July 8, 2021 at 8:59:20 PM UTC+3, Branimir Maksimovic wrote:
>> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> > On Thursday, July 8, 2021 at 7:36:04 PM UTC+3, Kent Dickey wrote:
>> >> In article <teidncg9kKcBvnr9...@giganews.com>,
>> >> Kent Dickey <ke...@provalid.com> wrote:
>> >> >In article <c5a7429b-b7e6-4fbc...@googlegroups.com>,
>> >> >Michael S <already...@yahoo.com> wrote:
>> >> >>In the mean time I simplified and improved this program.
>> >> >>With new variant nDigits=11 is no longer interesting as a benchmark (too
>> >> >>fast), but nDigits=12 and nDigits=13 are now well suited.
>> >> >>On my Xeon-E they took, respectively, 9m47.099s and 1m50.575s
>> >> >>Code:
>> >> >>
>> >> >>//-- beg
>> >> >>#include <stdint.h>
>> >> >>#include <stdio.h>
>> >> >>#include <stdlib.h>
>> >> >>#include <string.h>
>> >> >>
>> >> >>static unsigned long long oneChildsInRange(int nDigits);
>> >> >>int main(int argz, char** argv)
>> >> >>{
>> >> >> if (argz < 2) {
>> >> >> fprintf(stderr, "Usage:\n%s nDigits\n", argv[0]);
>> >> >> return 1;
>> >> >> }
>> >> >>
>> >> >> char* endp;
>> >> >> int nDigits = strtol(argv[1], &endp, 0);
>> >> >> if (endp == argv[1]) {
>> >> >> fprintf(stderr, "Bad nDigits argument '%s'. Not a number.\n", argv[1]);
>> >> >> return 1;
>> >> >> }
>> >> >>
>> >> >> if (nDigits < 11 || nDigits > 19) {
>> >> >> fprintf(stderr, "Please specify nDigits argument in range [11:19].\n");
>> >> >> return 1;
>> >> >> }
>> >> >>
>> >> >> printf("%2d %20llu\n", nDigits, oneChildsInRange(nDigits));
>> >> >> return 0;
>> >> >>}
>> >> >>
>> >> >>typedef struct {
>> >> >> uint8_t isChildTab[19+10]; // [i] = i % nDigits == 0
>> >> >> uint8_t x10remTab [19+10]; // [i] = (10*i) % nDigits
>> >> >>} tabs_t;
>> >> >>
>> >> >>static unsigned long long countChildsRecursive(
>> >> >> int prefix_nChilds, // 0 or 1
>> >> >> const uint8_t prefixRem[],
>> >> >> int prefixlen,
>> >> >> const tabs_t* tabs)
>> >> >>{
>> >> >> unsigned long long cnt = 0;
>> >> >> for (int suffix = prefix_nChilds; suffix < 10; ++suffix) {
>> >> >> int nChilds = suffix ? prefix_nChilds : 1;
>> >> >> const uint8_t *isChild = &tabs->isChildTab[suffix];
>> >> >> for (int i = 0; i < prefixlen; ++i)
>> >> >> nChilds += isChild[prefixRem[i]];
>> >> >>
>> >> >> if (nChilds < 2) {
>> >> >> if (tabs->isChildTab[prefixlen+1]) { // all digits processed
>> >> >> cnt += nChilds;
>> >> >> } else {
>> >> >> // extend prefix
>> >> >> uint8_t prefixRemEx[20];
>> >> >> for (int i = 0; i < prefixlen; ++i)
>> >> >> prefixRemEx[i] = tabs->x10remTab[prefixRem[i]+suffix];
>> >> >> prefixRemEx[prefixlen] = tabs->x10remTab[suffix];
>> >> >> cnt += countChildsRecursive(nChilds, prefixRemEx, prefixlen+1, tabs);
>> >> >> }
>> >> >> }
>> >> >> }
>> >> >> return cnt;
>> >> >>}
>> >> >>
>> >> >>static unsigned long long oneChildsInRange(int nDigits)
>> >> >>{
>> >> >> // initialize look-up tables
>> >> >> tabs_t tabs;
>> >> >> for (int i = 0; i < 19+10; ++i) {
>> >> >> tabs.isChildTab[i] = i % nDigits == 0;
>> >> >> tabs.x10remTab [i] = (i*10) % nDigits;
>> >> >> }
>> >> >>
>> >> >> unsigned long long cnt = 0;
>> >> >> for (int pref = 1; pref < 10; ++pref) {
>> >> >> uint8_t prefixRem[1];
>> >> >> prefixRem[0] = tabs.x10remTab[pref];
>> >> >> cnt += countChildsRecursive(0, prefixRem, 1, &tabs);
>> >> >> }
>> >> >>
>> >> >> return cnt;
>> >> >>}
>> >> >>
>> >> >>//-- end
>> >> >>
>> >> >>Unlike the previous code, this variant on x86-64 is faster when compile
>> >> >>with 'clang -march=native -O2'.
>> >> >>gcc is significantly slower.
>> >> >
>> >> >OK, on my Mac Mini M1 (all Apple compilers are clang):
>> >> >
>> >> >---
>> >> >m1-mini-bash$ cc -O2 -o michaels2 michaels2.c
>> >> >m1-mini-bash$ time ./michaels2 13
>> >> >13 1057516028
>> >> >
>> >> >real 1m32.648s
>> >> >user 1m32.311s
>> >> >sys 0m0.098s
>> >> >m1-mini-bash$ time ./michaels2 12
>> >> >12 55121700430
>> >> >
>> >> >real 7m30.212s
>> >> >user 7m29.783s
>> >> >sys 0m0.432s
>> >> >---
>> >> And I copied over the x86 executable (I do this since then I'm sure Apple can
>> >> in no way "cheat") and ran it:
>> >>
>> >> --
>> >> m1-mini-bash$ time ./michaels2.x86 13
>> >> 13 1057516028
>> >>
>> >> real 1m26.825s
>> >> user 1m26.541s
>> >> sys 0m0.108s
>> >> m1-mini-bash$ time ./michaels2.x86 12
>> >> 12 55121700430
>> >>
>> >> real 8m5.908s
>> >> user 8m5.452s
>> >> sys 0m0.465s
>> >> --
>> >>
>> >> Yes, that's right, for the argument "13", the JIT x86 code on Apple M1 ran
>> >> FASTER than the native compiled code. This did not happen for the "12"
>> >> version (which runs much longer).
>> >> >
>> >> >And for comparison, my iMac Pro:
>> >> >
>> >> >---
>> >> >imacpro-bash$ cc -O2 -o michaels2 michaels2.c
>> >> >imacpro-bash$ time michaels2 13
>> >> >13 1057516028
>> >> >
>> >> >real 1m50.732s
>> >> >user 1m50.517s
>> >> >sys 0m0.055s
>> >> >imacpro-bash$ time michaels2 12
>> >> >12 55121700430
>> >> >
>> >> >real 9m45.505s
>> >> >user 9m45.260s
>> >> >sys 0m0.181s
>> >> >---
>> >>
>> >> Kent
>> >
>> > Diz JIT boy is a BAD mazafaka!
>> >
>> It's not JIT it's M1, native code is faster then x86 :P
>> maxa@Branimirs-Air euler % gcc -O3 euler413i.c -o euler413i -target x86_64-apple-darwin
>
> In order to see the effect, observed by Kent, you should compile x86-64 binary with "clang -O2 -march=skylake".
> As I said above, in this particular program clang-x64 is measurably faster than gcc-x64.
> That's assuming that your gcc is a real gcc rather than alias to clang, as typical on MacOS.

I have real gcc for ARM, gcc-11, that is, yes in this particular case it is slower,
but produces significantly faster code then Apples clang most of the time...
Yes gcc is alias for clang on macOS, so when I want real gcc I use gcc-11 command.
Here is result with your parameters to clang
maxa@Branimirs-Air euler % gcc -O2 euler413i.c -o euler413i -march=skylake -target x86_64-apple-darwin
bmaxa@Branimirs-Air euler % time ./euler413i 13
13 1057516028
../euler413i 13 96.14s user 0.44s system 99% cpu 1:36.94 total
bmaxa@Branimirs-Air euler % time ./euler413i 12
12 55121700430
../euler413i 12 487.02s user 2.13s system 99% cpu 8:09.21 total

Click here to read the complete article

On 7/6/2021 10:07 PM, Kent Dickey wrote:
> In article <sc2qtq$5la$1@gioia.aioe.org>,
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 7/4/2021 5:05 AM, Branimir Maksimovic wrote:
>>> Fantastic chip, blows away my 2700X in single thread.
>>> 3950 scimark2 score vs 3050 2700X.
>>> But, chip scales badly. Tried WCG and when running 2 threads no slowdown.
>>> However with 3 threads slowdown iz 25%, with 4 threads slowdown is ~50%.
>>> So chip is actually like i3 with two cores+HT.
>>> But it is much faster as single thread perfomance is fantastic.
>>> What is puzzling is that 5-7 thread does not have such slowdown,
>>> that is on low power cores performance loss per thread is more like 10-15%.
>>>
>>
>> Can you try to run the following program, and post your results? C++17...
>>
>> https://pastebin.com/raw/CYZ78gVj
>>
>> Should compile right up. Its a poor mans RCU, with debugging variables
>> turned on for a sanity check. Basically, a proxy collector to be able to
>> get RCU like abilities using 100% standard C++.
>
> This looks like a program which spawns dozens of threads. The M1 chip only
> has 8 cores (4 fast, 4 slow), so I'm not really sure what this benchmark
> shows. It may be that it does better with fewer cores (less contention),
> it may do worse.
[...]

Thank you! I am interested in the output. I should have some more time
tonight to analyze the results. A lot of contention is in the debug
parameters. Also, I have not yet padded and aligned data structures to
cache lines. However, I should have some more time in a couple of days
to post a new version that should run a heck of a lot more efficiently.
My poor mans RCU algorihtm should be able to get some pretty decent
numbers without having to use kernel RCU. Its an analog in user space
that can be created in 100% pure C++.

One of my goals is to distribute it out so threads are not hammering the
same proxy collector.

The interesting part is going to be comparing the numbers you got vs the
new version.

Also, I am creating way to many threads. I need to trim that down to:

https://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency

Also, the cache lines should be:

https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size

Re: My experience with Apple M1 chip

<0701eb1d-f5bd-4fce-84e2-c09455ca7c43n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18530&group=comp.arch#18530

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:11c3:: with SMTP id n3mr30771116qtk.211.1625786787339;
Thu, 08 Jul 2021 16:26:27 -0700 (PDT)
X-Received: by 2002:aca:bfd6:: with SMTP id p205mr5487355oif.122.1625786787068;
Thu, 08 Jul 2021 16:26:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 8 Jul 2021 16:26:26 -0700 (PDT)
In-Reply-To: <lvFFI.331$nj3.62@fx15.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.191; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.191
References: <T6VEI.159$VU3.17@fx46.iad> <19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com>
<2l%EI.8$gE.2@fx21.iad> <80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com>
<SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <A74FI.172$Yv3.30@fx41.iad>
<d5a8db38-b2b8-403d-9692-b4bdbd4388f0n@googlegroups.com> <xmDFI.1310$6U5.249@fx02.iad>
<84dc1c7c-4db9-4699-9fb0-a97a6cc3cf27n@googlegroups.com> <lvFFI.331$nj3.62@fx15.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0701eb1d-f5bd-4fce-84e2-c09455ca7c43n@googlegroups.com>
Subject: Re: My experience with Apple M1 chip
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 08 Jul 2021 23:26:27 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Michael S - Thu, 8 Jul 2021 23:26 UTC

On Thursday, July 8, 2021 at 7:14:13 PM UTC+3, Branimir Maksimovic wrote:
> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> > On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
> >> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> >> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
> >> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
> >> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
> >> >> > Michael S <already...@yahoo.com> wrote:
> >> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
> >> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
> >> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
> >> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
> >> >> >>either in single thread or in multithread throughput. Unless, of course,
> >> >> >>absolute majority of run time spent in native libraries.
> >> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
> >> >> >>
> >> >> >>I took it into account.
> >> >> >>
> >> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
> >> >> >>for *any* x86.
> >> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
> >> >> >>than top Comet Lake.
> >> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
> >> >> >>rather close.
> >> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
> >> >> >>Rocket Lake closely.
> >> >> >
> >> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
> >> >> > predict branches,
> >> >> Yes, my assembler doesn't work as well as on x86 ;p
> >> >> likes loop unrolling very much :p
> >> >> Got 10% just coknverting suboutine in macro and calling several times :P
> >> >> or working set in the 100-200KB range). It is not the
> >> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
> >> >> Sorry can't gree. Blows away x86 alright.
> >> >> At < 10W
> >> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
> >> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
> >> >> > command line, doesn't require me to install anything else, should run in less
> >> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
> >> >> > those results to any system you like. I don't think comparing optimized AVX
> >> >> > is going to be useful, but simple integer or floating point algorithms would
> >> >> > be best.
> >> >> It can compare with optimised AVX alright :P
> >> >>
> >> >
> >> > Do you want to test it?
> >> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
> >> > plain C++ that can be, in theory, vectorized by good compiler.
> >> Of course, would be interresting how M1 does that.
> >
> > It's in my public github repo already5chosen/others under directory cholesky_solver.
> > You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
> > for the same algorithm in plain c++.
> > For speed measurement, I was mostly concerned with N=85.
> >
> > Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
> > The repo was intended for myself so didn't contain comprehensive readme..
> > Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(
> >
> >> >
> >> > The most interesting would be compiling for Intel Mac and then running binary through Rosetta, as Kent did for my first test.
> >> > I don't know if you can do it without Intel Mac.
> >> Apples gcc support cross compilation. I build x86 binaries no problem. Only thing is if they contain AVX code won't work
> >> as Rosetta does not supports AVX...
> >
> > That a pity.
> > I heard about it a year ago, but completely forgot.
> >
> >> >
> >> >
> >> >
> >> >> >
> >> >> > Kent
> >> >>
> >> >>
> >> >> --
> >> >> something dumb
> >>
> >>
> >> --
> >> something dumb
> bmaxa@Branimirs-Air cholesky_solver % g++-11 -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 9.845e-12. N= 128. 84.79 usec. 17.657 GMADD/s

That's prety good, even if not as fast as optimized AVX2 on top x86.

> ./chol 128 1.48s user 0.02s system 85% cpu 1.752 total
> bmaxa@Branimirs-Air cholesky_solver % g++ -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 1.001e-11. N= 128. 258.35 usec. 5.795 GMADD/s
> ./chol 128 4.16s user 0.02s system 95% cpu 4.358 total

That's no good. But clang on x86 is equally bad.

>
> gcc is much faster then Apples clang so it seems in this case...
>
>
> --
> something dumb

Now you can compile and run outer_product_c2x2hiv on you Zen2 and compare M1 with optimized AVX2 coded in intrinsic functions (which is still not as fast as real asm, which is another 10-15% faster, but I only have asm for Windows, so you can't test) on CPU that is ~30% slower than the fastest x86 out here. I expect 19-20 GMADD/s for N=128. My Xeon-E gets 20.5.
Just don't forget to specify -march=skylake in gcc command line.

BTW, I am not that interested in N=128. For me an important data size is N=83.
As far as goodness of the CPU goes, getting high GMADD/s for smaller values of N is more challenging. Also, I consider it more important,
not just for me, but for majority of people that run Linear Algebra on general-purpose CPUs. That's because for bigger Ns CPUs are
not competitive with GPUs anyway, but for N under 100, CPU is often the fastest available engine.
I suspect that with smaller N M1 will do relatively better.

On 7/6/2021 5:07 PM, Branimir Maksimovic wrote:
> On 2021-07-07, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 7/4/2021 5:05 AM, Branimir Maksimovic wrote:
>>> Fantastic chip, blows away my 2700X in single thread.
>>> 3950 scimark2 score vs 3050 2700X.
>>> But, chip scales badly. Tried WCG and when running 2 threads no slowdown.
>>> However with 3 threads slowdown iz 25%, with 4 threads slowdown is ~50%.
>>> So chip is actually like i3 with two cores+HT.
>>> But it is much faster as single thread perfomance is fantastic.
>>> What is puzzling is that 5-7 thread does not have such slowdown,
>>> that is on low power cores performance loss per thread is more like 10-15%.
>>>
>>
>> Can you try to run the following program, and post your results? C++17...
>>
>> https://pastebin.com/raw/CYZ78gVj
>>
>> Should compile right up. Its a poor mans RCU, with debugging variables
>> turned on for a sanity check. Basically, a proxy collector to be able to
>> get RCU like abilities using 100% standard C++.
> bmaxa@Branimirs-Air Chris % time ./test1
> Chris M. Thomassons Proxy Collector Port ver .0.0.2...
> _______________________________________
>
> Booting threads...
> Threads running...
> Threads completed!
>
> node_allocations = 92400000
> node_deallocations = 92400000
>
> dtor_collect = 249854
> release_collect = 905
> quiesce_complete = 250759
> quiesce_begin = 250759
> quiesce_complete_nodes = 92400000
>
>
> Test Completed!
>
> ./test1 136.73s user 1.26s system 657% cpu 20.992 total
>
>

Thank you! I will have another version that should get better numbers,
run faster. No debug, and aligned and padded structures. Also, it just
might be distributed.

Re: My experience with Apple M1 chip

<kgMFI.3932$Nq7.3401@fx33.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18532&group=comp.arch#18532

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx33.iad.POSTED!not-for-mail
Newsgroups: comp.arch
From: branimir...@gmail.com (Branimir Maksimovic)
Subject: Re: My experience with Apple M1 chip
References: <T6VEI.159$VU3.17@fx46.iad>
<19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com>
<2l%EI.8$gE.2@fx21.iad>
<80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com>
<SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <A74FI.172$Yv3.30@fx41.iad>
<d5a8db38-b2b8-403d-9692-b4bdbd4388f0n@googlegroups.com>
<xmDFI.1310$6U5.249@fx02.iad>
<84dc1c7c-4db9-4699-9fb0-a97a6cc3cf27n@googlegroups.com>
<lvFFI.331$nj3.62@fx15.iad>
<0701eb1d-f5bd-4fce-84e2-c09455ca7c43n@googlegroups.com>
User-Agent: slrn/1.0.3 (Darwin)
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 136
Message-ID: <kgMFI.3932$Nq7.3401@fx33.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Thu, 08 Jul 2021 23:56:00 UTC
Organization: usenet-news.net
Date: Thu, 08 Jul 2021 23:56:00 GMT
X-Received-Bytes: 8708

by: Branimir Maksimovic - Thu, 8 Jul 2021 23:56 UTC

On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> On Thursday, July 8, 2021 at 7:14:13 PM UTC+3, Branimir Maksimovic wrote:
>> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> > On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
>> >> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
>> >> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
>> >> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
>> >> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
>> >> >> > Michael S <already...@yahoo.com> wrote:
>> >> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
>> >> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
>> >> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
>> >> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
>> >> >> >>either in single thread or in multithread throughput. Unless, of course,
>> >> >> >>absolute majority of run time spent in native libraries.
>> >> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
>> >> >> >>
>> >> >> >>I took it into account.
>> >> >> >>
>> >> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
>> >> >> >>for *any* x86.
>> >> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
>> >> >> >>than top Comet Lake.
>> >> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
>> >> >> >>rather close.
>> >> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
>> >> >> >>Rocket Lake closely.
>> >> >> >
>> >> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
>> >> >> > predict branches,
>> >> >> Yes, my assembler doesn't work as well as on x86 ;p
>> >> >> likes loop unrolling very much :p
>> >> >> Got 10% just coknverting suboutine in macro and calling several times :P
>> >> >> or working set in the 100-200KB range). It is not the
>> >> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
>> >> >> Sorry can't gree. Blows away x86 alright.
>> >> >> At < 10W
>> >> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
>> >> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
>> >> >> > command line, doesn't require me to install anything else, should run in less
>> >> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
>> >> >> > those results to any system you like. I don't think comparing optimized AVX
>> >> >> > is going to be useful, but simple integer or floating point algorithms would
>> >> >> > be best.
>> >> >> It can compare with optimised AVX alright :P
>> >> >>
>> >> >
>> >> > Do you want to test it?
>> >> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
>> >> > plain C++ that can be, in theory, vectorized by good compiler.
>> >> Of course, would be interresting how M1 does that.
>> >
>> > It's in my public github repo already5chosen/others under directory cholesky_solver.
>> > You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
>> > for the same algorithm in plain c++.
>> > For speed measurement, I was mostly concerned with N=85.
>> >
>> > Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
>> > The repo was intended for myself so didn't contain comprehensive readme.
>> > Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(
>> >
>> >> >
>> >> > The most interesting would be compiling for Intel Mac and then running binary through Rosetta, as Kent did for my first test.
>> >> > I don't know if you can do it without Intel Mac.
>> >> Apples gcc support cross compilation. I build x86 binaries no problem. Only thing is if they contain AVX code won't work
>> >> as Rosetta does not supports AVX...
>> >
>> > That a pity.
>> > I heard about it a year ago, but completely forgot.
>> >
>> >> >
>> >> >
>> >> >
>> >> >> >
>> >> >> > Kent
>> >> >>
>> >> >>
>> >> >> --
>> >> >> something dumb
>> >>
>> >>
>> >> --
>> >> something dumb
>> bmaxa@Branimirs-Air cholesky_solver % g++-11 -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
>> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
>> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 9.845e-12. N= 128. 84.79 usec. 17.657 GMADD/s
>
> That's prety good, even if not as fast as optimized AVX2 on top x86.
>
>> ./chol 128 1.48s user 0.02s system 85% cpu 1.752 total
>> bmaxa@Branimirs-Air cholesky_solver % g++ -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
>> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
>> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 1.001e-11. N= 128. 258.35 usec. 5.795 GMADD/s
>> ./chol 128 4.16s user 0.02s system 95% cpu 4.358 total
>
> That's no good. But clang on x86 is equally bad.
>
>>
>> gcc is much faster then Apples clang so it seems in this case...
>>
>>
>> --
>> something dumb
>
> Now you can compile and run outer_product_c2x2hiv on you Zen2 and compare M1 with optimized AVX2 coded in intrinsic functions (which is still not as fast as real asm, which is another 10-15% faster, but I only have asm for Windows, so you can't test) on CPU that is ~30% slower than the fastest x86 out here. I expect 19-20 GMADD/s for N=128. My Xeon-E gets 20.5.
> Just don't forget to specify -march=skylake in gcc command line.
>
> BTW, I am not that interested in N=128. For me an important data size is N=83.
> As far as goodness of the CPU goes, getting high GMADD/s for smaller values of N is more challenging. Also, I consider it more important,
> not just for me, but for majority of people that run Linear Algebra on general-purpose CPUs. That's because for bigger Ns CPUs are
> not competitive with GPUs anyway, but for N under 100, CPU is often the fastest available engine.
> I suspect that with smaller N M1 will do relatively better.
>
>
On Air with heavy load ~2500Mhz
maxa@Branimirs-Air cholesky_solver % time ./chol 83
Layout='R'. N= 83. max. err: decomposition 3.126e-13, solver 1.484e-11. N= 83. 108.82 usec. 3.887 GMADD/s
../chol 83 4.20s user 0.01s system 93% cpu 4.503 total

On 2700X with heavy load ~2200Mhz
/.../cholesky_solver/outer_product_c2x2hiv >>> time ./chol 83 ±[●●][master]
Layout='R'. N= 83. max. err: decomposition 3.126e-13, solver 1.370e-11. N= 83. 80.11 usec. 5.280 GMADD/s
../chol 83 3.36s user 0.02s system 99% cpu 3.407 total

So 2700X is better in this case, but wait, let's try clang
~/.../cholesky_solver/outer_product_c2x2hiv >>> time ./chol 83 ±[●●][master]
Layout='R'. N= 83. max. err: decomposition 3.126e-13, solver 1.523e-11. N= 83. 78.16 usec. 5.412 GMADD/s
../chol 83 4.32s user 0.03s system 99% cpu 4.385 total

So clang is slower for both x86 and M1 in this case, but x86 is 8 threads busy with WCG
and M1 is just two threads busy with WCG :P

--
something dumb

Re: My experience with Apple M1 chip

<954859f4-e053-4338-9f58-bbd4f7d02c9an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18537&group=comp.arch#18537

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:126e:: with SMTP id b14mr34200846qkl.36.1625793144807;
Thu, 08 Jul 2021 18:12:24 -0700 (PDT)
X-Received: by 2002:aca:dbd6:: with SMTP id s205mr148934oig.155.1625793144554;
Thu, 08 Jul 2021 18:12:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 8 Jul 2021 18:12:24 -0700 (PDT)
In-Reply-To: <kgMFI.3932$Nq7.3401@fx33.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.191; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.191
References: <T6VEI.159$VU3.17@fx46.iad> <19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com>
<2l%EI.8$gE.2@fx21.iad> <80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com>
<SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com> <A74FI.172$Yv3.30@fx41.iad>
<d5a8db38-b2b8-403d-9692-b4bdbd4388f0n@googlegroups.com> <xmDFI.1310$6U5.249@fx02.iad>
<84dc1c7c-4db9-4699-9fb0-a97a6cc3cf27n@googlegroups.com> <lvFFI.331$nj3.62@fx15.iad>
<0701eb1d-f5bd-4fce-84e2-c09455ca7c43n@googlegroups.com> <kgMFI.3932$Nq7.3401@fx33.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <954859f4-e053-4338-9f58-bbd4f7d02c9an@googlegroups.com>
Subject: Re: My experience with Apple M1 chip
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 09 Jul 2021 01:12:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Michael S - Fri, 9 Jul 2021 01:12 UTC

On Friday, July 9, 2021 at 2:56:04 AM UTC+3, Branimir Maksimovic wrote:
> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> > On Thursday, July 8, 2021 at 7:14:13 PM UTC+3, Branimir Maksimovic wrote:
> >> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> >> > On Thursday, July 8, 2021 at 4:48:16 PM UTC+3, Branimir Maksimovic wrote:
> >> >> On 2021-07-08, Michael S <already...@yahoo.com> wrote:
> >> >> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
> >> >> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
> >> >> >> > In article <80ac50a0-dde2-4a66...@googlegroups.com>,
> >> >> >> > Michael S <already...@yahoo.com> wrote:
> >> >> >> >>On Tuesday, July 6, 2021 at 7:16:01 PM UTC+3, Branimir Maksimovic wrote:
> >> >> >> >>> On 2021-07-06, Michael S <already...@yahoo.com> wrote:
> >> >> >> >>> > I can believe that M1@3.2 GHz/Rosetta is able to run x64 software as
> >> >> >> >>fast as i3-8100B but have trouble believing that it could match i7-8700B
> >> >> >> >>either in single thread or in multithread throughput. Unless, of course,
> >> >> >> >>absolute majority of run time spent in native libraries.
> >> >> >> >>> You don't count that M1 is ~25-33% faster single core then any x86 :P
> >> >> >> >>
> >> >> >> >>I took it into account.
> >> >> >> >>
> >> >> >> >>Besides, while it's true for x86 CPUs in prev-gen Mac-Mini it's not true
> >> >> >> >>for *any* x86.
> >> >> >> >>M1 is slower than top Zen3 bins and about the same or a little slower
> >> >> >> >>than top Comet Lake.
> >> >> >> >>Probably somewhat slower than top Tiger Lake, but that comparison is
> >> >> >> >>rather close.
> >> >> >> >>Probably, measurably slower than top Rocket Lake, but I didn't look at
> >> >> >> >>Rocket Lake closely.
> >> >> >> >
> >> >> >> > I have a Mac Mini M1, and it seems fast--very fast for some workloads (hard to
> >> >> >> > predict branches,
> >> >> >> Yes, my assembler doesn't work as well as on x86 ;p
> >> >> >> likes loop unrolling very much :p
> >> >> >> Got 10% just coknverting suboutine in macro and calling several times :P
> >> >> >> or working set in the 100-200KB range). It is not the
> >> >> >> > fastest CPU on the planet, but it likely is the fastest laptop CPU.
> >> >> >> Sorry can't gree. Blows away x86 alright.
> >> >> >> At < 10W
> >> >> >> > at the AC plug it compares pretty favorably to 60W CPUs. If you have a
> >> >> >> > relatively short benchmark (say, one file, C or C++, can be run from the Unix
> >> >> >> > command line, doesn't require me to install anything else, should run in less
> >> >> >> > than 5 minutes), I can compile it and run it for you, and then you can compare
> >> >> >> > those results to any system you like. I don't think comparing optimized AVX
> >> >> >> > is going to be useful, but simple integer or floating point algorithms would
> >> >> >> > be best.
> >> >> >> It can compare with optimised AVX alright :P
> >> >> >>
> >> >> >
> >> >> > Do you want to test it?
> >> >> > I have a linear algebra core (Cholesky decomposition) coded in optimised AVX (Intel intrisinc) and the same algorithm in
> >> >> > plain C++ that can be, in theory, vectorized by good compiler.
> >> >> Of course, would be interresting how M1 does that.
> >> >
> >> > It's in my public github repo already5chosen/others under directory cholesky_solver.
> >> > You can try it yourself, e.g. outer_product_c2x2hiv for intrinsic-based variant vs outer_product_c2x2hi
> >> > for the same algorithm in plain c++.
> >> > For speed measurement, I was mostly concerned with N=85.
> >> >
> >> > Unfortunately, there is a big chance that without additional explanations you will not be able proceed.
> >> > The repo was intended for myself so didn't contain comprehensive readme.
> >> > Tomorrow, or much later today, I could answer few questions, but right now I have to work. :(
> >> >
> >> >> >
> >> >> > The most interesting would be compiling for Intel Mac and then running binary through Rosetta, as Kent did for my first test.
> >> >> > I don't know if you can do it without Intel Mac.
> >> >> Apples gcc support cross compilation. I build x86 binaries no problem. Only thing is if they contain AVX code won't work
> >> >> as Rosetta does not supports AVX...
> >> >
> >> > That a pity.
> >> > I heard about it a year ago, but completely forgot.
> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> > Kent
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> something dumb
> >> >>
> >> >>
> >> >> --
> >> >> something dumb
> >> bmaxa@Branimirs-Air cholesky_solver % g++-11 -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
> >> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
> >> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 9.845e-12. N= 128. 84.79 usec. 17.657 GMADD/s
> >
> > That's prety good, even if not as fast as optimized AVX2 on top x86.
> >
> >> ./chol 128 1.48s user 0.02s system 85% cpu 1.752 total
> >> bmaxa@Branimirs-Air cholesky_solver % g++ -O3 outer_product_c2x2hi/chol.cpp main.cpp -o chol -std=c++11
> >> bmaxa@Branimirs-Air cholesky_solver % time ./chol 128
> >> Layout='R'. N= 128. max. err: decomposition 5.116e-13, solver 1.001e-11. N= 128. 258.35 usec. 5.795 GMADD/s
> >> ./chol 128 4.16s user 0.02s system 95% cpu 4.358 total
> >
> > That's no good. But clang on x86 is equally bad.
> >
> >>
> >> gcc is much faster then Apples clang so it seems in this case...
> >>
> >>
> >> --
> >> something dumb
> >
> > Now you can compile and run outer_product_c2x2hiv on you Zen2 and compare M1 with optimized AVX2 coded in intrinsic functions (which is still not as fast as real asm, which is another 10-15% faster, but I only have asm for Windows, so you can't test) on CPU that is ~30% slower than the fastest x86 out here. I expect 19-20 GMADD/s for N=128. My Xeon-E gets 20.5.
> > Just don't forget to specify -march=skylake in gcc command line.
> >
> > BTW, I am not that interested in N=128. For me an important data size is N=83.
> > As far as goodness of the CPU goes, getting high GMADD/s for smaller values of N is more challenging. Also, I consider it more important,
> > not just for me, but for majority of people that run Linear Algebra on general-purpose CPUs. That's because for bigger Ns CPUs are
> > not competitive with GPUs anyway, but for N under 100, CPU is often the fastest available engine.
> > I suspect that with smaller N M1 will do relatively better.
> >
> >
> On Air with heavy load ~2500Mhz
> maxa@Branimirs-Air cholesky_solver % time ./chol 83
> Layout='R'. N= 83. max. err: decomposition 3.126e-13, solver 1.484e-11. N= 83. 108.82 usec. 3.887 GMADD/s
> ./chol 83 4.20s user 0.01s system 93% cpu 4.503 total
>
> On 2700X with heavy load ~2200Mhz
> /.../cholesky_solver/outer_product_c2x2hiv >>> time ./chol 83 ±[●●][master]
> Layout='R'. N= 83. max. err: decomposition 3.126e-13, solver 1.370e-11. N= 83. 80.11 usec. 5.280 GMADD/s
> ./chol 83 3.36s user 0.02s system 99% cpu 3.407 total
>
> So 2700X is better in this case, but wait, let's try clang
> ~/.../cholesky_solver/outer_product_c2x2hiv >>> time ./chol 83 ±[●●][master]
> Layout='R'. N= 83. max. err: decomposition 3.126e-13, solver 1.523e-11. N= 83. 78.16 usec. 5.412 GMADD/s
> ./chol 83 4.32s user 0.03s system 99% cpu 4.385 total
>
> So clang is slower for both x86 and M1 in this case, but x86 is 8 threads busy with WCG
> and M1 is just two threads busy with WCG :P
>

Click here to read the complete article

Re: My experience with Apple M1 chip

<sc8nf1$tmi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18541&group=comp.arch#18541

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: My experience with Apple M1 chip
Date: Fri, 9 Jul 2021 07:39:44 +0200
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <sc8nf1$tmi$1@dont-email.me>
References: <T6VEI.159$VU3.17@fx46.iad>
<19dcc459-6eb5-4191-a186-c50d12ed347fn@googlegroups.com>
<2l%EI.8$gE.2@fx21.iad>
<80ac50a0-dde2-4a66-b09c-62663cd5b4aan@googlegroups.com>
<SJGdnUhI6OG5N3n9nZ2dnUU7-ffNnZ2d@giganews.com>
<187875de-0cd7-4e6e-b4a9-71a9eb1f5527n@googlegroups.com>
<zQ5FI.779$VU3.610@fx46.iad> <2021Jul7.122809@mips.complang.tuwien.ac.at>
<YigFI.3408$Nq7.1210@fx33.iad> <2021Jul8.084725@mips.complang.tuwien.ac.at>
<jwva6mwprzp.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 9 Jul 2021 05:39:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c1e782a2bc08943000c770b237edd756";
logging-data="30418"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9vvqpqDDUqYOSFy7MrVdrS5pUAkHbWV8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:rzFRKGbY8Xky2doscaZ2VFUiwmM=
In-Reply-To: <jwva6mwprzp.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Marcus - Fri, 9 Jul 2021 05:39 UTC

On 2021-07-08, Stefan Monnier wrote:
>> The M1 is pretty good. Too bad it's only available in Apple products.
>
> Yes, this raises some philosophical/political issues for me, but the
> more real problems are technical: AFAIU you can't buy a machine with
> such a CPU where the RAM and SSD can be later upgraded.
> That's a kind of planned obsolescence that I find very problematic, also
> from an environmental point of view.
>

I agree, but OTOH it's the clear trend in the industry. E.g. try
upgrading the RAM or storage of a smartphone, or even replacing the
battery. SoC and tighter integration seems to be the way to go.

I believe that there's actually a technical reason for the locked RAM
size - by putting the DRAM so close to the CPU silicon they can probably
get better memory performance (but it could also just be a cost thing,
I don't know).

/Marcus

jgd@cix.co.uk (John Dallman) writes:
>In article <2021Jul8.164728@mips.complang.tuwien.ac.at>,
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> Branimir Maksimovic <branimir.maksimovic@gmail.com> writes:
>> >Worse is that I can't install Linux at this time
>>
>> Yes, because it's only available in Apple products.
>>
>> I admit to getting an iBook G4 in 2004, but at that time you could
>> install Linux on it. Mine never had MacOS installed.
>
>There is a crowdfunded Linux project underway for the M1, and it seems to
>be making reasonable progress. While some of Apple's decisions have not
>been especially helpful for the Linux work, they have not locked down the
>bootloader, so there's no need for jailbreaking.
>
>https://www.patreon.com/marcan

If Apple really wanted to sell hardware, they would support such an
effort. If they don't, I go along with their wishes and won't buy
their stuff. Efforts without manufacturer support tend to be less and
less effective as hardware becomes more complicated. E.g., look at
the Nouveau driver for Nvidia graphics cards.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Branimir Maksimovic <branimir.maksimovic@gmail.com> wrote:
> On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
>
> >
> > The most interesting would be compiling for Intel Mac and then running binary through Rosetta, as Kent did for my first test.
> > I don't know if you can do it without Intel Mac.
>
> Apples gcc support cross compilation. I build x86 binaries no problem. Only thing is if they contain AVX code won't work
> as Rosetta does not supports AVX...

Gcc supported cross compilation for ages. The problem is/was with
assembler and linker. About 15 years ago I looked at this. At
that time Apple used hacked GNU assembler and they published sources.
But surces depended and Apple header files I know of nobody how
succeded compiling them without a Mac. There were substantial
chages in Apple toolchain but did the essential part change?
More precisy, can you get assembler and linker targeting Mac
on other platforms?

--
Waldek Hebisch

On 2021-07-10, antispam@math.uni.wroc.pl <antispam@math.uni.wroc.pl> wrote:
> Branimir Maksimovic <branimir.maksimovic@gmail.com> wrote:
>> On 2021-07-08, Michael S <already5chosen@yahoo.com> wrote:
>> > On Wednesday, July 7, 2021 at 12:43:00 AM UTC+3, Branimir Maksimovic wrote:
>> >> On 2021-07-06, Kent Dickey <ke...@provalid.com> wrote:
>>
>> >
>> > The most interesting would be compiling for Intel Mac and then running binary through Rosetta, as Kent did for my first test.
>> > I don't know if you can do it without Intel Mac.
>>
>> Apples gcc support cross compilation. I build x86 binaries no problem. Only thing is if they contain AVX code won't work
>> as Rosetta does not supports AVX...
>
> Gcc supported cross compilation for ages. The problem is/was with
> assembler and linker. About 15 years ago I looked at this. At
> that time Apple used hacked GNU assembler and they published sources.
They have it alright, I use it program M1 :P
Only choice for M1 right now :P
Copped with fasmg, but is seems does not produce valid binaries yet.
And I am not keen to code that myself :P
So gnu as it is :P

> But surces depended and Apple header files I know of nobody how
> succeded compiling them without a Mac. There were substantial
> chages in Apple toolchain but did the essential part change?
> More precisy, can you get assembler and linker targeting Mac
> on other platforms?

fasmg alright (for x86).

--
bmaxa now listens Last Rockers by Vice Squad from Punk and Disorderly Volume 1

Leveraging always beats prototyping.

devel / comp.arch / Re: My experience with Apple M1 chip

Subject	Author
My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Chris M. Thomasson
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Thomas Koenig
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Quadibloc
Re: My experience with Apple M1 chip	aph
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	aph
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	John Dallman
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Theo
Re: My experience with Apple M1 chip	John Levine
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Kent Dickey
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	antispam
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Anton Ertl
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Kent Dickey
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Anton Ertl
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Anton Ertl
Re: My experience with Apple M1 chip	John Dallman
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Anton Ertl
Re: My experience with Apple M1 chip	Stefan Monnier
Re: My experience with Apple M1 chip	Marcus
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Anton Ertl
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Kent Dickey
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	John Dallman
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Kent Dickey
Re: My experience with Apple M1 chip	Kent Dickey
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Michael S
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Chris M. Thomasson
Re: My experience with Apple M1 chip	Branimir Maksimovic
Re: My experience with Apple M1 chip	Chris M. Thomasson
Re: My experience with Apple M1 chip	Kent Dickey
Re: My experience with Apple M1 chip	Chris M. Thomasson