novaBBS - comp.arch - Odd-length Data Once Again

Re: Odd-length Data Once Again

<so2v75$v44$1@newsreader4.netcologne.de>

https://www.novabbs.com/devel/article-flat.php?id=22175&group=comp.arch#22175

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-282-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Odd-length Data Once Again
Date: Mon, 29 Nov 2021 16:30:29 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <so2v75$v44$1@newsreader4.netcologne.de>
References: <9ec1b1f6-d685-4e27-83c1-506f90f6d3f2n@googlegroups.com>
<ee025a51-ec51-4c05-afba-14dac73a7bben@googlegroups.com>
<345a3478-acfc-4a13-a11b-4756b560a66an@googlegroups.com>
<fd376748-c164-4449-8215-29c4c92bf34dn@googlegroups.com>
<20bc194b-6eb2-41ed-b8f5-cf98d000b134n@googlegroups.com>
<bacef91a-d240-4e8b-9e36-53956f58dd40n@googlegroups.com>
<095a1b86-b4ad-438c-9051-747c8dbd8252n@googlegroups.com>
<so0fi9$1olt$1@gioia.aioe.org>
<2021Nov29.134003@mips.complang.tuwien.ac.at>
Injection-Date: Mon, 29 Nov 2021 16:30:29 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-282-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:282:0:7285:c2ff:fe6c:992d";
logging-data="31876"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 29 Nov 2021 16:30 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>Since then every compiler where I have checked the asm/machine code
>>output will use effectively the same approach, sometimes simplifying it
>>slightly for smaller divisors.
>
> For u/7 on AMD64 (latency measured on Skylake) there are at least the
> following sequences:
>
> gcc Robison/fish ertl
> latency=8c latency=6.25c latency=6c
> %Cl=0x4924924924924925
> %C=0x2492492492492493 %C=0x9249249249249249 %Ch=0x2492492492492492
> movabs $C,%rdx movabs $C,%rax movabs $Cl,%rax
> mov %rdi,%rax mov %rax,%rcx mul %rdi
> mul %rdx mul %rdi mov %rdx,%rcx
> sub %rdx,%rdi add %rcx,%rax movabs $Ch,%rax
> shr %rdi adc $0x0,%rdx mul %rdi
> lea (%rdx,%rdi),%rax shr $0x2,%rdx add %rcx,%rax
> shr $0x2,%rax mov %rdx,%rax adc $0x0,%rdx
> mov %rdx,%rax

Makes me wonder... what do you think of the 128-bit division and
remainder by some constants with gcc11, like

unsigned int divrem_10 (__uint128_t x, __uint128_t *div)
{ *div = x / 10;
return x % 10;
}

? (see https://godbolt.org/z/czqWEn8jo ).

Re: Odd-length Data Once Again

<2021Nov29.180631@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22176&group=comp.arch#22176

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Odd-length Data Once Again
Date: Mon, 29 Nov 2021 17:06:31 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 14
Message-ID: <2021Nov29.180631@mips.complang.tuwien.ac.at>
References: <9ec1b1f6-d685-4e27-83c1-506f90f6d3f2n@googlegroups.com> <345a3478-acfc-4a13-a11b-4756b560a66an@googlegroups.com> <fd376748-c164-4449-8215-29c4c92bf34dn@googlegroups.com> <20bc194b-6eb2-41ed-b8f5-cf98d000b134n@googlegroups.com> <bacef91a-d240-4e8b-9e36-53956f58dd40n@googlegroups.com> <095a1b86-b4ad-438c-9051-747c8dbd8252n@googlegroups.com> <so0fi9$1olt$1@gioia.aioe.org> <2021Nov29.134003@mips.complang.tuwien.ac.at> <so2r6a$lhn$1@gioia.aioe.org>
Injection-Info: reader02.eternal-september.org; posting-host="ef6c9b24978eca151723df781a349157";
logging-data="5541"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Vf5dF/SGp06xLYyS5m8Dv"
Cancel-Lock: sha1:8YsT8YMr+ZVrG0Q94MQneBzIGVE=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 29 Nov 2021 17:06 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>I suspect that for divisors with less than max significant bits, your
>version will also generate a sufficiently accurate factional part which,
>after multiplication, can generate the remainder from the division.

Interesting idea, but I did not investigate it; the classical approach
of multiplying the quotient by the divisor and subtracting the result
from the dividend looked good enough (and it's not clear if it's
slower).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Odd-length Data Once Again

<so38vd$6r4$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22177&group=comp.arch#22177

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Odd-length Data Once Again
Date: Mon, 29 Nov 2021 20:17:01 +0100
Organization: Aioe.org NNTP Server
Message-ID: <so38vd$6r4$1@gioia.aioe.org>
References: <9ec1b1f6-d685-4e27-83c1-506f90f6d3f2n@googlegroups.com>
<ee025a51-ec51-4c05-afba-14dac73a7bben@googlegroups.com>
<345a3478-acfc-4a13-a11b-4756b560a66an@googlegroups.com>
<fd376748-c164-4449-8215-29c4c92bf34dn@googlegroups.com>
<20bc194b-6eb2-41ed-b8f5-cf98d000b134n@googlegroups.com>
<bacef91a-d240-4e8b-9e36-53956f58dd40n@googlegroups.com>
<095a1b86-b4ad-438c-9051-747c8dbd8252n@googlegroups.com>
<so0fi9$1olt$1@gioia.aioe.org> <2021Nov29.134003@mips.complang.tuwien.ac.at>
<so2v75$v44$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="7012"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Mon, 29 Nov 2021 19:17 UTC

Thomas Koenig wrote:
> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>> Since then every compiler where I have checked the asm/machine code
>>> output will use effectively the same approach, sometimes simplifying it
>>> slightly for smaller divisors.
>>
>> For u/7 on AMD64 (latency measured on Skylake) there are at least the
>> following sequences:
>>
>> gcc Robison/fish ertl
>> latency=8c latency=6.25c latency=6c
>> %Cl=0x4924924924924925
>> %C=0x2492492492492493 %C=0x9249249249249249 %Ch=0x2492492492492492
>> movabs $C,%rdx movabs $C,%rax movabs $Cl,%rax
>> mov %rdi,%rax mov %rax,%rcx mul %rdi
>> mul %rdx mul %rdi mov %rdx,%rcx
>> sub %rdx,%rdi add %rcx,%rax movabs $Ch,%rax
>> shr %rdi adc $0x0,%rdx mul %rdi
>> lea (%rdx,%rdi),%rax shr $0x2,%rdx add %rcx,%rax
>> shr $0x2,%rax mov %rdx,%rax adc $0x0,%rdx
>> mov %rdx,%rax
>
> Makes me wonder... what do you think of the 128-bit division and
> remainder by some constants with gcc11, like
>
> unsigned int divrem_10 (__uint128_t x, __uint128_t *div)
> {
> *div = x / 10;
> return x % 10;
> }
>
> ? (see https://godbolt.org/z/czqWEn8jo ).
>

It looks like 25-30 cycles of latency on a 3+ wide core?

It isn't horrible, but it also isn't the best we can do:

What I think is that this particular code will pretty much always occur
in the context of transforming multiple digits to decimal, in which case
I am sure I could outperform a loop of this code by a big factor using
my divide & conquer approach:

An uint128 can require up to 39 decimal digits, so we start by splitting
the input into 13-digit chunks with a constant div-mod by 10^13, then we
can extract all 39 digits in parallel:

Start by scaling each chunk by 2^60/1e12, leaving the top digit in the
top 4 bits of a 64-bit register, then repeatedly mask away those 4 bits,
multiply the remainder by 5 (a single-cycle LEA) and move the mask down
one bit position, so effectively a mul-by-10.

Since all these operations are fast and unrolled by the three parallel
chunks, we'll get close to a digit/cycle, possibly staying under 100
cycles for the full conversion?

An easier approach will do the same using divrem_1e13() then convert
each chunk using standard 64-bit operations:

void u128_to_ascii(char *buf, int buflen, __uint128_t n)
{ if (buflen < 40) return null;
__uint128_t t;
uint64_t c0, c1, c2;

t = n / (__uint128_t) 10000000000000LL;
c0 = (uint64_t) n % 10000000000000LL;

c2 = (uint64_t) t / (__uint128_t) 10000000000000LL;
c0 = (uint64_t) t % 10000000000000LL;

sprintf(buf,"%013d%013d%013d", c2, c1, c0);

return buf;
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Odd-length Data Once Again

<so394s$6r4$2@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22178&group=comp.arch#22178

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Odd-length Data Once Again
Date: Mon, 29 Nov 2021 20:19:56 +0100
Organization: Aioe.org NNTP Server
Message-ID: <so394s$6r4$2@gioia.aioe.org>
References: <9ec1b1f6-d685-4e27-83c1-506f90f6d3f2n@googlegroups.com>
<345a3478-acfc-4a13-a11b-4756b560a66an@googlegroups.com>
<fd376748-c164-4449-8215-29c4c92bf34dn@googlegroups.com>
<20bc194b-6eb2-41ed-b8f5-cf98d000b134n@googlegroups.com>
<bacef91a-d240-4e8b-9e36-53956f58dd40n@googlegroups.com>
<095a1b86-b4ad-438c-9051-747c8dbd8252n@googlegroups.com>
<so0fi9$1olt$1@gioia.aioe.org> <2021Nov29.134003@mips.complang.tuwien.ac.at>
<so2r6a$lhn$1@gioia.aioe.org> <2021Nov29.180631@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="7012"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Mon, 29 Nov 2021 19:19 UTC

Anton Ertl wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>> I suspect that for divisors with less than max significant bits, your
>> version will also generate a sufficiently accurate factional part which,
>> after multiplication, can generate the remainder from the division.
>
> Interesting idea, but I did not investigate it; the classical approach
> of multiplying the quotient by the divisor and subtracting the result
> from the dividend looked good enough (and it's not clear if it's
> slower).

It is probably just marginally slower, by needing a couple more
housekeeping instructions, but nothing significant.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Odd-length Data Once Again

<so3er4$alu$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22179&group=comp.arch#22179

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-282-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Odd-length Data Once Again
Date: Mon, 29 Nov 2021 20:57:09 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <so3er4$alu$1@newsreader4.netcologne.de>
References: <9ec1b1f6-d685-4e27-83c1-506f90f6d3f2n@googlegroups.com>
<ee025a51-ec51-4c05-afba-14dac73a7bben@googlegroups.com>
<345a3478-acfc-4a13-a11b-4756b560a66an@googlegroups.com>
<fd376748-c164-4449-8215-29c4c92bf34dn@googlegroups.com>
<20bc194b-6eb2-41ed-b8f5-cf98d000b134n@googlegroups.com>
<bacef91a-d240-4e8b-9e36-53956f58dd40n@googlegroups.com>
<095a1b86-b4ad-438c-9051-747c8dbd8252n@googlegroups.com>
<so0fi9$1olt$1@gioia.aioe.org>
<2021Nov29.134003@mips.complang.tuwien.ac.at>
<so2v75$v44$1@newsreader4.netcologne.de> <so38vd$6r4$1@gioia.aioe.org>
Injection-Date: Mon, 29 Nov 2021 20:57:09 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-282-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:282:0:7285:c2ff:fe6c:992d";
logging-data="10942"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 29 Nov 2021 20:57 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>> Terje Mathisen <terje.mathisen@tmsw.no> writes:
>>>> Since then every compiler where I have checked the asm/machine code
>>>> output will use effectively the same approach, sometimes simplifying it
>>>> slightly for smaller divisors.
>>>
>>> For u/7 on AMD64 (latency measured on Skylake) there are at least the
>>> following sequences:
>>>
>>> gcc Robison/fish ertl
>>> latency=8c latency=6.25c latency=6c
>>> %Cl=0x4924924924924925
>>> %C=0x2492492492492493 %C=0x9249249249249249 %Ch=0x2492492492492492
>>> movabs $C,%rdx movabs $C,%rax movabs $Cl,%rax
>>> mov %rdi,%rax mov %rax,%rcx mul %rdi
>>> mul %rdx mul %rdi mov %rdx,%rcx
>>> sub %rdx,%rdi add %rcx,%rax movabs $Ch,%rax
>>> shr %rdi adc $0x0,%rdx mul %rdi
>>> lea (%rdx,%rdi),%rax shr $0x2,%rdx add %rcx,%rax
>>> shr $0x2,%rax mov %rdx,%rax adc $0x0,%rdx
>>> mov %rdx,%rax
>>
>> Makes me wonder... what do you think of the 128-bit division and
>> remainder by some constants with gcc11, like
>>
>> unsigned int divrem_10 (__uint128_t x, __uint128_t *div)
>> {
>> *div = x / 10;
>> return x % 10;
>> }
>>
>> ? (see https://godbolt.org/z/czqWEn8jo ).
>>
>
> It looks like 25-30 cycles of latency on a 3+ wide core?
>
> It isn't horrible, but it also isn't the best we can do:
>
> What I think is that this particular code will pretty much always occur
> in the context of transforming multiple digits to decimal, in which case
> I am sure I could outperform a loop of this code by a big factor using
> my divide & conquer approach:

Actually, truth be told, this was partially inspired by my search
to find non-trivial numbers whose sum of digits is the same for
every prime number up to 17 (you may remember the discussion about
this some time ago).

I've found 70911040973874056146188543 and 77332999599545910254098143
with an algorithm that used the approach above, which gained me
an OEIS entry (even if it is one of the "less interesting" ones).

> An uint128 can require up to 39 decimal digits, so we start by splitting
> the input into 13-digit chunks with a constant div-mod by 10^13, then we
> can extract all 39 digits in parallel:

Hm, yes, that approach could have sped up my calculations somewhat :-)

> Start by scaling each chunk by 2^60/1e12, leaving the top digit in the
> top 4 bits of a 64-bit register, then repeatedly mask away those 4 bits,
> multiply the remainder by 5 (a single-cycle LEA) and move the mask down
> one bit position, so effectively a mul-by-10.
>
> Since all these operations are fast and unrolled by the three parallel
> chunks, we'll get close to a digit/cycle, possibly staying under 100
> cycles for the full conversion?

I am tempted to put this approach libgfortran, but unfortunately
the general overhad of Fortran I/O is very high.

> An easier approach will do the same using divrem_1e13() then convert
> each chunk using standard 64-bit operations:
>
> void u128_to_ascii(char *buf, int buflen, __uint128_t n)
> {
> if (buflen < 40) return null;
> __uint128_t t;
> uint64_t c0, c1, c2;
>
> t = n / (__uint128_t) 10000000000000LL;
> c0 = (uint64_t) n % 10000000000000LL;
>
> c2 = (uint64_t) t / (__uint128_t) 10000000000000LL;
> c0 = (uint64_t) t % 10000000000000LL;
>
> sprintf(buf,"%013d%013d%013d", c2, c1, c0);
>
> return buf;
> }

That, as well.

"Why should we subsidize intellectual curiosity?" -- Ronald Reagan

devel / comp.arch / Odd-length Data Once Again

Subject	Author
Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	Thomas Koenig
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	MitchAlsup
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	JimBrakefield
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	MitchAlsup
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	JimBrakefield
Re: Odd-length Data Once Again	BGB
Re: Odd-length Data Once Again	Quadibloc
Re: Odd-length Data Once Again	BGB
Re: Odd-length Data Once Again	Thomas Koenig
Re: Odd-length Data Once Again	BGB
Re: Odd-length Data Once Again	Thomas Koenig
Re: Odd-length Data Once Again	BGB
Re: Odd-length Data Once Again	Terje Mathisen
Re: Odd-length Data Once Again	BGB
Division by constants: (was: Odd-length Data Once Again)	Anton Ertl
Re: Odd-length Data Once Again	Terje Mathisen
Re: Odd-length Data Once Again	Anton Ertl
Re: Odd-length Data Once Again	Terje Mathisen
Re: Odd-length Data Once Again	Anton Ertl
Re: Odd-length Data Once Again	Terje Mathisen
Re: Odd-length Data Once Again	Thomas Koenig
Re: Odd-length Data Once Again	Terje Mathisen
Re: Odd-length Data Once Again	Thomas Koenig