Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

The universe is all a spin-off of the Big Bang.

Re: (FP)MADD and data scheduling

Subject	Author
MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	antispam
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	antispam
Re: MADD instruction (integer multiply and add)	aph
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	EricP
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	robf...@gmail.com
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	Theo
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	antispam
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	EricP
Re: MADD instruction (integer multiply and add)	BGB
(FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	EricP
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Ivan Godard
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Ivan Godard
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Terje Mathisen
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Terje Mathisen
Re: (FP)MADD and data scheduling	Thomas Koenig
Re: (FP)MADD and data scheduling	MitchAlsup
Re: MADD instruction (integer multiply and add)	Marcus

Pages:123

Re: MADD instruction (integer multiply and add)

<sltrii$mf5$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21845&group=comp.arch#21845

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Wed, 3 Nov 2021 12:25:04 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sltrii$mf5$1@gioia.aioe.org>
References: <slm4ja$e0b$1@dont-email.me> <sltom7$14sa$1@gioia.aioe.org>
<sltqf4$nck$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="23013"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Wed, 3 Nov 2021 11:25 UTC

Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>
>> On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
>> that is probably fast enough, and it provides the most general building
>> block:
>>
>> mov rax,b
>> mul c ;; 4 or 5 cycles
>>
>> add rax,a ;; 1 cycle
>>
>> adc rdx,0
>> add rax,carry ;; 1 cycle
>>
>> adc rdx,0 ;; 1 cycle
>>
>> Doing it in hardware is as you note almost free, zero or one additional
>> cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
>> or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
>> general CPU, even if you cheat and make both outputs implied. :-(
>
> Digging through the POWER ISA... in 3.0B aka POWER9 you can do
> (RA, RB, RC, RT1 and RT2 refer to suitably defined registers)
>
> addic RC, RC, 0 ! RC = RC + Carry
> maddhdu RT1, RA, RB, RC ! RT1 = high(RA*RB + RC)
> maddld RT2, RA, RB, RC ! RT2 = low(RA*RB + RC)
>
That is pretty nice as long as the maddhdu and maddld can overlap for
all or all_minus_1 cycles.

I used the wrong term for 'carry' above here, it is another full 64-bit
input, so the code above would not work the same way.

The idea is of course that 64+(64*64)+64 can maximally result in a
128-bit result with all bits set, i.e. no overflow is possible.

The mul-add-add is however a near-perfect building block for
crypto-class intermediate length (256-4096 bits) bigint math.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: MADD instruction (integer multiply and add)

<sltuj8$qlk$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21846&group=comp.arch#21846

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Wed, 3 Nov 2021 12:16:40 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sltuj8$qlk$1@newsreader4.netcologne.de>
References: <slm4ja$e0b$1@dont-email.me> <sltom7$14sa$1@gioia.aioe.org>
<sltqf4$nck$1@newsreader4.netcologne.de> <sltrii$mf5$1@gioia.aioe.org>
Injection-Date: Wed, 3 Nov 2021 12:16:40 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:a34:0:7285:c2ff:fe6c:992d";
logging-data="27316"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 3 Nov 2021 12:16 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>
>>> On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
>>> that is probably fast enough, and it provides the most general building
>>> block:
>>>
>>> mov rax,b
>>> mul c ;; 4 or 5 cycles
>>>
>>> add rax,a ;; 1 cycle
>>>
>>> adc rdx,0
>>> add rax,carry ;; 1 cycle
>>>
>>> adc rdx,0 ;; 1 cycle
>>>
>>> Doing it in hardware is as you note almost free, zero or one additional
>>> cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
>>> or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
>>> general CPU, even if you cheat and make both outputs implied. :-(
>>
>> Digging through the POWER ISA... in 3.0B aka POWER9 you can do
>> (RA, RB, RC, RT1 and RT2 refer to suitably defined registers)
>>
>> addic RC, RC, 0 ! RC = RC + Carry
>> maddhdu RT1, RA, RB, RC ! RT1 = high(RA*RB + RC)
>> maddld RT2, RA, RB, RC ! RT2 = low(RA*RB + RC)
>>
> That is pretty nice as long as the maddhdu and maddld can overlap for
> all or all_minus_1 cycles.

There is no reason why these two instructions could not run in
parallel, though - there is no dependency between them.

> I used the wrong term for 'carry' above here, it is another full 64-bit
> input, so the code above would not work the same way.

Hm... so the question is when to add the carry. Doing this
at the end (probably the sane way) would lead to

maddld RT_low, RA, RB, RC
maddhdu RT_high, RA, RB, RC
add. RT_low, R_Carry
addic RT_high, RT_high, 0

An alternative might be

add. RA, R_Carry
maddld RT_low, RA, RB, RC
maddhdu RT_high, RA, RB, RC
addic RT_high, RT_high, 0

but I am not sure which version would be better, if there is
a difference at all.

> The idea is of course that 64+(64*64)+64 can maximally result in a
> 128-bit result with all bits set, i.e. no overflow is possible.
>
> The mul-add-add is however a near-perfect building block for
> crypto-class intermediate length (256-4096 bits) bigint math.

Which is why IBM added it in POWER9 (although they called it
blockchain or something like that - marketing).

Re: MADD instruction (integer multiply and add)

<cPf*Mrryy@news.chiark.greenend.org.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21865&group=comp.arch#21865

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!.POSTED!not-for-mail
From: theom+n...@chiark.greenend.org.uk (Theo)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: 04 Nov 2021 21:33:14 +0000 (GMT)
Organization: University of Cambridge, England
Lines: 26
Message-ID: <cPf*Mrryy@news.chiark.greenend.org.uk>
References: <slm4ja$e0b$1@dont-email.me>
NNTP-Posting-Host: chiark.greenend.org.uk
X-Trace: chiark.greenend.org.uk 1636061596 28085 212.13.197.229 (4 Nov 2021 21:33:16 GMT)
X-Complaints-To: abuse@chiark.greenend.org.uk
NNTP-Posting-Date: Thu, 4 Nov 2021 21:33:16 +0000 (UTC)
User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (Linux/3.16.0-11-amd64 (x86_64))
Originator: theom@chiark.greenend.org.uk ([212.13.197.229])

by: Theo - Thu, 4 Nov 2021 21:33 UTC

Marcus <m.delete@this.bitsnbites.eu> wrote:
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

32-bit ARM has a MLA instruction. This came in in ARM2, which added a
hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
bearing in mind there's no cache so it depends on DRAM for instruction
fetch), although it implies they can terminate early. I think the add
therefore comes 'free' in terms of cycles.

So it appeared in ARM architecture v2 and has stuck around since. I think I
remember Sophie Wilson saying it was added to aid in graphics, although
somebody else suggests sound.

> It seems to me that it's a very useful instruction, and that it has
> a relatively small hardware cost.

Indeed, it can cut down the instruction overhead of DSP things (and sound
and graphics) by quite a bit.

Theo

Re: MADD instruction (integer multiply and add)

<sm2j3m$e7q$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21874&group=comp.arch#21874

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Fri, 5 Nov 2021 07:31:17 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sm2j3m$e7q$1@gioia.aioe.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="14586"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 5 Nov 2021 06:31 UTC

Theo wrote:
> Marcus <m.delete@this.bitsnbites.eu> wrote:
>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>> MADD instruction (the addend is set to zero).
>>
>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>
> 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> bearing in mind there's no cache so it depends on DRAM for instruction
> fetch), although it implies they can terminate early. I think the add
> therefore comes 'free' in terms of cycles.
>
> So it appeared in ARM architecture v2 and has stuck around since. I think I
> remember Sophie Wilson saying it was added to aid in graphics, although
> somebody else suggests sound.
>
>> It seems to me that it's a very useful instruction, and that it has
>> a relatively small hardware cost.
>
> Indeed, it can cut down the instruction overhead of DSP things (and sound
> and graphics) by quite a bit.

That's not true, at least not for the standard single-wide version: The
latency of the MUL is typically 4-5 cycles (or even more) while an ADD
is 1 cycle, so that's the only saving in this case.

Adding one or two register-size chunks to an N*N->2N MUL is however
quite useful, even more so on architectures without carry flag(s).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: MADD instruction (integer multiply and add)

<oPmdnVGtNe_BaBn8nZ2dnUU78VXNnZ2d@supernews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21879&group=comp.arch#21879

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!border1.nntp.ams1.giganews.com!nntp.giganews.com!buffer1.nntp.ams1.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 05 Nov 2021 04:11:56 -0500
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: MADD instruction (integer multiply and add)
Newsgroups: comp.arch
References: <slm4ja$e0b$1@dont-email.me> <slm94k$pqk$1@z-news.wcss.wroc.pl> <slmcja$5ph$1@dont-email.me> <slmoso$b12$1@z-news.wcss.wroc.pl>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-305.25.1.el8_4.x86_64 (x86_64))
Message-ID: <oPmdnVGtNe_BaBn8nZ2dnUU78VXNnZ2d@supernews.com>
Date: Fri, 05 Nov 2021 04:11:56 -0500
Lines: 16
X-Trace: sv3-gxfb7IzO6kWLlNFlhYmM7UkKbIn6ld6aHHO7tKB+uSi0rlO47HB4+Eo510zDcHTdFrPqKUORtDbAAhV!vxzo2KnAOhWBPH+SmeSmL/sSkm6pKdM43Eck5dmaoWE+dWAuFd5jllhjRsCYwfkqMyQCA1i1uaI2!g4XporY5
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 1773

by: aph...@littlepinkcloud.invalid - Fri, 5 Nov 2021 09:11 UTC

antispam@math.uni.wroc.pl wrote:
> Marcus <m.delete@this.bitsnbites.eu> wrote:
>> Which instruction is that? I know about FMA3 for floating-point, but how
>> about integer?
>
> Well, I was tinking about floating-point. However, there is old
> PMADDWD instruction. It is limited to take 4 packed 16-bit numbers
> and produces 2 dot products a0*b0 + a1*b1 and a2*b2 + a3*b3. 64-bit
> version (128-bit dot products of pairs of 64-bit numbers) of this
> would be nice...

VPMADD52LUQ is the useful one. Packed Multiply of Unsigned 52-bit
Integers and Add the Low 52-bit Products to Qword
Accumulators. AVX-512, I'm afraid.

Andrew.

Re: MADD instruction (integer multiply and add)

<87c18710-5ceb-44c5-ba4d-056042030784n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21892&group=comp.arch#21892

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2903:: with SMTP id m3mr33917157qkp.452.1636133123250;
Fri, 05 Nov 2021 10:25:23 -0700 (PDT)
X-Received: by 2002:a05:6808:128d:: with SMTP id a13mr23177716oiw.51.1636133122973;
Fri, 05 Nov 2021 10:25:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 5 Nov 2021 10:25:22 -0700 (PDT)
In-Reply-To: <sm2j3m$e7q$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <87c18710-5ceb-44c5-ba4d-056042030784n@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 05 Nov 2021 17:25:23 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 60

by: MitchAlsup - Fri, 5 Nov 2021 17:25 UTC

On Friday, November 5, 2021 at 1:31:21 AM UTC-5, Terje Mathisen wrote:
> Theo wrote:
> > Marcus <m.de...@this.bitsnbites.eu> wrote:
> >> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> >> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> >> MADD instruction (the addend is set to zero).
> >>
> >> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> >> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> >
> > 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> > hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> > and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> > bearing in mind there's no cache so it depends on DRAM for instruction
> > fetch), although it implies they can terminate early. I think the add
> > therefore comes 'free' in terms of cycles.
> >
> > So it appeared in ARM architecture v2 and has stuck around since. I think I
> > remember Sophie Wilson saying it was added to aid in graphics, although
> > somebody else suggests sound.
> >
> >> It seems to me that it's a very useful instruction, and that it has
> >> a relatively small hardware cost.
> >
> > Indeed, it can cut down the instruction overhead of DSP things (and sound
> > and graphics) by quite a bit.
> That's not true, at least not for the standard single-wide version: The
> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> is 1 cycle, so that's the only saving in this case.
>
> Adding one or two register-size chunks to an N*N->2N MUL is however
> quite useful, even more so on architectures without carry flag(s).
<
Why stop there::
<
void Long_multiplication( uint64_t multiplicand[],
multiplier[],
sum[],
ilength, jlength )
{ for( uint64_t i = 0;
i < (ilength + jlength);
i++ )
sum[i] = 0;

for( uint64_t acarry = j = 0; j < jlength; j++ )
{
for( uint64_t mcarry = i = 0; i < ilength; i++ )
{
{mcarry, product} = multiplicand[i]*multiplier[j]
+ mcarry;
{acarry,sum[i+j]} = {sum[i+j]+acarry} + product;
}
}
}

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: MADD instruction (integer multiply and add)

<sm47vq$ta7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21895&group=comp.arch#21895

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Fri, 5 Nov 2021 16:32:35 -0500
Organization: A noiseless patient Spider
Lines: 217
Message-ID: <sm47vq$ta7$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 5 Nov 2021 21:33:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3f43b7b0bac6dfa7ddfd57dcc0f1744c";
logging-data="30023"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/25EqtgHTcd+TqUX6B9sMX"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Cancel-Lock: sha1:q4geTuXKQ86F0Ww13r2SRnduI8w=
In-Reply-To: <sm2j3m$e7q$1@gioia.aioe.org>
Content-Language: en-US

by: BGB - Fri, 5 Nov 2021 21:32 UTC

On 11/5/2021 1:31 AM, Terje Mathisen wrote:
> Theo wrote:
>> Marcus <m.delete@this.bitsnbites.eu> wrote:
>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>>> MADD instruction (the addend is set to zero).
>>>
>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>
>> 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
>> hardware multiply after ARM1 went without. The ARM2 datasheet says
>> both MUL
>> and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait
>> states,
>> bearing in mind there's no cache so it depends on DRAM for instruction
>> fetch), although it implies they can terminate early. I think the add
>> therefore comes 'free' in terms of cycles.
>>
>> So it appeared in ARM architecture v2 and has stuck around since. I
>> think I
>> remember Sophie Wilson saying it was added to aid in graphics, although
>> somebody else suggests sound.
>>
>>> It seems to me that it's a very useful instruction, and that it has
>>> a relatively small hardware cost.
>>
>> Indeed, it can cut down the instruction overhead of DSP things (and sound
>> and graphics) by quite a bit.
>
> That's not true, at least not for the standard single-wide version: The
> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> is 1 cycle, so that's the only saving in this case.
>

Such is the issue in my case:
MUL is 1-3 cycles in my case (32x32=>64);
MUL is also relatively infrequent in my tests;
MAC saves typically 1 cycle over MUL+ADD;
...

I didn't see much gain from adding it, because there are generally just
not enough integer multiplies to begin with.

Most of the integer MUL+ADD were in the form of "arrays of structs" and
similar, which benefited primarily from an "Rn=Rp+Rm*Imm" instruction.

However, even with this instruction able to fairly effectively handle
this use-case, there wasn't enough of them in the general case to make
much visible impact on performance.

One could argue that it could be "better" to have a full 64-bit hardware
multiply and not "fake it" with 32-bit multiply ops, however, the logic
for "faking it" fits into the pipeline moderately well. The most obvious
addition would be if there were a DMULS.L variant which could take the
high halves of a register (and avoid the need for shift ops).

Say (64x64=>128):
MOV 0, R17 | DMULUH.L R4, R5, R16 //1c, Lo(R4) * Hi(R5)
MOV 0, R19 | DMULUH.L R5, R4, R18 //1c, Lo(R5) * Hi(R4)
DMULSHH.L R4, R5, R3 //1c, Hi(R4) * Hi(R5)
DMULU.L R4, R5, R2 //1c, Lo(R4) * Lo(R5)
ADDX R16, R18, R20 //2c (interlock)
SHLDX R20, 32, R20 //1c (*2)
ADDX R2, R20, R2 //1c
RTS //2c (predicted)

Time: ~ 10 clock cycles (excluding function-call overheads).

Otherwise, as-is, one would need to spend an extra clock cycle on a pair
of "SHLD.Q" operations or similar.

*2: Newly added encoding.

The closest exception to this (DMACS.L being "kinda pointless") was
Dhrystone, which has some multidimensional arrays, which were used
highly enough to have a visible effect, but even then it was still
fairly modest.

Dhrystone score is still pretty weak though (~ 69k ATM; 0.79 DMIPS/MHz).

Though, this would seem to still be pretty solid at least if going by
"vintage stats" (eg; results for this benchmark as posted back in the
1990s).

Testing on my PC, there is actually a fairly large difference between
compilers and between optimization settings when it comes to this
benchmark (and both GCC and Clang seem to give notably higher numbers
than MSVC on this test; need to compare "-Os" vs "/O2" or similar to
give MSVC much hope of winning this one).

Still not really gotten "RISC-V Mode" working well enough to test an
RV64I build of Dhrystone to see how it compares; with both versions
running on the same hardware (could confirm or deny the "GCC is using
arcane magic on this benchmark" hypothesis).

Though, probably doesn't help that BGBCC kinda sucks even vs MSVC when
it comes to things like register allocation (despite x86-64 having half
as many registers, MSVC still somewhat beats BGBCC at the "not spilling
registers to memory all the time" game).

I actually got a lot more performance gains more recently mostly by some
fiddly with the register allocation logic (and was, for the first time
in a while, actually able to get a visible improvement in terms of Doom
framerate, *).

*: Doom is now ~ 15-25 (for the most part), now occasionally hitting the
30 fps limiter, and no longer dropping into single-digit territory.

This was a tweak that mostly eliminated a certain amount of "double
loading" (loading the same value from memory into multiple registers
using multiple memory loads). As well as some "spill value to memory,
immediately reload into another register" cases, ...

Mostly this was by adding logic to check whether a given variable would
be referenced again within the same basic-block, preferentially loading
a value into a register and working on the in-register version if this
was the case, or preferentially spilling to or operating on memory (via
scratch registers if needed) if this variable will not be used again
within the same basic block (the previous register allocation logic did
not use any sort of "forward looking" behavior).

This was along with recently running into some code (while working on
DMACS.L), which was handling scaled-addressing by acting as if it were
still generating code for SuperH, namely trying to build index-scale
from fixed-shift operators and ADD operations, and seemingly unaware
that the ISA now has 3R and 3RI operations (*3).

Could still be better here.

*3: Like, BJX2 is well past the stage of needing to do things like:
MOV RsrcA, Rdst
ADD RsrcB, Rdst
....

Or, trying to implement things like Rd=Rs*1280 as:
MOV Rs, Rd
SHLL2 Rd
ADD Rs, Rd
SHLL8 Rd

Just sorta used "#if 0" on a lot of this, since a 3RI MUL is now the
faster option (but, there are still some amount of "dark corners" like
this in the codegen I guess). Granted, the ISA had been enough of a
moving target that much of the codegen is sort of like a mass of layers
partly divided along different parts of the ISAs development.

So, some high-level parts of the codegen still pretend they are
targeting SuperH, then emit instructions with encodings for a much
earlier version of the ISA, which are then bit-twiddled into their newer
encodings. Some amount of it should probably be rewritten, but endless
"quick and dirty hacks" was an easier prospect than "just go and rewrite
all this from a clean slate".

Well, and if I did this, almost may as well go and finally banish the
"XML Demon" from the compiler frontend (because, well, using DOM as the
basis for ones' C compiler AST system was not such a great idea in
retrospect; have spent over a decade dealing with the fallout from this
decision, but it never being quite bad enough to justify "throw it out
and rewrite the whole thing from the ground up").

Though, my younger self was from an era when XML and SQL and similar
were hyped as the silver bullets to end all of ones' woes (well, also
Java; but my younger self was put off enough by how painful and awkward
it was, and at the time its performance was kinda trash which didn't
exactly help matters). Actually, this seems to be a common theme with
this era, most of the "silver bullet" technologies were about like
asking someone to build a house with an oversized lead mallet.

I guess this also slightly reduces the ranking position of "MOV Rm, Rn",
but there is still a lot more 'MOV' than there probably should be.

Probably still need to work on things, eg:
Trivial functions referenced via function pointers should probably not
create stack frames and save/restore GBR;
....

> Adding one or two register-size chunks to an N*N->2N MUL is however
> quite useful, even more so on architectures without carry flag(s).
>

The widening MUL is the default in my case; the narrow MUL is actually
implemented in hardware by taking the widening version and then sign or
zero extending the result.

Note that sign and zero extending arithmetic operators are fairly useful
in terms of keeping old C code behaving as expected. Some amount of the
code I am working with is prone to misbehave if integer values go "out
of range" rather than implementing modulo 2^32 wrapping behavior; which
in turn means either needing any operations which are prone to produce
out-of-range results to either have sign/zero extended versions, or
needing to insert explicit sign or zero extensions.

Click here to read the complete article

Re: MADD instruction (integer multiply and add)

<sm5nfh$or0$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21899&group=comp.arch#21899

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sat, 6 Nov 2021 12:04:16 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sm5nfh$or0$1@gioia.aioe.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<87c18710-5ceb-44c5-ba4d-056042030784n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="25440"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Sat, 6 Nov 2021 11:04 UTC

MitchAlsup wrote:
> On Friday, November 5, 2021 at 1:31:21 AM UTC-5, Terje Mathisen wrote:
>> Theo wrote:
>>> Marcus <m.de...@this.bitsnbites.eu> wrote:
>>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>>>> MADD instruction (the addend is set to zero).
>>>>
>>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>>
>>> 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
>>> hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
>>> and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
>>> bearing in mind there's no cache so it depends on DRAM for instruction
>>> fetch), although it implies they can terminate early. I think the add
>>> therefore comes 'free' in terms of cycles.
>>>
>>> So it appeared in ARM architecture v2 and has stuck around since. I think I
>>> remember Sophie Wilson saying it was added to aid in graphics, although
>>> somebody else suggests sound.
>>>
>>>> It seems to me that it's a very useful instruction, and that it has
>>>> a relatively small hardware cost.
>>>
>>> Indeed, it can cut down the instruction overhead of DSP things (and sound
>>> and graphics) by quite a bit.
>> That's not true, at least not for the standard single-wide version: The
>> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
>> is 1 cycle, so that's the only saving in this case.
>>
>> Adding one or two register-size chunks to an N*N->2N MUL is however
>> quite useful, even more so on architectures without carry flag(s).
> <
> Why stop there::
> <
> void Long_multiplication( uint64_t multiplicand[],
> multiplier[],
> sum[],
> ilength, jlength )
> {
> for( uint64_t i = 0;
> i < (ilength + jlength);
> i++ )
> sum[i] = 0;
>
> for( uint64_t acarry = j = 0; j < jlength; j++ )
> {
> for( uint64_t mcarry = i = 0; i < ilength; i++ )
> {
> {mcarry, product} = multiplicand[i]*multiplier[j]
> + mcarry;
> {acarry,sum[i+j]} = {sum[i+j]+acarry} + product;
> }
> }
> }

That looks a lot like a crypto-size bigint MUL, it can of course be
synthesized very easily if you have that MUL-ADD-ADD intrinsic, er even
better, the My66000 CARRY feature.

BTW, I assume you would normally have code at the end to detect
overflow, i.e. a non-null final mcarry?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: MADD instruction (integer multiply and add)

<c53b8dce-409a-4f52-8be3-a1e76b19681dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21901&group=comp.arch#21901

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:27c3:: with SMTP id i3mr21562083qkp.442.1636218823158;
Sat, 06 Nov 2021 10:13:43 -0700 (PDT)
X-Received: by 2002:a9d:5c18:: with SMTP id o24mr9895289otk.243.1636218822915;
Sat, 06 Nov 2021 10:13:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 6 Nov 2021 10:13:42 -0700 (PDT)
In-Reply-To: <sm5nfh$or0$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <87c18710-5ceb-44c5-ba4d-056042030784n@googlegroups.com>
<sm5nfh$or0$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c53b8dce-409a-4f52-8be3-a1e76b19681dn@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 06 Nov 2021 17:13:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sat, 6 Nov 2021 17:13 UTC

On Saturday, November 6, 2021 at 6:04:20 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, November 5, 2021 at 1:31:21 AM UTC-5, Terje Mathisen wrote:
> >> Theo wrote:
> >>> Marcus <m.de...@this.bitsnbites.eu> wrote:
> >>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> >>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> >>>> MADD instruction (the addend is set to zero).
> >>>>
> >>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> >>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> >>>
> >>> 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> >>> hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> >>> and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> >>> bearing in mind there's no cache so it depends on DRAM for instruction
> >>> fetch), although it implies they can terminate early. I think the add
> >>> therefore comes 'free' in terms of cycles.
> >>>
> >>> So it appeared in ARM architecture v2 and has stuck around since. I think I
> >>> remember Sophie Wilson saying it was added to aid in graphics, although
> >>> somebody else suggests sound.
> >>>
> >>>> It seems to me that it's a very useful instruction, and that it has
> >>>> a relatively small hardware cost.
> >>>
> >>> Indeed, it can cut down the instruction overhead of DSP things (and sound
> >>> and graphics) by quite a bit.
> >> That's not true, at least not for the standard single-wide version: The
> >> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> >> is 1 cycle, so that's the only saving in this case.
> >>
> >> Adding one or two register-size chunks to an N*N->2N MUL is however
> >> quite useful, even more so on architectures without carry flag(s).
> > <
> > Why stop there::
> > <
> > void Long_multiplication( uint64_t multiplicand[],
> > multiplier[],
> > sum[],
> > ilength, jlength )
> > {
> > for( uint64_t i = 0;
> > i < (ilength + jlength);
> > i++ )
> > sum[i] = 0;
> >
> > for( uint64_t acarry = j = 0; j < jlength; j++ )
> > {
> > for( uint64_t mcarry = i = 0; i < ilength; i++ )
> > {
> > {mcarry, product} = multiplicand[i]*multiplier[j]
> > + mcarry;
> > {acarry,sum[i+j]} = {sum[i+j]+acarry} + product;
> > }
> > }
> > }
> That looks a lot like a crypto-size bigint MUL, it can of course be
> synthesized very easily if you have that MUL-ADD-ADD intrinsic, er even
> better, the My66000 CARRY feature.
>
> BTW, I assume you would normally have code at the end to detect
> overflow, i.e. a non-null final mcarry?
<
As written it is a large×large multiply in unsigned with a large+large
product:: does not overflow.
<
Could easily be changed to signed and detect OVERFLOW into the
highest significant container.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: MADD instruction (integer multiply and add)

<sm6qi3$fe6$2@z-news.wcss.wroc.pl>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21906&group=comp.arch#21906

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!newsfeed.neostrada.pl!unt-exc-01.news.neostrada.pl!newsfeed.pionier.net.pl!pwr.wroc.pl!news.wcss.wroc.pl!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sat, 6 Nov 2021 21:02:59 +0000 (UTC)
Organization: Politechnika Wroclawska
Lines: 40
Message-ID: <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
NNTP-Posting-Host: hera.math.uni.wroc.pl
X-Trace: z-news.wcss.wroc.pl 1636232579 15814 156.17.86.1 (6 Nov 2021 21:02:59 GMT)
X-Complaints-To: abuse@news.pwr.wroc.pl
NNTP-Posting-Date: Sat, 6 Nov 2021 21:02:59 +0000 (UTC)
Cancel-Lock: sha1:F8xAa1GuROnG2QhucmDzQJLEln0=
User-Agent: tin/2.4.3-20181224 ("Glen Mhor") (UNIX) (Linux/4.19.0-10-amd64 (x86_64))
X-Received-Bytes: 3027

by: antis...@math.uni.wroc.pl - Sat, 6 Nov 2021 21:02 UTC

Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> Theo wrote:
> > Marcus <m.delete@this.bitsnbites.eu> wrote:
> >> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> >> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> >> MADD instruction (the addend is set to zero).
> >>
> >> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> >> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> >
> > 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> > hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> > and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> > bearing in mind there's no cache so it depends on DRAM for instruction
> > fetch), although it implies they can terminate early. I think the add
> > therefore comes 'free' in terms of cycles.
> >
> > So it appeared in ARM architecture v2 and has stuck around since. I think I
> > remember Sophie Wilson saying it was added to aid in graphics, although
> > somebody else suggests sound.
> >
> >> It seems to me that it's a very useful instruction, and that it has
> >> a relatively small hardware cost.
> >
> > Indeed, it can cut down the instruction overhead of DSP things (and sound
> > and graphics) by quite a bit.
>
> That's not true, at least not for the standard single-wide version: The
> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> is 1 cycle, so that's the only saving in this case.

Typical application of muladd is accumulating dot product-like things.
In such case what matter is throughput of multiply and latency of
add. In fact, by using multiple acumulators one can compensate
for latency of add. If muladd has higher throughput, then it is
highly useful. I would hope for muladd that has 1 cycle latency
with respect to added argument, so that one could execute chained
muladd-s one per clock, but maybe this is too much...
--
Waldek Hebisch

Re: MADD instruction (integer multiply and add)

<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21907&group=comp.arch#21907

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:d45:: with SMTP id 66mr21172954qkn.395.1636234542737;
Sat, 06 Nov 2021 14:35:42 -0700 (PDT)
X-Received: by 2002:a05:6808:120e:: with SMTP id a14mr28838855oil.122.1636234542512;
Sat, 06 Nov 2021 14:35:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 6 Nov 2021 14:35:42 -0700 (PDT)
In-Reply-To: <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 06 Nov 2021 21:35:42 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Sat, 6 Nov 2021 21:35 UTC

On Saturday, November 6, 2021 at 4:03:02 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> Terje Mathisen <terje.m...@tmsw.no> wrote:
> > Theo wrote:
> > > Marcus <m.de...@this.bitsnbites.eu> wrote:
> > >> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> > >> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> > >> MADD instruction (the addend is set to zero).
> > >>
> > >> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> > >> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> > >
> > > 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> > > hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> > > and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> > > bearing in mind there's no cache so it depends on DRAM for instruction
> > > fetch), although it implies they can terminate early. I think the add
> > > therefore comes 'free' in terms of cycles.
> > >
> > > So it appeared in ARM architecture v2 and has stuck around since. I think I
> > > remember Sophie Wilson saying it was added to aid in graphics, although
> > > somebody else suggests sound.
> > >
> > >> It seems to me that it's a very useful instruction, and that it has
> > >> a relatively small hardware cost.
> > >
> > > Indeed, it can cut down the instruction overhead of DSP things (and sound
> > > and graphics) by quite a bit.
> >
> > That's not true, at least not for the standard single-wide version: The
> > latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> > is 1 cycle, so that's the only saving in this case.
> Typical application of muladd is accumulating dot product-like things.
> In such case what matter is throughput of multiply and latency of
> add. In fact, by using multiple acumulators one can compensate
> for latency of add. If muladd has higher throughput, then it is
> highly useful. I would hope for muladd that has 1 cycle latency
> with respect to added argument, so that one could execute chained
> muladd-s one per clock, but maybe this is too much...
<
Any reasonable 2-wide machine can already do this (in order machines
merely need to be code scheduled, out-of-order machines just need a
big enough window to contain the MUL+ADD latency.) There is not a lot
of gain by making it a single instruction, nor is there a large HW cost
to making it a single instruction.
<
Today's designers should treat this as a free variable.
> --
> Waldek Hebisch

Re: MADD instruction (integer multiply and add)

<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21908&group=comp.arch#21908

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:44cc:: with SMTP id r195mr55990561qka.77.1636234634959;
Sat, 06 Nov 2021 14:37:14 -0700 (PDT)
X-Received: by 2002:a9d:5c18:: with SMTP id o24mr10881503otk.243.1636234634749;
Sat, 06 Nov 2021 14:37:14 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 6 Nov 2021 14:37:14 -0700 (PDT)
In-Reply-To: <fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl> <fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 06 Nov 2021 21:37:14 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Sat, 6 Nov 2021 21:37 UTC

On Saturday, November 6, 2021 at 4:35:43 PM UTC-5, MitchAlsup wrote:
> On Saturday, November 6, 2021 at 4:03:02 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> > Terje Mathisen <terje.m...@tmsw.no> wrote:
> > > Theo wrote:
> > > > Marcus <m.de...@this.bitsnbites.eu> wrote:
> > > >> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> > > >> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> > > >> MADD instruction (the addend is set to zero).
> > > >>
> > > >> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> > > >> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> > > >
> > > > 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
> > > > hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
> > > > and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
> > > > bearing in mind there's no cache so it depends on DRAM for instruction
> > > > fetch), although it implies they can terminate early. I think the add
> > > > therefore comes 'free' in terms of cycles.
> > > >
> > > > So it appeared in ARM architecture v2 and has stuck around since. I think I
> > > > remember Sophie Wilson saying it was added to aid in graphics, although
> > > > somebody else suggests sound.
> > > >
> > > >> It seems to me that it's a very useful instruction, and that it has
> > > >> a relatively small hardware cost.
> > > >
> > > > Indeed, it can cut down the instruction overhead of DSP things (and sound
> > > > and graphics) by quite a bit.
> > >
> > > That's not true, at least not for the standard single-wide version: The
> > > latency of the MUL is typically 4-5 cycles (or even more) while an ADD
> > > is 1 cycle, so that's the only saving in this case.
> > Typical application of muladd is accumulating dot product-like things.
> > In such case what matter is throughput of multiply and latency of
> > add. In fact, by using multiple acumulators one can compensate
> > for latency of add. If muladd has higher throughput, then it is
> > highly useful. I would hope for muladd that has 1 cycle latency
> > with respect to added argument, so that one could execute chained
> > muladd-s one per clock, but maybe this is too much...
> <
> Any reasonable 2-wide machine can already do this (in order machines
> merely need to be code scheduled, out-of-order machines just need a
> big enough window to contain the MUL+ADD latency.) There is not a lot
> of gain by making it a single instruction, nor is there a large HW cost
> to making it a single instruction.
> <
> Today's designers should treat this as a free variable.
<
I should also add, making integer MUL+ADD a single instruction is a
LOT easier with a combined register file than separate integer and FP files.
{The 3-operand data path is already present for FMAC/FMAD.}
> > --
> > Waldek Hebisch

Re: MADD instruction (integer multiply and add)

<smd5k2$h8v$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21925&group=comp.arch#21925

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 9 Nov 2021 07:48:34 +0100
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <smd5k2$h8v$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Nov 2021 06:48:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="0f7f9e92766f63ffcf44285c2f854404";
logging-data="17695"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18zpKcR/kKl3EDLPldSmheDs2SS2XGb5vs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:IPUMZjCR3EZmC+Q1FGUDq+tlTb4=
In-Reply-To: <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
Content-Language: en-US

by: Marcus - Tue, 9 Nov 2021 06:48 UTC

On 2021-11-06 22:37, MitchAlsup wrote:
> On Saturday, November 6, 2021 at 4:35:43 PM UTC-5, MitchAlsup wrote:
>> On Saturday, November 6, 2021 at 4:03:02 PM UTC-5, anti...@math.uni.wroc.pl wrote:
>>> Terje Mathisen <terje.m...@tmsw.no> wrote:
>>>> Theo wrote:
>>>>> Marcus <m.de...@this.bitsnbites.eu> wrote:
>>>>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>>>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>>>>>> MADD instruction (the addend is set to zero).
>>>>>>
>>>>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>>>>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>>>>
>>>>> 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
>>>>> hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
>>>>> and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
>>>>> bearing in mind there's no cache so it depends on DRAM for instruction
>>>>> fetch), although it implies they can terminate early. I think the add
>>>>> therefore comes 'free' in terms of cycles.
>>>>>
>>>>> So it appeared in ARM architecture v2 and has stuck around since. I think I
>>>>> remember Sophie Wilson saying it was added to aid in graphics, although
>>>>> somebody else suggests sound.
>>>>>
>>>>>> It seems to me that it's a very useful instruction, and that it has
>>>>>> a relatively small hardware cost.
>>>>>
>>>>> Indeed, it can cut down the instruction overhead of DSP things (and sound
>>>>> and graphics) by quite a bit.
>>>>
>>>> That's not true, at least not for the standard single-wide version: The
>>>> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
>>>> is 1 cycle, so that's the only saving in this case.
>>> Typical application of muladd is accumulating dot product-like things.
>>> In such case what matter is throughput of multiply and latency of
>>> add. In fact, by using multiple acumulators one can compensate
>>> for latency of add. If muladd has higher throughput, then it is
>>> highly useful. I would hope for muladd that has 1 cycle latency
>>> with respect to added argument, so that one could execute chained
>>> muladd-s one per clock, but maybe this is too much...
>> <
>> Any reasonable 2-wide machine can already do this (in order machines
>> merely need to be code scheduled, out-of-order machines just need a
>> big enough window to contain the MUL+ADD latency.) There is not a lot
>> of gain by making it a single instruction, nor is there a large HW cost
>> to making it a single instruction.
>> <
>> Today's designers should treat this as a free variable.
> <
> I should also add, making integer MUL+ADD a single instruction is a
> LOT easier with a combined register file than separate integer and FP files.
> {The 3-operand data path is already present for FMAC/FMAD.}

That's actually one of the reasons I went this route: I reasoned that I
needed fused multiply-add for floating-point anyway. Same thing with the
bitwise SELect (IIRC it's called MIX in My 66000).

/Marcus

Re: MADD instruction (integer multiply and add)

<smd86q$vkd$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21926&group=comp.arch#21926

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 9 Nov 2021 08:32:41 +0100
Organization: A noiseless patient Spider
Lines: 72
Message-ID: <smd86q$vkd$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Nov 2021 07:32:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="0f7f9e92766f63ffcf44285c2f854404";
logging-data="32397"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+j/V9UMd3erW3STVBRhJdeUiZrl9hUts8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:jyJJO/7bHLMLfdE+sUnSlwt+yLU=
In-Reply-To: <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
Content-Language: en-US

by: Marcus - Tue, 9 Nov 2021 07:32 UTC

On 2021-11-06 22:02, antispam@math.uni.wroc.pl wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>> Theo wrote:
>>> Marcus <m.delete@this.bitsnbites.eu> wrote:
>>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>>>> MADD instruction (the addend is set to zero).
>>>>
>>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>>
>>> 32-bit ARM has a MLA instruction. This came in in ARM2, which added a
>>> hardware multiply after ARM1 went without. The ARM2 datasheet says both MUL
>>> and MLA take 'up to' 16 S-cycles (sequential cycles, ie without wait states,
>>> bearing in mind there's no cache so it depends on DRAM for instruction
>>> fetch), although it implies they can terminate early. I think the add
>>> therefore comes 'free' in terms of cycles.
>>>
>>> So it appeared in ARM architecture v2 and has stuck around since. I think I
>>> remember Sophie Wilson saying it was added to aid in graphics, although
>>> somebody else suggests sound.
>>>
>>>> It seems to me that it's a very useful instruction, and that it has
>>>> a relatively small hardware cost.
>>>
>>> Indeed, it can cut down the instruction overhead of DSP things (and sound
>>> and graphics) by quite a bit.
>>
>> That's not true, at least not for the standard single-wide version: The
>> latency of the MUL is typically 4-5 cycles (or even more) while an ADD
>> is 1 cycle, so that's the only saving in this case.
>
> Typical application of muladd is accumulating dot product-like things.
> In such case what matter is throughput of multiply and latency of
> add. In fact, by using multiple acumulators one can compensate
> for latency of add. If muladd has higher throughput, then it is
> highly useful. I would hope for muladd that has 1 cycle latency
> with respect to added argument, so that one could execute chained
> muladd-s one per clock, but maybe this is too much...
>

I think that DSP:s commonly use a dedicated accumulator register that
works like a low-latency feedback into to the MAC adder, thus enabling
1 MAC / cycle (per MAC unit) without the need for insn scheduling or
interlocks.

In my design it would be possible to achieve the same thing if I
implemented "late forwarding". I.e. identify that we're using the output
from a previous MADD operation as an addend to the next MADD operation,
and rather than waiting for that result to be ready before starting the
multiplication, go ahead and start the operation immediately and forward
the MADD result into the adder.

(Let's see if this ASCII art works...)

Here's how the MADD pipeline works now (output->input latency = +2):

MUL1 : MUL2 : MUL3/ADD :
^ |
+-------------------------+

....and with late forwarding (no extra output->input latency):

MUL1 : MUL2 : MUL3/ADD :
^ |
+-----------+

I still have not implemented late forwarding, though, as it feels like
there are a number of things that you can easily get wrong.

/Marcus

Re: MADD instruction (integer multiply and add)

<MruiJ.26002$IB7.11654@fx02.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21927&group=comp.arch#21927

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx02.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl> <fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
In-Reply-To: <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 31
Message-ID: <MruiJ.26002$IB7.11654@fx02.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 09 Nov 2021 13:11:08 UTC
Date: Tue, 09 Nov 2021 08:10:26 -0500
X-Received-Bytes: 2542

by: EricP - Tue, 9 Nov 2021 13:10 UTC

MitchAlsup wrote:
> On Saturday, November 6, 2021 at 4:35:43 PM UTC-5, MitchAlsup wrote:
>> On Saturday, November 6, 2021 at 4:03:02 PM UTC-5, anti...@math.uni.wroc.pl wrote:
>>> Typical application of muladd is accumulating dot product-like things.
>>> In such case what matter is throughput of multiply and latency of
>>> add. In fact, by using multiple acumulators one can compensate
>>> for latency of add. If muladd has higher throughput, then it is
>>> highly useful. I would hope for muladd that has 1 cycle latency
>>> with respect to added argument, so that one could execute chained
>>> muladd-s one per clock, but maybe this is too much...
>> <
>> Any reasonable 2-wide machine can already do this (in order machines
>> merely need to be code scheduled, out-of-order machines just need a
>> big enough window to contain the MUL+ADD latency.) There is not a lot
>> of gain by making it a single instruction, nor is there a large HW cost
>> to making it a single instruction.
>> <
>> Today's designers should treat this as a free variable.
> <
> I should also add, making integer MUL+ADD a single instruction is a
> LOT easier with a combined register file than separate integer and FP files.
> {The 3-operand data path is already present for FMAC/FMAD.}

A base-indexed store needs 3 register operand paths,
and 4 operand buses if you allow an immediate too,
if it doesn't use your trick of reading the register to
store at the Write Back stage instead of Register Read.

It can make multi-issue operand buses a bit of a rats nest.

(FP)MADD and data scheduling

<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21928&group=comp.arch#21928

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: (FP)MADD and data scheduling
Date: Tue, 09 Nov 2021 08:30:32 -0500
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="31d4f715a1a83541dcd0e8665e88c831";
logging-data="12500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/uUpVNAAxxmiYDCL2Togoh"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:z74P7rYyiADlCk4Gdx1LTTgwj/g=
sha1:m2SDesEnHT7fRmpv09qhy10eu0k=

by: Stefan Monnier - Tue, 9 Nov 2021 13:30 UTC

> I should also add, making integer MUL+ADD a single instruction is a
> LOT easier with a combined register file than separate integer and FP files.
> {The 3-operand data path is already present for FMAC/FMAD.}

Reminds me of a question: IIUC, the MADD instruction will typically have
a latency of a few cycles (let's say 3) but the latency of the signal
is not the same for all 3 inputs. More specifically, IIUC the 3rd
("carry") input can arrive as late as the last cycle without impacting
the overall latency.

This means that in theory we can have a sequence of MADD instructions
all accumulating into the same register at a rate of 1 per cycle
(assuming the unit is pipelined), but if all inputs are "read" at
the same time, then the rate goes down to 1/3 per cycle.

How do OoO cores avoid this throughput problem?
Do they schedule the MADD by predicting that the carry will be available
2 cycles later? How do they do this prediction? What if it fails?

AFAIK the same kind of problem affects stores since the input registers
affecting the address are needed earlier than the input register
containing the to-be-stored data, but IIUC these are solved more easily
because the two parts affect different units so the stores can
conceptually be split into 2 sub-operations and go back to waiting in
the ROB between the two.

Stefan

Re: (FP)MADD and data scheduling

<hgwiJ.4483$Pl1.754@fx23.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21929&group=comp.arch#21929

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx23.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl> <fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com> <jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
In-Reply-To: <jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 77
Message-ID: <hgwiJ.4483$Pl1.754@fx23.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 09 Nov 2021 15:15:25 UTC
Date: Tue, 09 Nov 2021 10:15:03 -0500
X-Received-Bytes: 4592

by: EricP - Tue, 9 Nov 2021 15:15 UTC

Stefan Monnier wrote:
>> I should also add, making integer MUL+ADD a single instruction is a
>> LOT easier with a combined register file than separate integer and FP files.
>> {The 3-operand data path is already present for FMAC/FMAD.}
>
> Reminds me of a question: IIUC, the MADD instruction will typically have
> a latency of a few cycles (let's say 3) but the latency of the signal
> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> ("carry") input can arrive as late as the last cycle without impacting
> the overall latency.
>
> This means that in theory we can have a sequence of MADD instructions
> all accumulating into the same register at a rate of 1 per cycle
> (assuming the unit is pipelined), but if all inputs are "read" at
> the same time, then the rate goes down to 1/3 per cycle.
>
> How do OoO cores avoid this throughput problem?
> Do they schedule the MADD by predicting that the carry will be available
> 2 cycles later? How do they do this prediction? What if it fails?
>
> AFAIK the same kind of problem affects stores since the input registers
> affecting the address are needed earlier than the input register
> containing the to-be-stored data, but IIUC these are solved more easily
> because the two parts affect different units so the stores can
> conceptually be split into 2 sub-operations and go back to waiting in
> the ROB between the two.
>
>
> Stefan

The store data isn't needed until later in the LSQ pipeline,
but not so late that it blocks a potential store-load forwarding.

This is from the design of my simulator's Load-Store Queue...

Some addresses don't require calculation or reading a register,
for example immediate absolutes.
RIP-relative can be calculated in Decode because it has
an adder for calculating RIP-rel branch destinations.
Such Effective Address (EA) can issue straight from the front end
uOp pipeline to the Load-Store Queue.

Others require reading a register but no calculation, e.g. [reg].
That register may be ready, in which case the EA can go straight to
the LSQ, or be in-flight and require waiting for the EA to be forwarded.

Others require a trip through AGEN for EA calculation,
possibly syncing with in-flight operands being forwarded.

Once the EA is available then comes virtual translate by state
machine(s) which could have multiple page table walkers at once,
generating multiple outstanding L1 or L2 misses.
A load or store data might straddle a page boundary requiring
multiple EA translates and producing multiple Physical Addresses (PA)
(These page table walker(s) multiplex their requests with other
LSQ operations talking to D$L1 cache.)

After the above is complete and we have 1 or 2 PA's comes address
disambiguation where it looks at older and younger LSQ entries
to see if any are for the same cache line and subject to
store ordering (for stores) or store-load forwarding (for loads).
And there can be 1 or 2 of those for data that straddles cache lines.

After all of that the store data is actually needed in the LSQ
in case younger load(s) require store-load forwarding
or the store instruction is next to retire.
That store data may come from a register or from in-flight forwarding.

The younger load(s) that are candidates for S-L-forwarding may already be
waiting in the LSQ, held up by an unresolved older store address or data,
or may arrive later.
Either of the store or load data may straddle cache lines.

Then the LSQ talks to the D$L1 cache which may itself allow
multiple hit-under-miss operations.

Re: (FP)MADD and data scheduling

<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21930&group=comp.arch#21930

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5853:: with SMTP id h19mr10096414qth.166.1636474267439;
Tue, 09 Nov 2021 08:11:07 -0800 (PST)
X-Received: by 2002:a9d:6358:: with SMTP id y24mr6827464otk.85.1636474267139;
Tue, 09 Nov 2021 08:11:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Nov 2021 08:11:06 -0800 (PST)
In-Reply-To: <jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 09 Nov 2021 16:11:07 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Tue, 9 Nov 2021 16:11 UTC

On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> > I should also add, making integer MUL+ADD a single instruction is a
> > LOT easier with a combined register file than separate integer and FP files.
> > {The 3-operand data path is already present for FMAC/FMAD.}
> Reminds me of a question: IIUC, the MADD instruction will typically have
> a latency of a few cycles (let's say 3) but the latency of the signal
> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> ("carry") input can arrive as late as the last cycle without impacting
> the overall latency.
>
> This means that in theory we can have a sequence of MADD instructions
> all accumulating into the same register at a rate of 1 per cycle
> (assuming the unit is pipelined), but if all inputs are "read" at
> the same time, then the rate goes down to 1/3 per cycle.
>
> How do OoO cores avoid this throughput problem?
<
We build the function unit to accept all 3 operands on the same clock.
Then we don't use the 3rd operand until "later"
<
> Do they schedule the MADD by predicting that the carry will be available
> 2 cycles later? How do they do this prediction? What if it fails?
>
> AFAIK the same kind of problem affects stores since the input registers
> affecting the address are needed earlier than the input register
> containing the to-be-stored data, but IIUC these are solved more easily
<
In fact you do not need the data until you have verified permissions
(and possibly cache hit.)
<
> because the two parts affect different units so the stores can
> conceptually be split into 2 sub-operations and go back to waiting in
> the ROB between the two.
>
>
> Stefan

Re: (FP)MADD and data scheduling

<e36ccf35-2bd8-40ed-bb6e-77a42bb0a6f5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21931&group=comp.arch#21931

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:16ca:: with SMTP id d10mr8576952qvz.14.1636474423364;
Tue, 09 Nov 2021 08:13:43 -0800 (PST)
X-Received: by 2002:a9d:5c18:: with SMTP id o24mr6688617otk.243.1636474423122;
Tue, 09 Nov 2021 08:13:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Nov 2021 08:13:42 -0800 (PST)
In-Reply-To: <hgwiJ.4483$Pl1.754@fx23.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <hgwiJ.4483$Pl1.754@fx23.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e36ccf35-2bd8-40ed-bb6e-77a42bb0a6f5n@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 09 Nov 2021 16:13:43 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5425

by: MitchAlsup - Tue, 9 Nov 2021 16:13 UTC

On Tuesday, November 9, 2021 at 9:15:30 AM UTC-6, EricP wrote:
> Stefan Monnier wrote:
> >> I should also add, making integer MUL+ADD a single instruction is a
> >> LOT easier with a combined register file than separate integer and FP files.
> >> {The 3-operand data path is already present for FMAC/FMAD.}
> >
> > Reminds me of a question: IIUC, the MADD instruction will typically have
> > a latency of a few cycles (let's say 3) but the latency of the signal
> > is not the same for all 3 inputs. More specifically, IIUC the 3rd
> > ("carry") input can arrive as late as the last cycle without impacting
> > the overall latency.
> >
> > This means that in theory we can have a sequence of MADD instructions
> > all accumulating into the same register at a rate of 1 per cycle
> > (assuming the unit is pipelined), but if all inputs are "read" at
> > the same time, then the rate goes down to 1/3 per cycle.
> >
> > How do OoO cores avoid this throughput problem?
> > Do they schedule the MADD by predicting that the carry will be available
> > 2 cycles later? How do they do this prediction? What if it fails?
> >
> > AFAIK the same kind of problem affects stores since the input registers
> > affecting the address are needed earlier than the input register
> > containing the to-be-stored data, but IIUC these are solved more easily
> > because the two parts affect different units so the stores can
> > conceptually be split into 2 sub-operations and go back to waiting in
> > the ROB between the two.
> >
> >
> > Stefan
> The store data isn't needed until later in the LSQ pipeline,
> but not so late that it blocks a potential store-load forwarding.
>
> This is from the design of my simulator's Load-Store Queue...
>
> Some addresses don't require calculation or reading a register,
> for example immediate absolutes.
> RIP-relative can be calculated in Decode because it has
> an adder for calculating RIP-rel branch destinations.
> Such Effective Address (EA) can issue straight from the front end
> uOp pipeline to the Load-Store Queue.
>
> Others require reading a register but no calculation, e.g. [reg].
> That register may be ready, in which case the EA can go straight to
> the LSQ, or be in-flight and require waiting for the EA to be forwarded.
>
> Others require a trip through AGEN for EA calculation,
> possibly syncing with in-flight operands being forwarded.
<
At this point, it is generally easier to perform calculation in AGEN
and just have the data path route the bits there over the operand
buses. This gives uniform timing which makes pipelining easier.
>
> Once the EA is available then comes virtual translate by state
> machine(s) which could have multiple page table walkers at once,
> generating multiple outstanding L1 or L2 misses.
> A load or store data might straddle a page boundary requiring
> multiple EA translates and producing multiple Physical Addresses (PA)
> (These page table walker(s) multiplex their requests with other
> LSQ operations talking to D$L1 cache.)
>
> After the above is complete and we have 1 or 2 PA's comes address
> disambiguation where it looks at older and younger LSQ entries
> to see if any are for the same cache line and subject to
> store ordering (for stores) or store-load forwarding (for loads).
> And there can be 1 or 2 of those for data that straddles cache lines.
>
> After all of that the store data is actually needed in the LSQ
> in case younger load(s) require store-load forwarding
> or the store instruction is next to retire.
> That store data may come from a register or from in-flight forwarding.
>
> The younger load(s) that are candidates for S-L-forwarding may already be
> waiting in the LSQ, held up by an unresolved older store address or data,
> or may arrive later.
> Either of the store or load data may straddle cache lines.
>
> Then the LSQ talks to the D$L1 cache which may itself allow
> multiple hit-under-miss operations.

Re: (FP)MADD and data scheduling

<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21932&group=comp.arch#21932

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Tue, 09 Nov 2021 14:21:26 -0500
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="4d9e46e2f6b32b080aeadf8ee73d1edf";
logging-data="15461"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18doflBmVas1vGuq9H7A+Ei"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:alcZptIUwdvRZIi9JqqG8IUVBdM=
sha1:w6alVXJAMXrO1aT7DSTShAK8oDQ=

by: Stefan Monnier - Tue, 9 Nov 2021 19:21 UTC

MitchAlsup [2021-11-09 08:11:06] wrote:
> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>> Reminds me of a question: IIUC, the MADD instruction will typically have
>> a latency of a few cycles (let's say 3) but the latency of the signal
>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>> ("carry") input can arrive as late as the last cycle without impacting
>> the overall latency.
>>
>> This means that in theory we can have a sequence of MADD instructions
>> all accumulating into the same register at a rate of 1 per cycle
>> (assuming the unit is pipelined), but if all inputs are "read" at
>> the same time, then the rate goes down to 1/3 per cycle.
>>
>> How do OoO cores avoid this throughput problem?
> We build the function unit to accept all 3 operands on the same clock.
> Then we don't use the 3rd operand until "later"

IOW you don't avoid this throughput problem?
I mean:

accum1 = MADD(x1, x2, accum);
accum2 = MADD(x3, x4, accum1);

ends up with a latency of 2*N cycles instead of N+1, right?
Because we can't start the second MADD before the first is over :-(

For integers, we can replace the code with;

x12 = MUL(x1, x2);
x34 = MUL(x3, x4);
accum1 = ADD(x12, accum);
accum2 = ADD(x34, accum1);

with latency N+2, which is likely better than 2*N.

Stefan

Re: (FP)MADD and data scheduling

<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21933&group=comp.arch#21933

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:29eb:: with SMTP id jv11mr10602848qvb.13.1636493160530;
Tue, 09 Nov 2021 13:26:00 -0800 (PST)
X-Received: by 2002:a9d:5c18:: with SMTP id o24mr8380974otk.243.1636493160330;
Tue, 09 Nov 2021 13:26:00 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Nov 2021 13:26:00 -0800 (PST)
In-Reply-To: <jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 09 Nov 2021 21:26:00 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3454

by: MitchAlsup - Tue, 9 Nov 2021 21:26 UTC

On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
> MitchAlsup [2021-11-09 08:11:06] wrote:
> > On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> >> Reminds me of a question: IIUC, the MADD instruction will typically have
> >> a latency of a few cycles (let's say 3) but the latency of the signal
> >> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> >> ("carry") input can arrive as late as the last cycle without impacting
> >> the overall latency.
> >>
> >> This means that in theory we can have a sequence of MADD instructions
> >> all accumulating into the same register at a rate of 1 per cycle
> >> (assuming the unit is pipelined), but if all inputs are "read" at
> >> the same time, then the rate goes down to 1/3 per cycle.
> >>
> >> How do OoO cores avoid this throughput problem?
> > We build the function unit to accept all 3 operands on the same clock.
> > Then we don't use the 3rd operand until "later"
> IOW you don't avoid this throughput problem?
> I mean:
>
> accum1 = MADD(x1, x2, accum);
> accum2 = MADD(x3, x4, accum1);
>
> ends up with a latency of 2*N cycles instead of N+1, right?
<
Yes,
<
> Because we can't start the second MADD before the first is over :-(
<
Yes,
<
But as long as you do not exceed the size of the execution window,
it all works; you can put new instructions into the window every
cycle, you can retire instructions from the window every cycle,
and each function units can start a calculation every cycle. All
without SW having to schedule the code or to apply any Herculean
effort in code selection.
<
>
> For integers, we can replace the code with;
>
> x12 = MUL(x1, x2);
> x34 = MUL(x3, x4);
> accum1 = ADD(x12, accum);
> accum2 = ADD(x34, accum1);
>
> with latency N+2, which is likely better than 2*N.
<
Given the IMUL takes 3 cycles, the count is N+3;
accum1 = cannot begin until x12 = MUL completes.
>
>
> Stefan

Re: (FP)MADD and data scheduling

<jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21934&group=comp.arch#21934

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Tue, 09 Nov 2021 17:04:02 -0500
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="4d9e46e2f6b32b080aeadf8ee73d1edf";
logging-data="7637"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18p+NTwvOSNaMitrLcQ325r"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:2XqM39hN3js3hDVj4hMbC9+S6LA=
sha1:8FeVpUgj/K9mlLamI/pEUERvXCw=

by: Stefan Monnier - Tue, 9 Nov 2021 22:04 UTC

>> For integers, we can replace the code with;
>>
>> x12 = MUL(x1, x2);
>> x34 = MUL(x3, x4);
>> accum1 = ADD(x12, accum);
>> accum2 = ADD(x34, accum1);
>>
>> with latency N+2, which is likely better than 2*N.
> <
> Given the IMUL takes 3 cycles, the count is N+3;
> accum1 = cannot begin until x12 = MUL completes.

I don't understand. My "N" was the latency of MUL (and MADD).
So with your N=3 it means the MADD version takes 6 cycles whiles the
MUL+ADD version only takes 5.

Stefan

Re: (FP)MADD and data scheduling

<46c8e027-becb-4792-80f7-8815750ff6d5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21935&group=comp.arch#21935

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:2c85:: with SMTP id s127mr9048565qkh.81.1636497033138;
Tue, 09 Nov 2021 14:30:33 -0800 (PST)
X-Received: by 2002:a05:6808:1441:: with SMTP id x1mr1185689oiv.175.1636497032978;
Tue, 09 Nov 2021 14:30:32 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Nov 2021 14:30:32 -0800 (PST)
In-Reply-To: <jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org> <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <46c8e027-becb-4792-80f7-8815750ff6d5n@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 09 Nov 2021 22:30:33 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Tue, 9 Nov 2021 22:30 UTC

On Tuesday, November 9, 2021 at 4:04:05 PM UTC-6, Stefan Monnier wrote:
> >> For integers, we can replace the code with;
> >>
> >> x12 = MUL(x1, x2);
> >> x34 = MUL(x3, x4);
> >> accum1 = ADD(x12, accum);
> >> accum2 = ADD(x34, accum1);
> >>
> >> with latency N+2, which is likely better than 2*N.
> > <
> > Given the IMUL takes 3 cycles, the count is N+3;
> > accum1 = cannot begin until x12 = MUL completes.
> I don't understand. My "N" was the latency of MUL (and MADD).
> So with your N=3 it means the MADD version takes 6 cycles whiles the
> MUL+ADD version only takes 5.
<
+----------+----------+----------+----------+----------+
| cycle 1| cycle2|cycle 3|cycle 4|cycle 5|
+----------+----------+----------+----------+----------+
Cycle 1 MUL x12 begins
Cycle 2 MUL x23 begins
Cycle 3 no instruction gets launched
Cycle 4 1st ADD begins MUL x12 forwards to ADD
Cycle 5 2nd Add begins 1st Add ends MUL x34 forwards to ADDs
Cycle 6 2nd ADD ends
>
>
> Stefan

Re: MADD instruction (integer multiply and add)

<smet7t$m1r$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21936&group=comp.arch#21936

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 9 Nov 2021 16:36:35 -0600
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <smet7t$m1r$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<MruiJ.26002$IB7.11654@fx02.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Nov 2021 22:37:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2f821c1b1471f46fad26ee9d654e9eac";
logging-data="22587"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IKPudJTDtoVNDNikTENgk"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.0
Cancel-Lock: sha1:9IfJSf58lmfhsrtAjT6Y6GCaa7M=
In-Reply-To: <MruiJ.26002$IB7.11654@fx02.iad>
Content-Language: en-US

by: BGB - Tue, 9 Nov 2021 22:36 UTC

On 11/9/2021 7:10 AM, EricP wrote:
> MitchAlsup wrote:
>> On Saturday, November 6, 2021 at 4:35:43 PM UTC-5, MitchAlsup wrote:
>>> On Saturday, November 6, 2021 at 4:03:02 PM UTC-5,
>>> anti...@math.uni.wroc.pl wrote:
>>>> Typical application of muladd is accumulating dot product-like
>>>> things. In such case what matter is throughput of multiply and
>>>> latency of add. In fact, by using multiple acumulators one can
>>>> compensate for latency of add. If muladd has higher throughput, then
>>>> it is highly useful. I would hope for muladd that has 1 cycle
>>>> latency with respect to added argument, so that one could execute
>>>> chained muladd-s one per clock, but maybe this is too much...
>>> <
>>> Any reasonable 2-wide machine can already do this (in order machines
>>> merely need to be code scheduled, out-of-order machines just need a
>>> big enough window to contain the MUL+ADD latency.) There is not a lot
>>> of gain by making it a single instruction, nor is there a large HW
>>> cost to making it a single instruction. < Today's designers should
>>> treat this as a free variable.
>> <
>> I should also add, making integer MUL+ADD a single instruction is a
>> LOT easier with a combined register file than separate integer and FP
>> files.
>> {The 3-operand data path is already present for FMAC/FMAD.}
>
> A base-indexed store needs 3 register operand paths,
> and 4 operand buses if you allow an immediate too,
> if it doesn't use your trick of reading the register to
> store at the Write Back stage instead of Register Read.
>
> It can make multi-issue operand buses a bit of a rats nest.
>
>

Side note:
For reasons like this, the BJX2 effectively drops down to 2-lane when
doing a memory store, because the memory store eats Lane 3 to supply the
3rd register port.

Partial result is that 3-wide bundles end up being rare, because:
* It is hard to find enough usable ILP;
* They are limited to "ALU|ALU|ALU" or "ALU|ALU|LD".

Also, a lot of the time:
* Load and Store ops tend to be the dominant operations.
* Other non Ld/St ops are frequently run in parallel with a Ld/St op.
....

Or, if expressed as 4R operations, the ports would look like (With a
6R/3W regfile, labeled as Rs/Rt/Ru/Rv/Rx/Ry -> Rm/Rn/Ro):
1-Wide:
Sc: Rs, Rt, Rx, Rm
W : Ru:Rs, Rv:Rt, Ry:Rx, Rm:Rn
2-Wide:
Ru, Rv, Ry, Rn | Rs, Rt, Rx, Rm
3-Wide:
Rx, Ry, ZR, Ro | Ru, Rv, ZR, Rn | Rs, Rt, ZR, Rm

For a few operations, there is an implicit 4th 'Imm' value field, which
is normally forwarded into one of the other ports if an Immed form of an
instruction is used, but serves its own role in a few cases:
As an optional additional displacement for Load/Store operations (if
enabled, *);
Giving the Rounding-Mode for Op64 encodings of FPU operations (this is
decoded as zero for the 32-bit encodings, and treated as constant RNE
rounding).

There is another encoding which encodes a dynamic rounding mode, which
may be used if "FENV_ACCESS" is enabled for a given function.

*, Namely, one of:
(Rm, Ro*Sc, Disp9) //Explicit Scale
(Rm, Ro, Disp11) //Implicit Scale (Element Size)
This 'extra' displacement being implicitly unscaled.

It isn't currently used, and it looks like the compiler would need to be
able to merge several accesses together to be able to use it
effectively, eg:
foo.bar.baz[index];
Which could, in theory, be merged into a single operation, rather than
several LEA operations followed by a Load operation (namely, this is how
BGBCC would currently generate here).

Otherwise, past few days was mostly caught up in partly redesigning
BGBCC's AST system. I had hoped to be able to "escape" from it being
in-effect built on top of XML (or, at least, an abstract model
resembling that of XML), but (sadly) this would also have required a
near complete rewrite of the compiler front-end.

In effect, was able to eliminate its used of linked lists.

So, now the AST nodes look like, roughly:
Radix-8 key/value mapping (fewer than 8 keys = single node);
0-7 keys: Single Node
8-15 keys: Three Nodes (splits B-Tree style)
16-63: Adds 1 node for every 8 keys.
64-511: Uses a 3 level tree.
Radix-16 sub-node list.
0-3 children: Encoded using keys;
4-15 children: Encoded as a single list node.
16-31 children: Encoded as 3 list nodes
32-255 children: Adds a list node for every 16 children.
256-4095 use a 3 level tree;
...

The keys may be either node attributes, or child nodes in some cases
(most "simple" cases fitting within a single node structure).
The key field is generally stored as 16-bit number holding a 4-bit type
tag and 12-bit symbol (*1). Keys are stored sorted by index (so that
binary search can be used, if above a certain minimum, for 1-4 keys,
linear search is faster; for non-leaf nodes, the key would hold the
value of the lowest numbered subkey).

The value field is 64 bits, with a type interpreted based on the key's
type tag.

*1: Interned index numbers are used rather than strings because this is
both more compact and significantly faster. Numeric types are expressed
directly as integer or floating point values (int128 or float128 values
are split into high and low halves). String values (includes most
symbols from the source program, as well as "gensym" names) are
generally interned into a (larger) string table. One can probably guess
from this where some early bottlenecks were at. The ASTs make no
provision for XML namespaces or similar, as these were mostly N/A for a
compiler (so were dropped early on).

I had initially considered storing everything as keys, however:
This does not scale very well;
This would put a hard limit on the number of child nodes per parent node
(and the limit was small enough to be blown out by things like top-level
function prototypes and similar).

This replaces the use of linked lists. This slightly reduces the size of
each node (despite me increasing the number of keys per node), while at
the same time eliminating most of the need for doing mass tree cloning
when doing expression reduction operations (this was wasteful both in
terms of clock cycles and memory footprint; but was needed because nodes
with internal linked-list fields can't be shared between multiple trees;
but this restriction does not apply to nodes held in an array).

The main (significant) change needed to the rest of the compiler was
mostly rewriting bunches of logic to replace linked-list walks with
trying to index into a node (and changing out logic in a lot of places
to deal with the side-effects of lists of nodes now needing another node
to hold them in; where this change wasn't exactly entirely transparent
in some places).

Well, this is one checklist item; another would be getting around to
moving the compiler's bytecode IR stage over to using TLV packaging (and
also properly split up the AST->RIL and RIL->3AC stages into two
separate stages; better nail down the specification and semantics for
the IR).

....

Re: (FP)MADD and data scheduling

<smeuqt$1ug$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21937&group=comp.arch#21937

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Tue, 9 Nov 2021 15:05:01 -0800
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <smeuqt$1ug$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Nov 2021 23:05:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bc28e0455447dfdc28d8964d8a3ea7da";
logging-data="2000"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18nNYzSc7RTLAOYR4w1z1Qg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.0
Cancel-Lock: sha1:mWa2x7APD85bTvDK3TkuzZzHLM8=
In-Reply-To: <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Tue, 9 Nov 2021 23:05 UTC

On 11/9/2021 1:26 PM, MitchAlsup wrote:
> On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
>> MitchAlsup [2021-11-09 08:11:06] wrote:
>>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>>>> Reminds me of a question: IIUC, the MADD instruction will typically have
>>>> a latency of a few cycles (let's say 3) but the latency of the signal
>>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>>>> ("carry") input can arrive as late as the last cycle without impacting
>>>> the overall latency.
>>>>
>>>> This means that in theory we can have a sequence of MADD instructions
>>>> all accumulating into the same register at a rate of 1 per cycle
>>>> (assuming the unit is pipelined), but if all inputs are "read" at
>>>> the same time, then the rate goes down to 1/3 per cycle.
>>>>
>>>> How do OoO cores avoid this throughput problem?
>>> We build the function unit to accept all 3 operands on the same clock.
>>> Then we don't use the 3rd operand until "later"
>> IOW you don't avoid this throughput problem?
>> I mean:
>>
>> accum1 = MADD(x1, x2, accum);
>> accum2 = MADD(x3, x4, accum1);
>>
>> ends up with a latency of 2*N cycles instead of N+1, right?
> <
> Yes,
> <
>> Because we can't start the second MADD before the first is over :-(
> <
> Yes,
> <
> But as long as you do not exceed the size of the execution window,
> it all works; you can put new instructions into the window every
> cycle, you can retire instructions from the window every cycle,
> and each function units can start a calculation every cycle. All
> without SW having to schedule the code or to apply any Herculean
> effort in code selection.

Except for reduction and other inter-instruction data dependencies. Then
you pay the full latency.

Pages:123

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor