novaBBS - comp.arch - Re: (FP)MADD and data scheduling

Re: (FP)MADD and data scheduling

<6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=21938&group=comp.arch#21938

X-Received: by 2002:a05:622a:609:: with SMTP id z9mr12865218qta.243.1636500385358;
Tue, 09 Nov 2021 15:26:25 -0800 (PST)
X-Received: by 2002:a05:6808:1914:: with SMTP id bf20mr7080995oib.7.1636500385136;
Tue, 09 Nov 2021 15:26:25 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Nov 2021 15:26:24 -0800 (PST)
In-Reply-To: <smeuqt$1ug$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org> <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<smeuqt$1ug$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 09 Nov 2021 23:26:25 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 45

by: MitchAlsup - Tue, 9 Nov 2021 23:26 UTC

On Tuesday, November 9, 2021 at 5:05:03 PM UTC-6, Ivan Godard wrote:
> On 11/9/2021 1:26 PM, MitchAlsup wrote:
> > On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
> >> MitchAlsup [2021-11-09 08:11:06] wrote:
> >>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> >>>> Reminds me of a question: IIUC, the MADD instruction will typically have
> >>>> a latency of a few cycles (let's say 3) but the latency of the signal
> >>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> >>>> ("carry") input can arrive as late as the last cycle without impacting
> >>>> the overall latency.
> >>>>
> >>>> This means that in theory we can have a sequence of MADD instructions
> >>>> all accumulating into the same register at a rate of 1 per cycle
> >>>> (assuming the unit is pipelined), but if all inputs are "read" at
> >>>> the same time, then the rate goes down to 1/3 per cycle.
> >>>>
> >>>> How do OoO cores avoid this throughput problem?
> >>> We build the function unit to accept all 3 operands on the same clock.
> >>> Then we don't use the 3rd operand until "later"
> >> IOW you don't avoid this throughput problem?
> >> I mean:
> >>
> >> accum1 = MADD(x1, x2, accum);
> >> accum2 = MADD(x3, x4, accum1);
> >>
> >> ends up with a latency of 2*N cycles instead of N+1, right?
> > <
> > Yes,
> > <
> >> Because we can't start the second MADD before the first is over :-(
> > <
> > Yes,
> > <
> > But as long as you do not exceed the size of the execution window,
> > it all works; you can put new instructions into the window every
> > cycle, you can retire instructions from the window every cycle,
> > and each function units can start a calculation every cycle. All
> > without SW having to schedule the code or to apply any Herculean
> > effort in code selection.
> Except for reduction and other inter-instruction data dependencies. Then
> you pay the full latency.
<
So does essentially everybody.
<
It was not until the invention of the quire that reduction arithmetic
became modern.

Re: (FP)MADD and data scheduling

<smf1v7$kdg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21939&group=comp.arch#21939

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Tue, 9 Nov 2021 15:58:30 -0800
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <smf1v7$kdg$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<smeuqt$1ug$1@dont-email.me>
<6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 Nov 2021 23:58:31 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bc28e0455447dfdc28d8964d8a3ea7da";
logging-data="20912"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/8MlQcM2QAWmWvx0ihQ375"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.0
Cancel-Lock: sha1:wAbupqWx6vJBq9o/SXSdOt9WV7o=
In-Reply-To: <6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Tue, 9 Nov 2021 23:58 UTC

On 11/9/2021 3:26 PM, MitchAlsup wrote:
> On Tuesday, November 9, 2021 at 5:05:03 PM UTC-6, Ivan Godard wrote:
>> On 11/9/2021 1:26 PM, MitchAlsup wrote:
>>> On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
>>>> MitchAlsup [2021-11-09 08:11:06] wrote:
>>>>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>>>>>> Reminds me of a question: IIUC, the MADD instruction will typically have
>>>>>> a latency of a few cycles (let's say 3) but the latency of the signal
>>>>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>>>>>> ("carry") input can arrive as late as the last cycle without impacting
>>>>>> the overall latency.
>>>>>>
>>>>>> This means that in theory we can have a sequence of MADD instructions
>>>>>> all accumulating into the same register at a rate of 1 per cycle
>>>>>> (assuming the unit is pipelined), but if all inputs are "read" at
>>>>>> the same time, then the rate goes down to 1/3 per cycle.
>>>>>>
>>>>>> How do OoO cores avoid this throughput problem?
>>>>> We build the function unit to accept all 3 operands on the same clock.
>>>>> Then we don't use the 3rd operand until "later"
>>>> IOW you don't avoid this throughput problem?
>>>> I mean:
>>>>
>>>> accum1 = MADD(x1, x2, accum);
>>>> accum2 = MADD(x3, x4, accum1);
>>>>
>>>> ends up with a latency of 2*N cycles instead of N+1, right?
>>> <
>>> Yes,
>>> <
>>>> Because we can't start the second MADD before the first is over :-(
>>> <
>>> Yes,
>>> <
>>> But as long as you do not exceed the size of the execution window,
>>> it all works; you can put new instructions into the window every
>>> cycle, you can retire instructions from the window every cycle,
>>> and each function units can start a calculation every cycle. All
>>> without SW having to schedule the code or to apply any Herculean
>>> effort in code selection.
>> Except for reduction and other inter-instruction data dependencies. Then
>> you pay the full latency.
> <
> So does essentially everybody.
> <
> It was not until the invention of the quire that reduction arithmetic
> became modern.
>

Please forgive my ignorance, but what's a "quire" in this context?

Re: (FP)MADD and data scheduling

<ebad945b-0ee9-4544-8010-667be6aa513bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21940&group=comp.arch#21940

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:609:: with SMTP id z9mr13415052qta.243.1636506209203;
Tue, 09 Nov 2021 17:03:29 -0800 (PST)
X-Received: by 2002:a9d:7615:: with SMTP id k21mr9534391otl.38.1636506108423;
Tue, 09 Nov 2021 17:01:48 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Nov 2021 17:01:48 -0800 (PST)
In-Reply-To: <smf1v7$kdg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org> <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<smeuqt$1ug$1@dont-email.me> <6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
<smf1v7$kdg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ebad945b-0ee9-4544-8010-667be6aa513bn@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 10 Nov 2021 01:03:29 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 10 Nov 2021 01:01 UTC

On Tuesday, November 9, 2021 at 5:58:33 PM UTC-6, Ivan Godard wrote:
> On 11/9/2021 3:26 PM, MitchAlsup wrote:
> > On Tuesday, November 9, 2021 at 5:05:03 PM UTC-6, Ivan Godard wrote:
> >> On 11/9/2021 1:26 PM, MitchAlsup wrote:
> >>> On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
> >>>> MitchAlsup [2021-11-09 08:11:06] wrote:
> >>>>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> >>>>>> Reminds me of a question: IIUC, the MADD instruction will typically have
> >>>>>> a latency of a few cycles (let's say 3) but the latency of the signal
> >>>>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> >>>>>> ("carry") input can arrive as late as the last cycle without impacting
> >>>>>> the overall latency.
> >>>>>>
> >>>>>> This means that in theory we can have a sequence of MADD instructions
> >>>>>> all accumulating into the same register at a rate of 1 per cycle
> >>>>>> (assuming the unit is pipelined), but if all inputs are "read" at
> >>>>>> the same time, then the rate goes down to 1/3 per cycle.
> >>>>>>
> >>>>>> How do OoO cores avoid this throughput problem?
> >>>>> We build the function unit to accept all 3 operands on the same clock.
> >>>>> Then we don't use the 3rd operand until "later"
> >>>> IOW you don't avoid this throughput problem?
> >>>> I mean:
> >>>>
> >>>> accum1 = MADD(x1, x2, accum);
> >>>> accum2 = MADD(x3, x4, accum1);
> >>>>
> >>>> ends up with a latency of 2*N cycles instead of N+1, right?
> >>> <
> >>> Yes,
> >>> <
> >>>> Because we can't start the second MADD before the first is over :-(
> >>> <
> >>> Yes,
> >>> <
> >>> But as long as you do not exceed the size of the execution window,
> >>> it all works; you can put new instructions into the window every
> >>> cycle, you can retire instructions from the window every cycle,
> >>> and each function units can start a calculation every cycle. All
> >>> without SW having to schedule the code or to apply any Herculean
> >>> effort in code selection.
> >> Except for reduction and other inter-instruction data dependencies. Then
> >> you pay the full latency.
> > <
> > So does essentially everybody.
> > <
> > It was not until the invention of the quire that reduction arithmetic
> > became modern.
> >
> Please forgive my ignorance, but what's a "quire" in this context?
<
quire is the error free accumulator designed into posits. It is as long
as required to be able to accumulate any result calculated and not
loose a single bit of precision (2048 for posit effective double)
<
We (they actually) could do something similar for 754..........costing
2100 bits in IEEE double (since you have to account for denorms)

Re: (FP)MADD and data scheduling

<smg10t$t20$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21941&group=comp.arch#21941

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!T3F9KNSTSM9ffyC31YXeHw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Wed, 10 Nov 2021 09:48:28 +0100
Organization: Aioe.org NNTP Server
Message-ID: <smg10t$t20$1@gioia.aioe.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<smeuqt$1ug$1@dont-email.me>
<6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
<smf1v7$kdg$1@dont-email.me>
<ebad945b-0ee9-4544-8010-667be6aa513bn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="29760"; posting-host="T3F9KNSTSM9ffyC31YXeHw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Wed, 10 Nov 2021 08:48 UTC

MitchAlsup wrote:
> On Tuesday, November 9, 2021 at 5:58:33 PM UTC-6, Ivan Godard wrote:
>> On 11/9/2021 3:26 PM, MitchAlsup wrote:
>>> On Tuesday, November 9, 2021 at 5:05:03 PM UTC-6, Ivan Godard wrote:
>>>> On 11/9/2021 1:26 PM, MitchAlsup wrote:
>>>>> On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
>>>>>> MitchAlsup [2021-11-09 08:11:06] wrote:
>>>>>>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>>>>>>>> Reminds me of a question: IIUC, the MADD instruction will typically have
>>>>>>>> a latency of a few cycles (let's say 3) but the latency of the signal
>>>>>>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>>>>>>>> ("carry") input can arrive as late as the last cycle without impacting
>>>>>>>> the overall latency.
>>>>>>>>
>>>>>>>> This means that in theory we can have a sequence of MADD instructions
>>>>>>>> all accumulating into the same register at a rate of 1 per cycle
>>>>>>>> (assuming the unit is pipelined), but if all inputs are "read" at
>>>>>>>> the same time, then the rate goes down to 1/3 per cycle.
>>>>>>>>
>>>>>>>> How do OoO cores avoid this throughput problem?
>>>>>>> We build the function unit to accept all 3 operands on the same clock.
>>>>>>> Then we don't use the 3rd operand until "later"
>>>>>> IOW you don't avoid this throughput problem?
>>>>>> I mean:
>>>>>>
>>>>>> accum1 = MADD(x1, x2, accum);
>>>>>> accum2 = MADD(x3, x4, accum1);
>>>>>>
>>>>>> ends up with a latency of 2*N cycles instead of N+1, right?
>>>>> <
>>>>> Yes,
>>>>> <
>>>>>> Because we can't start the second MADD before the first is over :-(
>>>>> <
>>>>> Yes,
>>>>> <
>>>>> But as long as you do not exceed the size of the execution window,
>>>>> it all works; you can put new instructions into the window every
>>>>> cycle, you can retire instructions from the window every cycle,
>>>>> and each function units can start a calculation every cycle. All
>>>>> without SW having to schedule the code or to apply any Herculean
>>>>> effort in code selection.
>>>> Except for reduction and other inter-instruction data dependencies. Then
>>>> you pay the full latency.
>>> <
>>> So does essentially everybody.
>>> <
>>> It was not until the invention of the quire that reduction arithmetic
>>> became modern.
>>>
>> Please forgive my ignorance, but what's a "quire" in this context?
> <
> quire is the error free accumulator designed into posits. It is as long
> as required to be able to accumulate any result calculated and not
> loose a single bit of precision (2048 for posit effective double)
> <
> We (they actually) could do something similar for 754..........costing
> 2100 bits in IEEE double (since you have to account for denorms)
>
We have discussed this many times here, afair always using the
SuperAccumulator term?

The key is that with carry-save/redundant storage, it becomes trivial to
accept one or more updates every cycle: Each full adder is just 2 or 3
gate delays, right?

If I was targeting just one update/cycle I would probably work in
multi-bit (8?) bit chunks with additional carry storage only at the
boundaries.

Actually extracting and rounding the final result requires an FF1
circuit that covers the entire array, a way to extract ~55/114 bits from
that point on and a circuit to OR together all trailing bits.

If the accumulator is stored as bytes, then you can just grab the 8 or
16 bytes that encompasses the target mantissa and shift it down by 0-7
bits, that seems very doable, while the tail end zero/non-zero detector
can be OR'ed into the bottom bit.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: (FP)MADD and data scheduling

<jwvlf1wyw5y.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21942&group=comp.arch#21942

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Wed, 10 Nov 2021 08:31:53 -0500
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <jwvlf1wyw5y.fsf-monnier+comp.arch@gnu.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org>
<46c8e027-becb-4792-80f7-8815750ff6d5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="7bc47f33cdb254f5f020d8cb20083173";
logging-data="3704"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18y5/KF11c4UPJEHzdTG4+P"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:S2jChUFJhz7V4Md9j+tt5cmNa14=
sha1:xtcJA0TOUd1Mog1BDFOuz1sBcQM=

by: Stefan Monnier - Wed, 10 Nov 2021 13:31 UTC

MitchAlsup [2021-11-09 14:30:32] wrote:
> On Tuesday, November 9, 2021 at 4:04:05 PM UTC-6, Stefan Monnier wrote:
>> >> For integers, we can replace the code with;
>> >>
>> >> x12 = MUL(x1, x2);
>> >> x34 = MUL(x3, x4);
>> >> accum1 = ADD(x12, accum);
>> >> accum2 = ADD(x34, accum1);
>> >>
>> >> with latency N+2, which is likely better than 2*N.
>> > <
>> > Given the IMUL takes 3 cycles, the count is N+3;
>> > accum1 = cannot begin until x12 = MUL completes.
>> I don't understand. My "N" was the latency of MUL (and MADD).
>> So with your N=3 it means the MADD version takes 6 cycles whiles the
>> MUL+ADD version only takes 5.
> <
> +----------+----------+----------+----------+----------+
> | cycle 1| cycle2|cycle 3|cycle 4|cycle 5|
> +----------+----------+----------+----------+----------+
> Cycle 1 MUL x12 begins
> Cycle 2 MUL x23 begins
> Cycle 3 no instruction gets launched
> Cycle 4 1st ADD begins MUL x12 forwards to ADD
> Cycle 5 2nd Add begins 1st Add ends MUL x34 forwards to ADDs
> Cycle 6 2nd ADD ends

So we agree: the MUL+ADD version has lower latency than the MADD version?

Stefan

Re: (FP)MADD and data scheduling

<jwvee7ohww8.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21944&group=comp.arch#21944

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Wed, 10 Nov 2021 10:18:32 -0500
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <jwvee7ohww8.fsf-monnier+comp.arch@gnu.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org>
<46c8e027-becb-4792-80f7-8815750ff6d5n@googlegroups.com>
<jwvlf1wyw5y.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="55a219e60e67e87a2aa98773478e5899";
logging-data="11102"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19SnUythh47PN+JG79kWSV9"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:G+N6VVVJrT34AZv9m4sHhc69CMk=
sha1:XNufBPkVkVa6PD17l12h8JxqEsc=

by: Stefan Monnier - Wed, 10 Nov 2021 15:18 UTC

> So we agree: the MUL+ADD version has lower latency than the MADD version?

I find this disappointing: I thought non-legacy architectures wouldn't
suffer from such cases where a "complex" instruction results in slower
code than its decomposition into its constituent simpler operations.

I suspect that for FPMADD the problem doesn't occur because FPADD isn't
cheap (which both means that the decomposed code will be slower, and
also that even if we tried to be super-extra careful about scheduling,
FPMADD wouldn't be able to delay reading its third argument as much as
MADD would (it's not just a matter of stuffing the bits into a carry
save adder but you also need to first shift/align them according to the
exponent)).

Stefan

Re: (FP)MADD and data scheduling

<97a4f7b1-57dc-486f-8f8c-eb02c38b9698n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21949&group=comp.arch#21949

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5f0c:: with SMTP id x12mr1637075qta.309.1636573142429;
Wed, 10 Nov 2021 11:39:02 -0800 (PST)
X-Received: by 2002:a05:6830:348f:: with SMTP id c15mr1397371otu.254.1636573142162;
Wed, 10 Nov 2021 11:39:02 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 10 Nov 2021 11:39:01 -0800 (PST)
In-Reply-To: <smg10t$t20$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org> <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<smeuqt$1ug$1@dont-email.me> <6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
<smf1v7$kdg$1@dont-email.me> <ebad945b-0ee9-4544-8010-667be6aa513bn@googlegroups.com>
<smg10t$t20$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <97a4f7b1-57dc-486f-8f8c-eb02c38b9698n@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 10 Nov 2021 19:39:02 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 97

by: MitchAlsup - Wed, 10 Nov 2021 19:39 UTC

On Wednesday, November 10, 2021 at 2:48:34 AM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Tuesday, November 9, 2021 at 5:58:33 PM UTC-6, Ivan Godard wrote:
> >> On 11/9/2021 3:26 PM, MitchAlsup wrote:
> >>> On Tuesday, November 9, 2021 at 5:05:03 PM UTC-6, Ivan Godard wrote:
> >>>> On 11/9/2021 1:26 PM, MitchAlsup wrote:
> >>>>> On Tuesday, November 9, 2021 at 1:21:29 PM UTC-6, Stefan Monnier wrote:
> >>>>>> MitchAlsup [2021-11-09 08:11:06] wrote:
> >>>>>>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> >>>>>>>> Reminds me of a question: IIUC, the MADD instruction will typically have
> >>>>>>>> a latency of a few cycles (let's say 3) but the latency of the signal
> >>>>>>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> >>>>>>>> ("carry") input can arrive as late as the last cycle without impacting
> >>>>>>>> the overall latency.
> >>>>>>>>
> >>>>>>>> This means that in theory we can have a sequence of MADD instructions
> >>>>>>>> all accumulating into the same register at a rate of 1 per cycle
> >>>>>>>> (assuming the unit is pipelined), but if all inputs are "read" at
> >>>>>>>> the same time, then the rate goes down to 1/3 per cycle.
> >>>>>>>>
> >>>>>>>> How do OoO cores avoid this throughput problem?
> >>>>>>> We build the function unit to accept all 3 operands on the same clock.
> >>>>>>> Then we don't use the 3rd operand until "later"
> >>>>>> IOW you don't avoid this throughput problem?
> >>>>>> I mean:
> >>>>>>
> >>>>>> accum1 = MADD(x1, x2, accum);
> >>>>>> accum2 = MADD(x3, x4, accum1);
> >>>>>>
> >>>>>> ends up with a latency of 2*N cycles instead of N+1, right?
> >>>>> <
> >>>>> Yes,
> >>>>> <
> >>>>>> Because we can't start the second MADD before the first is over :-(
> >>>>> <
> >>>>> Yes,
> >>>>> <
> >>>>> But as long as you do not exceed the size of the execution window,
> >>>>> it all works; you can put new instructions into the window every
> >>>>> cycle, you can retire instructions from the window every cycle,
> >>>>> and each function units can start a calculation every cycle. All
> >>>>> without SW having to schedule the code or to apply any Herculean
> >>>>> effort in code selection.
> >>>> Except for reduction and other inter-instruction data dependencies. Then
> >>>> you pay the full latency.
> >>> <
> >>> So does essentially everybody.
> >>> <
> >>> It was not until the invention of the quire that reduction arithmetic
> >>> became modern.
> >>>
> >> Please forgive my ignorance, but what's a "quire" in this context?
> > <
> > quire is the error free accumulator designed into posits. It is as long
> > as required to be able to accumulate any result calculated and not
> > loose a single bit of precision (2048 for posit effective double)
> > <
> > We (they actually) could do something similar for 754..........costing
> > 2100 bits in IEEE double (since you have to account for denorms)
> >
> We have discussed this many times here, afair always using the
> SuperAccumulator term?
>
> The key is that with carry-save/redundant storage, it becomes trivial to
> accept one or more updates every cycle: Each full adder is just 2 or 3
> gate delays, right?
<
A single 3-2 counter is 1 gate of delay.
A single 4-2 compressor is 2-gates of delay.
So, basically, you can add as many things into the accumulator as you
can afford to route to the accumulator. Certainly 8 lanes per cycle is
reasonable.
>
> If I was targeting just one update/cycle I would probably work in
> multi-bit (8?) bit chunks with additional carry storage only at the
> boundaries.
>
> Actually extracting and rounding the final result requires an FF1
> circuit that covers the entire array, a way to extract ~55/114 bits from
> that point on and a circuit to OR together all trailing bits.
<
In general, it is fairly easy to remember where the HoB is.
In general, assembling the sticky bit is only a few gates of delay (5-ish)
>
> If the accumulator is stored as bytes, then you can just grab the 8 or
> 16 bytes that encompasses the target mantissa and shift it down by 0-7
> bits, that seems very doable, while the tail end zero/non-zero detector
> can be OR'ed into the bottom bit.
<
The posit guys suggest a model where one accesses the quire as an array in
memory, accessing 2 quadwords per accumulation based on the exponent
of the intermediate augend.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: (FP)MADD and data scheduling

<d1ca31a4-725f-470f-b278-d884424262dan@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21950&group=comp.arch#21950

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5e4e:: with SMTP id i14mr1651974qtx.129.1636573267187;
Wed, 10 Nov 2021 11:41:07 -0800 (PST)
X-Received: by 2002:a05:6808:19aa:: with SMTP id bj42mr1378143oib.37.1636573266982;
Wed, 10 Nov 2021 11:41:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 10 Nov 2021 11:41:06 -0800 (PST)
In-Reply-To: <jwvee7ohww8.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org> <4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<jwvpmr9htrz.fsf-monnier+comp.arch@gnu.org> <46c8e027-becb-4792-80f7-8815750ff6d5n@googlegroups.com>
<jwvlf1wyw5y.fsf-monnier+comp.arch@gnu.org> <jwvee7ohww8.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d1ca31a4-725f-470f-b278-d884424262dan@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 10 Nov 2021 19:41:07 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 21

by: MitchAlsup - Wed, 10 Nov 2021 19:41 UTC

On Wednesday, November 10, 2021 at 9:18:36 AM UTC-6, Stefan Monnier wrote:
> > So we agree: the MUL+ADD version has lower latency than the MADD version?
> I find this disappointing: I thought non-legacy architectures wouldn't
> suffer from such cases where a "complex" instruction results in slower
> code than its decomposition into its constituent simpler operations.
>
> I suspect that for FPMADD the problem doesn't occur because FPADD isn't
> cheap (which both means that the decomposed code will be slower, and
> also that even if we tried to be super-extra careful about scheduling,
> FPMADD wouldn't be able to delay reading its third argument as much as
> MADD would (it's not just a matter of stuffing the bits into a carry
> save adder but you also need to first shift/align them according to the
> exponent)).
<
Integer MAC has the property that no bits have been lost.
Floating MAC has the property of both the multiply and the add are performed
prior to any rounding (rounding=loss in precision).
<
So while IMAC can be done as 2 instructions; FMAC cannot.
>
>
> Stefan

Re: (FP)MADD and data scheduling

<smiqk5$2d8$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21951&group=comp.arch#21951

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!T3F9KNSTSM9ffyC31YXeHw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Thu, 11 Nov 2021 11:17:40 +0100
Organization: Aioe.org NNTP Server
Message-ID: <smiqk5$2d8$1@gioia.aioe.org>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
<4d97acdd-2c0b-4503-818a-802377b33024n@googlegroups.com>
<smeuqt$1ug$1@dont-email.me>
<6db97b2e-7d66-4d5d-a48f-6e132aa364dfn@googlegroups.com>
<smf1v7$kdg$1@dont-email.me>
<ebad945b-0ee9-4544-8010-667be6aa513bn@googlegroups.com>
<smg10t$t20$1@gioia.aioe.org>
<97a4f7b1-57dc-486f-8f8c-eb02c38b9698n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="2472"; posting-host="T3F9KNSTSM9ffyC31YXeHw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 11 Nov 2021 10:17 UTC

MitchAlsup wrote:
> On Wednesday, November 10, 2021 at 2:48:34 AM UTC-6, Terje Mathisen wrote:
>> Actually extracting and rounding the final result requires an FF1
>> circuit that covers the entire array, a way to extract ~55/114 bits from
>> that point on and a circuit to OR together all trailing bits.
> <
> In general, it is fairly easy to remember where the HoB is.
> In general, assembling the sticky bit is only a few gates of delay (5-ish)
>>
Sticky is just a wide zero/non-zero detector, right?

>> If the accumulator is stored as bytes, then you can just grab the 8 or
>> 16 bytes that encompasses the target mantissa and shift it down by 0-7
>> bits, that seems very doable, while the tail end zero/non-zero detector
>> can be OR'ed into the bottom bit.
> <
> The posit guys suggest a model where one accesses the quire as an array in
> memory, accessing 2 quadwords per accumulation based on the exponent
> of the intermediate augend.

If I had to implement it in SW, I'm guessing that an approach which
allows delayed carry updates would be nice.

I.e. something like 48 bits stored in each 64-bit block, each addition
splits the input into two parts based on the exponent, and we can
blindly add up to 65535 entries before we have to propagate the carries,
i.e. usually just a single iteration of this at the end?

Using 50% storage (32 bits in each 64-bit word) would make the alignment
slightly faster.

It would also be possible to let each block be either positive or
negative, with just a very small increase in the final carry/overflow
propagation.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: (FP)MADD and data scheduling

<sml30k$aqj$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21960&group=comp.arch#21960

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-592a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: (FP)MADD and data scheduling
Date: Fri, 12 Nov 2021 06:53:08 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sml30k$aqj$1@newsreader4.netcologne.de>
References: <slm4ja$e0b$1@dont-email.me>
<cPf*Mrryy@news.chiark.greenend.org.uk> <sm2j3m$e7q$1@gioia.aioe.org>
<sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com>
<9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org>
<63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Fri, 12 Nov 2021 06:53:08 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-592a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:592a:0:7285:c2ff:fe6c:992d";
logging-data="11091"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Fri, 12 Nov 2021 06:53 UTC

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
> MitchAlsup [2021-11-09 08:11:06] wrote:
>> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
>>> Reminds me of a question: IIUC, the MADD instruction will typically have
>>> a latency of a few cycles (let's say 3) but the latency of the signal
>>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
>>> ("carry") input can arrive as late as the last cycle without impacting
>>> the overall latency.
>>>
>>> This means that in theory we can have a sequence of MADD instructions
>>> all accumulating into the same register at a rate of 1 per cycle
>>> (assuming the unit is pipelined), but if all inputs are "read" at
>>> the same time, then the rate goes down to 1/3 per cycle.
>>>
>>> How do OoO cores avoid this throughput problem?
>> We build the function unit to accept all 3 operands on the same clock.
>> Then we don't use the 3rd operand until "later"
>
> IOW you don't avoid this throughput problem?
> I mean:
>
> accum1 = MADD(x1, x2, accum);
> accum2 = MADD(x3, x4, accum1);
>
> ends up with a latency of 2*N cycles instead of N+1, right?
> Because we can't start the second MADD before the first is over :-(

Depends, I think.

The most likely use case is going to be multiplication of
long integers for crypto. For this, you will need access
to the low and high part of the result, and you will
need another add. Let's also assume that the
architecture has separate high and low multiply + adds.

So, looking at the pseudocode that Mitch recently posted:

{mcarry, product} = multiplicand[i]*multiplier[j]
+ mcarry;
{acarry,sum[i+j]} = {sum[i+j]+acarry} + product;

with i the inner loop and J in a register. This would
then translate to (roughly)

LD Rmi, R_mulitiplicand + i << 8
LD Rs_ij, R_sum
MADD R_product, R_mi, R_mcarry
MADDH R_mcarry, R_mi, R_mcarry
ADD R_temp, Rs_ij, R_acarry ! Sets a carry
ADDC R_acarry, #0 ! Increment if carry set
ADD R_sum, R_temp, R_product ! Sets a carry
ST Rs_ij, R_sum
ADDC R_acarry, #0 ! Increment if carry set

The summation is a bit awkward (if I have that right), but there
is an advantage in using the multiply+add, as the latency is used
up by other instructions.

Re: (FP)MADD and data scheduling

<01cdf744-2661-4581-ad43-89c1044bef14n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21971&group=comp.arch#21971

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:411e:: with SMTP id kc30mr16315404qvb.38.1636738967886;
Fri, 12 Nov 2021 09:42:47 -0800 (PST)
X-Received: by 2002:a05:6808:4d2:: with SMTP id a18mr14122703oie.99.1636738967237;
Fri, 12 Nov 2021 09:42:47 -0800 (PST)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 12 Nov 2021 09:42:47 -0800 (PST)
In-Reply-To: <sml30k$aqj$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <slm4ja$e0b$1@dont-email.me> <cPf*Mrryy@news.chiark.greenend.org.uk>
<sm2j3m$e7q$1@gioia.aioe.org> <sm6qi3$fe6$2@z-news.wcss.wroc.pl>
<fdfa62db-b315-4391-9017-808b8d4593e8n@googlegroups.com> <9c6d350c-7a3c-4b4f-924d-4ffdbbe5d560n@googlegroups.com>
<jwv8rxx1n9r.fsf-monnier+comp.arch@gnu.org> <63b10a50-271c-45e8-a3f9-f8eb692cadb9n@googlegroups.com>
<jwvwnlhjgcm.fsf-monnier+comp.arch@gnu.org> <sml30k$aqj$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <01cdf744-2661-4581-ad43-89c1044bef14n@googlegroups.com>
Subject: Re: (FP)MADD and data scheduling
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 12 Nov 2021 17:42:47 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4364

by: MitchAlsup - Fri, 12 Nov 2021 17:42 UTC

On Friday, November 12, 2021 at 12:53:10 AM UTC-6, Thomas Koenig wrote:
> Stefan Monnier <mon...@iro.umontreal.ca> schrieb:
> > MitchAlsup [2021-11-09 08:11:06] wrote:
> >> On Tuesday, November 9, 2021 at 7:30:35 AM UTC-6, Stefan Monnier wrote:
> >>> Reminds me of a question: IIUC, the MADD instruction will typically have
> >>> a latency of a few cycles (let's say 3) but the latency of the signal
> >>> is not the same for all 3 inputs. More specifically, IIUC the 3rd
> >>> ("carry") input can arrive as late as the last cycle without impacting
> >>> the overall latency.
> >>>
> >>> This means that in theory we can have a sequence of MADD instructions
> >>> all accumulating into the same register at a rate of 1 per cycle
> >>> (assuming the unit is pipelined), but if all inputs are "read" at
> >>> the same time, then the rate goes down to 1/3 per cycle.
> >>>
> >>> How do OoO cores avoid this throughput problem?
> >> We build the function unit to accept all 3 operands on the same clock.
> >> Then we don't use the 3rd operand until "later"
> >
> > IOW you don't avoid this throughput problem?
> > I mean:
> >
> > accum1 = MADD(x1, x2, accum);
> > accum2 = MADD(x3, x4, accum1);
> >
> > ends up with a latency of 2*N cycles instead of N+1, right?
> > Because we can't start the second MADD before the first is over :-(
> Depends, I think.
>
> The most likely use case is going to be multiplication of
> long integers for crypto. For this, you will need access
> to the low and high part of the result, and you will
> need another add. Let's also assume that the
> architecture has separate high and low multiply + adds.
>
> So, looking at the pseudocode that Mitch recently posted:
>
> {mcarry, product} = multiplicand[i]*multiplier[j]
> + mcarry;
> {acarry,sum[i+j]} = {sum[i+j]+acarry} + product;
>
> with i the inner loop and J in a register. This would
> then translate to (roughly)
>
> LD Rmi, R_mulitiplicand + i << 8
> LD Rs_ij, R_sum
> MADD R_product, R_mi, R_mcarry
> MADDH R_mcarry, R_mi, R_mcarry
> ADD R_temp, Rs_ij, R_acarry ! Sets a carry
> ADDC R_acarry, #0 ! Increment if carry set
> ADD R_sum, R_temp, R_product ! Sets a carry
> ST Rs_ij, R_sum
> ADDC R_acarry, #0 ! Increment if carry set
<
< then translate to (exactly)
ADD Rij,Ri,Rj // you seem to have missed this
LDD Rmc,[Rmpc+Rj<<3] // multiplicand[i]
LDD Rsm,[Rsmp+Rij<<3] // sum[i+j]
CARRY Rcm,{IO}
MUL Rpr,Rmp,Rmc // MAC {Rcm,Rpr},{Rmp,Rcm},Rmc
CARRY Rca,{IO}
ADD Rsum,Rsum,Rpr // ADC {Rca,Rsum},{Rsum,Rca},Rpr
STD Rsum,[Rsum+Rij<<3] // sum[i+j]
<
>
> The summation is a bit awkward (if I have that right), but there
> is an advantage in using the multiply+add, as the latency is used
> up by other instructions.

Subject	Author
MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	antispam
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	antispam
Re: MADD instruction (integer multiply and add)	aph
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	EricP
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	robf...@gmail.com
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	Thomas Koenig
Re: MADD instruction (integer multiply and add)	Theo
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	Terje Mathisen
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	BGB
Re: MADD instruction (integer multiply and add)	antispam
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	MitchAlsup
Re: MADD instruction (integer multiply and add)	Marcus
Re: MADD instruction (integer multiply and add)	EricP
Re: MADD instruction (integer multiply and add)	BGB
(FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	EricP
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	Stefan Monnier
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Ivan Godard
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Ivan Godard
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Terje Mathisen
Re: (FP)MADD and data scheduling	MitchAlsup
Re: (FP)MADD and data scheduling	Terje Mathisen
Re: (FP)MADD and data scheduling	Thomas Koenig
Re: (FP)MADD and data scheduling	MitchAlsup
Re: MADD instruction (integer multiply and add)	Marcus

I bet the human brain is a kludge. -- Marvin Minsky

devel / comp.arch / Re: (FP)MADD and data scheduling