Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

As of next week, passwords will be entered in Morse code.


devel / comp.arch / Re: MADD instruction (integer multiply and add)

SubjectAuthor
* MADD instruction (integer multiply and add)Marcus
+- Re: MADD instruction (integer multiply and add)Marcus
+* Re: MADD instruction (integer multiply and add)antispam
|`* Re: MADD instruction (integer multiply and add)Marcus
| +- Re: MADD instruction (integer multiply and add)Thomas Koenig
| `* Re: MADD instruction (integer multiply and add)antispam
|  `- Re: MADD instruction (integer multiply and add)aph
+* Re: MADD instruction (integer multiply and add)MitchAlsup
|`* Re: MADD instruction (integer multiply and add)Marcus
| `* Re: MADD instruction (integer multiply and add)Thomas Koenig
|  +- Re: MADD instruction (integer multiply and add)BGB
|  `* Re: MADD instruction (integer multiply and add)Marcus
|   `- Re: MADD instruction (integer multiply and add)Thomas Koenig
+* Re: MADD instruction (integer multiply and add)BGB
|`* Re: MADD instruction (integer multiply and add)Marcus
| `* Re: MADD instruction (integer multiply and add)BGB
|  `* Re: MADD instruction (integer multiply and add)Marcus
|   +* Re: MADD instruction (integer multiply and add)EricP
|   |`* Re: MADD instruction (integer multiply and add)Marcus
|   | `- Re: MADD instruction (integer multiply and add)MitchAlsup
|   `* Re: MADD instruction (integer multiply and add)BGB
|    `* Re: MADD instruction (integer multiply and add)robf...@gmail.com
|     +- Re: MADD instruction (integer multiply and add)MitchAlsup
|     `- Re: MADD instruction (integer multiply and add)BGB
+* Re: MADD instruction (integer multiply and add)Terje Mathisen
|`* Re: MADD instruction (integer multiply and add)Thomas Koenig
| `* Re: MADD instruction (integer multiply and add)Terje Mathisen
|  `- Re: MADD instruction (integer multiply and add)Thomas Koenig
`* Re: MADD instruction (integer multiply and add)Theo
 `* Re: MADD instruction (integer multiply and add)Terje Mathisen
  +* Re: MADD instruction (integer multiply and add)MitchAlsup
  |`* Re: MADD instruction (integer multiply and add)Terje Mathisen
  | `- Re: MADD instruction (integer multiply and add)MitchAlsup
  +- Re: MADD instruction (integer multiply and add)BGB
  `* Re: MADD instruction (integer multiply and add)antispam
   +* Re: MADD instruction (integer multiply and add)MitchAlsup
   |`* Re: MADD instruction (integer multiply and add)MitchAlsup
   | +- Re: MADD instruction (integer multiply and add)Marcus
   | +* Re: MADD instruction (integer multiply and add)EricP
   | |`- Re: MADD instruction (integer multiply and add)BGB
   | `* (FP)MADD and data schedulingStefan Monnier
   |  +* Re: (FP)MADD and data schedulingEricP
   |  |`- Re: (FP)MADD and data schedulingMitchAlsup
   |  `* Re: (FP)MADD and data schedulingMitchAlsup
   |   `* Re: (FP)MADD and data schedulingStefan Monnier
   |    +* Re: (FP)MADD and data schedulingMitchAlsup
   |    |+* Re: (FP)MADD and data schedulingStefan Monnier
   |    ||`* Re: (FP)MADD and data schedulingMitchAlsup
   |    || `* Re: (FP)MADD and data schedulingStefan Monnier
   |    ||  `* Re: (FP)MADD and data schedulingStefan Monnier
   |    ||   `- Re: (FP)MADD and data schedulingMitchAlsup
   |    |`* Re: (FP)MADD and data schedulingIvan Godard
   |    | `* Re: (FP)MADD and data schedulingMitchAlsup
   |    |  `* Re: (FP)MADD and data schedulingIvan Godard
   |    |   `* Re: (FP)MADD and data schedulingMitchAlsup
   |    |    `* Re: (FP)MADD and data schedulingTerje Mathisen
   |    |     `* Re: (FP)MADD and data schedulingMitchAlsup
   |    |      `- Re: (FP)MADD and data schedulingTerje Mathisen
   |    `* Re: (FP)MADD and data schedulingThomas Koenig
   |     `- Re: (FP)MADD and data schedulingMitchAlsup
   `- Re: MADD instruction (integer multiply and add)Marcus

Pages:123
MADD instruction (integer multiply and add)

<slm4ja$e0b$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21781&group=comp.arch#21781

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 14:10:01 +0100
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <slm4ja$e0b$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 31 Oct 2021 13:10:02 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9f47d7f3b2855841cbd486681992b10d";
logging-data="14347"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19LkeSQw+r9DM2ULyLsRxiC4aE0eZEKYgk="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:o/jcEbt58hMSZ8/8JxkL+uJIAEE=
Content-Language: en-US
X-Mozilla-News-Host: snews://news.eternal-september.org:563
 by: Marcus - Sun, 31 Oct 2021 13:10 UTC

Hello group!

I just did some analysis of some code that included integer
multiplications, and concluded that having a fused multiply-and-add
(a.k.a MAC) instruction would eliminate many ADD instructions, since
a multiplication operation more often than not comes together with a
following addition operation.

Furthermore the instruction eliminates the (potential) data dependency
latency between the MUL instruction and the ADD instruction.

So, I went ahead and added a simple MADD instruction to my ISA, on the
form:

MADD R1, R2, R3 ; R1 <- R1 + R2 * R3

This was a very simple addition to my hardware implementation, and in
my FPGA design it didn't really add any extra delay (the addition is
performed in the pipeline stage directly after the multiplication,
concurrently with some other result fixup operations). Thus the cost of
the ADD instruction could be completely eliminated (both in code size
and in instruction cycle count).

I also noticed that ARMv8 has a similar (but more flexible 4-operand)
instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
MADD instruction (the addend is set to zero).

However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

It seems to me that it's a very useful instruction, and that it has
a relatively small hardware cost.

/Marcus

Re: MADD instruction (integer multiply and add)

<slm55e$hms$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21782&group=comp.arch#21782

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 14:19:41 +0100
Organization: A noiseless patient Spider
Lines: 7
Message-ID: <slm55e$hms$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 31 Oct 2021 13:19:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9f47d7f3b2855841cbd486681992b10d";
logging-data="18140"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1900UD5hU48YIdmLlQAe6H5kBscMOT6MKQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:vPa5SaQWSVJqS3Gj60HZLUWS0Pk=
In-Reply-To: <slm4ja$e0b$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Sun, 31 Oct 2021 13:19 UTC

On 2021-10-31, Marcus wrote:
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

Correction: The POWER ISA has MADDLD

/Marcus

Re: MADD instruction (integer multiply and add)

<slm94k$pqk$1@z-news.wcss.wroc.pl>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21784&group=comp.arch#21784

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!newsfeed.neostrada.pl!unt-exc-02.news.neostrada.pl!newsfeed.pionier.net.pl!pwr.wroc.pl!news.wcss.wroc.pl!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 14:27:32 +0000 (UTC)
Organization: Politechnika Wroclawska
Lines: 51
Message-ID: <slm94k$pqk$1@z-news.wcss.wroc.pl>
References: <slm4ja$e0b$1@dont-email.me>
NNTP-Posting-Host: hera.math.uni.wroc.pl
X-Trace: z-news.wcss.wroc.pl 1635690452 26452 156.17.86.1 (31 Oct 2021 14:27:32 GMT)
X-Complaints-To: abuse@news.pwr.wroc.pl
NNTP-Posting-Date: Sun, 31 Oct 2021 14:27:32 +0000 (UTC)
Cancel-Lock: sha1:cMVOnH5Bm7ojwSjd5ZyH82Ucb6Q=
User-Agent: tin/2.4.3-20181224 ("Glen Mhor") (UNIX) (Linux/4.19.0-10-amd64 (x86_64))
X-Received-Bytes: 2967
 by: antis...@math.uni.wroc.pl - Sun, 31 Oct 2021 14:27 UTC

Marcus <m.delete@this.bitsnbites.eu> wrote:
> Hello group!
>
> I just did some analysis of some code that included integer
> multiplications, and concluded that having a fused multiply-and-add
> (a.k.a MAC) instruction would eliminate many ADD instructions, since
> a multiplication operation more often than not comes together with a
> following addition operation.
>
> Furthermore the instruction eliminates the (potential) data dependency
> latency between the MUL instruction and the ADD instruction.
>
> So, I went ahead and added a simple MADD instruction to my ISA, on the
> form:
>
> MADD R1, R2, R3 ; R1 <- R1 + R2 * R3
>
> This was a very simple addition to my hardware implementation, and in
> my FPGA design it didn't really add any extra delay (the addition is
> performed in the pipeline stage directly after the multiplication,
> concurrently with some other result fixup operations). Thus the cost of
> the ADD instruction could be completely eliminated (both in code size
> and in instruction cycle count).
>
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

x86 has it in its vector extentions.

> It seems to me that it's a very useful instruction, and that it has
> a relatively small hardware cost.

It is very useful instruction, in particular when correctly
rounded. However, it is not clear for me that cost is small.
Namely, consider

pr1 = x*y
pr2 = madd(x, y, -pr1)

where madd means hardware madd. When correctly rounded
pr2 will deliver low order bits of product. Which means
that your multiplier has to produce all bits of product.
IIUC "normal" FP multiplier can skip most of low order
bits, just setting few flags to be able to round correctly.

--
Waldek Hebisch

Re: MADD instruction (integer multiply and add)

<slmcja$5ph$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21787&group=comp.arch#21787

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 16:26:33 +0100
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <slmcja$5ph$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slm94k$pqk$1@z-news.wcss.wroc.pl>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 31 Oct 2021 15:26:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9f47d7f3b2855841cbd486681992b10d";
logging-data="5937"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fSkC9ExU9Tpxlked3Kh3qS1PeB1cxoXU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:DTZdllEC6x2YNPSXDu6i/cICKKI=
In-Reply-To: <slm94k$pqk$1@z-news.wcss.wroc.pl>
Content-Language: en-US
 by: Marcus - Sun, 31 Oct 2021 15:26 UTC

On 2021-10-31 15:27, antispam@math.uni.wroc.pl wrote:
> Marcus <m.delete@this.bitsnbites.eu> wrote:
>> Hello group!
>>
>> I just did some analysis of some code that included integer
>> multiplications, and concluded that having a fused multiply-and-add
>> (a.k.a MAC) instruction would eliminate many ADD instructions, since
>> a multiplication operation more often than not comes together with a
>> following addition operation.
>>
>> Furthermore the instruction eliminates the (potential) data dependency
>> latency between the MUL instruction and the ADD instruction.
>>
>> So, I went ahead and added a simple MADD instruction to my ISA, on the
>> form:
>>
>> MADD R1, R2, R3 ; R1 <- R1 + R2 * R3
>>
>> This was a very simple addition to my hardware implementation, and in
>> my FPGA design it didn't really add any extra delay (the addition is
>> performed in the pipeline stage directly after the multiplication,
>> concurrently with some other result fixup operations). Thus the cost of
>> the ADD instruction could be completely eliminated (both in code size
>> and in instruction cycle count).
>>
>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>> MADD instruction (the addend is set to zero).
>>
>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>
> x86 has it in its vector extentions.

Which instruction is that? I know about FMA3 for floating-point, but how
about integer?

However for the scalar integer ISA they don't have it, which is what I
was thinking about (things like memory address calculations for matrix
indices etc use integer MUL + ADD). See:

x86: https://godbolt.org/z/dTEej5rP4 (imull + leal)

....but ARM does, for instance:

ARMv7: https://godbolt.org/z/K3Ecsavq1 (mla) <- Correction: ARMv7 has it
ARMv8: https://godbolt.org/z/EPczq6Wcc (madd)

>> It seems to me that it's a very useful instruction, and that it has
>> a relatively small hardware cost.
>
> It is very useful instruction, in particular when correctly
> rounded. However, it is not clear for me that cost is small.
> Namely, consider
>
> pr1 = x*y
> pr2 = madd(x, y, -pr1)
>
> where madd means hardware madd. When correctly rounded
> pr2 will deliver low order bits of product. Which means
> that your multiplier has to produce all bits of product.
> IIUC "normal" FP multiplier can skip most of low order
> bits, just setting few flags to be able to round correctly.

Are you talking about integer or floating-point?

/Marcus

Re: MADD instruction (integer multiply and add)

<slmg8c$rfa$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21789&group=comp.arch#21789

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-5748-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 16:29:00 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <slmg8c$rfa$1@newsreader4.netcologne.de>
References: <slm4ja$e0b$1@dont-email.me> <slm94k$pqk$1@z-news.wcss.wroc.pl>
<slmcja$5ph$1@dont-email.me>
Injection-Date: Sun, 31 Oct 2021 16:29:00 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-5748-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:5748:0:7285:c2ff:fe6c:992d";
logging-data="28138"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 31 Oct 2021 16:29 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:

> However for the scalar integer ISA they don't have it, which is what I
> was thinking about (things like memory address calculations for matrix
> indices etc use integer MUL + ADD).

Because multiplication used to be so expensive, strength reduction,
i.e. replacement multiplication by repeated addition, has received
considerable attention in compiler optimization. IIRC this is
something the very first optimizting compiler, for FORTRAN, did.

And if strength reduction does not work:

Even if an ISA does not have multiply + add as an instruction,
many of them support

MUL R1,R2,R3 // R1=R2*R3
L R4, R5(R1) // R4 = Mem(R5+R1)

so the addition is done in the load/store instruction. Then,
there is less advantage for array operations because an indirect
load is just a load with a fixed offset of zero.

Re: MADD instruction (integer multiply and add)

<762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21790&group=comp.arch#21790

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:f902:: with SMTP id l2mr18561805qkj.511.1635698149406;
Sun, 31 Oct 2021 09:35:49 -0700 (PDT)
X-Received: by 2002:a05:6808:1444:: with SMTP id x4mr16252335oiv.157.1635698149220;
Sun, 31 Oct 2021 09:35:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 31 Oct 2021 09:35:49 -0700 (PDT)
In-Reply-To: <slm4ja$e0b$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7d65:9bc5:f275:a261;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7d65:9bc5:f275:a261
References: <slm4ja$e0b$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 31 Oct 2021 16:35:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 55
 by: MitchAlsup - Sun, 31 Oct 2021 16:35 UTC

On Sunday, October 31, 2021 at 8:10:04 AM UTC-5, Marcus wrote:
> Hello group!
>
> I just did some analysis of some code that included integer
> multiplications, and concluded that having a fused multiply-and-add
> (a.k.a MAC) instruction would eliminate many ADD instructions, since
> a multiplication operation more often than not comes together with a
> following addition operation.
<
My 66000 has dabbled with having an integer MADD instruction.
>
> Furthermore the instruction eliminates the (potential) data dependency
> latency between the MUL instruction and the ADD instruction.
<
For many integer multiplies, the end result is an index into an array/matrix
where one has access to a "free" ADD in the memory reference addressing
mode.
>
> So, I went ahead and added a simple MADD instruction to my ISA, on the
> form:
>
> MADD R1, R2, R3 ; R1 <- R1 + R2 * R3
>
> This was a very simple addition to my hardware implementation, and in
> my FPGA design it didn't really add any extra delay (the addition is
> performed in the pipeline stage directly after the multiplication,
> concurrently with some other result fixup operations). Thus the cost of
> the ADD instruction could be completely eliminated (both in code size
> and in instruction cycle count).
<
Done "properly" the multiply and the MADD both produce 2×n results
(at least optionally).
<
But the cost of the ADD should be 2-gates of delay (where a 2-input
carry select adder is 11 gates of delay for 64-bit results,) because
the ADD portion can be done before conversion from carry-save to
binary form.
>
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
<
Never made the "statistics" to be warranted always on the edge.
The other reason is that the 3-operand encoding field is nearly full.
>
> It seems to me that it's a very useful instruction, and that it has
> a relatively small hardware cost.
>
> /Marcus

Re: MADD instruction (integer multiply and add)

<slmnvr$tl0$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21794&group=comp.arch#21794

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 13:40:50 -0500
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <slmnvr$tl0$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 31 Oct 2021 18:40:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="255153086e9a2e933d3d50607344d723";
logging-data="30368"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18EGhdT3vY9+y1dEp3J/uWK"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Cancel-Lock: sha1:3LWgW+WFYrTYGKLQvNvboPbuqeI=
In-Reply-To: <slm4ja$e0b$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sun, 31 Oct 2021 18:40 UTC

On 10/31/2021 8:10 AM, Marcus wrote:
> Hello group!
>
> I just did some analysis of some code that included integer
> multiplications, and concluded that having a fused multiply-and-add
> (a.k.a MAC) instruction would eliminate many ADD instructions, since
> a multiplication operation more often than not comes together with a
> following addition operation.
>
> Furthermore the instruction eliminates the (potential) data dependency
> latency between the MUL instruction and the ADD instruction.
>
> So, I went ahead and added a simple MADD instruction to my ISA, on the
> form:
>
>   MADD R1, R2, R3  ; R1 <- R1 + R2 * R3
>
> This was a very simple addition to my hardware implementation, and in
> my FPGA design it didn't really add any extra delay (the addition is
> performed in the pipeline stage directly after the multiplication,
> concurrently with some other result fixup operations). Thus the cost of
> the ADD instruction could be completely eliminated (both in code size
> and in instruction cycle count).
>
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>
> It seems to me that it's a very useful instruction, and that it has
> a relatively small hardware cost.
>

Goes looking:
Yeah, it appears that this is a pretty common pattern.

Though, MUL isn't a super high priority op (in a few programs, several
orders of magnitude less common than the load/store ops; even vs ADD by
itself).

The debate, I guess, is whether it would save enough to be worthwhile.

Going by some of my clock-cycle usage stats, it would likely save around
0.01% of the total clock-cycles for programs like Doom and similar.

It does result in an increase in LUT cost and timing getting a little
worse. I ended up shoving it into the adders which deal with high-order
products and sign-extension. Causes resource cost to increase by around 2%.

But, there are cases where it could be useful.
Went and defined a DMAC extension, adding:
MACS.L Rm, Ro, Rn //Rn=SExt(Rn+(Rm*Ro))
MACU.L Rm, Ro, Rn //Rn=ZExt(Rn+(Rm*Ro))
DMACS.L Rm, Ro, Rn //Rn=Rn+(Rm*Ro) //Widening MAC
DMACU.L Rm, Ro, Rn //Rn=Rn+(Rm*Ro) //Widening MAC

Debate being mostly whether it is worth a 2% increase in LUT budget
(~1kLUT).

....

Re: MADD instruction (integer multiply and add)

<slmoso$b12$1@z-news.wcss.wroc.pl>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21797&group=comp.arch#21797

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!newsfeed.neostrada.pl!unt-exc-02.news.neostrada.pl!wsisiz.edu.pl!news.icm.edu.pl!newsfeed.pionier.net.pl!pwr.wroc.pl!news.wcss.wroc.pl!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Sun, 31 Oct 2021 18:56:25 +0000 (UTC)
Organization: Politechnika Wroclawska
Lines: 88
Message-ID: <slmoso$b12$1@z-news.wcss.wroc.pl>
References: <slm4ja$e0b$1@dont-email.me> <slm94k$pqk$1@z-news.wcss.wroc.pl> <slmcja$5ph$1@dont-email.me>
NNTP-Posting-Host: hera.math.uni.wroc.pl
X-Trace: z-news.wcss.wroc.pl 1635706585 11298 156.17.86.1 (31 Oct 2021 18:56:25 GMT)
X-Complaints-To: abuse@news.pwr.wroc.pl
NNTP-Posting-Date: Sun, 31 Oct 2021 18:56:25 +0000 (UTC)
Cancel-Lock: sha1:Y9x1LQ7Dski1OMrESMFaSrPOo0Y=
User-Agent: tin/2.4.3-20181224 ("Glen Mhor") (UNIX) (Linux/4.19.0-10-amd64 (x86_64))
X-Received-Bytes: 4862
 by: antis...@math.uni.wroc.pl - Sun, 31 Oct 2021 18:56 UTC

Marcus <m.delete@this.bitsnbites.eu> wrote:
> On 2021-10-31 15:27, antispam@math.uni.wroc.pl wrote:
> > Marcus <m.delete@this.bitsnbites.eu> wrote:
> >> Hello group!
> >>
> >> I just did some analysis of some code that included integer
> >> multiplications, and concluded that having a fused multiply-and-add
> >> (a.k.a MAC) instruction would eliminate many ADD instructions, since
> >> a multiplication operation more often than not comes together with a
> >> following addition operation.
> >>
> >> Furthermore the instruction eliminates the (potential) data dependency
> >> latency between the MUL instruction and the ADD instruction.
> >>
> >> So, I went ahead and added a simple MADD instruction to my ISA, on the
> >> form:
> >>
> >> MADD R1, R2, R3 ; R1 <- R1 + R2 * R3
> >>
> >> This was a very simple addition to my hardware implementation, and in
> >> my FPGA design it didn't really add any extra delay (the addition is
> >> performed in the pipeline stage directly after the multiplication,
> >> concurrently with some other result fixup operations). Thus the cost of
> >> the ADD instruction could be completely eliminated (both in code size
> >> and in instruction cycle count).
> >>
> >> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> >> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> >> MADD instruction (the addend is set to zero).
> >>
> >> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> >> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> >
> > x86 has it in its vector extentions.
>
> Which instruction is that? I know about FMA3 for floating-point, but how
> about integer?

Well, I was tinking about floating-point. However, there is old
PMADDWD instruction. It is limited to take 4 packed 16-bit numbers
and produces 2 dot products a0*b0 + a1*b1 and a2*b2 + a3*b3. 64-bit
version (128-bit dot products of pairs of 64-bit numbers) of this
would be nice...

> However for the scalar integer ISA they don't have it, which is what I
> was thinking about (things like memory address calculations for matrix
> indices etc use integer MUL + ADD). See:
>
> x86: https://godbolt.org/z/dTEej5rP4 (imull + leal)
>
> ...but ARM does, for instance:
>
> ARMv7: https://godbolt.org/z/K3Ecsavq1 (mla) <- Correction: ARMv7 has it
> ARMv8: https://godbolt.org/z/EPczq6Wcc (madd)
>
> >> It seems to me that it's a very useful instruction, and that it has
> >> a relatively small hardware cost.
> >
> > It is very useful instruction, in particular when correctly
> > rounded. However, it is not clear for me that cost is small.
> > Namely, consider
> >
> > pr1 = x*y
> > pr2 = madd(x, y, -pr1)
> >
> > where madd means hardware madd. When correctly rounded
> > pr2 will deliver low order bits of product. Which means
> > that your multiplier has to produce all bits of product.
> > IIUC "normal" FP multiplier can skip most of low order
> > bits, just setting few flags to be able to round correctly.
>
> Are you talking about integer or floating-point?

I was thinking about floating point. I would like to have integer
version, but to be really useful it would have to produce also high
bits of result, which means 4 input registers and 2 output redisters.
32-bit ARM is doing this, I am not sure if they kept this for 64-bit
mode.

More generally, most CPU desigers seem to skimp on integer arithmetic.
The effect is that for my needs I may be forced to use FPU as
"poor man" integer unit: it delivers substandard results (53 bits
instead of 64 or 128) and eats memory bandwidth (as I need to
have 64-bit containes even though only say 25 bits are significant),
but has higher througput than real integer unit...

--
Waldek Hebisch

Re: MADD instruction (integer multiply and add)

<slo9bj$56d$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21804&group=comp.arch#21804

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Mon, 1 Nov 2021 09:43:30 +0100
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <slo9bj$56d$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 1 Nov 2021 08:43:31 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="033b5ce2219ae813ff32162858b292db";
logging-data="5325"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18foZg3wTHTJLe3uGH3FmOS444L/nzvqVM="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:vN2Ffg67LSjpnzJQcv1CIboGX6U=
In-Reply-To: <762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
Content-Language: en-US
 by: Marcus - Mon, 1 Nov 2021 08:43 UTC

On 2021-10-31 17:35, MitchAlsup wrote:
> On Sunday, October 31, 2021 at 8:10:04 AM UTC-5, Marcus wrote:
>> Hello group!
>>
>> I just did some analysis of some code that included integer
>> multiplications, and concluded that having a fused multiply-and-add
>> (a.k.a MAC) instruction would eliminate many ADD instructions, since
>> a multiplication operation more often than not comes together with a
>> following addition operation.
> <
> My 66000 has dabbled with having an integer MADD instruction.
>>
>> Furthermore the instruction eliminates the (potential) data dependency
>> latency between the MUL instruction and the ADD instruction.
> <
> For many integer multiplies, the end result is an index into an array/matrix
> where one has access to a "free" ADD in the memory reference addressing
> mode.
>>
>> So, I went ahead and added a simple MADD instruction to my ISA, on the
>> form:
>>
>> MADD R1, R2, R3 ; R1 <- R1 + R2 * R3
>>
>> This was a very simple addition to my hardware implementation, and in
>> my FPGA design it didn't really add any extra delay (the addition is
>> performed in the pipeline stage directly after the multiplication,
>> concurrently with some other result fixup operations). Thus the cost of
>> the ADD instruction could be completely eliminated (both in code size
>> and in instruction cycle count).
> <
> Done "properly" the multiply and the MADD both produce 2×n results
> (at least optionally).
> <
> But the cost of the ADD should be 2-gates of delay (where a 2-input
> carry select adder is 11 gates of delay for 64-bit results,) because
> the ADD portion can be done before conversion from carry-save to
> binary form.
>>
>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>> MADD instruction (the addend is set to zero).
>>
>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
> <
> Never made the "statistics" to be warranted always on the edge.
> The other reason is that the 3-operand encoding field is nearly full.

In my ISA I use destructive 3-operand (a = a + b x c), which isn't quite
as flexible as a full 4-operand version (a = b + c x d). OTOH it seems
to cover the most common cases without requiring any additional moves
(and I think that a pre-move is better than a post-add anyway).

A quick test on a few code bases with my MADD-enabled GCC (keep in mind
that my GCC machine description is far from perfect):

Quake:
# of MUL: 146
# of MADD: 162 (52%)

Doom:
# of MUL: 395
# of MADD: 152 (28%)

Sqlite:
# of MUL: 152
# of MADD: 77 (33%)

That's not a huge number of multiplications, but OTOH the share of
multiplications that can utilize MADD is 30% or better (and when I look
at the cases where MUL is being used, some of them should be possible
to transform to MADD too, even though GCC fails at doing so).

The code snippet that originally got me thinking about the issue was
a fixed point (16.16) bilinear sampler routine:

https://godbolt.org/z/xGaszj7qY

After adding MADD to my ISA I get the following code (GCC):

sample_bilinear:
madd r3, r4, r2 ; Calculate row addresses
add r3, r1, r3
add r2, r3, r2
ldub r8, [r3, #0] ; Load four adjacent values
ldub r7, [r2, #0]
ldub r4, [r3, #1]
ldub r1, [r2, #1]
lsl r3, r8, #16
sub r1, r1, r7
sub r4, r4, r8
lsl r2, r7, #16
madd r3, r4, r5 ; Lerp horizontally, row 1
madd r2, r1, r5 ; Lerp horizontally, row 2
lsl r1, r3, #7
sub r2, r2, r3
lsr r2, r2, #9
add r1, r1, #0x400000 ; Round
madd r1, r2, r6 ; Lerp vertically
ebfu r1, r1, #<23:8>
ret

In this routine four instructions could be dropped (17%).

BTW, bilinear sampling is one of the things that is very hard to do
efficiently in SIMD (at least w/o gather load), so scalar
implementations are quite common.

/Marcus

Re: MADD instruction (integer multiply and add)

<slobli$krq$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21805&group=comp.arch#21805

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Mon, 1 Nov 2021 10:22:58 +0100
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <slobli$krq$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 1 Nov 2021 09:22:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="033b5ce2219ae813ff32162858b292db";
logging-data="21370"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+48eKuqDhaYkvBB9sSgZeafFZbforH4Ow="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:zLqGP/VXHz2mVomDmPqcP70sv74=
In-Reply-To: <slmnvr$tl0$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Mon, 1 Nov 2021 09:22 UTC

On 2021-10-31 19:40, BGB wrote:
> On 10/31/2021 8:10 AM, Marcus wrote:
>> Hello group!
>>
>> I just did some analysis of some code that included integer
>> multiplications, and concluded that having a fused multiply-and-add
>> (a.k.a MAC) instruction would eliminate many ADD instructions, since
>> a multiplication operation more often than not comes together with a
>> following addition operation.
>>
>> Furthermore the instruction eliminates the (potential) data dependency
>> latency between the MUL instruction and the ADD instruction.
>>
>> So, I went ahead and added a simple MADD instruction to my ISA, on the
>> form:
>>
>>    MADD R1, R2, R3  ; R1 <- R1 + R2 * R3
>>
>> This was a very simple addition to my hardware implementation, and in
>> my FPGA design it didn't really add any extra delay (the addition is
>> performed in the pipeline stage directly after the multiplication,
>> concurrently with some other result fixup operations). Thus the cost of
>> the ADD instruction could be completely eliminated (both in code size
>> and in instruction cycle count).
>>
>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>> MADD instruction (the addend is set to zero).
>>
>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>
>> It seems to me that it's a very useful instruction, and that it has
>> a relatively small hardware cost.
>>
>
> Goes looking:
> Yeah, it appears that this is a pretty common pattern.
>
> Though, MUL isn't a super high priority op (in a few programs, several
> orders of magnitude less common than the load/store ops; even vs ADD by
> itself).
>
> The debate, I guess, is whether it would save enough to be worthwhile.
>
>
> Going by some of my clock-cycle usage stats, it would likely save around
> 0.01% of the total clock-cycles for programs like Doom and similar.
>
> It does result in an increase in LUT cost and timing getting a little
> worse. I ended up shoving it into the adders which deal with high-order
> products and sign-extension. Causes resource cost to increase by around 2%.
>
>
> But, there are cases where it could be useful.
>   Went and defined a DMAC extension, adding:
>     MACS.L   Rm, Ro, Rn  //Rn=SExt(Rn+(Rm*Ro))
>     MACU.L   Rm, Ro, Rn  //Rn=ZExt(Rn+(Rm*Ro))
>     DMACS.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>     DMACU.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>
> Debate being mostly whether it is worth a 2% increase in LUT budget
> (~1kLUT).

Yes, I saw a ~1% LUT increase & ~4% register increase in my CPU core
(the registers are probably for the pipelined passing of the addend to
the final multiplier stage). Though, granted, I have not been very
restrictive with the LUT budget. I'm still under 50% logic utilization
for my complete SoC computer design. Lack of BRAM is a bigger issue for
me.

The single MADD instruction was just a PoC. I'm going to investigate if
MSUB makes sense too, and then I think I'm going to rearrange the
opcodes so that I can use an immediate value for one of the
multiplicands (and the divisor for division), which could make for a
nice improvement in some situations.

/Marcus

Re: MADD instruction (integer multiply and add)

<slpba7$p5p$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21814&group=comp.arch#21814

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Mon, 1 Nov 2021 18:23:03 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <slpba7$p5p$1@newsreader4.netcologne.de>
References: <slm4ja$e0b$1@dont-email.me>
<762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
<slo9bj$56d$1@dont-email.me>
Injection-Date: Mon, 1 Nov 2021 18:23:03 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:a34:0:7285:c2ff:fe6c:992d";
logging-data="25785"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 1 Nov 2021 18:23 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:

> A quick test on a few code bases with my MADD-enabled GCC (keep in mind
> that my GCC machine description is far from perfect):
>
> Quake:
> # of MUL: 146
> # of MADD: 162 (52%)

Here are a few numbers for POWER binaries, which has the madd*
instructions:

[tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmadd" | wc -l
1066
[tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmul" | wc -l
6350
[tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | wc -l
400530

[tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
0 [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
318
[tkoenig@gcc135 bin]$ objdump --disassemble troff | wc -l
106736

So, even in a very matrix-oriented code like libgfortran, the number
of integer multiply and adds is rather low - 0.25%.

Re: MADD instruction (integer multiply and add)

<slpki7$gps$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21816&group=comp.arch#21816

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Mon, 1 Nov 2021 16:00:45 -0500
Organization: A noiseless patient Spider
Lines: 182
Message-ID: <slpki7$gps$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
<slo9bj$56d$1@dont-email.me> <slpba7$p5p$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 1 Nov 2021 21:00:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="39e69ebddc83cae1d253ba7434b8243c";
logging-data="17212"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ty9FOvzMhwuo7yYCbh0so"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Cancel-Lock: sha1:X/p0PCGChrkbQGfo4b13UQJ+Myo=
In-Reply-To: <slpba7$p5p$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Mon, 1 Nov 2021 21:00 UTC

On 11/1/2021 1:23 PM, Thomas Koenig wrote:
> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>
>> A quick test on a few code bases with my MADD-enabled GCC (keep in mind
>> that my GCC machine description is far from perfect):
>>
>> Quake:
>> # of MUL: 146
>> # of MADD: 162 (52%)
>
> Here are a few numbers for POWER binaries, which has the madd*
> instructions:
>
> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmadd" | wc -l
> 1066
> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmul" | wc -l
> 6350
> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | wc -l
> 400530
>
> [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
> 0
> [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
> 318
> [tkoenig@gcc135 bin]$ objdump --disassemble troff | wc -l
> 106736
>
> So, even in a very matrix-oriented code like libgfortran, the number
> of integer multiply and adds is rather low - 0.25%.
>

This is closer to what I was seeing:
Around 1/4 to 1/8 of MUL can be conjoined with an ADD.

Combined with relatively few of these cases being in the hot path, the
average case savings are pretty small (in terms of clock cycles).

It is a promising operation for things like IDCT in JPEG or MPEG, but if
one has a machine where much of the bottleneck for JPEG decoding is in
the Huffman stage, the savings from a slightly faster IDCT are small.

At least for BJX2, there would likely be a bigger advantage for a
JPEG-style format using Rice coding or similar rather than Huffman.

Granted, throwing Rice coding at problems seems to be anathema to many
compression people.

As I have added it experimentally in BGBCC (this was a hassle, ended up
needing to add support for trinary operators to multiple compiler
stages), it is likely that it would be more efficient to spread it out
using "+=" operators.

So, rather than writing, say:
i0=a00*c00 + a01*c01 + a02*c02 + a03*c03;
i1=a10*c10 + a11*c11 + a12*c12 + a13*c13;
i2=a20*c20 + a21*c21 + a22*c22 + a23*c23;
i3=a30*c30 + a31*c31 + a32*c32 + a33*c33;
It would work out more efficient to write it more like:
i0 =a00*c00; i1 =a10*c10; i2 =a20*c20; i3 =a30*c30;
i0+=a01*c01; i1+=a11*c11; i2+=a21*c21; i3+=a31*c31;
i0+=a02*c02; i1+=a12*c12; i2+=a22*c22; i3+=a32*c32;
i0+=a03*c03; i1+=a13*c13; i2+=a23*c23; i3+=a33*c33;

Mostly because the former would stumble on interlocks, and thus cost ~ 3
cycles per MAC, whereas the latter would average ~ 1 cycle per MAC, at
least, subject to the whims of the register allocator and similar.

Available encoding options would all:
MAC with reg coeff;
MAC with Imm5u coeff (32-bit encoding);
MAC with Imm29s coeff (64-bit encoding).

With this operation (if supported) being limited to Lane 1 (not subject
to WEX).

It is possible I could add MACS.W variants (as a special case of
MULS.W), depending mostly on if I want to spend encoding the space on
it, which could be combined with WEX (allowed in Lanes 1 or 2 with a
1-cycle latency). These could give a performance advantage if the input
values are 'char' or 'short' (*1).

....

*1:
const short c00=15, c01=7, ...;
short a00, a01, a02, ...;
int i0, i1, i2, i3;
....
i0=a00*c00 + a01*c01 + a02*c02 + a03*c03;
....

The tradeoff here is mostly that relatively little "normal" code goes
out of its way to use 'short' for multiplier inputs rather than 'int'
(given 'int' is pretty much the default type for pretty much everything,
even when technically overkill). So, it is likely that a "MACS.W
instruction" would end up hardly ever being used (well, much outside of
things like audio/video codecs or similar).

Partial reason this case can works OK is given this operation would map
fairly directly onto a DSP48 element.

I still have encoding space available, but given I have already used up
"most of it" (*2), this is starting to be cause for concern (though, the
rate of expansion has slowed considerably).

*2: I wanted to leave the F3 and F9 blocks as "user blocks", but if I
run out of space in the F0 block, that would mean I would start needing
to add new instructions mostly in the Op64 space, which would kinda suck
(any instructions added here can't currently be WEXified).

Quick survey:
I have enough remaining 3R space (in 32-bit land) for around 68, 3R
encodings (56 generic, 12 Ld/St).

This would be out of a starting space of 167 ( 192 - 24 - 1; with 24 for
2R spaces, and 1 for 1R/0R space ).

Actually, the actual starting space would have been significantly
larger, apart from ISA designs cutting off a big chunk of it for WEX.
The Predicate and XGPR spaces, meanwhile, were reclaimed from 16-bit
Land (16-bit land is basically already full; and XGPR ate the encoding
space that was previously occupied by the Op24 encodings; with XGPR and
Op24 assumed to be mutually exclusive).

There is enough remaining encoding space in Op64 land for around 1
million 3R encodings. Op48 space also theoretically exists, but is
incompatible with the WEXifier (which is built around the assumption of
being able to work with machine-code in terms of 32-bit units).

Where, currently:
F0nm_0eoZ Ld/St (Full)
F0nm_1eoZ ALU, 2R space (Mostly full; only 2R remains)
F0nm_2eoZ SIMD and FPU ops (Full)
F0nm_3eoZ 0R/1R/2R spaces, ALUX ops (Mostly only 1R and 2R remains)
F0nm_4eoZ Ld/St (MOV.X, MOV.C)
F0nm_5eoZ SIMD, FPU, DMULx, More ALU, ... (Full)
F0nm_6eoZ MAC, UTX/UAB ops, FPU (Mostly full)
F0nm_7eoZ Mostly Unused, more 2R space
F0nm_8eoZ Ld/St (XMOV; Mostly Full)
F0nm_9eoZ Unused (Reclaimed from original FPU)
F0nm_AeoZ Unused
F0nm_BeoZ Unused
F0dd_Cddd BRA Disp20
F0dd_Dddd BSR Disp20
F0dd_Eddd BT Disp20
F0dd_Fddd BF Disp20

The F1 block is now full.
The F2 block only really has Imm10 slots left.

The F3 block is Reserved for User extensions; These are intended for
implementation specific instructions (such as if one were to use BJX2 as
part of a GPU or Neural Net processor, these instructions would probably
go here); Similar for F9.

Can't really claim the remaining space in F8 as more 3R space, as the
block layout of the F8 block is incompatible with the F0 block layout.

Note that the 2R spaces are limited to 2R encodings, which are currently
things like:
OP Rm, Rn
OP Imm6, Rn

There is still plenty of 2R space left.

It could be possible to reclaim some encoding space, but with the usual
drawback that doing so tends to break binary compatibility with existing
code (preferably avoided).

....

Re: MADD instruction (integer multiply and add)

<slpoq0$ejh$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21819&group=comp.arch#21819

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Mon, 1 Nov 2021 17:13:10 -0500
Organization: A noiseless patient Spider
Lines: 152
Message-ID: <slpoq0$ejh$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 1 Nov 2021 22:13:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="39e69ebddc83cae1d253ba7434b8243c";
logging-data="14961"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18s7xpVKY2L7hldWDktmyVD"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Cancel-Lock: sha1:+mXmYzN/x/GkD/1zNk0MVZHD8ws=
In-Reply-To: <slobli$krq$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 1 Nov 2021 22:13 UTC

On 11/1/2021 4:22 AM, Marcus wrote:
> On 2021-10-31 19:40, BGB wrote:
>> On 10/31/2021 8:10 AM, Marcus wrote:
>>> Hello group!
>>>
>>> I just did some analysis of some code that included integer
>>> multiplications, and concluded that having a fused multiply-and-add
>>> (a.k.a MAC) instruction would eliminate many ADD instructions, since
>>> a multiplication operation more often than not comes together with a
>>> following addition operation.
>>>
>>> Furthermore the instruction eliminates the (potential) data dependency
>>> latency between the MUL instruction and the ADD instruction.
>>>
>>> So, I went ahead and added a simple MADD instruction to my ISA, on the
>>> form:
>>>
>>>    MADD R1, R2, R3  ; R1 <- R1 + R2 * R3
>>>
>>> This was a very simple addition to my hardware implementation, and in
>>> my FPGA design it didn't really add any extra delay (the addition is
>>> performed in the pipeline stage directly after the multiplication,
>>> concurrently with some other result fixup operations). Thus the cost of
>>> the ADD instruction could be completely eliminated (both in code size
>>> and in instruction cycle count).
>>>
>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
>>> MADD instruction (the addend is set to zero).
>>>
>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>>
>>> It seems to me that it's a very useful instruction, and that it has
>>> a relatively small hardware cost.
>>>
>>
>> Goes looking:
>> Yeah, it appears that this is a pretty common pattern.
>>
>> Though, MUL isn't a super high priority op (in a few programs, several
>> orders of magnitude less common than the load/store ops; even vs ADD
>> by itself).
>>
>> The debate, I guess, is whether it would save enough to be worthwhile.
>>
>>
>> Going by some of my clock-cycle usage stats, it would likely save
>> around 0.01% of the total clock-cycles for programs like Doom and
>> similar.
>>
>> It does result in an increase in LUT cost and timing getting a little
>> worse. I ended up shoving it into the adders which deal with
>> high-order products and sign-extension. Causes resource cost to
>> increase by around 2%.
>>
>>
>> But, there are cases where it could be useful.
>>    Went and defined a DMAC extension, adding:
>>      MACS.L   Rm, Ro, Rn  //Rn=SExt(Rn+(Rm*Ro))
>>      MACU.L   Rm, Ro, Rn  //Rn=ZExt(Rn+(Rm*Ro))
>>      DMACS.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>>      DMACU.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>>
>> Debate being mostly whether it is worth a 2% increase in LUT budget
>> (~1kLUT).
>
> Yes, I saw a ~1% LUT increase & ~4% register increase in my CPU core
> (the registers are probably for the pipelined passing of the addend to
> the final multiplier stage). Though, granted, I have not been very
> restrictive with the LUT budget. I'm still under 50% logic utilization
> for my complete SoC computer design. Lack of BRAM is a bigger issue for
> me.
>

I had to give up on doing dual-core, so I am currently stuck with a
single CPU core on the XC7A100 with the current feature-set (though, a
whole lot of LUTs are being spent on having 64B cache lines in the L2
cache).

I had considered SMT as a possible cheaper alternative, but this is
still pretty far from being fully implemented.

Currently using ~ 70% (44000 / 63400 LUTs), and 118 / 135 BRAM, ...

As noted, this is with the current major configuration:
WEX-3W (3 Execute Lanes)
ALUX (128-bit ADD/SUB/Shift)
FP-SIMD (128-bit packed Floating-Point SIMD)
Also 64-bit packed Half
Also 64-bit 3x FP21 (S.E5.M15)
Some FP8 Ops ( 32-bit, 4x FP8; E4.M4 / S.E4.M3 )
Block-Texture and Block-Audio Decoders (UTX/UAB)
RISC-V Mode (Secondary Decoders)
XMOV (96-bit virtual address space)
...
64B L2 cache lines
256K L2 Cache
16K + 32K L1 Caches
256x 4-way TLB (2-way when using 96-bit VAs)
...

Without the MAC instructions, it is ~ 68% of the LUT budget.
Using 16B L2 cache lines saves ~ 8%, but somewhat reduces memory
bandwidth (and is basically required for the DRAM-backed framebuffer to
work effectively).

Currently not enabled (but exist):
FPUX / Long-Double (Hardware support for a 96-bit / S.E15.M80 format);
BLINT / BLERP (Hardware Bilinear Interpolator).

These features a both fairly expensive and also fairly niche.
The absence of FPUX can be partly offset by ALUX being able to implement
FPU emulation via 128-bit integer operations. Well, except FMUL is still
kinda slow when only only has a 32x32->64 multiplier (eg: int128
multiply is kinda slow).

The software emulation does the S.E15.M112 format (Quad Precision).

In practice, neither __float128 nor "long double" are commonly used, so
it isn't too much of an issue at present.

While the hardware bilinear interpolator did work, it didn't offer that
big of an advantage over doing it in software (to offset its cost).

Though, partly for TKRA-GL, did end up with some of the bilinear paths
using a slightly cheaper but approximate interpolator (namely, a
cost-reduced 3-point interpolation algo partly "inspired" by the
Nintendo 64).

> The single MADD instruction was just a PoC. I'm going to investigate if
> MSUB makes sense too, and then I think I'm going to rearrange the
> opcodes so that I can use an immediate value for one of the
> multiplicands (and the divisor for division), which could make for a
> nice improvement in some situations.
>

In my case, I ended up fiddling with it more, and adding some Imm5u
encodings as well (since Imm5 fits into 3R space).

There was an Imm9 space, but none of this remains (it was mostly used
for ALU ops).

> /Marcus

Re: MADD instruction (integer multiply and add)

<slqqnd$53c$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21821&group=comp.arch#21821

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 2 Nov 2021 08:52:13 +0100
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <slqqnd$53c$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me>
<762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
<slo9bj$56d$1@dont-email.me> <slpba7$p5p$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 2 Nov 2021 07:52:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fbf5ac804f37548c4fae79bd6f976f55";
logging-data="5228"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/A54/fL16M+0UvaWZi9v/xS7LuEPI24Ec="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:Hdlcp0FmZpOhf8xGL0pJN4J9CB8=
In-Reply-To: <slpba7$p5p$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Marcus - Tue, 2 Nov 2021 07:52 UTC

On 2021-11-01 19:23, Thomas Koenig wrote:
> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>
>> A quick test on a few code bases with my MADD-enabled GCC (keep in mind
>> that my GCC machine description is far from perfect):
>>
>> Quake:
>> # of MUL: 146
>> # of MADD: 162 (52%)
>
> Here are a few numbers for POWER binaries, which has the madd*
> instructions:
>
> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmadd" | wc -l
> 1066
> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmul" | wc -l
> 6350
> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | wc -l
> 400530
>
> [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
> 0
> [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
> 318
> [tkoenig@gcc135 bin]$ objdump --disassemble troff | wc -l
> 106736
>
> So, even in a very matrix-oriented code like libgfortran, the number
> of integer multiply and adds is rather low - 0.25%.
>

Thanks for the stats!

Unfortunately I do not yet have access to a lot of larger code bases
(e.g. my GCC toolchain is missing OS/posix functions), so I have to use
a few "simple" ones when I try out new features (e.g. Quake).

I wonder if your findings would translate to MRISC32 (apples to
apples?). Also, was troff compiled for POWER9?

Goes checking GNU troff...

Ok, I was able to build a bunch of MRISC32 flavored .o files from the
GNU troff source (including stuff going into libgroff.a), but program
linking failed because I do not have support for shared linking (or
something to that effect, I did not bother to try to get it to work at
this point).

I added all the .o files to a static library, libstuff.o, and got:

$ mrisc32-elf-objdump -d libstuff.a | grep '\smadd\s' | wc -l
35
$ mrisc32-elf-objdump -d libstuff.a | grep '\smul\s' | wc -l
42
$ mrisc32-elf-objdump -d libstuff.a | wc -l
32903

....so ~45% of the multiplications transform to MADD for MRISC32. I guess
there could be differences in the POWER ISA that allows it to do other
transformations?

/Marcus

Re: MADD instruction (integer multiply and add)

<slr2cl$47v$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21823&group=comp.arch#21823

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 2 Nov 2021 11:03:01 +0100
Organization: A noiseless patient Spider
Lines: 184
Message-ID: <slr2cl$47v$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 2 Nov 2021 10:03:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fbf5ac804f37548c4fae79bd6f976f55";
logging-data="4351"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VrKjcWX5qhZMixmqlwrFF4jowwpeAU24="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:dTccSZ1jRZw7POw6VA+3P/G5YtA=
In-Reply-To: <slpoq0$ejh$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Tue, 2 Nov 2021 10:03 UTC

On 2021-11-01 23:13, BGB wrote:
> On 11/1/2021 4:22 AM, Marcus wrote:
>> On 2021-10-31 19:40, BGB wrote:
>>> On 10/31/2021 8:10 AM, Marcus wrote:
>>>> Hello group!
>>>>
>>>> I just did some analysis of some code that included integer
>>>> multiplications, and concluded that having a fused multiply-and-add
>>>> (a.k.a MAC) instruction would eliminate many ADD instructions, since
>>>> a multiplication operation more often than not comes together with a
>>>> following addition operation.
>>>>
>>>> Furthermore the instruction eliminates the (potential) data dependency
>>>> latency between the MUL instruction and the ADD instruction.
>>>>
>>>> So, I went ahead and added a simple MADD instruction to my ISA, on the
>>>> form:
>>>>
>>>>    MADD R1, R2, R3  ; R1 <- R1 + R2 * R3
>>>>
>>>> This was a very simple addition to my hardware implementation, and in
>>>> my FPGA design it didn't really add any extra delay (the addition is
>>>> performed in the pipeline stage directly after the multiplication,
>>>> concurrently with some other result fixup operations). Thus the cost of
>>>> the ADD instruction could be completely eliminated (both in code size
>>>> and in instruction cycle count).
>>>>
>>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for
>>>> the
>>>> MADD instruction (the addend is set to zero).
>>>>
>>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking
>>>> this
>>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>>>
>>>> It seems to me that it's a very useful instruction, and that it has
>>>> a relatively small hardware cost.
>>>>
>>>
>>> Goes looking:
>>> Yeah, it appears that this is a pretty common pattern.
>>>
>>> Though, MUL isn't a super high priority op (in a few programs,
>>> several orders of magnitude less common than the load/store ops; even
>>> vs ADD by itself).
>>>
>>> The debate, I guess, is whether it would save enough to be worthwhile.
>>>
>>>
>>> Going by some of my clock-cycle usage stats, it would likely save
>>> around 0.01% of the total clock-cycles for programs like Doom and
>>> similar.
>>>
>>> It does result in an increase in LUT cost and timing getting a little
>>> worse. I ended up shoving it into the adders which deal with
>>> high-order products and sign-extension. Causes resource cost to
>>> increase by around 2%.
>>>
>>>
>>> But, there are cases where it could be useful.
>>>    Went and defined a DMAC extension, adding:
>>>      MACS.L   Rm, Ro, Rn  //Rn=SExt(Rn+(Rm*Ro))
>>>      MACU.L   Rm, Ro, Rn  //Rn=ZExt(Rn+(Rm*Ro))
>>>      DMACS.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>>>      DMACU.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>>>
>>> Debate being mostly whether it is worth a 2% increase in LUT budget
>>> (~1kLUT).
>>
>> Yes, I saw a ~1% LUT increase & ~4% register increase in my CPU core
>> (the registers are probably for the pipelined passing of the addend to
>> the final multiplier stage). Though, granted, I have not been very
>> restrictive with the LUT budget. I'm still under 50% logic utilization
>> for my complete SoC computer design. Lack of BRAM is a bigger issue for
>> me.
>>
>
> I had to give up on doing dual-core, so I am currently stuck with a
> single CPU core on the XC7A100 with the current feature-set (though, a
> whole lot of LUTs are being spent on having 64B cache lines in the L2
> cache).
>
> I had considered SMT as a possible cheaper alternative, but this is
> still pretty far from being fully implemented.
>
>
> Currently using ~ 70% (44000 / 63400 LUTs), and 118 / 135 BRAM, ...
>
> As noted, this is with the current major configuration:
>   WEX-3W (3 Execute Lanes)
>   ALUX (128-bit ADD/SUB/Shift)
>   FP-SIMD (128-bit packed Floating-Point SIMD)
>     Also 64-bit packed Half
>     Also 64-bit 3x FP21 (S.E5.M15)
>     Some FP8 Ops ( 32-bit, 4x FP8; E4.M4 / S.E4.M3 )
>   Block-Texture and Block-Audio Decoders (UTX/UAB)
>   RISC-V Mode (Secondary Decoders)
>   XMOV (96-bit virtual address space)
>   ...
>   64B L2 cache lines
>   256K L2 Cache
>   16K + 32K L1 Caches
>   256x 4-way TLB (2-way when using 96-bit VAs)
>   ...
>
>
> Without the MAC instructions, it is ~ 68% of the LUT budget.
> Using 16B L2 cache lines saves ~ 8%, but somewhat reduces memory
> bandwidth (and is basically required for the DRAM-backed framebuffer to
> work effectively).
>
>
> Currently not enabled (but exist):
>   FPUX / Long-Double (Hardware support for a 96-bit / S.E15.M80 format);
>   BLINT / BLERP (Hardware Bilinear Interpolator).
>
>
> These features a both fairly expensive and also fairly niche.
> The absence of FPUX can be partly offset by ALUX being able to implement
> FPU emulation via 128-bit integer operations. Well, except FMUL is still
> kinda slow when only only has a 32x32->64 multiplier (eg: int128
> multiply is kinda slow).
>
> The software emulation does the S.E15.M112 format (Quad Precision).
>
> In practice, neither __float128 nor "long double" are commonly used, so
> it isn't too much of an issue at present.
>
>
> While the hardware bilinear interpolator did work, it didn't offer that
> big of an advantage over doing it in software (to offset its cost).
>
> Though, partly for TKRA-GL, did end up with some of the bilinear paths
> using a slightly cheaper but approximate interpolator (namely, a
> cost-reduced 3-point interpolation algo partly "inspired" by the
> Nintendo 64).
>
>
>> The single MADD instruction was just a PoC. I'm going to investigate if
>> MSUB makes sense too, and then I think I'm going to rearrange the
>> opcodes so that I can use an immediate value for one of the
>> multiplicands (and the divisor for division), which could make for a
>> nice improvement in some situations.
>>
>
> In my case, I ended up fiddling with it more, and adding some Imm5u
> encodings as well (since Imm5 fits into 3R space).
>
> There was an Imm9 space, but none of this remains (it was mostly used
> for ALU ops).
>
>

I ended up ditching MSUB - it was almost never used. I now have the
following variants, which covers most of my needs (things are also
simplified by the fact that my ISA is 32-bit, so I don't have to worry
about widening combinations etc):

MADD R1,R2,R3 ; R1 += R2 * R3
MADD R1,R2,#imm ; R1 += R2 * imm

....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
that can either be placed in bits <13:0> (sign extended to 32 bits) or
in bits <31:18> (LSB extended to bits <17:0>). I've found that this
particular immediate format is quite useful for most arithmetic and
bitwise logic instructions.

As usual, this also works with vector registers and packed data types,
so e.g:

MADD.H V1,V2,R3 ; V1[k] += V2[k] * R3 (packed half-word)
MADD V1,V2,#imm ; V1[k] += V2[k] * imm

I still have the option to add another MADD version that changes the
order of the operands in order to avoid MOV:s. For instance, it could
work like this:

MADD213 R1,R2,R3 ; R1 = R2 + R1 * R3

....but I think that at that point we're getting into diminishing
returns.

/Marcus

Re: MADD instruction (integer multiply and add)

<slr8uu$uq7$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21825&group=comp.arch#21825

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 2 Nov 2021 11:55:10 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <slr8uu$uq7$1@newsreader4.netcologne.de>
References: <slm4ja$e0b$1@dont-email.me>
<762616e8-e291-4a00-bdb6-a8611111f1b0n@googlegroups.com>
<slo9bj$56d$1@dont-email.me> <slpba7$p5p$1@newsreader4.netcologne.de>
<slqqnd$53c$1@dont-email.me>
Injection-Date: Tue, 2 Nov 2021 11:55:10 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:a34:0:7285:c2ff:fe6c:992d";
logging-data="31559"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 2 Nov 2021 11:55 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:
> On 2021-11-01 19:23, Thomas Koenig wrote:
>> Marcus <m.delete@this.bitsnbites.eu> schrieb:
>>
>>> A quick test on a few code bases with my MADD-enabled GCC (keep in mind
>>> that my GCC machine description is far from perfect):
>>>
>>> Quake:
>>> # of MUL: 146
>>> # of MADD: 162 (52%)
>>
>> Here are a few numbers for POWER binaries, which has the madd*
>> instructions:
>>
>> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmadd" | wc -l
>> 1066
>> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | grep -P "\tmul" | wc -l
>> 6350
>> [tkoenig@gcc135 lib64]$ objdump --disassemble libgfortran.a | wc -l
>> 400530
>>
>> [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
>> 0
>> [tkoenig@gcc135 bin]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
>> 318
>> [tkoenig@gcc135 bin]$ objdump --disassemble troff | wc -l
>> 106736
>>
>> So, even in a very matrix-oriented code like libgfortran, the number
>> of integer multiply and adds is rather low - 0.25%.
>>
>
> Thanks for the stats!
>
> Unfortunately I do not yet have access to a lot of larger code bases
> (e.g. my GCC toolchain is missing OS/posix functions), so I have to use
> a few "simple" ones when I try out new features (e.g. Quake).
>
> I wonder if your findings would translate to MRISC32 (apples to
> apples?). Also, was troff compiled for POWER9?

No, it wasn't.

Here is some data for libgsl, built with recent trunk with -mcpu=power9:

[tkoenig@gcc135 .libs]$ objdump --disassemble libgsl.a | grep -P "\tmadd" | wc -l
1045
[tkoenig@gcc135 .libs]$ objdump --disassemble libgsl.a | grep -P "\tmul" | wc -l
3416
[tkoenig@gcc135 .libs]$ objdump --disassemble libgsl.a | wc -l
633746

> Goes checking GNU troff...

Recent gcc trunk and the most recent groff gives me, with -mcpu=power9:

[tkoenig@gcc135 groff-1.22.4]$ objdump --disassemble troff | grep -P "\tmadd" | wc -l
79
[tkoenig@gcc135 groff-1.22.4]$ objdump --disassemble troff | grep -P "\tmul" | wc -l
265
[tkoenig@gcc135 groff-1.22.4]$ objdump --disassemble troff | wc -l
118783

[...]

> ...so ~45% of the multiplications transform to MADD for MRISC32. I guess
> there could be differences in the POWER ISA that allows it to do other
> transformations?

Possibly, I have not looked into this in that detail.

What would be interesting for your ISA is the effect that this
has on bignum multiplication, especially with vectorization
(important for crypto, of course).

Re: MADD instruction (integer multiply and add)

<c4bgJ.15658$IB7.11873@fx02.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21830&group=comp.arch#21830

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx02.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me> <slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me> <slr2cl$47v$1@dont-email.me>
In-Reply-To: <slr2cl$47v$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 19
Message-ID: <c4bgJ.15658$IB7.11873@fx02.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 02 Nov 2021 13:30:48 UTC
Date: Tue, 02 Nov 2021 09:30:43 -0400
X-Received-Bytes: 1552
 by: EricP - Tue, 2 Nov 2021 13:30 UTC

Marcus wrote:
>
> I ended up ditching MSUB - it was almost never used. I now have the
> following variants, which covers most of my needs (things are also
> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> about widening combinations etc):
>
> MADD R1,R2,R3 ; R1 += R2 * R3
> MADD R1,R2,#imm ; R1 += R2 * imm
>
> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> particular immediate format is quite useful for most arithmetic and
> bitwise logic instructions.

What do you do about immediate bits [17:14]?

Re: MADD instruction (integer multiply and add)

<slrilu$u51$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21832&group=comp.arch#21832

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 2 Nov 2021 15:41:02 +0100
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <slrilu$u51$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me>
<slr2cl$47v$1@dont-email.me> <c4bgJ.15658$IB7.11873@fx02.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 2 Nov 2021 14:41:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fbf5ac804f37548c4fae79bd6f976f55";
logging-data="30881"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PtjdAe5iVaCxAfR5ZTEeVIq+bukWkJkU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:hOHGP8ZmMy+EPHVDhP2rBnZkcfo=
In-Reply-To: <c4bgJ.15658$IB7.11873@fx02.iad>
Content-Language: en-US
 by: Marcus - Tue, 2 Nov 2021 14:41 UTC

On 2021-11-02 14:30, EricP wrote:
> Marcus wrote:
>>
>> I ended up ditching MSUB - it was almost never used. I now have the
>> following variants, which covers most of my needs (things are also
>> simplified by the fact that my ISA is 32-bit, so I don't have to worry
>> about widening combinations etc):
>>
>>   MADD R1,R2,R3    ; R1 += R2 * R3
>>   MADD R1,R2,#imm  ; R1 += R2 * imm
>>
>> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
>> that can either be placed in bits <13:0> (sign extended to 32 bits) or
>> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
>> particular immediate format is quite useful for most arithmetic and
>> bitwise logic instructions.
>
> What do you do about immediate bits [17:14]?

That's another instruction (or possibly even two more instructions).
I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
and actually expands it to 2 instructions (LDI + OR) if needed. It's a
classic RISC solution, but at least slightly better than the naive
solution thanks to the extra Hi/Lo flag...

I have an I21HL 21-bit encoding that's used by the LDI instruction, that
places the 20 bits in <19:0> or <31:13>.

So, with the smaller I15HL format you can express:

* Signed integers in the range [-8192, 8191].
* Bit masks & values such as:
- 0xFF000000
- 0xFFFFF000
- 0x1FFFFFFF
- 0x7F800000 (exponent mask for IEEE 754 binary-32)
- 0x41A80000 (21.0F in IEEE 754 binary-32)
- etc.

....and with the I21HL format you can express even wider ranges, e.g.
integers in the range [-524288, 524287], or 32-bit floating-point
values with up to 11 fractional bits (e.g. -1020.5F).

It's all in the ISA manual [1] (section 1.5: Immediate value encoding).

I have found that with these encodings I get away with a single 32-bit
instruction word for most common operations, e.g.

MIN R2, R1, #5000

XOR R2, R1, #0x07F00000

....and two 32-bit instruction words suffice in many of the situations
where one is not enough, e.g:

LDI R2, #500000
ADD R2, R1, R2

And finally there's the three-instruction fall-back for rare cases:

LDI R2, #0x12344000
OR R2, R2, #0x00001678 ; R2 = 0x12345678
ADD R2, R1, R2

In the latter case you can just write "LDI R2, #0x12345678" and the
assembler will expand it for you (and I'd expect advanced CPU front-
ends to be able to fuse the instruction pair into a single instruction).
Also, I've found that most of these large constants will be preloaded
into registers outside of loops (thanks to LICM and a sufficient number
of architectural registers).

/Marcus

[1] https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf

Re: MADD instruction (integer multiply and add)

<9ec8cabf-0410-47a0-ac4d-8f0105a33e43n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21834&group=comp.arch#21834

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:46c8:: with SMTP id h8mr4150053qto.208.1635883302731;
Tue, 02 Nov 2021 13:01:42 -0700 (PDT)
X-Received: by 2002:a05:6808:128d:: with SMTP id a13mr7030856oiw.51.1635883302489;
Tue, 02 Nov 2021 13:01:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 2 Nov 2021 13:01:42 -0700 (PDT)
In-Reply-To: <slrilu$u51$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:50b2:2c45:9648:67c4;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:50b2:2c45:9648:67c4
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me> <slr2cl$47v$1@dont-email.me>
<c4bgJ.15658$IB7.11873@fx02.iad> <slrilu$u51$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9ec8cabf-0410-47a0-ac4d-8f0105a33e43n@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 02 Nov 2021 20:01:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 38
 by: MitchAlsup - Tue, 2 Nov 2021 20:01 UTC

On Tuesday, November 2, 2021 at 9:41:05 AM UTC-5, Marcus wrote:
> On 2021-11-02 14:30, EricP wrote:
> > Marcus wrote:
> >>
> >> I ended up ditching MSUB - it was almost never used. I now have the
> >> following variants, which covers most of my needs (things are also
> >> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> >> about widening combinations etc):
> >>
> >> MADD R1,R2,R3 ; R1 += R2 * R3
> >> MADD R1,R2,#imm ; R1 += R2 * imm
> >>
> >> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
> >> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> >> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> >> particular immediate format is quite useful for most arithmetic and
> >> bitwise logic instructions.
> >
> > What do you do about immediate bits [17:14]?
<
> That's another instruction (or possibly even two more instructions).
> I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
> and actually expands it to 2 instructions (LDI + OR) if needed. It's a
> classic RISC solution, but at least slightly better than the naive
> solution thanks to the extra Hi/Lo flag...
<
But not as nice as full length immediates available in the instruction set.
<
MUL R7,#123456789101112,-R9
<
So, while this takes 3 words of instruction space (64-bit immediate) it
takes only 1 pipelined cycle of execution (4 cycles of latency) and can
be performed even in lower end implementations as 1 instruction

>
> /Marcus
>
>
> [1] https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf

Re: MADD instruction (integer multiply and add)

<sls6lf$3k9$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21835&group=comp.arch#21835

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Tue, 2 Nov 2021 15:21:57 -0500
Organization: A noiseless patient Spider
Lines: 270
Message-ID: <sls6lf$3k9$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me>
<slr2cl$47v$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 2 Nov 2021 20:22:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3f8c70927fa185e93928b5ee53279b55";
logging-data="3721"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18z1diRKc1N3MyF5oDx9Gnk"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Cancel-Lock: sha1:G+v/wfdm8SEcYT+TI1/SGBJ08I0=
In-Reply-To: <slr2cl$47v$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 2 Nov 2021 20:21 UTC

On 11/2/2021 5:03 AM, Marcus wrote:
> On 2021-11-01 23:13, BGB wrote:
>> On 11/1/2021 4:22 AM, Marcus wrote:
>>> On 2021-10-31 19:40, BGB wrote:
>>>> On 10/31/2021 8:10 AM, Marcus wrote:
>>>>> Hello group!
>>>>>
>>>>> I just did some analysis of some code that included integer
>>>>> multiplications, and concluded that having a fused multiply-and-add
>>>>> (a.k.a MAC) instruction would eliminate many ADD instructions, since
>>>>> a multiplication operation more often than not comes together with a
>>>>> following addition operation.
>>>>>
>>>>> Furthermore the instruction eliminates the (potential) data dependency
>>>>> latency between the MUL instruction and the ADD instruction.
>>>>>
>>>>> So, I went ahead and added a simple MADD instruction to my ISA, on the
>>>>> form:
>>>>>
>>>>>    MADD R1, R2, R3  ; R1 <- R1 + R2 * R3
>>>>>
>>>>> This was a very simple addition to my hardware implementation, and in
>>>>> my FPGA design it didn't really add any extra delay (the addition is
>>>>> performed in the pipeline stage directly after the multiplication,
>>>>> concurrently with some other result fixup operations). Thus the
>>>>> cost of
>>>>> the ADD instruction could be completely eliminated (both in code size
>>>>> and in instruction cycle count).
>>>>>
>>>>> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
>>>>> instruction. In fact, in ARMv8 the MUL instruction is an /alias/
>>>>> for the
>>>>> MADD instruction (the addend is set to zero).
>>>>>
>>>>> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking
>>>>> this
>>>>> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?
>>>>>
>>>>> It seems to me that it's a very useful instruction, and that it has
>>>>> a relatively small hardware cost.
>>>>>
>>>>
>>>> Goes looking:
>>>> Yeah, it appears that this is a pretty common pattern.
>>>>
>>>> Though, MUL isn't a super high priority op (in a few programs,
>>>> several orders of magnitude less common than the load/store ops;
>>>> even vs ADD by itself).
>>>>
>>>> The debate, I guess, is whether it would save enough to be worthwhile.
>>>>
>>>>
>>>> Going by some of my clock-cycle usage stats, it would likely save
>>>> around 0.01% of the total clock-cycles for programs like Doom and
>>>> similar.
>>>>
>>>> It does result in an increase in LUT cost and timing getting a
>>>> little worse. I ended up shoving it into the adders which deal with
>>>> high-order products and sign-extension. Causes resource cost to
>>>> increase by around 2%.
>>>>
>>>>
>>>> But, there are cases where it could be useful.
>>>>    Went and defined a DMAC extension, adding:
>>>>      MACS.L   Rm, Ro, Rn  //Rn=SExt(Rn+(Rm*Ro))
>>>>      MACU.L   Rm, Ro, Rn  //Rn=ZExt(Rn+(Rm*Ro))
>>>>      DMACS.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>>>>      DMACU.L  Rm, Ro, Rn  //Rn=Rn+(Rm*Ro)  //Widening MAC
>>>>
>>>> Debate being mostly whether it is worth a 2% increase in LUT budget
>>>> (~1kLUT).
>>>
>>> Yes, I saw a ~1% LUT increase & ~4% register increase in my CPU core
>>> (the registers are probably for the pipelined passing of the addend to
>>> the final multiplier stage). Though, granted, I have not been very
>>> restrictive with the LUT budget. I'm still under 50% logic utilization
>>> for my complete SoC computer design. Lack of BRAM is a bigger issue for
>>> me.
>>>
>>
>> I had to give up on doing dual-core, so I am currently stuck with a
>> single CPU core on the XC7A100 with the current feature-set (though, a
>> whole lot of LUTs are being spent on having 64B cache lines in the L2
>> cache).
>>
>> I had considered SMT as a possible cheaper alternative, but this is
>> still pretty far from being fully implemented.
>>
>>
>> Currently using ~ 70% (44000 / 63400 LUTs), and 118 / 135 BRAM, ...
>>
>> As noted, this is with the current major configuration:
>>    WEX-3W (3 Execute Lanes)
>>    ALUX (128-bit ADD/SUB/Shift)
>>    FP-SIMD (128-bit packed Floating-Point SIMD)
>>      Also 64-bit packed Half
>>      Also 64-bit 3x FP21 (S.E5.M15)
>>      Some FP8 Ops ( 32-bit, 4x FP8; E4.M4 / S.E4.M3 )
>>    Block-Texture and Block-Audio Decoders (UTX/UAB)
>>    RISC-V Mode (Secondary Decoders)
>>    XMOV (96-bit virtual address space)
>>    ...
>>    64B L2 cache lines
>>    256K L2 Cache
>>    16K + 32K L1 Caches
>>    256x 4-way TLB (2-way when using 96-bit VAs)
>>    ...
>>
>>
>> Without the MAC instructions, it is ~ 68% of the LUT budget.
>> Using 16B L2 cache lines saves ~ 8%, but somewhat reduces memory
>> bandwidth (and is basically required for the DRAM-backed framebuffer
>> to work effectively).
>>
>>
>> Currently not enabled (but exist):
>>    FPUX / Long-Double (Hardware support for a 96-bit / S.E15.M80 format);
>>    BLINT / BLERP (Hardware Bilinear Interpolator).
>>
>>
>> These features a both fairly expensive and also fairly niche.
>> The absence of FPUX can be partly offset by ALUX being able to
>> implement FPU emulation via 128-bit integer operations. Well, except
>> FMUL is still kinda slow when only only has a 32x32->64 multiplier
>> (eg: int128 multiply is kinda slow).
>>
>> The software emulation does the S.E15.M112 format (Quad Precision).
>>
>> In practice, neither __float128 nor "long double" are commonly used,
>> so it isn't too much of an issue at present.
>>
>>
>> While the hardware bilinear interpolator did work, it didn't offer
>> that big of an advantage over doing it in software (to offset its cost).
>>
>> Though, partly for TKRA-GL, did end up with some of the bilinear paths
>> using a slightly cheaper but approximate interpolator (namely, a
>> cost-reduced 3-point interpolation algo partly "inspired" by the
>> Nintendo 64).
>>
>>
>>> The single MADD instruction was just a PoC. I'm going to investigate if
>>> MSUB makes sense too, and then I think I'm going to rearrange the
>>> opcodes so that I can use an immediate value for one of the
>>> multiplicands (and the divisor for division), which could make for a
>>> nice improvement in some situations.
>>>
>>
>> In my case, I ended up fiddling with it more, and adding some Imm5u
>> encodings as well (since Imm5 fits into 3R space).
>>
>> There was an Imm9 space, but none of this remains (it was mostly used
>> for ALU ops).
>>
>>
>
> I ended up ditching MSUB - it was almost never used. I now have the
> following variants, which covers most of my needs (things are also
> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> about widening combinations etc):
>
>   MADD R1,R2,R3    ; R1 += R2 * R3
>   MADD R1,R2,#imm  ; R1 += R2 * imm
>
> ...where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> particular immediate format is quite useful for most arithmetic and
> bitwise logic instructions.
>
> As usual, this also works with vector registers and packed data types,
> so e.g:
>
>   MADD.H V1,V2,R3    ; V1[k] += V2[k] * R3  (packed half-word)
>   MADD   V1,V2,#imm  ; V1[k] += V2[k] * imm
>

From my listing:
* F0nm_6go0 ? MACS.L Rm, Ro, Rn
* F0nm_6Go0 ? MACS.L Rm, Imm5u, Rn
* F0nm_6go1 ? MACU.L Rm, Ro, Rn
* F0nm_6Go1 ? MACU.L Rm, Imm5u, Rn
* F0nm_6go2 ? DMACS.L Rm, Ro, Rn
* F0nm_6Go2 ? DMACS.L Rm, Imm5u, Rn
* F0nm_6go3 ? DMACU.L Rm, Ro, Rn
* F0nm_6Go3 ? DMACU.L Rm, Imm5u, Rn


Click here to read the complete article
Re: MADD instruction (integer multiply and add)

<e8116c97-4f6b-45f3-8e25-5258554e899fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21836&group=comp.arch#21836

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a7b:cb12:: with SMTP id u18mr10594060wmj.109.1635895164518;
Tue, 02 Nov 2021 16:19:24 -0700 (PDT)
X-Received: by 2002:a05:6808:10d2:: with SMTP id s18mr7757580ois.30.1635895163852;
Tue, 02 Nov 2021 16:19:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 2 Nov 2021 16:19:23 -0700 (PDT)
In-Reply-To: <sls6lf$3k9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:7413:9cd9:f77d:15cf;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:7413:9cd9:f77d:15cf
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me> <slr2cl$47v$1@dont-email.me>
<sls6lf$3k9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e8116c97-4f6b-45f3-8e25-5258554e899fn@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 02 Nov 2021 23:19:24 +0000
Content-Type: text/plain; charset="UTF-8"
 by: robf...@gmail.com - Tue, 2 Nov 2021 23:19 UTC

On Tuesday, November 2, 2021 at 9:41:05 AM UTC-5, Marcus wrote:
> On 2021-11-02 14:30, EricP wrote:
> > Marcus wrote:
> >>
> >> I ended up ditching MSUB - it was almost never used. I now have the
> >> following variants, which covers most of my needs (things are also
> >> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> >> about widening combinations etc):
> >>
> >> MADD R1,R2,R3 ; R1 += R2 * R3
> >> MADD R1,R2,#imm ; R1 += R2 * imm
> >>
> >> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
> >> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> >> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> >> particular immediate format is quite useful for most arithmetic and
> >> bitwise logic instructions.
> >
> > What do you do about immediate bits [17:14]?
<
> That's another instruction (or possibly even two more instructions).
> I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
> and actually expands it to 2 instructions (LDI + OR) if needed. It's a
> classic RISC solution, but at least slightly better than the naive
> solution thanks to the extra Hi/Lo flag...
<

>But not as nice as full length immediates available in the instruction set.
<
>MUL R7,#123456789101112,-R9
<
>So, while this takes 3 words of instruction space (64-bit immediate) it
>takes only 1 pipelined cycle of execution (4 cycles of latency) and can
>be performed even in lower end implementations as 1 instruction

>
> /Marcus
>
>
I like to use prefix instructions to extend constants because they can be applied to
extending displacements as well as immediates. The prefix and following instruction
can be fetched and processed as a single unit if desired. For Thor there are short
11-bit immediate instructions that do not accept a prefix and longer 23-bit immediate
instructions that can be extended to 64 bits using just one prefix. Prefixes can add 7,
23, 41 or 55 bits. The number of bits for the instruction can then be tailored in 16-bit
parcel sizes. Rather than have whole bunches of instructions that process immediates
piece-meal for a 64-bit machine, prefixes are used. Shifting constants around does
not work for all instructions, like divide, but prefixes do. Using prefixes also does not
require the use of intermediate registers. So, full length immediates and displacements
are supported for most instructions in Thor. Gotta have those full- length constants!

Thor has a fast multiply and add instruction, single cycle hopefully, in addition to regular
multiply-and add. As the popular instruction format is 3R, I use all three source register
slots for many instructions. Logic operations have 3R forms Ra & Rb & Rc and the
compiler will support them.

Re: MADD instruction (integer multiply and add)

<25d17d5d-e7fd-4c7f-8df6-f4149c103ba4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21837&group=comp.arch#21837

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a7b:c049:: with SMTP id u9mr11517403wmc.102.1635901110612;
Tue, 02 Nov 2021 17:58:30 -0700 (PDT)
X-Received: by 2002:a05:6808:1923:: with SMTP id bf35mr1504165oib.7.1635901109652;
Tue, 02 Nov 2021 17:58:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 2 Nov 2021 17:58:29 -0700 (PDT)
In-Reply-To: <e8116c97-4f6b-45f3-8e25-5258554e899fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:50b2:2c45:9648:67c4;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:50b2:2c45:9648:67c4
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me> <slr2cl$47v$1@dont-email.me>
<sls6lf$3k9$1@dont-email.me> <e8116c97-4f6b-45f3-8e25-5258554e899fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25d17d5d-e7fd-4c7f-8df6-f4149c103ba4n@googlegroups.com>
Subject: Re: MADD instruction (integer multiply and add)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 03 Nov 2021 00:58:30 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 3 Nov 2021 00:58 UTC

On Tuesday, November 2, 2021 at 6:19:26 PM UTC-5, robf...@gmail.com wrote:
> On Tuesday, November 2, 2021 at 9:41:05 AM UTC-5, Marcus wrote:
> > On 2021-11-02 14:30, EricP wrote:
> > > Marcus wrote:
> > >>
> > >> I ended up ditching MSUB - it was almost never used. I now have the
> > >> following variants, which covers most of my needs (things are also
> > >> simplified by the fact that my ISA is 32-bit, so I don't have to worry
> > >> about widening combinations etc):
> > >>
> > >> MADD R1,R2,R3 ; R1 += R2 * R3
> > >> MADD R1,R2,#imm ; R1 += R2 * imm
> > >>
> > >> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
> > >> that can either be placed in bits <13:0> (sign extended to 32 bits) or
> > >> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
> > >> particular immediate format is quite useful for most arithmetic and
> > >> bitwise logic instructions.
> > >
> > > What do you do about immediate bits [17:14]?
> <
> > That's another instruction (or possibly even two more instructions).
> > I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
> > and actually expands it to 2 instructions (LDI + OR) if needed. It's a
> > classic RISC solution, but at least slightly better than the naive
> > solution thanks to the extra Hi/Lo flag...
> <
>
> >But not as nice as full length immediates available in the instruction set.
> <
> >MUL R7,#123456789101112,-R9
> <
> >So, while this takes 3 words of instruction space (64-bit immediate) it
> >takes only 1 pipelined cycle of execution (4 cycles of latency) and can
> >be performed even in lower end implementations as 1 instruction
>
> >
> > /Marcus
> >
> >
> I like to use prefix instructions to extend constants because they can be applied to
> extending displacements as well as immediates. The prefix and following instruction
> can be fetched and processed as a single unit if desired.
<
In My 66000, there is a group of OpCodes that contain 16-bit immediates 0b1xxxxx
and another set of groups that can attach 32-bit or 64-bit immediates 0b00xxxx. The
second set of groups also contains all the reg-reg OpCodes and the shifts with 12-bit
immediates.
<
I use a 4-bit encoding that also deals with all of the signs, immediates, displacements
and whether the immediate is operand[1] or operand[2]. One can argue that I have
lost entropy using such an encoding, but the pattern extends to the 1-operand and
3-operand formats without change in the 4-bit encoding.
<
> For Thor there are short
> 11-bit immediate instructions that do not accept a prefix and longer 23-bit immediate
> instructions that can be extended to 64 bits using just one prefix. Prefixes can add 7,
> 23, 41 or 55 bits. The number of bits for the instruction can then be tailored in 16-bit
> parcel sizes.
<
My 66000 is getting code density comparable to x86-64 without finding any need for
16-bit parcels. When I looked at adding 16-bit parcels, it screwed too many things
up for what looked to be a minor improvement in OpCode density; essentially eating
the vast majority of expansion room in the OpCode space that I did not look worthwhile
as a long term plan.
<
> Rather than have whole bunches of instructions that process immediates
> piece-meal for a 64-bit machine, prefixes are used.
<
Rather than having a whole bunch of instructions that process immediates, I gave EVERY
instruction access to every size of immediate (within reason).
<
> Shifting constants around does
> not work for all instructions, like divide, but prefixes do. Using prefixes also does not
> require the use of intermediate registers. So, full length immediates and displacements
> are supported for most instructions in Thor. Gotta have those full- length constants!
>
> Thor has a fast multiply and add instruction, single cycle hopefully, in addition to regular
> multiply-and add.
<
It may be fully pipelined, but it not going to be 1 cycle of latency. I haven't seen a
multiplier (integer) faster than 3 cycles and most of them are 4-cycles.
<
Each layer in the 4-2 compressor tree (that perform the actual multiplication) is
2-gates of delay, and you need at least 5-layers of these (plus wire delay, lots
of wire delay) and following the tree, you need an 11-gate delay adder; add in 3 gates
for Booth recoding, 1 gate for sign/unsign control and you have a good 26-gates of delay.
You might be able to route this and make 2-cycle pipeline, but I have not seen it done
in a otherwise high perf processor.
<
> As the popular instruction format is 3R, I use all three source register
> slots for many instructions. Logic operations have 3R forms Ra & Rb & Rc and the
> compiler will support them.
<
But can you do :: Ra &~ Rb | Rc in a single shot ?

Re: MADD instruction (integer multiply and add)

<slteul$j1c$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21839&group=comp.arch#21839

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Wed, 3 Nov 2021 02:49:31 -0500
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <slteul$j1c$1@dont-email.me>
References: <slm4ja$e0b$1@dont-email.me> <slmnvr$tl0$1@dont-email.me>
<slobli$krq$1@dont-email.me> <slpoq0$ejh$1@dont-email.me>
<slr2cl$47v$1@dont-email.me> <sls6lf$3k9$1@dont-email.me>
<e8116c97-4f6b-45f3-8e25-5258554e899fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 3 Nov 2021 07:49:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d931f29ad12e5c3486665eff45c79b3e";
logging-data="19500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX192+8kPBKZ5jKfs/KT3c8Yr"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
Cancel-Lock: sha1:xW/lIr21H4EP497wLuBhqbf6fZg=
In-Reply-To: <e8116c97-4f6b-45f3-8e25-5258554e899fn@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 3 Nov 2021 07:49 UTC

On 11/2/2021 6:19 PM, robf...@gmail.com wrote:
> On Tuesday, November 2, 2021 at 9:41:05 AM UTC-5, Marcus wrote:
>> On 2021-11-02 14:30, EricP wrote:
>>> Marcus wrote:
>>>>
>>>> I ended up ditching MSUB - it was almost never used. I now have the
>>>> following variants, which covers most of my needs (things are also
>>>> simplified by the fact that my ISA is 32-bit, so I don't have to worry
>>>> about widening combinations etc):
>>>>
>>>> MADD R1,R2,R3 ; R1 += R2 * R3
>>>> MADD R1,R2,#imm ; R1 += R2 * imm
>>>>
>>>> ....where imm uses my 15-bit "I15HL" format, which is a 14-bit immediate
>>>> that can either be placed in bits <13:0> (sign extended to 32 bits) or
>>>> in bits <31:18> (LSB extended to bits <17:0>). I've found that this
>>>> particular immediate format is quite useful for most arithmetic and
>>>> bitwise logic instructions.
>>>
>>> What do you do about immediate bits [17:14]?
> <
>> That's another instruction (or possibly even two more instructions).
>> I have a LoaD Immediate (LDI) instruction that accepts any 32-bit value,
>> and actually expands it to 2 instructions (LDI + OR) if needed. It's a
>> classic RISC solution, but at least slightly better than the naive
>> solution thanks to the extra Hi/Lo flag...
> <
>
>> But not as nice as full length immediates available in the instruction set.
> <
>> MUL R7,#123456789101112,-R9
> <
>> So, while this takes 3 words of instruction space (64-bit immediate) it
>> takes only 1 pipelined cycle of execution (4 cycles of latency) and can
>> be performed even in lower end implementations as 1 instruction
>
>>
>> /Marcus
>>
>>

( misplaced response, going to assume Google Groups or something... )

> I like to use prefix instructions to extend constants because they can be applied to
> extending displacements as well as immediates. The prefix and following instruction
> can be fetched and processed as a single unit if desired. For Thor there are short
> 11-bit immediate instructions that do not accept a prefix and longer 23-bit immediate
> instructions that can be extended to 64 bits using just one prefix. Prefixes can add 7,
> 23, 41 or 55 bits. The number of bits for the instruction can then be tailored in 16-bit
> parcel sizes. Rather than have whole bunches of instructions that process immediates
> piece-meal for a 64-bit machine, prefixes are used. Shifting constants around does
> not work for all instructions, like divide, but prefixes do. Using prefixes also does not
> require the use of intermediate registers. So, full length immediates and displacements
> are supported for most instructions in Thor. Gotta have those full- length constants!
>
> Thor has a fast multiply and add instruction, single cycle hopefully, in addition to regular
> multiply-and add. As the popular instruction format is 3R, I use all three source register
> slots for many instructions. Logic operations have 3R forms Ra & Rb & Rc and the
> compiler will support them.
>

The multiply op is 1-3 cycles, depending on interlocks (mostly as with
Load, not using the result for at least 2 cycles makes it faster).

The experimental MAC / DMAC form does not add any additional latency
(but does have a non-zero LUT cost and similar).

But, yeah, in BJX2, the encodings for Jumbo and Op64 forms are in-effect
built from the use of prefixes.

This seems to be fairly cheap and easy to sort out in a decoder, even if
the actual encoding of the constants is a bit of a bit-twiddly mess.

Decided to skip listing out all of the immediate encodings in BJX2.

Then again, bit-twiddly encodings for immediate values is hardly an
issue unique to BJX2 (look at the organization of the bits in the branch
displacement in RISC-V's JAL instruction, for comparison).

The recent addition of 4R/4RI encodings for the MAC and DMAC
instructions was extended to allow encoding a few new address modes (as
an experiment).

Namely:
(Rm, Disp17s) (*1)
(Rm, Ro, Disp11)
(Rm, Ro*Sc, Disp9)

Seems able to pass timing as well, though supporting both an index and
displacement at the same time is debatable. Ended up making the
displacements unscaled in this case. Sc is 1/2/4/8 for now.

These don't use any additional encoding space in terms of 32-bit ops.

*1: Though, the simple case can also be encoded with a Jumbo prefix for
Disp33s.

A lot of these encodings are still untested though, and I still don't
really know if they are "sane".

Re: MADD instruction (integer multiply and add)

<sltom7$14sa$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21841&group=comp.arch#21841

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Wed, 3 Nov 2021 11:35:49 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sltom7$14sa$1@gioia.aioe.org>
References: <slm4ja$e0b$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="37770"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Wed, 3 Nov 2021 10:35 UTC

Marcus wrote:
> Hello group!
>
> I just did some analysis of some code that included integer
> multiplications, and concluded that having a fused multiply-and-add
> (a.k.a MAC) instruction would eliminate many ADD instructions, since
> a multiplication operation more often than not comes together with a
> following addition operation.
>
> Furthermore the instruction eliminates the (potential) data dependency
> latency between the MUL instruction and the ADD instruction.
>
> So, I went ahead and added a simple MADD instruction to my ISA, on the
> form:
>
>   MADD R1, R2, R3  ; R1 <- R1 + R2 * R3
>
> This was a very simple addition to my hardware implementation, and in
> my FPGA design it didn't really add any extra delay (the addition is
> performed in the pipeline stage directly after the multiplication,
> concurrently with some other result fixup operations). Thus the cost of
> the ADD instruction could be completely eliminated (both in code size
> and in instruction cycle count).
>
> I also noticed that ARMv8 has a similar (but more flexible 4-operand)
> instruction. In fact, in ARMv8 the MUL instruction is an /alias/ for the
> MADD instruction (the addend is set to zero).
>
> However, many other ISA:s (apart from DSP ISA:s) seem to be lacking this
> instruction (x86, RISC-V, ARMv7, POWER?, My 66000?). How come?

On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
that is probably fast enough, and it provides the most general building
block:

mov rax,b
mul c ;; 4 or 5 cycles

add rax,a ;; 1 cycle

adc rdx,0
add rax,carry ;; 1 cycle

adc rdx,0 ;; 1 cycle

Doing it in hardware is as you note almost free, zero or one additional
cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
general CPU, even if you cheat and make both outputs implied. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: MADD instruction (integer multiply and add)

<sltqf4$nck$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21843&group=comp.arch#21843

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: MADD instruction (integer multiply and add)
Date: Wed, 3 Nov 2021 11:06:12 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sltqf4$nck$1@newsreader4.netcologne.de>
References: <slm4ja$e0b$1@dont-email.me> <sltom7$14sa$1@gioia.aioe.org>
Injection-Date: Wed, 3 Nov 2021 11:06:12 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-a34-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:a34:0:7285:c2ff:fe6c:992d";
logging-data="23956"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 3 Nov 2021 11:06 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

> On x86 you get the full form (hi,lo = a + (b*c) + carry) in 7-8 cycles,
> that is probably fast enough, and it provides the most general building
> block:
>
> mov rax,b
> mul c ;; 4 or 5 cycles
>
> add rax,a ;; 1 cycle
>
> adc rdx,0
> add rax,carry ;; 1 cycle
>
> adc rdx,0 ;; 1 cycle
>
> Doing it in hardware is as you note almost free, zero or one additional
> cycle over the basic 64x64->128 MUL, but as shown above, you only save 2
> or 3 cycles and it is hard to fit a 4-input/2-output instruction in a
> general CPU, even if you cheat and make both outputs implied. :-(

Digging through the POWER ISA... in 3.0B aka POWER9 you can do
(RA, RB, RC, RT1 and RT2 refer to suitably defined registers)

addic RC, RC, 0 ! RC = RC + Carry
maddhdu RT1, RA, RB, RC ! RT1 = high(RA*RB + RC)
maddld RT2, RA, RB, RC ! RT2 = low(RA*RB + RC)

Pages:123
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor