Message-ID:

6 May, 2024: The networking issue during the past two days has been identified and appears to be fixed. Will keep monitoring.

devel / comp.arch / Re: Three-way add

Three-way add

<tu0m6q$1nh4a$1@newsreader4.netcologne.de>

https://www.novabbs.com/devel/article-flat.php?id=31037&group=comp.arch#31037

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-1ca7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Three-way add
Date: Sun, 5 Mar 2023 00:02:34 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <tu0m6q$1nh4a$1@newsreader4.netcologne.de>
Injection-Date: Sun, 5 Mar 2023 00:02:34 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-1ca7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:1ca7:0:7285:c2ff:fe6c:992d";
logging-data="1819786"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 5 Mar 2023 00:02 UTC

I've been wondering if a three-way add would be worth the bother
(and the opcode space), and I wrote a little Perl script to
go through the output of the current My 66000 compiler to find
opportunities.

The script is not 100% perfect, because it does not look at all
opportunities for adding if more than two add statements come in
sequence, and it does not detect adds that are not adjacent.
But it should be good enough for a general idea.

Results (numbers of positives, numbers of instructions,
percentage):

Perl: 1440 640291 0.225 %
gnuplot: 192 151402 0.127 %
embench-iot: 218 26405 0.826 %

By comparision, the number of fmac instructions:

Perl: 29
gnuplot: 1016
embench-iot: 29

So... not a great deal of instruction space to be saved, probably,
and I will refrain from advocating its inclusion in any ISA :-)

And here's the script:

#! /usr/bin/perl -w
$i = 0;
$n = 0;

while (<>) {
next unless /^\s[^.]/;
$i++;
next unless /^\s+add\s+(r[0-9]+),([^,]+),([^,]+)/;
$r1 = $1;
$next = <>;
if ($next =~ /\s+add\s+(r[0-9]+),-?([^,]+),-?([^,]+)/ )
{
if ($r1 eq $2 || $r1 eq $3) {
$n++;
}
}
}

printf ("%d %d %.3f %%\n", $n, $i, $n/$i*100.);

Re: Three-way add

<1b037e94-e84e-48cf-ba3f-7b65c53b6628n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31038&group=comp.arch#31038

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:43d7:0:b0:742:7fb5:f516 with SMTP id q206-20020a3743d7000000b007427fb5f516mr1562678qka.1.1677978333638;
Sat, 04 Mar 2023 17:05:33 -0800 (PST)
X-Received: by 2002:a05:6808:30c:b0:383:e7b5:8177 with SMTP id
i12-20020a056808030c00b00383e7b58177mr2031965oie.11.1677978333352; Sat, 04
Mar 2023 17:05:33 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 4 Mar 2023 17:05:33 -0800 (PST)
In-Reply-To: <tu0m6q$1nh4a$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2db2:b61e:2e7c:357e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2db2:b61e:2e7c:357e
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1b037e94-e84e-48cf-ba3f-7b65c53b6628n@googlegroups.com>
Subject: Re: Three-way add
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 05 Mar 2023 01:05:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sun, 5 Mar 2023 01:05 UTC

On Saturday, March 4, 2023 at 6:02:37 PM UTC-6, Thomas Koenig wrote:
<
> I've been wondering if a three-way add would be worth the bother
> (and the opcode space), and I wrote a little Perl script to
> go through the output of the current My 66000 compiler to find
> opportunities.
<
An interesting notion::
<
My 66000 AGEN is inherently a 3-input addition, and when one considers
<
CARRY R8,{I}
ADD R9,-R17,-R19
<
This is easier to code in Verilog as a 3-input adder than it is to code as
the first 3-bits being a 3-input adder and the other 61-bits being a 2-input
adder. All that you are doing is converting 61-3-input full adders into
2-input half-adders. Both 3-input full adders and 2-input adders are
2 gates of logic and 1 gate of delay.
<
Done are 3-input adder, one can get what Thomas was looking for done
using the above notation. The first multiprecision addition (or subtraction)
gets 63-bits from carry, and all subsequent carry outputs contain (0,1,2)-
bits. So, one can "get" 3-input integer adds using CARRY as a leading step.
<
..................
>
> The script is not 100% perfect, because it does not look at all
> opportunities for adding if more than two add statements come in
> sequence, and it does not detect adds that are not adjacent.
> But it should be good enough for a general idea.
<
Thanks for looking.
>
> Results (numbers of positives, numbers of instructions,
> percentage):
>
> Perl: 1440 640291 0.225 %
> gnuplot: 192 151402 0.127 %
> embench-iot: 218 26405 0.826 %
>
> By comparision, the number of fmac instructions:
>
> Perl: 29
> gnuplot: 1016
> embench-iot: 29
>
> So... not a great deal of instruction space to be saved, probably,
> and I will refrain from advocating its inclusion in any ISA :-)
>
> And here's the script:
>
> #! /usr/bin/perl -w
> $i = 0;
> $n = 0;
>
> while (<>) {
> next unless /^\s[^.]/;
> $i++;
> next unless /^\s+add\s+(r[0-9]+),([^,]+),([^,]+)/;
> $r1 = $1;
> $next = <>;
> if ($next =~ /\s+add\s+(r[0-9]+),-?([^,]+),-?([^,]+)/ )
> {
> if ($r1 eq $2 || $r1 eq $3) {
> $n++;
> }
> }
> }
>
> printf ("%d %d %.3f %%\n", $n, $i, $n/$i*100.);

Re: Three-way add

<2023Mar5.092646@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31040&group=comp.arch#31040

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Three-way add
Date: Sun, 05 Mar 2023 08:26:46 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 37
Distribution: world
Message-ID: <2023Mar5.092646@mips.complang.tuwien.ac.at>
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de>
Injection-Info: reader01.eternal-september.org; posting-host="f82d73bc64f2cedc17704a60aa208615";
logging-data="1352266"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+X2rev6xQwK2G8j3JqCanV"
Cancel-Lock: sha1:IMdgXY00hvcAxVgrN4Scw9xFTrQ=
X-newsreader: xrn 10.11

by: Anton Ertl - Sun, 5 Mar 2023 08:26 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>I've been wondering if a three-way add would be worth the bother
>(and the opcode space)

My memory told me that SuperSPARC supports three-way add, but looking
at
<https://vincyjoseph.files.wordpress.com/2013/04/supersparcwhitepaper.pdf>,
it actually supports two dependent integer ALU instructions in each cycle.

Similarly, Willamette and Northwood (the first two Pentium 4
generations) support two dependent integer ALU operation per cycle.

Prsecott (third generation Pentium 4) and AFAIK HyperSPARC (which
replaced SuperSPARC) did not have this feature, so apparently it was
not that important for performance (or it would have impacted cycle
time more than it would have helped IPC).

Implementationwise SuperSPARC used two ALUs for independent operations
followed by a crossbar followed by another two ALUs for the dependent
operation (so, given the total limit of three instructions per cycle,
there could be two ALU operations feeding one ALU operation or one ALU
operation feeding two ALU operations).

The Willamette/Northwood used two double-pumped ALUs, also on a
three-wide architecture, resulting in similar execution capabilities
as the SuperSPARC. I don't know if you could forward from one ALU to
the other every half-cycle. The data paths should be present (there
has to be forwarding every full cycle), so it probably could, but I
wonder if and how the timing worked out; after all, the Willamette and
Northwood were significantly higher clocked than the competition even
without double-pumping, how could they fit two (16-bit) ALU delays and
that much forwarding into one short cycle?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Three-way add

<28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31239&group=comp.arch#31239

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:8c6:b0:56e:ace8:866f with SMTP id da6-20020a05621408c600b0056eace8866fmr5876370qvb.3.1679130333494;
Sat, 18 Mar 2023 02:05:33 -0700 (PDT)
X-Received: by 2002:a54:470c:0:b0:383:fef9:6cac with SMTP id
k12-20020a54470c000000b00383fef96cacmr4284283oik.9.1679130333259; Sat, 18 Mar
2023 02:05:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 18 Mar 2023 02:05:33 -0700 (PDT)
In-Reply-To: <2023Mar5.092646@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=128.76.247.189; posting-account=tYjOgQoAAACRs74arwcusKjVVQt_fFMX
NNTP-Posting-Host: 128.76.247.189
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de> <2023Mar5.092646@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>
Subject: Re: Three-way add
From: agf...@dtu.dk (Agner Fog)
Injection-Date: Sat, 18 Mar 2023 09:05:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2559

by: Agner Fog - Sat, 18 Mar 2023 09:05 UTC

Anton Ertl wrote:
>(the first two Pentium 4 generations) support two dependent integer ALU operation per cycle.

The Pentium 4 NetBurst architecture had double clock frequency on integer ALUs and half frequency on vector ALUs. In other words, an integer addition in general purpose registers was 4 times faster than an integer vector addition. The extraordinary high speed was obtained by making the adder staggered, which means that the upper half was added one clock cycle later than the lower half. Independent 64-bit additions could run at pipelined speed, while subsequent dependent additions had to wait an extra clock cycle. (This is discussed on p. 59 in my microarchitecture manual).

My ForwardCom instruction set has a 3-way addition instruction with option bits for sign change on each operand:
Y = ± A ± B ± C
I found that this instruction is very useful. The integer version executes in a single clock cycle. If you can do a 64-bit multiplication in ~3 clock cycles then you can also do two additions in one clock cycle because a multiplication involves many additions.

There is a problem with floating point 3-operand addition because the precision depends on the order of the operands. Maybe you need to take the two numerically largest operands first, or calculate the intermediate sum with very high precision.

Re: Three-way add

<2352fb83-9e2c-4a45-8c54-c92b8ca8c492n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31249&group=comp.arch#31249

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:ca:b0:3de:f192:600d with SMTP id p10-20020a05622a00ca00b003def192600dmr183641qtw.2.1679165002766;
Sat, 18 Mar 2023 11:43:22 -0700 (PDT)
X-Received: by 2002:a05:6870:7f8e:b0:177:b694:724c with SMTP id
aw14-20020a0568707f8e00b00177b694724cmr826459oac.1.1679165002539; Sat, 18 Mar
2023 11:43:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 18 Mar 2023 11:43:22 -0700 (PDT)
In-Reply-To: <28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7071:39b9:f238:68aa;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7071:39b9:f238:68aa
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de> <2023Mar5.092646@mips.complang.tuwien.ac.at>
<28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2352fb83-9e2c-4a45-8c54-c92b8ca8c492n@googlegroups.com>
Subject: Re: Three-way add
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 18 Mar 2023 18:43:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3329

by: MitchAlsup - Sat, 18 Mar 2023 18:43 UTC

On Saturday, March 18, 2023 at 4:05:35 AM UTC-5, Agner Fog wrote:
> Anton Ertl wrote:
> >(the first two Pentium 4 generations) support two dependent integer ALU operation per cycle.
> The Pentium 4 NetBurst architecture had double clock frequency on integer ALUs and half frequency on vector ALUs. In other words, an integer addition in general purpose registers was 4 times faster than an integer vector addition. The extraordinary high speed was obtained by making the adder staggered, which means that the upper half was added one clock cycle later than the lower half. Independent 64-bit additions could run at pipelined speed, while subsequent dependent additions had to wait an extra clock cycle. (This is discussed on p. 59 in my microarchitecture manual).
>
> My ForwardCom instruction set has a 3-way addition instruction with option bits for sign change on each operand:
> Y = ± A ± B ± C
> I found that this instruction is very useful. The integer version executes in a single clock cycle. If you can do a 64-bit multiplication in ~3 clock cycles then you can also do two additions in one clock cycle because a multiplication involves many additions.
<
¿integer multiplication? or the ¿multiplier tree? inside the FMAC unit ?
<
Some side effects::
Y = - A - B - C has a carry in of {3} and can have a carry out of {0, 1, 2}
>
> There is a problem with floating point 3-operand addition because the precision depends on the order of the operands. Maybe you need to take the two numerically largest operands first, or calculate the intermediate sum with very high precision.
<
There are 2 ways to utilize 3-operand FP addition::
a) Error Free Addition (IEEE 754 2019 call this Augmented Addition)
b) Three Way Sum
<
a) has several good properties:: none of the bits of {a, residule} overlap,
{a, residule} = {a,residule} + b is easy to calculate, and round.
(b) has none of these good properties.
<

Re: Three-way add

<190b14e7-c39d-4c14-9043-af8bf261a326n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31255&group=comp.arch#31255

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:b88:0:b0:745:ce11:117a with SMTP id 130-20020a370b88000000b00745ce11117amr4476403qkl.7.1679175967051;
Sat, 18 Mar 2023 14:46:07 -0700 (PDT)
X-Received: by 2002:a05:6808:2b0a:b0:384:893:a924 with SMTP id
fe10-20020a0568082b0a00b003840893a924mr4543933oib.3.1679175966852; Sat, 18
Mar 2023 14:46:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!diablo1.usenet.blueworldhosting.com!85.12.63.48.MISMATCH!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 18 Mar 2023 14:46:06 -0700 (PDT)
In-Reply-To: <28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7071:39b9:f238:68aa;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7071:39b9:f238:68aa
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de> <2023Mar5.092646@mips.complang.tuwien.ac.at>
<28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <190b14e7-c39d-4c14-9043-af8bf261a326n@googlegroups.com>
Subject: Re: Three-way add
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 18 Mar 2023 21:46:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3756

by: MitchAlsup - Sat, 18 Mar 2023 21:46 UTC

On Saturday, March 18, 2023 at 4:05:35 AM UTC-5, Agner Fog wrote:
> Anton Ertl wrote:
> >(the first two Pentium 4 generations) support two dependent integer ALU operation per cycle.
<
> The Pentium 4 NetBurst architecture had double clock frequency on integer ALUs and half frequency on vector ALUs. In other words, an integer addition in general purpose registers was 4 times faster than an integer vector addition. The extraordinary high speed was obtained by making the adder staggered, which means that the upper half was added one clock cycle later than the lower half. Independent 64-bit additions could run at pipelined speed, while subsequent dependent additions had to wait an extra clock cycle. (This is discussed on p. 59 in my microarchitecture manual).
>
> My ForwardCom instruction set has a 3-way addition instruction with option bits for sign change on each operand:
> Y = ± A ± B ± C
> I found that this instruction is very useful. The integer version executes in a single clock cycle.
<
I just went through 100,000 lines of ASM out of Brian's compiler. I am finding Y = ± A ± B ± C
is way under 1% (probably near 0.2%*); not 1% of instructions executed, 1% of integer ADDs
executed. I ran into 3×-4× (i×j+k) IMAC than 3-operand adds.
{CoreMark, EMBencvh, LLVM front end, BLAS, Other Numerics, 560 subroutines}
>
Perhaps this is because my memory reference instructions operate under the
[Rbase+Rindex<<scale+Displacement] template, getting rid of most of the
need for 3-operand adds. I ran into 3×-4× (i×j+k) IMAC than 3-operand adds.
<
Nearly 1/4th of 3-operand ADDs want an immediate; I saw a range of {-257..+168}
<
> If you can do a 64-bit multiplication in ~3 clock cycles then you can also do two additions in one clock cycle because a multiplication involves many additions.
<
A 3-oprand ADD/SUB circuit is 2 gates delays longer than a 2-operand ADD/SUB.
However, a 3-operand ADD-scale is 1 gates shorter than your 3-operand ADD/SUB
and you don't do [Rb+Ri<<sc+Disp] Address generation !! Why ?
>
> There is a problem with floating point 3-operand addition because the precision depends on the order of the operands. Maybe you need to take the two numerically largest operands first, or calculate the intermediate sum with very high precision.

Re: Three-way add

<ce241638-82f4-4cfb-b00d-a67fb58cfca2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31274&group=comp.arch#31274

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:4e88:0:b0:56f:378:951 with SMTP id dy8-20020ad44e88000000b0056f03780951mr2226628qvb.1.1679257782892;
Sun, 19 Mar 2023 13:29:42 -0700 (PDT)
X-Received: by 2002:a9d:67d6:0:b0:698:6b65:f563 with SMTP id
c22-20020a9d67d6000000b006986b65f563mr1986028otn.4.1679257782652; Sun, 19 Mar
2023 13:29:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!diablo1.usenet.blueworldhosting.com!usenet.blueworldhosting.com!85.12.63.49.MISMATCH!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 19 Mar 2023 13:29:42 -0700 (PDT)
In-Reply-To: <190b14e7-c39d-4c14-9043-af8bf261a326n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1b0:5d5b:518f:be7b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1b0:5d5b:518f:be7b
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de> <2023Mar5.092646@mips.complang.tuwien.ac.at>
<28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com> <190b14e7-c39d-4c14-9043-af8bf261a326n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ce241638-82f4-4cfb-b00d-a67fb58cfca2n@googlegroups.com>
Subject: Re: Three-way add
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 19 Mar 2023 20:29:42 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4341

by: MitchAlsup - Sun, 19 Mar 2023 20:29 UTC

On Saturday, March 18, 2023 at 4:46:08 PM UTC-5, MitchAlsup wrote:
> On Saturday, March 18, 2023 at 4:05:35 AM UTC-5, Agner Fog wrote:
> > Anton Ertl wrote:
> > >(the first two Pentium 4 generations) support two dependent integer ALU operation per cycle.
> <
> > The Pentium 4 NetBurst architecture had double clock frequency on integer ALUs and half frequency on vector ALUs. In other words, an integer addition in general purpose registers was 4 times faster than an integer vector addition. The extraordinary high speed was obtained by making the adder staggered, which means that the upper half was added one clock cycle later than the lower half. Independent 64-bit additions could run at pipelined speed, while subsequent dependent additions had to wait an extra clock cycle. (This is discussed on p. 59 in my microarchitecture manual).
> >
> > My ForwardCom instruction set has a 3-way addition instruction with option bits for sign change on each operand:
> > Y = ± A ± B ± C
> > I found that this instruction is very useful. The integer version executes in a single clock cycle.
> <
> I just went through 100,000 lines of ASM out of Brian's compiler. I am finding Y = ± A ± B ± C
> is way under 1% (probably near 0.2%*); not 1% of instructions executed, 1% of integer ADDs
> executed. I ran into 3×-4× (i×j+k) IMAC than 3-operand adds.
> {CoreMark, EMBencvh, LLVM front end, BLAS, Other Numerics, 560 subroutines}
> >
> Perhaps this is because my memory reference instructions operate under the
> [Rbase+Rindex<<scale+Displacement] template, getting rid of most of the
> need for 3-operand adds. I ran into 3×-4× (i×j+k) IMAC than 3-operand adds.
> <
> Nearly 1/4th of 3-operand ADDs want an immediate; I saw a range of {-257...+168}
> <
> > If you can do a 64-bit multiplication in ~3 clock cycles then you can also do two additions in one clock cycle because a multiplication involves many additions.
> <
> A 3-oprand ADD/SUB circuit is 2 gates delays longer than a 2-operand ADD/SUB.
<----------------------------------------------------------------------------------
> However, a 3-operand ADD-scale is 1 gates shorter than your 3-operand ADD/SUB
> and you don't do [Rb+Ri<<sc+Disp] Address generation !! Why ?
<
I withdraw this question. And I especially withdraw the implied accusation.
Forward.com ISA certainly has this address mode.
<
Agner, I am sorry my memory is so flaky these days.
<------------------------------------------------------------------------------------
> >
> > There is a problem with floating point 3-operand addition because the precision depends on the order of the operands. Maybe you need to take the two numerically largest operands first, or calculate the intermediate sum with very high precision.

Re: Three-way add

<tv9k1e$3fig8$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31286&group=comp.arch#31286

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Three-way add
Date: Mon, 20 Mar 2023 05:37:02 -0700
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <tv9k1e$3fig8$1@dont-email.me>
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de>
<2023Mar5.092646@mips.complang.tuwien.ac.at>
<28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 20 Mar 2023 12:37:02 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="ad183c1c707a3c10b7c024da87a5857a";
logging-data="3656200"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18p2K4EyU6MdZI6XtNIYl85"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.9.0
Cancel-Lock: sha1:/BXsBAsqVb2OYuwPs7yc0XqeZhE=
Content-Language: en-US
In-Reply-To: <28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com>

by: Ivan Godard - Mon, 20 Mar 2023 12:37 UTC

On 3/18/2023 2:05 AM, Agner Fog wrote:
> Anton Ertl wrote:
>> (the first two Pentium 4 generations) support two dependent integer ALU operation per cycle.
>
> The Pentium 4 NetBurst architecture had double clock frequency on integer ALUs and half frequency on vector ALUs. In other words, an integer addition in general purpose registers was 4 times faster than an integer vector addition. The extraordinary high speed was obtained by making the adder staggered, which means that the upper half was added one clock cycle later than the lower half. Independent 64-bit additions could run at pipelined speed, while subsequent dependent additions had to wait an extra clock cycle. (This is discussed on p. 59 in my microarchitecture manual).
>
> My ForwardCom instruction set has a 3-way addition instruction with option bits for sign change on each operand:
> Y = ± A ± B ± C
> I found that this instruction is very useful. The integer version executes in a single clock cycle. If you can do a 64-bit multiplication in ~3 clock cycles then you can also do two additions in one clock cycle because a multiplication involves many additions.
>
> There is a problem with floating point 3-operand addition because the precision depends on the order of the operands. Maybe you need to take the two numerically largest operands first, or calculate the intermediate sum with very high precision.

Comprehensive addressing has a three-operand add even without scaling:
A[i].f has base address, index, and immediate offset.

Re: Three-way add

<8f094b13-ed71-4ffc-9c39-35ef354ea01en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31358&group=comp.arch#31358

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5502:0:b0:56e:a05a:2d92 with SMTP id pz2-20020ad45502000000b0056ea05a2d92mr768352qvb.2.1679692834801;
Fri, 24 Mar 2023 14:20:34 -0700 (PDT)
X-Received: by 2002:a05:6808:659:b0:378:30dc:ae5b with SMTP id
z25-20020a056808065900b0037830dcae5bmr1040143oih.5.1679692834204; Fri, 24 Mar
2023 14:20:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Mar 2023 14:20:33 -0700 (PDT)
In-Reply-To: <tv9k1e$3fig8$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:39b2:c5f7:9dc3:e36b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:39b2:c5f7:9dc3:e36b
References: <tu0m6q$1nh4a$1@newsreader4.netcologne.de> <2023Mar5.092646@mips.complang.tuwien.ac.at>
<28fd2615-a93e-4916-8622-fd45620b453en@googlegroups.com> <tv9k1e$3fig8$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8f094b13-ed71-4ffc-9c39-35ef354ea01en@googlegroups.com>
Subject: Re: Three-way add
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Mar 2023 21:20:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3171

by: MitchAlsup - Fri, 24 Mar 2023 21:20 UTC

On Monday, March 20, 2023 at 7:37:08 AM UTC-5, Ivan Godard wrote:
> On 3/18/2023 2:05 AM, Agner Fog wrote:
> > Anton Ertl wrote:
> >> (the first two Pentium 4 generations) support two dependent integer ALU operation per cycle.
> >
> > The Pentium 4 NetBurst architecture had double clock frequency on integer ALUs and half frequency on vector ALUs. In other words, an integer addition in general purpose registers was 4 times faster than an integer vector addition. The extraordinary high speed was obtained by making the adder staggered, which means that the upper half was added one clock cycle later than the lower half. Independent 64-bit additions could run at pipelined speed, while subsequent dependent additions had to wait an extra clock cycle. (This is discussed on p. 59 in my microarchitecture manual).
> >
> > My ForwardCom instruction set has a 3-way addition instruction with option bits for sign change on each operand:
> > Y = ± A ± B ± C
> > I found that this instruction is very useful. The integer version executes in a single clock cycle. If you can do a 64-bit multiplication in ~3 clock cycles then you can also do two additions in one clock cycle because a multiplication involves many additions.
> >
> > There is a problem with floating point 3-operand addition because the precision depends on the order of the operands. Maybe you need to take the two numerically largest operands first, or calculate the intermediate sum with very high precision.
<
> Comprehensive addressing has a three-operand add even without scaling:
> A[i].f has base address, index, and immediate offset.
<
If A ends up being unresolved during the link process, it may also require an
indirection through GOT to get the address of the section containing A.

Subject	Author
Three-way add	Thomas Koenig
Re: Three-way add	MitchAlsup
Re: Three-way add	Anton Ertl
Re: Three-way add	Agner Fog
Re: Three-way add	MitchAlsup
Re: Three-way add	MitchAlsup
Re: Three-way add	MitchAlsup
Re: Three-way add	Ivan Godard
Re: Three-way add	MitchAlsup