Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Life is difficult because it is non-linear.


devel / comp.arch / Oops (The Burroughs Scientific Processor)

SubjectAuthor
* Oops (The Burroughs Scientific Processor)Quadibloc
+- Re: Oops (The Burroughs Scientific Processor)Brett
`* Re: Oops (The Burroughs Scientific Processor)Anton Ertl
 `* Re: Oops (The Burroughs Scientific Processor)Quadibloc
  `* Re: Oops (The Burroughs Scientific Processor)Anton Ertl
   +* Re: Oops (The Burroughs Scientific Processor)Quadibloc
   |`- Re: Oops (The Burroughs Scientific Processor)Anton Ertl
   `* Re: Oops (The Burroughs Scientific Processor)Thomas Koenig
    +* Re: Oops (The Burroughs Scientific Processor)Michael S
    |`- Re: Oops (The Burroughs Scientific Processor)Thomas Koenig
    +- Re: Oops (The Burroughs Scientific Processor)antispam
    `- Re: Oops (The Burroughs Scientific Processor)Anton Ertl

1
Oops (The Burroughs Scientific Processor)

<3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24569&group=comp.arch#24569

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2552:b0:67b:32e2:2400 with SMTP id s18-20020a05620a255200b0067b32e22400mr12467705qko.768.1649051403974;
Sun, 03 Apr 2022 22:50:03 -0700 (PDT)
X-Received: by 2002:a05:6870:1607:b0:de:984:496d with SMTP id
b7-20020a056870160700b000de0984496dmr10125807oae.253.1649051403710; Sun, 03
Apr 2022 22:50:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 3 Apr 2022 22:50:03 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:e0b4:8c8b:c298:6f76;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:e0b4:8c8b:c298:6f76
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
Subject: Oops (The Burroughs Scientific Processor)
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 04 Apr 2022 05:50:03 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 70
 by: Quadibloc - Mon, 4 Apr 2022 05:50 UTC

As people who have read my posts are aware, I am fond of the
Cray I style of vector architecture.
Also, I am fond of designs where there is more than one address
bus, so that the memory is divided across its width, to allow access
to unaligned data at full speed, for various reasons.
Another thing I'm interested in is a design that would allow
computations to take place at very great speed.

One very important type of computation performed by high-speed scientific
computers is matrix multiplication. Since the elements in the rows of one
matrix are multiplied by the elements in the columns of the other one, it is
important to be able to perform vector operations with _stride_; that is,
instead of reading consecutive elements from one matrix, elements that are
spaced apart by a certain amount must be read.
One common case is that the matrices might have a power-of-two size; that
would mean that the same memory bank within a wide memory bus would
always be used, preventing a wide memory bus from delivering operands
with a higher bandwidth.
The Burroughs Scientific Processor was an attempt to deal with this issue. It
had a memory bus wide enough to transfer seventeen 64-bit floating-point
numbers, and that bus was divided into seventeen parts, each with its own
address bus.
Addresses were divided by 17 through being multiplied by 15; 17 times 15 is
255, the multiplication was turned into an exact division by the use of an
end-around-carry.

I had been thinking about what it would take to design a modern version of
this computer.
One thing is that the behavior of modern DRAM is not suitable for this kind
of memory subsystem. But it's possible to get external static RAM modules,
or a considerable amount of static RAM could be included within a multi-
chip module.
But if I wanted it to run at a high speed, where it just keeps continuously
repeating the same operation for cycle after cycle on each of multiple
buses, it seemed to me that the design was growing into something even
more impractical than it started out as.
However, on further thought, I see now that it isn't really so bad.
One needs _one_ memory subsystem, built around static RAM, that
follows the design of the Burroughs Scientific Processor. Yes, this
is already highly impractical, I know. But one doesn't need three or
four of them, as I feared I might.
And one only needs one other memory subsystem - a conventional
one made from DRAM that is of about the same width. Wide enough
for sixteen double-precision floating-point numbers, so that would be
1,024 bits wide.
And that's it.
Because what happens in each cycle during sustained matrix
multiplication goes like this:
A strided fetch from the BSP-like static RAM subsystem, where the
successive elements are at some arbitrary spacing (as long as it
isn't a multiple of 17, which is much more likely than not being a
multiple of 16), and a fetch of consecutive elements from the
conventional subsystem.
And then in the CPU, these numbers are shoved into the multiplication
pipeline, and another fetch is done.
Eventually, when one gets to the end of a row of the first matrix, and
a column of the second one, one has calculated one element of the
product matrix; when one has sixteen of those elements, one can store
them from cache into the same conventional DRAM as contained the
first matrix; that is a minimal overhead, not justifying a third memory
subsystem.
So a roaring monster of a matrix multiplier does not need multiple
static-RAM based memory subsystems based on the BSP, just one
of them, and it definitely doesn't need memory that is also capable of
adding FP64 numbers sent to it to what is already in storage (with
a one-per-cycle pipeline, and without needing a heat sink, to boot).

Instead, this amazing dream machine is... not _utterly_
preposterous?

John Savard

Re: Oops (The Burroughs Scientific Processor)

<t2e1q3$jro$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24570&group=comp.arch#24570

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Mon, 4 Apr 2022 06:04:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <t2e1q3$jro$1@dont-email.me>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 4 Apr 2022 06:04:52 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b6cc595838da9593aa5c94b46da682f0";
logging-data="20344"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Pqf2PKFt//pl36Y+WP/9k"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:47OSjD+rxID60lc+qw6to+IefnY=
sha1:PH5ZXtxe6pC2w8VSvEMvQTg0p9c=
 by: Brett - Mon, 4 Apr 2022 06:04 UTC

Quadibloc <jsavard@ecn.ab.ca> wrote:
> As people who have read my posts are aware, I am fond of the
> Cray I style of vector architecture.
> Also, I am fond of designs where there is more than one address
> bus, so that the memory is divided across its width, to allow access
> to unaligned data at full speed, for various reasons.
> Another thing I'm interested in is a design that would allow
> computations to take place at very great speed.
>
> One very important type of computation performed by high-speed scientific
> computers is matrix multiplication. Since the elements in the rows of one
> matrix are multiplied by the elements in the columns of the other one, it is
> important to be able to perform vector operations with _stride_; that is,
> instead of reading consecutive elements from one matrix, elements that are
> spaced apart by a certain amount must be read.
> One common case is that the matrices might have a power-of-two size; that
> would mean that the same memory bank within a wide memory bus would
> always be used, preventing a wide memory bus from delivering operands
> with a higher bandwidth.
> The Burroughs Scientific Processor was an attempt to deal with this issue. It
> had a memory bus wide enough to transfer seventeen 64-bit floating-point
> numbers, and that bus was divided into seventeen parts, each with its own
> address bus.
> Addresses were divided by 17 through being multiplied by 15; 17 times 15 is
> 255, the multiplication was turned into an exact division by the use of an
> end-around-carry.
>
> I had been thinking about what it would take to design a modern version of
> this computer.
> One thing is that the behavior of modern DRAM is not suitable for this kind
> of memory subsystem. But it's possible to get external static RAM modules,
> or a considerable amount of static RAM could be included within a multi-
> chip module.
> But if I wanted it to run at a high speed, where it just keeps continuously
> repeating the same operation for cycle after cycle on each of multiple
> buses, it seemed to me that the design was growing into something even
> more impractical than it started out as.
> However, on further thought, I see now that it isn't really so bad.
> One needs _one_ memory subsystem, built around static RAM, that
> follows the design of the Burroughs Scientific Processor. Yes, this
> is already highly impractical, I know. But one doesn't need three or
> four of them, as I feared I might.
> And one only needs one other memory subsystem - a conventional
> one made from DRAM that is of about the same width. Wide enough
> for sixteen double-precision floating-point numbers, so that would be
> 1,024 bits wide.
> And that's it.
> Because what happens in each cycle during sustained matrix
> multiplication goes like this:
> A strided fetch from the BSP-like static RAM subsystem, where the
> successive elements are at some arbitrary spacing (as long as it
> isn't a multiple of 17, which is much more likely than not being a
> multiple of 16), and a fetch of consecutive elements from the
> conventional subsystem.
> And then in the CPU, these numbers are shoved into the multiplication
> pipeline, and another fetch is done.
> Eventually, when one gets to the end of a row of the first matrix, and
> a column of the second one, one has calculated one element of the
> product matrix; when one has sixteen of those elements, one can store
> them from cache into the same conventional DRAM as contained the
> first matrix; that is a minimal overhead, not justifying a third memory
> subsystem.
> So a roaring monster of a matrix multiplier does not need multiple
> static-RAM based memory subsystems based on the BSP, just one
> of them, and it definitely doesn't need memory that is also capable of
> adding FP64 numbers sent to it to what is already in storage (with
> a one-per-cycle pipeline, and without needing a heat sink, to boot).
>
> Instead, this amazing dream machine is... not _utterly_
> preposterous?
>
> John Savard
>

Buy an Apple M1 Pro, it has four times the dram bandwidth of everyone else.

Re: Oops (The Burroughs Scientific Processor)

<2022Apr4.131340@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24573&group=comp.arch#24573

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Mon, 04 Apr 2022 11:13:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 58
Message-ID: <2022Apr4.131340@mips.complang.tuwien.ac.at>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="66e9f889d36ee8910c5f29d58d8f76a0";
logging-data="30090"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//gRNXf32/oGwtWfJTS/Mg"
Cancel-Lock: sha1:C4Ht0LJ1ZI571g1fVJslbeizkgE=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Mon, 4 Apr 2022 11:13 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>One very important type of computation performed by high-speed scientific
>computers is matrix multiplication. Since the elements in the rows of one
>matrix are multiplied by the elements in the columns of the other one, it is
>important to be able to perform vector operations with _stride_;

Not at all. Matrix multiply provides a lot of freedom in reordering
operations. You are apparently thinking of

for (i=0; i<n; i++)
for (j=0; j<p; j++) {
for (k=0, r=0.0; k<m; k++)
r += a[i*m+k]*b[k*p+j];
c[i*p+j]=r;

which is already slow because of the recurrence through r (incurring
an FP addition latency per iteration of the inner loop). Let's first
transform this into the (slightly slower)

for (i=0; i<n; i++)
for (j=0; j<p; j++)
c[i*p+j] = 0.0;
for (i=0; i<n; i++)
for (j=0; j<p; j++)
for (k=0; k<m; k++)
c[i*p+j] += a[i*m+k]*b[k*p+j];

which is basically the same, but using c[i*p+j] instead of r.

But now you can perform loop interchange. The best among the 6
orderings of the second loop nest is:

for (i=0; i<n; i++)
for (k=0; k<m; k++)
for (j=0; j<p; j++)
c[i*p+j]+=a[i*m+k]*b[k*p+j];

and guess what: In the inner loop, this accesses the same element of a
in every iteration, while accessing b and c sequentially. No strided
access needed.

Moreover, matrix multiply is amenable to cache blocking, which reduces
the main memory bandwidth needed so much that matrix multiply speed is
limited by FPU bandwidth.

There are other HPC operations that are memory-bandwidth limited, and
where optimizing strided access may be helpful, but matrix multiply is
not one of them.

>Instead, this amazing dream machine is... not _utterly_
>preposterous?

It's preposterous for matrix multiplication.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Oops (The Burroughs Scientific Processor)

<5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24574&group=comp.arch#24574

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:593:b0:2e1:eabd:5e25 with SMTP id c19-20020a05622a059300b002e1eabd5e25mr857602qtb.191.1649090984504;
Mon, 04 Apr 2022 09:49:44 -0700 (PDT)
X-Received: by 2002:a4a:5b83:0:b0:324:4866:4f6e with SMTP id
g125-20020a4a5b83000000b0032448664f6emr176065oob.61.1649090984242; Mon, 04
Apr 2022 09:49:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 4 Apr 2022 09:49:44 -0700 (PDT)
In-Reply-To: <2022Apr4.131340@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:e0b4:8c8b:c298:6f76;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:e0b4:8c8b:c298:6f76
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com> <2022Apr4.131340@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>
Subject: Re: Oops (The Burroughs Scientific Processor)
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 04 Apr 2022 16:49:44 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2221
 by: Quadibloc - Mon, 4 Apr 2022 16:49 UTC

On Monday, April 4, 2022 at 5:33:29 AM UTC-6, Anton Ertl wrote:

> which is already slow because of the recurrence through r (incurring
> an FP addition latency per iteration of the inner loop).

I am assuming that the CPU will be designed so as to be
completely able to avoid that. The fetches take place. The
multiplies take place. The products are then sent off to
a pipeline to be added together, with no waiting to continue
fetching and multiplying.

It is important to know that matrix multiplication, at least, can
be easily done on inexpensive conventional hardware. While
what I propose is almost twice as fast - since now for a 2048
by 2048 matrix, 2048 reads of a and 2048 reads of b and one write
of c are replaced by 2048 reads of a, 2048 reads of b, 2048
reads of c, and 2048 writes of c - it certainly does require
expensive hardware.

John Savard

Re: Oops (The Burroughs Scientific Processor)

<2022Apr5.094909@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24594&group=comp.arch#24594

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Tue, 05 Apr 2022 07:49:09 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 95
Message-ID: <2022Apr5.094909@mips.complang.tuwien.ac.at>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com> <2022Apr4.131340@mips.complang.tuwien.ac.at> <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="99feda7513dc74fce6b974d616718049";
logging-data="13126"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18eHUr/xsfjgX0TLX/DSUjO"
Cancel-Lock: sha1:lxPtUsuhdob+tY3dJexHuKtXlOw=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 5 Apr 2022 07:49 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>On Monday, April 4, 2022 at 5:33:29 AM UTC-6, Anton Ertl wrote:
>
>> which is already slow because of the recurrence through r (incurring
>> an FP addition latency per iteration of the inner loop).
>
>I am assuming that the CPU will be designed so as to be
>completely able to avoid that. The fetches take place. The
>multiplies take place. The products are then sent off to
>a pipeline to be added together, with no waiting to continue
>fetching and multiplying.

That's what every OoO CPU does. However, this works only until you
have filled up the reorder buffer or rename registers (whichever is
full first); after that new adds and multiplies have to wait until an
add retires (and allows all finished instructions after it to be
retired). From that point onward, the performance is determined by
the latency of the adds. See the end of this posting for empirical
results.

Plus, many implementations of matrix multiply use multiply-add.

>It is important to know that matrix multiplication, at least, can
>be easily done on inexpensive conventional hardware. While
>what I propose is almost twice as fast - since now for a 2048
>by 2048 matrix, 2048 reads of a and 2048 reads of b and one write
>of c are replaced by 2048 reads of a, 2048 reads of b, 2048
>reads of c, and 2048 writes of c - it certainly does require
>expensive hardware.

What I showed was just the first step in getting a fast matrix
multiplication implementation; a very good one is OpenBLAS. Let's see
how they perform on 2048x2048 DP matrix multiplication compared to the
naive approach, on a 4GHz Core i5-6600K (256KB L2, 6MB L3 cache,
DDR4-2133 memory):

naive first-step OpenBLAS
206,800,266,916 15,523,161,561 1,331,211,591 cycles
60,307,651,807 23,934,570,817 4,252,192,328 instructions
8,483,988,740 46,086,939 323,246 LLC-load-misses
2,961,335 2,296,936 2,578,391 LLC-store-misses
51.703015790 3.981516397 0.344595877 seconds time elapsed

[OpenBLAS is used with OMP_NUM_THREADS=1 here for better comparability]

So the naive approach that you want to design hardware for results in
many more DRAM accesses than the first-step approach. The magic of
caches and hardware prefetchers combines to reduce LLC misses
(unfortunately I don't find a performance counter that tells me the
amount of DRAM reads and writes). OpenBLAS manages that even better.

You can also see that first-step reduces the number of instructions by
a factor 2.5, thanks to auto-vectorization (sometimes it works) using
AVX-256. OpenBLAS uses a lot of smarts, FMA and unrolling to achieve
another reduction in instructions by a factor >4.

Finally, in the cycles and seconds department, the naive approach
takes 13 times more time than first-step, and 150 times more time than
OpenBLAS; that's indeed due to the bad access pattern to array b,
which you want to mitigate with your special hardware. Even the
addition latency does not explain the slowness of the naive algorithm
in this case (there are 8G adds in 2048x2048 matrix multiplication;
with 4 cycles latency that would explain only 32G cycles).

I am all for designing hardware such that software developers can
concentrate on the functionality rather than having to work around
hardware peculiarities. But in this case, you are proposing hardware
using one specialized problem as motivation which has been solved
efficiently on existing hardware (with little (first-step) or a lot
(OpenBLAS) of smarts), and using your special hardware would require
special software adaption (for the placement of b), so it would not
even satisfy the ideal of just being able to make the naive algorithm
fast without further ado. And we already have the software for making
matrix multiply run fast on existing, cheap hardware.

And here are the promised empirical results for the latency effect: We
reduce the size of our arrays such that they fit into the L3 cache: We
do 500x500 matrix multiply, for a total of 125M FP adds (with a total
latency of 500M cycles).

naive first-step OpenBLAS
516,676,742 161,294,024 38,004,823 cycles
920,763,078 367,494,328 82,732,065 instructions
23,205 20,543 20,511 LLC-load-misses
130,501 129,572 176,072 LLC-store-misses
0.131489890 0.042522620 0.011523877 seconds time elapsed

Here the DRAM accesses play a minor role, and the latency of the FP
adds dominates the performance of the naive approach. The first-step
approach has nice parallel adds and is limited by the CPU resources.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Oops (The Burroughs Scientific Processor)

<c7d5850d-e863-41b9-9d3a-76be866e8a60n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24598&group=comp.arch#24598

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5a88:0:b0:2e1:bbda:3b21 with SMTP id c8-20020ac85a88000000b002e1bbda3b21mr3292097qtc.307.1649170607645;
Tue, 05 Apr 2022 07:56:47 -0700 (PDT)
X-Received: by 2002:a05:6870:e9a7:b0:de:e59a:7376 with SMTP id
r39-20020a056870e9a700b000dee59a7376mr1928628oao.194.1649170607310; Tue, 05
Apr 2022 07:56:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 5 Apr 2022 07:56:47 -0700 (PDT)
In-Reply-To: <2022Apr5.094909@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b9d3:8548:afcc:91f9;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b9d3:8548:afcc:91f9
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
<2022Apr4.131340@mips.complang.tuwien.ac.at> <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>
<2022Apr5.094909@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c7d5850d-e863-41b9-9d3a-76be866e8a60n@googlegroups.com>
Subject: Re: Oops (The Burroughs Scientific Processor)
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 05 Apr 2022 14:56:47 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 52
 by: Quadibloc - Tue, 5 Apr 2022 14:56 UTC

On Tuesday, April 5, 2022 at 3:28:39 AM UTC-6, Anton Ertl wrote:

> That's what every OoO CPU does. However, this works only until you
> have filled up the reorder buffer or rename registers (whichever is
> full first);

Oh, yes. However, what I'm talking about is a CPU that is
especially designed to be able to do things like matrix
multiplication in an _efficient_ manner.

Thus, it is to be programmed in such a fashion that it can
just keep pipelining the multiplications indefinitely and
continuously, without _ever_ running out of resources.
It is not infeasible to design a CPU this way.

However, I certainly agree that in practice at the present
time, it's better to use COTS hardware. A factor of 2 in
memory bandwidth is nothing beside a factor of 10 or
100 in cost. So I'm not at all saying you're wrong.

But since you've acknowledged there are other matrix
operations where efficient strided access is useful.

It's not enough to just _design_ hardware that's especially
suited to bleeding-edge HPC, letting one get more work out
of each square millimetre of silicon than present designs
permit.

It is also necessary to ensure that such hardware is fully
cost-competitive with commodity microprocessors, on a
dollars per square millimetre of die area on any given
process.

I'm sure that some readers who have gotten as far as the
preceding paragraph will now have fallen off their chairs,
and will be rolling on the floor laughing. So, given that I am
actually serious here, I suppose I had better explain how I
think such an ambitious goal could conceivably be achieved.

Look at a gaming rig... even from three years ago, before the
GPU shortages started.

Think of how much processing power it has.

Think of how this kind of processing power, just used for playing
games, would look to someone in 1979, let alone 1968.

Hence, my plan is the following: after designing ultra high-end
HPC hardware, don't just manufacture enough of it to go into
the nation's supercomputers. Instead, also score a design win
in the world's game consoles and desktop PCs.

John Savard

Re: Oops (The Burroughs Scientific Processor)

<2022Apr5.175518@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24600&group=comp.arch#24600

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Tue, 05 Apr 2022 15:55:18 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 44
Message-ID: <2022Apr5.175518@mips.complang.tuwien.ac.at>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com> <2022Apr4.131340@mips.complang.tuwien.ac.at> <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com> <2022Apr5.094909@mips.complang.tuwien.ac.at> <c7d5850d-e863-41b9-9d3a-76be866e8a60n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="99feda7513dc74fce6b974d616718049";
logging-data="11455"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18TxpuQqiNwUKG20D54KE+5"
Cancel-Lock: sha1:oWZ0t+NHjbRYPjKjUo7ZnxbKxUg=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 5 Apr 2022 15:55 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>On Tuesday, April 5, 2022 at 3:28:39 AM UTC-6, Anton Ertl wrote:
>
>> That's what every OoO CPU does. However, this works only until you
>> have filled up the reorder buffer or rename registers (whichever is
>> full first);
>
>Oh, yes. However, what I'm talking about is a CPU that is
>especially designed to be able to do things like matrix
>multiplication in an _efficient_ manner.

A CPU with two AVX-512 FMA units? The increase in the number of
register names in AVX-512 increases the computation-to-load/store
ratio even more than before.

>Thus, it is to be programmed in such a fashion that it can
>just keep pipelining the multiplications indefinitely and
>continuously, without _ever_ running out of resources.
>It is not infeasible to design a CPU this way.

Sure, with a slightly better implementation of matrix multiply.

>But since you've acknowledged there are other matrix
>operations where efficient strided access is useful.

There are HPC routines that are memory-limited. Efficient
implementations of matrix multiply are not among them.

Despite memory-limited HPC tasks, memory systems like the Burroughs
one have not been seen on supercomputers in recent decades, so
apparently disadvantages of this design outweigh its advantages.

>Hence, my plan is the following: after designing ultra high-end
>HPC hardware, don't just manufacture enough of it to go into
>the nation's supercomputers. Instead, also score a design win
>in the world's game consoles and desktop PCs.

Sounds like the story of the Cell broadband engine, which was not a
long-term success in game consoles.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Oops (The Burroughs Scientific Processor)

<t2jr3m$s80$2@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24610&group=comp.arch#24610

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Wed, 6 Apr 2022 10:47:18 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t2jr3m$s80$2@newsreader4.netcologne.de>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
<2022Apr4.131340@mips.complang.tuwien.ac.at>
<5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>
<2022Apr5.094909@mips.complang.tuwien.ac.at>
Injection-Date: Wed, 6 Apr 2022 10:47:18 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:ec41:0:7285:c2ff:fe6c:992d";
logging-data="28928"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 6 Apr 2022 10:47 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

> naive first-step OpenBLAS
> 206,800,266,916 15,523,161,561 1,331,211,591 cycles

I have run a few benchmarks on OpenBLAS myself, and it was my
impression that they use fast matrix multiplication above some
treshhold - if you calculate the FLOPs per cycle assuming the
standard algorithm, it goes up for very large matrices.

Re: Oops (The Burroughs Scientific Processor)

<dd5a05c4-8e7a-4445-88b5-12371b15e2f9n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24611&group=comp.arch#24611

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7603:0:b0:2eb:ab0d:dc50 with SMTP id t3-20020ac87603000000b002ebab0ddc50mr8012405qtq.173.1649258906961;
Wed, 06 Apr 2022 08:28:26 -0700 (PDT)
X-Received: by 2002:a05:6870:9712:b0:e1:f96c:f62a with SMTP id
n18-20020a056870971200b000e1f96cf62amr4336952oaq.58.1649258906315; Wed, 06
Apr 2022 08:28:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 6 Apr 2022 08:28:26 -0700 (PDT)
In-Reply-To: <t2jr3m$s80$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
<2022Apr4.131340@mips.complang.tuwien.ac.at> <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>
<2022Apr5.094909@mips.complang.tuwien.ac.at> <t2jr3m$s80$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dd5a05c4-8e7a-4445-88b5-12371b15e2f9n@googlegroups.com>
Subject: Re: Oops (The Burroughs Scientific Processor)
From: already5...@yahoo.com (Michael S)
Injection-Date: Wed, 06 Apr 2022 15:28:26 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Michael S - Wed, 6 Apr 2022 15:28 UTC

On Wednesday, April 6, 2022 at 1:47:21 PM UTC+3, Thomas Koenig wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> > naive first-step OpenBLAS
> > 206,800,266,916 15,523,161,561 1,331,211,591 cycles
> I have run a few benchmarks on OpenBLAS myself, and it was my
> impression that they use fast matrix multiplication above some
> treshhold - if you calculate the FLOPs per cycle assuming the
> standard algorithm, it goes up for very large matrices.

May be, they use non O(N**3) algorithms for very big values of N.
But I am pretty sure that they don't use them for N=500 or 1000.

If you see FLOPS/clock that are higher than expected then the most likely explanation
is not a different algorithm, but a wrong assumption about clock frequency.

Re: Oops (The Burroughs Scientific Processor)

<t2kdum$bgo$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24612&group=comp.arch#24612

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Wed, 6 Apr 2022 16:08:54 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t2kdum$bgo$1@newsreader4.netcologne.de>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com>
<2022Apr4.131340@mips.complang.tuwien.ac.at>
<5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com>
<2022Apr5.094909@mips.complang.tuwien.ac.at>
<t2jr3m$s80$2@newsreader4.netcologne.de>
<dd5a05c4-8e7a-4445-88b5-12371b15e2f9n@googlegroups.com>
Injection-Date: Wed, 6 Apr 2022 16:08:54 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:ec41:0:7285:c2ff:fe6c:992d";
logging-data="11800"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 6 Apr 2022 16:08 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Wednesday, April 6, 2022 at 1:47:21 PM UTC+3, Thomas Koenig wrote:
>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> > naive first-step OpenBLAS
>> > 206,800,266,916 15,523,161,561 1,331,211,591 cycles
>> I have run a few benchmarks on OpenBLAS myself, and it was my
>> impression that they use fast matrix multiplication above some
>> treshhold - if you calculate the FLOPs per cycle assuming the
>> standard algorithm, it goes up for very large matrices.
>
> May be, they use non O(N**3) algorithms for very big values of N.
> But I am pretty sure that they don't use them for N=500 or 1000.

That's what I also found. The benchmarks I ran were against
the libgfortran implementation. I was rather surprised and
pleased that gfortran actually held its own against up to
(somewhere around 1000 of matrix size), then OpenBLAS really
took off.

> If you see FLOPS/clock that are higher than expected then the most likely explanation
> is not a different algorithm, but a wrong assumption about clock frequency.

Unfortunately, gfortran's algorithm didn't take off,
so I guess that can be discounted :-)

Re: Oops (The Burroughs Scientific Processor)

<t2keeb$qcl$2@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24613&group=comp.arch#24613

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!NZ87pNe1TKxNDknVl4tZhw.user.46.165.242.91.POSTED!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Wed, 6 Apr 2022 16:17:15 -0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID: <t2keeb$qcl$2@gioia.aioe.org>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com> <2022Apr4.131340@mips.complang.tuwien.ac.at> <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com> <2022Apr5.094909@mips.complang.tuwien.ac.at> <t2jr3m$s80$2@newsreader4.netcologne.de>
Injection-Info: gioia.aioe.org; logging-data="27029"; posting-host="NZ87pNe1TKxNDknVl4tZhw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: tin/2.4.5-20201224 ("Glen Albyn") (Linux/5.10.0-9-amd64 (x86_64))
Cancel-Lock: sha1:JtlqRL+30JEjNR+Id7IHse0WSbk=
X-Notice: Filtered by postfilter v. 0.9.2
 by: antis...@math.uni.wroc.pl - Wed, 6 Apr 2022 16:17 UTC

Thomas Koenig <tkoenig@netcologne.de> wrote:
> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
> > naive first-step OpenBLAS
> > 206,800,266,916 15,523,161,561 1,331,211,591 cycles
>
> I have run a few benchmarks on OpenBLAS myself, and it was my
> impression that they use fast matrix multiplication above some
> treshhold - if you calculate the FLOPs per cycle assuming the
> standard algorithm, it goes up for very large matrices.

Hmm, in the past I looked at Atlas results. At that time Atlas
used relatively simple n^3 algorithm. Smart part was dividing
matrix into blocks to "optimize" cache use. That involved
copies and other "lower order" overhead. The effect was that
top speed was attained only for large matrices. On modern
machine due to vector instructions and bigger caches top
speed is likely to require larger matrices than in the past.

--
Waldek Hebisch

Re: Oops (The Burroughs Scientific Processor)

<2022Apr6.185322@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24615&group=comp.arch#24615

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Oops (The Burroughs Scientific Processor)
Date: Wed, 06 Apr 2022 16:53:22 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 35
Distribution: world
Message-ID: <2022Apr6.185322@mips.complang.tuwien.ac.at>
References: <3a9250d0-6729-44a3-861f-8bd171ada65cn@googlegroups.com> <2022Apr4.131340@mips.complang.tuwien.ac.at> <5605c20f-a704-486c-ba1a-ccb3e08910d5n@googlegroups.com> <2022Apr5.094909@mips.complang.tuwien.ac.at> <t2jr3m$s80$2@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="f60c0aab4b32b175d1c86946b0f03722";
logging-data="7899"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ICwdaENS5jmrfZagYAfcl"
Cancel-Lock: sha1:Ftf0Xrc+JB4WYoqLkt2TxqYpCLM=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Wed, 6 Apr 2022 16:53 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
>> naive first-step OpenBLAS
>> 206,800,266,916 15,523,161,561 1,331,211,591 cycles
>
>I have run a few benchmarks on OpenBLAS myself, and it was my
>impression that they use fast matrix multiplication above some
>treshhold - if you calculate the FLOPs per cycle assuming the
>standard algorithm, it goes up for very large matrices.

Let's see:

for i in 1024 512 1024 2048 4096 8192; do LC_NUMERIC=en_US.utf8 OMP_NUM_THREADS=1 perf stat -B -e cycles -e instructions -e LLC-load-misses -e LLC-store-misses openblas $i; done

512 1024 2048 4096 8192
43,137,904 189,235,048 1,297,630,540 9,948,584,662 78,683,259,360 cycles
93,860,449 551,710,788 4,204,698,737 33,154,242,476 263,489,265,458 insts
15,151 75,822 302,376 1,268,831 6,034,937 load-miss
159,264 662,265 2,520,866 9,848,313 38,892,869 store-miss
2.18 2.92 3.24 3.33 3.35 IPC

The factor in the instructions between the 8192 and 4096 case is
7.947, close enough to the naive 8 that the gap can be explained by
lower overheads (loop initialization, initialization of the input
arrays and writing the result to a file) for the larger matrixes. So
I doubt that we are seeing an n^x algorithm with x<3. What's
interesting is that the IPC is continuously significantly increasing
even though we are already starting with big matrixes. The 1024 run
at the start (result not shown) is used for warming the CPU up.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor