novaBBS - comp.arch - A Modern Version of the BSP

A Modern Version of the BSP

<c236aef8-4ca8-4dd3-a84d-21e658b34494n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=22171&group=comp.arch#22171

X-Received: by 2002:a05:6214:1050:: with SMTP id l16mr28371970qvr.118.1638169507821;
Sun, 28 Nov 2021 23:05:07 -0800 (PST)
X-Received: by 2002:a9d:82a:: with SMTP id 39mr43131005oty.282.1638169507583;
Sun, 28 Nov 2021 23:05:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 28 Nov 2021 23:05:07 -0800 (PST)
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:ec00:75cc:b570:ce49:1d3;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:ec00:75cc:b570:ce49:1d3
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c236aef8-4ca8-4dd3-a84d-21e658b34494n@googlegroups.com>
Subject: A Modern Version of the BSP
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 29 Nov 2021 07:05:07 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 90

by: Quadibloc - Mon, 29 Nov 2021 07:05 UTC

The Burroughs Scientific Processor employed a unique design with 17
memory banks running in parallel so that matrix multiplication could be
highly efficient; a matrix could have power-of-two dimensions, and the
required stride wouldn't cause clashes in memory accesses.

Maybe the idea doesn't really make sense, because surely it is trivial to pad
an array with an extra row and/or column if necessary.

But ignoring that for now, if one's goal is really high LINPACK scores...

from the previous discussion of odd-length memory accesses, I've come up
with some thoughts as to how such a machine would look.

Basically, I see a bus which is mostly shared between two memory
subsystems.

On the left is DRAM. The DRAM is 16 modules, each 72 bits wide, in width.

On the right is external static RAM. This is 17 modules, each 72 bits wide, in
width.

Coming out of the chip... there are 33 address buses. Except that the low-
order bits of corresponding buses on the left and right belong to the same
bus.

And, similarly, there are 33 data buses. But here the pin count is much
closer to only having 17 data buses, because only the four high-order bits
of corresponding buses on the left and right are independent and go to
separate pins on the processor.

The R/W pins are completely separate for the buses on the two sides of the
chip.

For "normal" operation - that is, computation not involving vectors - the
high-order bits of corresponding left and right buses are just passed through
the chip. The seventeenth bank of static RAM is not used.

On both sides of the CPU, ECC is employed on pairs of memory modules,
so that 16 bits of ECC are used with 128 bits of data. (The data bus lines for ECC are also separate for the left, DRAM, side and the right, static
RAM, side.)

The option exists to divide the 16-channel memory into two 8-channel
memories or four 4-channel memories. This allows programs that don't
use vectors to avoid wasting bandwith on data they won't use; instead,
more programs can run at the same time without interfering with each
other, and the memory bandwidth is employed more efficiently.

Four 4-channel memories is the maximum because 128 bits of data,
due to the ECC, is the irreducible unit, and two of those buses are needed
to provide efficient access to unaligned data.

However, the chip has ways to deal with a different memory organization
in other parts of the memory.

In the BSP mode... ECC is still 16 bits per 128 bits of data on the left side.

On the right side, it is 12 bits per 60 bits of data on the right side.

So the top four data bits are now used for ECC, and the same four bits
from the left side are routed to the seventeenth bank of memory modules
on the right side.

Twelve bits of ECC are enough for DECTED; I don't know if chipkill can
be managed that way. Maybe if the chips are 8 bits wide instead of 4
bits so there are fewer possibilities to take care of.

So on the DRAM side, the vector data is 'packed', on the static RAM
side it is unpacked into a seventeen-wide form with seventeen
independent address buses, just what's needed for BSP-style operation.

This means, though, that we need to use 60-bit floating-point.

But what if one needs greater than 64-bit floats instead of that?

Well, given that with DRAM, at least, it is most effective if one is
fetching data from 16 consecutive addresses at once... an obvious
strategy suggests itself.

Use 120 bit floats, put the first half of the float in one word, the second
half of the float in the following word, within a single memory module...

and deem the next 120 bits of memory space to be the first two words
in the next memory module over.

So the memory structure is a bit interleaved here.

Perhaps this verbal description, without diagrams, of such a complicated
setup is as clear as mud. I hope at least that Mitch Alsup can understand
it, although he may deem it to be the veriest insanity.

John Savard

You can be replaced by this computer.

devel / comp.arch / A Modern Version of the BSP

Subject	Author
A Modern Version of the BSP	Quadibloc