novaBBS - comp.arch - Falcon Shores

On 2/26/2022 10:09 AM, Quadibloc wrote:
> Intel has announced its plans for a multi-chip module to go in a Xeon socket which will include integrated graphics on a separate die.
>
> Except that the GPU on the chip is really intended for use for things like AI models.
>
> And Raja Koduri even said it would offer "a vastly simplified GPU programming model".
>
> This almost sounds like what I've been waiting for!
>

If it has an open ISA, that could be useful.

Seems like Intel could also stick an FPGA in the CPU die, maybe have it
able to access the same memory bus as the CPU in a (hopefully sane)
manner (it being given its own region of memory to play around in).

Better still if it could achieve a reasonably high clock speeds, and not
be absurdly expensive.

If on-chip, it seems like it wouldn't be asking too much to have a
512-bit memory bus interface or similar. Preferably avoiding some of the
needless complexity of AXI and similar.

Would be happy if it worked sort of like a ringbus, say:
Send a request over a clock-edge:
Address, Data, Operation Mode, Sequence Number;
Eventually a response comes back out the other end;
Multiple requests may be in-flight at the same time;
...
The ring advances one request per clock cycle.
It seems probable that the bus interface would provide the clock.

I guess it is also a question of how good of an FPGA one could made if
using the same process as a CPU (vs 28nm or 40nm).

Like, hopefully it could achieve higher clock speeds than can be
achieved on an Artix or Spartan or similar.

I guess it would be sorta like a Cyclone V just with a Xeon cores in
place of a Cortex-A9 cores or similar?... (Then again, this would just
leave it as being a significantly more expensive DE10; would need an
FPGA significantly better than a Cyclone V to justify this...).

....

On Sat, 26 Feb 2022 20:11:21 -0600, BGB wrote:

> Seems like Intel could also stick an FPGA in the CPU die, maybe have it
> able to access the same memory bus as the CPU in a (hopefully sane)
> manner (it being given its own region of memory to play around in).

Intel did, a range of CPUs was announced in 2018 but I don't see any
mention of them in the current product list.

The first article that comes up for me in a search is this:

<https://www.nextplatform.com/2018/05/24/a-peek-inside-that-intel-xeon-
fpga-hybrid-chip/>

This kind of thing should be easier for AMD to build now that they have
bought Xilinx, put the FPGA on a chiplet inside the package.

On 2/27/2022 6:15 AM, Robert Swindells wrote:
> On Sat, 26 Feb 2022 20:11:21 -0600, BGB wrote:
>
>> Seems like Intel could also stick an FPGA in the CPU die, maybe have it
>> able to access the same memory bus as the CPU in a (hopefully sane)
>> manner (it being given its own region of memory to play around in).
>
> Intel did, a range of CPUs was announced in 2018 but I don't see any
> mention of them in the current product list.
>
> The first article that comes up for me in a search is this:
>
> <https://www.nextplatform.com/2018/05/24/a-peek-inside-that-intel-xeon-
> fpga-hybrid-chip/>
>
> This kind of thing should be easier for AMD to build now that they have
> bought Xilinx, put the FPGA on a chiplet inside the package.

Would be cool, but then realizes that even if it were available, I
probably couldn't afford it...

I guess possible "fantasy" boards at the moment:
Something with a Kintex (1);
Something with a XC7A200T (such as the Nexys Video, 2).

1: Could experiment with higher clock speeds and more expensive designs.

2: This has a somewhat better FPGA, more RAM, ...
XC7A200T at a -2 speed grade (vs -1)
512MB of RAM
...

But, at present, don't have $500 to burn on an FPGA board...

Had noted recently that my project has significantly fewer likes and
followers than another project which seemingly is far less developed,
and which proposes a core with stats well outside what seems viable on a
low end FPGA.

They want 128K 4-way L1's and a 512K 8-way L2. Short of something like
an XC7K325 or similar, this likely falls into "not gonna happen" territory.

In my case, I have ended up burning most of this month mostly on trying
to get the MMU / TLB to be stable...

Also kinda annoyed because "Desktop Window Manager" is trying to eat the
GPU (excessive GPU activity, even with most graphical features turned
off). Starts wondering if maybe this is why my PC has been running like
a space heater.

Well, besides recently having recently been running ~ 4-6 Verilog
simulations at a time trying to hunt down these MMU bugs. Most bugs
found thus far were due to edge cases involving the L1 caches and TLB
miss interrupt handling. Also battled with a bug where stall timing was
causing the interrupt dispatch to "miss" the interrupt.

Namely, it seemed that a case exists where:
Branch triggers a TLB Miss in the L1 I$
L1 does the NOP Slide Miss case
(Needed so that the TLB Miss ISR can initiate)
Some other stall signal happens at the same time (unidentified);
...

In this case, the Branch and ISR mechanism would fight over where
control flow would go, and the ISR would drop out (thinking the ISR's
control-flow transfer had been initiated), with control flow instead
going to the target of the branch instruction.

These sorts of issues are a pain to track down (given the amount of time
it takes to launch Doom or Quake or similar in a simulation and wait for
the bug to occur).

Had tried making the TLB size a lot smaller (to hopefully shake out bugs
faster), but this is less effective than one might think (sometimes the
bug will only occur with a specific TLB size or similar).

It doesn't help that in Verilator, the behavior of Verilog code may
sometimes change with the addition or removal of $display statements (or
adding a $display causes it to start freaking out about "circular
dependencies" or similar).

In a few cases, have ended up needing to put the debug $display's in
their own "always" block, as this seems to reduce their effect on the
rest of the simulation.

I don't really get it though, as it seems like a $display should be
more-or-less invisible as far as all the logic is concerned (and
presumably Vivado prunes them away in synthesis).

Otherwise, had been starting trying to work on adding S3M playback
support to my MOD player code (and dealing with the issue of the quality
of the documentation on these formats tending to be rather poor), but
hard to do this in parallel with looking for bugs in my Verilog code.

Like, the debugging effort is "just enough activity" that one can't
really go and work on some other coding project in parallel.
Or, at least, I am not "mentally flexible" to switch back and forth
between debugging Verilog and waiting for one of my simulations to
crash, while also working on another unrelated coding project.

That and still occasionally dealing with compiler bugs resulting from my
effort to implement support for the 128-bit ABI (and starting to suspect
I may need, at some point, to redesign how the backend compiler stages
work).

The current approach kinda sucks:
Generate instruction words as they were in early versions of the ISA
(mix of late BJX1 and early BJX2 encodings);
Repack instructions into the current ISA's encoding;
Do bit-twiddling (after the fact) to deal with the expanded register space;
Emit to output section;
Repack again, to deal with stuff like 40x2 bundle encoding, etc;
Go back over again (after the basic block is emitted) to shuffle
instructions and try to put stuff into bundles.

With most of this logic working on instructions being emitted 16 bits at
a time (some amount of the logic operating as finite state machines
triggered by conditions in the emitted instruction words).

This approach is kinda, well, crap...

Would make more sense to redo the backend in terms of "a whole
instruction at a time", though this is hindered slightly:
The largest instruction formats are 96 bits;
MSVC doesn't have int128 or similar, which is sorta an issue;
Passing around instruction encodings as structs would also suck.
It is either that, or structs operating at a "Meta ASM" level.
Tradeoff being between as ASM like representation,
Or a semi-encoded machine instruction
(would likely also need to exist).

But, a struct is probably still better than passing a bunch of pointers
to 16-bit words or similar and driving finite state machines over a word
at a time.

Rewriting the instruction-emitter backend of BGBCC's BJX2 backend to be
"less crap" would be a fair bit of effort though.

This whole mess being partly because the backend originally evolved out
of my SuperH backend, where SH2 and SH4 had fixed-length 16-bit
instructions (so working with everything in terms of 16-bit words was
less of an issue).

....

Re: Falcon Shores

<74966dd7-c676-431f-bc7a-cc7f948d95bdn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23815&group=comp.arch#23815

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:82b4:0:b0:1ed:d109:bdf6 with SMTP id 49-20020adf82b4000000b001edd109bdf6mr15544421wrc.441.1646053593829;
Mon, 28 Feb 2022 05:06:33 -0800 (PST)
X-Received: by 2002:a05:6808:1447:b0:2d7:28cc:29ff with SMTP id
x7-20020a056808144700b002d728cc29ffmr10508945oiv.175.1646053592881; Mon, 28
Feb 2022 05:06:32 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 28 Feb 2022 05:06:32 -0800 (PST)
In-Reply-To: <svemkc$m2k$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:34f1:40a7:6905:52ca;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:34f1:40a7:6905:52ca
References: <bcfd2d1e-52e0-4228-9786-b1b47072b07dn@googlegroups.com> <svemkc$m2k$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <74966dd7-c676-431f-bc7a-cc7f948d95bdn@googlegroups.com>
Subject: Re: Falcon Shores
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 28 Feb 2022 13:06:33 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Mon, 28 Feb 2022 13:06 UTC

On Saturday, February 26, 2022 at 7:11:28 PM UTC-7, BGB wrote:

That is true. I know that there were some supercomputers that included
FPGAs paired with the conventional CPUs for better effectiveness on some
types of problems.

I wasn't thinking about that possibility overmuch, though, since the _big_
thing supercomputing needs is lots of floating-point muscle. This is
getting achieved at reduced cost for power consumption through using lots
of GPUs in supercomputers.

Because GPUs tend to lack flexibility, though, they're not able to effectively
accelerate _all_ the floating-point calculations in a scientific computation.
That's why I was rooting for NEC's SX-Aurora TSUBASA because it seemed to
offer power approaching that of a GPU but with the flexibility of a vector
supercomputer, something easier to program which could take on a larger
proportion of the floating-point workload.

That's what made the Intel announcement exciting to me. Not only were these
new chips - Xeons, for sale to supercomputer builders only, or for use in Intel's
own supercomputers, so they would be expensive even if available - going to
have GPUs designed for computation (instead of video support) in them, but
there would be architectural changes making them *easier to program*.

That was what was really exciting to me, as it sounded like this new architecture
would be more effective, more flexible, allowing a larger proportion of a problem to
be offloaded to the GPU.

While initially this would be something expensive of limited availability - HBM has
been seen in consumer products, and _lots_ of CPUs have at least a small GPU
embedded - to provide graphics capability, of course.

If the pathway is opened, though, to increasing the floating-point power of the microprocessor,
and there are applications for such power that are attractive to the consumer - physics
computations in games, for example - we could see supercomputer power trickling down
to the desktop.

John Savard

Quadibloc <jsavard@ecn.ab.ca> wrote:
> Intel has announced its plans for a multi-chip module to go in a Xeon
> socket which will include integrated graphics on a separate die.
>
> Except that the GPU on the chip is really intended for use for things like
> AI models.
>
> And Raja Koduri even said it would offer "a vastly simplified GPU
> programming model".
>
> This almost sounds like what I've been waiting for!

I watched his talk:
https://www.intel.com/content/www/us/en/events/investor-meeting.html
but it seems that's *all* he said on the progamming model. So we're none
the wiser really.

The 'software' talk was pushing oneAPI, whose main benefit seems to be 'it's
not CUDA'. It didn't seem to me to be particularly simplifying, only that
you can run it on things that aren't GPUs (with big caveats).

Has there been any more detail on the 'simplified' GPU programming model?

Theo

Re: Falcon Shores

<86ae3fe8-0ee4-44c4-8d06-4d87d16fccaan@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23837&group=comp.arch#23837

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:27d3:0:b0:49b:56e8:579b with SMTP id n202-20020a3727d3000000b0049b56e8579bmr12714391qkn.146.1646096009752;
Mon, 28 Feb 2022 16:53:29 -0800 (PST)
X-Received: by 2002:a05:6830:b8c:b0:59d:67ea:5da7 with SMTP id
a12-20020a0568300b8c00b0059d67ea5da7mr10854046otv.38.1646096009511; Mon, 28
Feb 2022 16:53:29 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 28 Feb 2022 16:53:29 -0800 (PST)
In-Reply-To: <pLf*oOZHy@news.chiark.greenend.org.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:34f1:40a7:6905:52ca;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:34f1:40a7:6905:52ca
References: <bcfd2d1e-52e0-4228-9786-b1b47072b07dn@googlegroups.com> <pLf*oOZHy@news.chiark.greenend.org.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <86ae3fe8-0ee4-44c4-8d06-4d87d16fccaan@googlegroups.com>
Subject: Re: Falcon Shores
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 01 Mar 2022 00:53:29 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 14

by: Quadibloc - Tue, 1 Mar 2022 00:53 UTC

"Turn on, tune up, rock out." -- Billy Gibbons

devel / comp.arch / Falcon Shores

Subject	Author
Falcon Shores	Quadibloc
Re: Falcon Shores	BGB
Re: Falcon Shores	Robert Swindells
Re: Falcon Shores	BGB
Re: Falcon Shores	Quadibloc
Re: Falcon Shores	Theo
Re: Falcon Shores	Quadibloc