Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

A Fortran compiler is the hobgoblin of little minis.


devel / comp.arch / Re: Could we build a better 6502?

SubjectAuthor
* Could we build a better 6502?Thomas Koenig
+* Re: Could we build a better 6502?Quadibloc
|+* Re: Could we build a better 6502?John Levine
||+* Re: Could we build a better 6502?MitchAlsup
|||`* Re: Could we build a better 6502?aph
||| `* Re: Could we build a better 6502?Anton Ertl
|||  `* Re: Could we build a better 6502?MitchAlsup
|||   +- Re: Could we build a better 6502?Thomas Koenig
|||   +- Re: Could we build a better 6502?Anton Ertl
|||   `* Re: Could we build a better 6502?Quadibloc
|||    `* Re: Could we build a better 6502?Thomas Koenig
|||     +* Re: Could we build a better 6502?Brian G. Lucas
|||     |`* Re: Could we build a better 6502?Quadibloc
|||     | +- Re: Could we build a better 6502?Brian G. Lucas
|||     | `- Re: Could we build a better 6502?Anton Ertl
|||     +* Re: Could we build a better 6502?Stephen Fuld
|||     |+- Re: Could we build a better 6502?Terje Mathisen
|||     |`* Re: Could we build a better 6502?pec...@gmail.com
|||     | +* Re: Could we build a better 6502?MitchAlsup
|||     | |+* Re: Could we build a better 6502?pec...@gmail.com
|||     | ||`* Re: Could we build a better 6502?Stephen Fuld
|||     | || `- Re: Could we build a better 6502?pec...@gmail.com
|||     | |`* Re: Could we build a better 6502?Timothy McCaffrey
|||     | | +- Re: Could we build a better 6502?Michael Barry
|||     | | `* Re: Could we build a better 6502?Thomas Koenig
|||     | |  `* Re: Could we build a better 6502?Timothy McCaffrey
|||     | |   +* Re: Could we build a better 6502?pec...@gmail.com
|||     | |   |`* Re: Could we build a better 6502?Michael Barry
|||     | |   | `- Re: Could we build a better 6502?Thomas Koenig
|||     | |   `* Re: Could we build a better 6502?chris
|||     | |    `* Re: Could we build a better 6502?pec...@gmail.com
|||     | |     +* Re: Could we build a better 6502?MitchAlsup
|||     | |     |`- Re: Could we build a better 6502?Thomas Koenig
|||     | |     `* Re: Could we build a better 6502?chris
|||     | |      `* Re: Could we build a better 6502?George Neuner
|||     | |       `* Re: Could we build a better 6502?chris
|||     | |        +* Re: Could we build a better 6502?MitchAlsup
|||     | |        |`* Re: Could we build a better 6502?Thomas Koenig
|||     | |        | +- Re: Could we build a better 6502?Bernd Linsel
|||     | |        | `* Re: Could we build a better 6502?David Brown
|||     | |        |  `* Re: Could we build a better 6502?chris
|||     | |        |   `* Re: Could we build a better 6502?David Brown
|||     | |        |    `* Re: Could we build a better 6502?Terje Mathisen
|||     | |        |     `* Re: Could we build a better 6502?Thomas Koenig
|||     | |        |      `- Re: Could we build a better 6502?Terje Mathisen
|||     | |        `* Re: Could we build a better 6502?Al Grant
|||     | |         `- Re: Could we build a better 6502?chris
|||     | `* Re: Could we build a better 6502?Thomas Koenig
|||     |  +- Re: Could we build a better 6502?MitchAlsup
|||     |  +- Re: Could we build a better 6502?pec...@gmail.com
|||     |  +* Re: Could we build a better 6502?Thomas Koenig
|||     |  |+* Re: Could we build a better 6502?Stefan Monnier
|||     |  ||`* Re: Could we build a better 6502?Ivan Godard
|||     |  || `* Re: Could we build a better 6502?Stefan Monnier
|||     |  ||  `* Re: Could we build a better 6502?John Dallman
|||     |  ||   +- Re: Could we build a better 6502?Stefan Monnier
|||     |  ||   +* Re: Could we build a better 6502?pec...@gmail.com
|||     |  ||   |`- Re: Could we build a better 6502?Ivan Godard
|||     |  ||   `- Re: Could we build a better 6502?Stephen Fuld
|||     |  |`* Re: Could we build a better 6502?pec...@gmail.com
|||     |  | `* Re: Could we build a better 6502?Thomas Koenig
|||     |  |  `- Re: Could we build a better 6502?pec...@gmail.com
|||     |  `* Re: Could we build a better 6502?Thomas Koenig
|||     |   +* Re: Could we build a better 6502?Anton Ertl
|||     |   |+* Re: Could we build a better 6502?Thomas Koenig
|||     |   ||`* Re: Could we build a better 6502?pec...@gmail.com
|||     |   || `- Re: Could we build a better 6502?MitchAlsup
|||     |   |`* Re: Could we build a better 6502?David Schultz
|||     |   | +* Re: Could we build a better 6502?Anton Ertl
|||     |   | |`- Re: Could we build a better 6502?David Schultz
|||     |   | `* Re: Could we build a better 6502?MitchAlsup
|||     |   |  `* Re: Could we build a better 6502?pec...@gmail.com
|||     |   |   `- Re: Could we build a better 6502?MitchAlsup
|||     |   `- Re: Could we build a better 6502?MitchAlsup
|||     `* Re: Could we build a better 6502?Anton Ertl
|||      `* Re: Could we build a better 6502?Thomas Koenig
|||       `* Re: Could we build a better 6502?MitchAlsup
|||        +* Re: Could we build a better 6502?Marcus
|||        |+* Re: Could we build a better 6502?MitchAlsup
|||        ||`* Re: Could we build a better 6502?Thomas Koenig
|||        || `- Re: Could we build a better 6502?Anton Ertl
|||        |`- Re: Could we build a better 6502?Thomas Koenig
|||        `- Re: Could we build a better 6502?Thomas Koenig
||+* Re: Could we build a better 6502?Quadibloc
|||`- Re: Could we build a better PDP-8, was 6502?John Levine
||`- Re: Could we build a better 6502?Tim Rentsch
|`* Re: Could we build a better 6502?Quadibloc
| +* Re: Could we build a better 6502?Thomas Koenig
| |`* Re: Could we build a better 6502?Anton Ertl
| | `* Re: Could we build a better 6502?David Schultz
| |  `* Re: Could we build a better 6502?Brett
| |   `* Re: Could we build a better 6502?David Schultz
| |    `* Re: Could we build a better 6502?Brett
| |     `* Re: Could we build a better 6502?David Schultz
| |      `* Re: Could we build a better 6502?Brett
| |       `- Re: Could we build a better 6502?David Schultz
| +* Re: Could we build a better 6502?Stefan Monnier
| |`* Re: Could we build a better 6502?Thomas Koenig
| | +* Re: Could we build a better 6502?Stefan Monnier
| | |+* Re: Could we build a better 6502?MitchAlsup
| | ||`- Re: Could we build a better 6502?pec...@gmail.com
| | |`* Re: Could we build a better 6502?pec...@gmail.com
| | +- Re: Could we build a better 6502?MitchAlsup
| | `* Re: Could we build a better 6502?pec...@gmail.com
| `- Re: Could we build a better 6502?MitchAlsup
+* Re: Could we build a better 6502?Marcus
+* Re: Could we build a better 6502?MitchAlsup
+* Re: Could we build a better 6502?EricP
+* Re: Could we build a better 6502?Guillaume
+- Re: Could we build a better 6502?EricP
+* Re: Could we build a better 6502?Timothy McCaffrey
+- Re: Could we build a better 6502?JimBrakefield
+* Re: Could we build a better 6502?Anssi Saari
+* Re: Could we build a better 6502?John Dallman
+* Re: Could we build a better 6502?Anton Ertl
+* Re: Could we build a better 6502?Michael Barry
+* Re: Could we build a better 6502?pec...@gmail.com
+* Re: Could we build a better 6502?Bernd Linsel
+- Re: Could we build a better 6502?clamky
+* Re: Could we build a better 6502?Quadibloc
`- Re: Could we build a better 6502?Quadibloc

Pages:12345678910111213
Re: OoO S/360 descendants

<2021Aug19.190106@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20002&group=comp.arch#20002

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Thu, 19 Aug 2021 17:01:06 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 13
Message-ID: <2021Aug19.190106@mips.complang.tuwien.ac.at>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <87fsv6a2kv.fsf@localhost> <2021Aug19.125511@mips.complang.tuwien.ac.at> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <e7dba3e8-2973-4b6d-aaca-98346ea30f40n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="e8056037f627f5941f54caa74988213e";
logging-data="30938"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18e0Pk/38CWj8vNb94FsY3K"
Cancel-Lock: sha1:nRDQHJS2j7mS3DE5ZPoDiKCzt3E=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 19 Aug 2021 17:01 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>On Thursday, August 19, 2021 at 9:43:32 AM UTC-5, John Levine wrote:
>> A long time ago, DEC sold the VAX-11.780 as a one MIPS machine even though it really ran
>> about 500 KIPS. It was roughly as fast as a 370/158 which ran at 1 IBM MIPS.
><
>VAX-11/780 ran at about 4 clocks per instruction.

The VAX-11/780 ran at 5MHz, so if it ran at 0.5 VAX MIPS, it had a CPI of 10.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: OoO S/360 descendants

<hiwTI.4158$Nf1.2818@fx26.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20003&group=comp.arch#20003

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx26.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <87fsv6a2kv.fsf@localhost> <2021Aug19.125511@mips.complang.tuwien.ac.at> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <e7dba3e8-2973-4b6d-aaca-98346ea30f40n@googlegroups.com>
In-Reply-To: <e7dba3e8-2973-4b6d-aaca-98346ea30f40n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 28
Message-ID: <hiwTI.4158$Nf1.2818@fx26.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 19 Aug 2021 17:12:45 UTC
Date: Thu, 19 Aug 2021 13:12:22 -0400
X-Received-Bytes: 2297
 by: EricP - Thu, 19 Aug 2021 17:12 UTC

MitchAlsup wrote:
> On Thursday, August 19, 2021 at 9:43:32 AM UTC-5, John Levine wrote:
> <
> Not just instructions, but function units designed and dedicated to doing these things.
>> A long time ago, DEC sold the VAX-11.780 as a one MIPS machine even though it really ran
>> about 500 KIPS. It was roughly as fast as a 370/158 which ran at 1 IBM MIPS.
> <
> VAX-11/780 ran at about 4 clocks per instruction.

According to my 780 hardware manual, cpu had a 200 ns micro-word cycle.
It had an 8 byte instruction prefetch buffer.
Unified instruction + data 8 KB 2 way set assoc cache with an
8 byte line size and a 4 byte data path to cpu for I or D,
and an effective cache access time of 290 ns at 95% hit ratio.
Main memory had an effective access time of 1800 ns for 8 bytes.
TLB was full assoc, 128 entries for the 512 byte pages.

The main memory bus called the Synchronous Backplane Interconnect
(SBI) had a 200 ns clock, used a 3-4 pipelined phases to send control,
address, 4 bytes data, or 8 bytes data with an extended cycle.

I can see possibly 5 micro ops for a reg-reg instruction,
read opcode, read opspec-1, read opspec-2, operate, store.
Maybe even a little overlap of decode and execute phases.
But there also is enough room in the above HW to drop the
effective throughput to 0.5 MIPS for average instructions.

Re: OoO S/360 descendants

<sfm3mf$uqe$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20005&group=comp.arch#20005

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Thu, 19 Aug 2021 10:17:01 -0700
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <sfm3mf$uqe$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<jwvtujq6d9n.fsf-monnier+comp.arch@gnu.org>
<sfb5qi$m3u$1@newsreader4.netcologne.de>
<84126256-8745-4707-9445-130860761e76n@googlegroups.com>
<2021Aug17.112004@mips.complang.tuwien.ac.at>
<61a8609e-7cda-46cd-9723-4936e719e927n@googlegroups.com>
<2021Aug18.102524@mips.complang.tuwien.ac.at>
<524743a0-0c40-4543-b68e-17b2af752d7en@googlegroups.com>
<2021Aug18.155716@mips.complang.tuwien.ac.at> <87fsv6a2kv.fsf@localhost>
<2021Aug19.125511@mips.complang.tuwien.ac.at>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<k9vTI.36153$Fu2.9688@fx16.iad>
<71d05233-5e4c-4886-ae4d-d3375f782c0cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 19 Aug 2021 17:17:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c4b354f149773bd247bf578055677a10";
logging-data="31566"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zfi/ZxoylNLrfAxqFIaztKOkuCMSCV2c="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:D9so00vMbspCp1Kl25pOLAJ8C6s=
In-Reply-To: <71d05233-5e4c-4886-ae4d-d3375f782c0cn@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Thu, 19 Aug 2021 17:17 UTC

On 8/19/2021 9:44 AM, Michael S wrote:
> On Thursday, August 19, 2021 at 6:54:59 PM UTC+3, EricP wrote:
>> Michael S wrote:
>>> On Thursday, August 19, 2021 at 2:11:04 PM UTC+3, Anton Ertl wrote:
>>>> Anne & Lynn Wheeler <ly...@garlic.com> writes:
>>>>> z196 seemed to have been the last where there were real live benchmark
>>>>> numbers ... since then things got a lot more obfuscated ...
>>>> Speaks against competetive performance.
>>>
>>> I think, even in period where, according to Lynn, there were "were real live benchmark numbers",
>>> they were always IBM mainframe vs some old IBM mainframe.
>>> I don't believe that in last 35-40 years IBM ever published CPU benchmarks that can be compared
>>> directly against IBM's competitors or even vs other IBM-made computers.
>> Probably because it doesn't help their sales so why do it.
>> Does anybody port _to_ zSeries?
>
> So, bragging rights do not matter any more?

That assumes that, to IBM, it ever did. While I am sure there are
exceptions, going back to at least the 1970s, IBM sold its systems
(emphasis on "systems", not just CPUs) primarily to IT managers and on
the basis of IBM's support, reliability, quality of peripherals, etc.,
rarely on CPU speed. And, as others have pointed out, most "business
applications" were I/O bound anyway, so CPU speed didn't matter much.

As time wore on, by say the 1990s, when microprocessors became
performance competitive, the appeal didn't change. It just started
becoming less compelling to customers, and their sales slowly started to
plateau.

By now, I expect the number of new name customers is close to zero, and
their goal is to keep their existing base (software lock-in), and add
capacity and functionality for those customers, in order to keep them
from converting away.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants (was: Could we build a better 6502?)

<sfm4ru$2g9v$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20006&group=comp.arch#20006

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!adore2!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants (was: Could we build a better 6502?)
Date: Thu, 19 Aug 2021 17:37:02 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sfm4ru$2g9v$1@gal.iecc.com>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <sfjpgd$25l8$1@gal.iecc.com> <sflvss$vq5$1@dont-email.me> <5c500666-e0da-4c5f-9eca-aecf80478b4fn@googlegroups.com>
Injection-Date: Thu, 19 Aug 2021 17:37:02 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="82239"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <sde7eg$hgb$1@newsreader4.netcologne.de> <sfjpgd$25l8$1@gal.iecc.com> <sflvss$vq5$1@dont-email.me> <5c500666-e0da-4c5f-9eca-aecf80478b4fn@googlegroups.com>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Thu, 19 Aug 2021 17:37 UTC

According to Michael S <already5chosen@yahoo.com>:
>> > to everything, not just scientific work. They never tried to build another mainframe
>> > with high performance floating point, but the did define a bunch of vector instruction
>> > sets including one for ESA/390 and a different one for zSeries, so someone must be
>> > using them.
>> Don't forget the vector facility for the 3090. Perhaps the new
>> instructions are there just for compatibility.
>
>I am pretty sure that "vector" instructions introduced in z13 are totally unrelated to vector facilities of either 3090 or ES/9000.
>A new facility is SIMD and is rather similar to "vector and scalar unit" of POWER8.

The zSeries manual says that its vector facility is not compatible with ESA/390 and presumably
any of its predecessors. So someone wants it enough to design a new vector instruction set
and to support it in whatever application is using it.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: OoO S/360 descendants

<sfm656$2lrr$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20007&group=comp.arch#20007

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Thu, 19 Aug 2021 17:59:02 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sfm656$2lrr$1@gal.iecc.com>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
Injection-Date: Thu, 19 Aug 2021 17:59:02 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="87931"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Thu, 19 Aug 2021 17:59 UTC

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
>>IBM has lots of very complex instructions to speed up specific tasks.
>
>As the VAX vs. RISC discussion taught us, just because it has very
>complex instructions does not mean that these speed up the tasks. And
>IIRC IBM' 801 project discovered the same even earlier wrt complex 360
>instructions.

Yes, but IBM is not stupid. It is hard to believe they would have added new
instructions if they didn't address hot spots.

>They really have an instruction for heapsort, the sort with the worst
>cache locality among the n*ln(n) sorts?

It doesn't do the whole sort, it does the heapify part, because ...

The way you sort a file too big to fit into memory is read it a chunk
at a time into memory, sort each chunk, write the sorted chunk out to
one of several intermediate files, run a merge pass to combine chunks
from the intermediate files into longer chunks in another set of
intermediate files, and repeat the merge passes back and forth between
the two sets of intermediate files until it's merged into one long
sorted chunk.

A well known trick to make this faster is every time you are about to
write a record from the sorted memory buffer, read an input record and
try to insert it into the sorted buffer, which you can do unless the
new record sorts before the one you are writing. This can make the
chunks bigger and reduce the number of merge passes. In the common
case that the file is already mostly sorted, it is particularly
effective. The "try to insert it into the sorted buffer" part is
heapify, not surprisingly what their sort accelerator does.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: OoO S/360 descendants

<sfm68n$2lrr$2@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20008&group=comp.arch#20008

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!adore2!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Thu, 19 Aug 2021 18:00:55 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sfm68n$2lrr$2@gal.iecc.com>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <2021Aug19.125511@mips.complang.tuwien.ac.at> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <k9vTI.36153$Fu2.9688@fx16.iad>
Injection-Date: Thu, 19 Aug 2021 18:00:55 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="87931"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <sde7eg$hgb$1@newsreader4.netcologne.de> <2021Aug19.125511@mips.complang.tuwien.ac.at> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <k9vTI.36153$Fu2.9688@fx16.iad>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Thu, 19 Aug 2021 18:00 UTC

According to EricP <ThatWouldBeTelling@thevillage.com>:
>Michael S wrote:
>> On Thursday, August 19, 2021 at 2:11:04 PM UTC+3, Anton Ertl wrote:
>>> Anne & Lynn Wheeler <ly...@garlic.com> writes:
>>>> z196 seemed to have been the last where there were real live benchmark
>>>> numbers ... since then things got a lot more obfuscated ...
>>> Speaks against competetive performance.
>>
>> I think, even in period where, according to Lynn, there were "were real live benchmark numbers",
>> they were always IBM mainframe vs some old IBM mainframe.
>> I don't believe that in last 35-40 years IBM ever published CPU benchmarks that can be compared
>> directly against IBM's competitors or even vs other IBM-made computers.
>
>Probably because it doesn't help their sales so why do it.
>Does anybody port _to_ zSeries?

Considering the extensive linux support, I assume so.

But nobody does it for CPU speed, it's for reliability and large system scalability.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: OoO S/360 descendants

<memo.20210819214956.15456M@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20010&group=comp.arch#20010

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Thu, 19 Aug 2021 21:49 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <memo.20210819214956.15456M@jgd.cix.co.uk>
References: <k9vTI.36153$Fu2.9688@fx16.iad>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="6b4b0cf2565cbb3ae37b75130d4e7f43";
logging-data="26955"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7cy5iW/nEt2qTAysELTw/I2MSBzxMxKM="
Cancel-Lock: sha1:AuMCeSNlUmusLjLb27Kx2PF1MaA=
 by: John Dallman - Thu, 19 Aug 2021 20:49 UTC

In article <k9vTI.36153$Fu2.9688@fx16.iad>,
ThatWouldBeTelling@thevillage.com (EricP) wrote:
> Michael S wrote:
> > I don't believe that in last 35-40 years IBM ever published CPU
> > benchmarks that can be compared directly against IBM's competitors
> > or even vs other IBM-made computers.
> Probably because it doesn't help their sales so why do it.
> Does anybody port _to_ zSeries?

Not often, but it happens. Linux runs on zSeries, so the software that
comes bundled with RHEL and SLES has all been ported. I took a look in
the spring out of curiosity, and it isn't all that strange, although a
lot of the nomenclature is weird. I stopped when I discovered the entry
price for a z15.

John

Re: OoO S/360 descendants

<sfnfor$1b5$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20016&group=comp.arch#20016

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Fri, 20 Aug 2021 05:49:15 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfnfor$1b5$1@newsreader4.netcologne.de>
References: <k9vTI.36153$Fu2.9688@fx16.iad>
<memo.20210819214956.15456M@jgd.cix.co.uk>
Injection-Date: Fri, 20 Aug 2021 05:49:15 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="1381"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 20 Aug 2021 05:49 UTC

John Dallman <jgd@cix.co.uk> schrieb:
> In article <k9vTI.36153$Fu2.9688@fx16.iad>,
> ThatWouldBeTelling@thevillage.com (EricP) wrote:
>> Michael S wrote:
>> > I don't believe that in last 35-40 years IBM ever published CPU
>> > benchmarks that can be compared directly against IBM's competitors
>> > or even vs other IBM-made computers.
>> Probably because it doesn't help their sales so why do it.
>> Does anybody port _to_ zSeries?
>
> Not often, but it happens. Linux runs on zSeries, so the software that
> comes bundled with RHEL and SLES has all been ported. I took a look in
> the spring out of curiosity, and it isn't all that strange, although a
> lot of the nomenclature is weird. I stopped when I discovered the entry
> price for a z15.

What is that price?

(I once read that there isn't such a thing as an entry-level
mainframe, but I don't recall any figures).

Re: Could we build a better 6502?

<sfnfud$1b5$2@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20017&group=comp.arch#20017

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Could we build a better 6502?
Date: Fri, 20 Aug 2021 05:52:13 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sfnfud$1b5$2@newsreader4.netcologne.de>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<4faf234c-eabc-4ed4-b849-5e33df12fdb9n@googlegroups.com>
<jwvtujq6d9n.fsf-monnier+comp.arch@gnu.org>
<sfb5qi$m3u$1@newsreader4.netcologne.de>
<84126256-8745-4707-9445-130860761e76n@googlegroups.com>
<2021Aug17.112004@mips.complang.tuwien.ac.at>
<sfi8ip$feu$1@newsreader4.netcologne.de> <G_9TI.10907$Lv3.8090@fx08.iad>
<2021Aug19.162124@mips.complang.tuwien.ac.at>
<a813eeb9-fda9-414d-b488-8dc661c7b168n@googlegroups.com>
<2021Aug19.184125@mips.complang.tuwien.ac.at>
Injection-Date: Fri, 20 Aug 2021 05:52:13 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="1381"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 20 Aug 2021 05:52 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

> Also, if you want to make maximal use of memory bandwidth, you may
> need to prefetch the instruction (to avoid a memory cycle going unused
> for decoding), which may require an 8-bit register for buffering.

Interestingly enough, the 6502 used a cycle to fetch the next byte
even for single-byte instructions.

Re: OoO S/360 descendants

<memo.20210820103622.15456O@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20018&group=comp.arch#20018

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Fri, 20 Aug 2021 10:36 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <memo.20210820103622.15456O@jgd.cix.co.uk>
References: <sfnfor$1b5$1@newsreader4.netcologne.de>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="e179bb4acca9c5d2569666779f9d4ea1";
logging-data="2124"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lHO2W6uk4Dx8fP7NFibHmMR0Iy91ucag="
Cancel-Lock: sha1:HrzBfF7pvUlZ8QIZ/TxbOAT1nvk=
 by: John Dallman - Fri, 20 Aug 2021 09:36 UTC

In article <sfnfor$1b5$1@newsreader4.netcologne.de>,
tkoenig@netcologne.de (Thomas Koenig) wrote:

> > I stopped when I discovered the entry price for a z15.
>
> What is that price?
>
> (I once read that there isn't such a thing as an entry-level
> mainframe, but I don't recall any figures).

Modern zSeries machines are extremely modular. The "entry" price is a
minimal configuration, with one of each of the basic kind of units, in a
single frame. This is far more expensive per CPU than a fully- configured
mainframe. For a z15, it seemed to be approximately US$ 220,000.

The alternative is various kinds of software emulator, which are either
free, but only legally usable for Linux, or pretty expensive.
<https://en.wikipedia.org/wiki/PC-based_IBM_mainframe-compatible_systems#z
/Architecture_and_today>

John

Re: OoO S/360 descendants

<87eeanrcwy.fsf@localhost>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20022&group=comp.arch#20022

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: lyn...@garlic.com (Anne & Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Fri, 20 Aug 2021 12:04:13 -1000
Organization: Wheeler&Wheeler
Lines: 50
Message-ID: <87eeanrcwy.fsf@localhost>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<2021Aug18.155716@mips.complang.tuwien.ac.at>
<3a2c9983-e999-465f-8c41-0492598e1370n@googlegroups.com>
<2021Aug18.184436@mips.complang.tuwien.ac.at>
<sfjpgd$25l8$1@gal.iecc.com> <sflvss$vq5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="34f6a664640adc335fea6f83b0b3b2a0";
logging-data="9013"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FZVe2du6VBE4aLivgyboOiudgHZrhqUE="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:xzCG55wtIGxVoPK0SYq1o28grRA=
sha1:LQwErC2EU+oDNfJVC853r2a7bAc=
 by: Anne & Lynn Whee - Fri, 20 Aug 2021 22:04 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
> Don't forget the vector facility for the 3090. Perhaps the new
> instructions are there just for compatibility.

reference to shutdown of ACS-360
https://people.cs.clemson.edu/~mark/acs_end.html
claim was executives felt that it would advance state-of-art too fast
and shut it down ... Amdahl leaves IBM shortly later. Also mentions some
of the features don't show up until ES9000 in the 90s. Also references
invention for hyperthreading.

trivia: 370/195 had out-of-order and 64 instruction pipeline, but didn't
have branch prediction and conditional branches drained the
pipeline. 370/195 could hit 10MIPS but most codes ran at 5MIPS (because
of conditional branches). 370/195 sucked me into helping them
implementing hyperthreading ... emulating two processor SMP (which
never got announced or shipped, in part the shift 370 to virtual memory
and that would have required effectively new machine for 195).

more trivia: Future System in the 1st half of the 70s was completely
different from 370 an was going to completely replace 370 ... and
internal politics was killing off 370 projects. The lack of new 370
during FS is credited with giving the 370 clone makers (Amdahl) their
market foothold.

When FS imploded, there was mad rush to get stuff back into 370 product
pipelines ... 3033 (remap 168-3 logic to 20% faster chips) and 3081 were
quick&dirty projects kicked off in parallel. some more details
http://www.jfsowa.com/computer/memo125.htm

I had gotten sucked into helping with 16-processor 370 SMP and lots of
people thot it was really great ... we also got the 3033 processor
engineers to work on it in their spare time (lot more interesting than
remapping 168-3 logic). Then somebody told the head of their lab that it
might take the POK favorite son operating system (MVS) decades before
they had effective 16-way support (z900 w/16way released more than 20
years later). Then some of us were invited to never visit POK again (and
3033 processor engineers were told to focus totally on 3033).

With 3033 out the door the 3033 processor engineers start on 3090
(trout, trout1.5, etc). The 3090 processor engineers complained when
vector processing was announced ... they claimed that in the past, big
issue for vector was it was so slow, that memory could easily keep dozen
of units fed but they had got scalar up to running at memory speed
.... trying to run multiple FP units concurrently would be memory
limited.

--
virtualization experience starting Jan1968, online at home since Mar1970

Re: OoO S/360 descendants

<2021Aug21.140442@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20029&group=comp.arch#20029

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Sat, 21 Aug 2021 12:04:42 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 73
Message-ID: <2021Aug21.140442@mips.complang.tuwien.ac.at>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com>
Injection-Info: reader02.eternal-september.org; posting-host="1a058b2af70ed0f49c1fa9b430f27a44";
logging-data="19138"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19X3e/UsjQ+6YyoBllvpQvh"
Cancel-Lock: sha1:VyBnbltAUDRnV6Rxwc14PPyM04U=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 21 Aug 2021 12:04 UTC

John Levine <johnl@taugh.com> writes:
>According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
>>>IBM has lots of very complex instructions to speed up specific tasks.
>>
>>As the VAX vs. RISC discussion taught us, just because it has very
>>complex instructions does not mean that these speed up the tasks. And
>>IIRC IBM' 801 project discovered the same even earlier wrt complex 360
>>instructions.
>
>Yes, but IBM is not stupid. It is hard to believe they would have added new
>instructions if they didn't address hot spots.

I find it easy to believe that they add instructions for customer
needs (as perceived with the help from IBM's marketing and sales
force) in order to have a unique selling proposition, whether the
instructions address real or imagined hot spots, and whether the
instructions actually provide a speedup or not.

The lack of benchmark results makes me suspicious.

>>They really have an instruction for heapsort, the sort with the worst
>>cache locality among the n*ln(n) sorts?
....
>A well known trick to make this faster is every time you are about to
>write a record from the sorted memory buffer, read an input record and
>try to insert it into the sorted buffer, which you can do unless the
>new record sorts before the one you are writing. This can make the
>chunks bigger and reduce the number of merge passes. In the common
>case that the file is already mostly sorted, it is particularly
>effective. The "try to insert it into the sorted buffer" part is
>heapify, not surprisingly what their sort accelerator does.

Given that in this scenario the buffer consumes most of the available
RAM, heapsort is going to incur ld n main memory latencies for every
insertion, and every extraction. I don't expect that special
instructions will provide any benefit for that.

I would also not use heapsort for this trick; instead, just have the
buffer sorted when starting to write out, collect the late arrivals at
the front (in the places freed by the writeouts), and remember the
smallest of them (maybe keep it at the front). When the smallest of
the late arrivals becomes the smallest overall, sort the late
arrivals, and then select between the main buffer and the late
arrivals on output. You can have another generation of late arrivals,
but that complicates memory management, and merging on output, and I
doubt it gives much benefit.

Finally, the whole scenario seems to be designed for small main memory
and huge mass storage (maybe mainframes in the 1960s). Currently I
can buy 128GB DRAM for <EUR 500 (a little more for ECC DRAM). Even
with 1000TB of mass storage (costs at least EUR 18000 for the HDDs
alone, plus infrastructure, plus redundancy), the mass storage is only
8000 times as big as the RAM, so I don't need this trick: I just do
one pass creating 4000 sorted chunks (we can only use half of the mass
storage, because the other half is consumed by the last generation),
and then another pass for merging the chunks (only 4000 records need
to be in memory in this pass). Maybe heapsort could be used in this
merge pass; if the sort-relevant parts of the records fit in the
cache, the bad cache behaviour of heapsort does not play a role. I am
still not convinced that dedicated instructions provides a benefit.
Most likely they are microcoded and do just the same thing in
microcode that library routines do in machine code; and even if they
are implemented in hardware, they still have to do all the cache
accesses, so I don't expect them to be faster.

You may say that RAM for Z is more expensive, true, but so is mass
storage and the whole system. Will you expend >$220k for a Z, and
then give it <128GB of RAM?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: OoO S/360 descendants

<jwvpmu3lin8.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20078&group=comp.arch#20078

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Mon, 23 Aug 2021 21:54:13 -0400
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <jwvpmu3lin8.fsf-monnier+comp.arch@gnu.org>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com>
<2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com>
<2021Aug21.140442@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="35037fb91f96f4a62e08ee9590cd7d44";
logging-data="25379"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hN+5nLUMGMr52r/H+CmQa"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:AUm+0bG3QLWFnLWUVIOOKgzBLcA=
sha1:umFNhaBPo0CYjLnNnoG9KVi2FbM=
 by: Stefan Monnier - Tue, 24 Aug 2021 01:54 UTC

> Given that in this scenario the buffer consumes most of the available
> RAM, heapsort is going to incur ld n main memory latencies for every
> insertion, and every extraction. I don't expect that special
> instructions will provide any benefit for that.

As described by John, the heapsort is used to create the "set of initial
sorted lists" which are then merged in a few passes.

The number of passes performed by the merge sort is O(ln N) where N is
the number of elements in this initial set. So if(?) heapsort
implemented in hardware is able to sort up to M elements in time
O(M), you get an N that is M times smaller than your total set of
elements. So you save O(ln M) passes. Whether it makes a significant
difference will depend on M, on the total number of elements, and on
whether the initial list is already mostly sorted or not.

Stefan

Re: OoO S/360 descendants

<sh5nuv$brr$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20298&group=comp.arch#20298

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Mon, 6 Sep 2021 11:51:09 -0700
Organization: A noiseless patient Spider
Lines: 201
Message-ID: <sh5nuv$brr$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 6 Sep 2021 18:51:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bfe96b8f39a319452b0d0e6f4160df1d";
logging-data="12155"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19SvmEcC9XZckLj7ZBcVbNJfRuGGBMilpo="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:zsHEGhWcu7Y6fvm8MdA3nWC9fX8=
In-Reply-To: <2021Aug21.140442@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Mon, 6 Sep 2021 18:51 UTC

On 8/21/2021 5:04 AM, Anton Ertl wrote:
> John Levine <johnl@taugh.com> writes:
>> According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
>>>> IBM has lots of very complex instructions to speed up specific tasks.
>>>
>>> As the VAX vs. RISC discussion taught us, just because it has very
>>> complex instructions does not mean that these speed up the tasks. And
>>> IIRC IBM' 801 project discovered the same even earlier wrt complex 360
>>> instructions.
>>
>> Yes, but IBM is not stupid. It is hard to believe they would have added new
>> instructions if they didn't address hot spots.
>
> I find it easy to believe that they add instructions for customer
> needs (as perceived with the help from IBM's marketing and sales
> force) in order to have a unique selling proposition, whether the
> instructions address real or imagined hot spots, and whether the
> instructions actually provide a speedup or not.
>
> The lack of benchmark results makes me suspicious.
>
>>> They really have an instruction for heapsort, the sort with the worst
>>> cache locality among the n*ln(n) sorts?
> ...
>> A well known trick to make this faster is every time you are about to
>> write a record from the sorted memory buffer, read an input record and
>> try to insert it into the sorted buffer, which you can do unless the
>> new record sorts before the one you are writing. This can make the
>> chunks bigger and reduce the number of merge passes. In the common
>> case that the file is already mostly sorted, it is particularly
>> effective. The "try to insert it into the sorted buffer" part is
>> heapify, not surprisingly what their sort accelerator does.
>
> Given that in this scenario the buffer consumes most of the available
> RAM, heapsort is going to incur ld n main memory latencies for every
> insertion, and every extraction. I don't expect that special
> instructions will provide any benefit for that.
>
> I would also not use heapsort for this trick; instead, just have the
> buffer sorted when starting to write out, collect the late arrivals at
> the front (in the places freed by the writeouts), and remember the
> smallest of them (maybe keep it at the front). When the smallest of
> the late arrivals becomes the smallest overall, sort the late
> arrivals, and then select between the main buffer and the late
> arrivals on output. You can have another generation of late arrivals,
> but that complicates memory management, and merging on output, and I
> doubt it gives much benefit.
>
> Finally, the whole scenario seems to be designed for small main memory
> and huge mass storage (maybe mainframes in the 1960s). Currently I
> can buy 128GB DRAM for <EUR 500 (a little more for ECC DRAM). Even
> with 1000TB of mass storage (costs at least EUR 18000 for the HDDs
> alone, plus infrastructure, plus redundancy), the mass storage is only
> 8000 times as big as the RAM, so I don't need this trick: I just do
> one pass creating 4000 sorted chunks (we can only use half of the mass
> storage, because the other half is consumed by the last generation),
> and then another pass for merging the chunks (only 4000 records need
> to be in memory in this pass). Maybe heapsort could be used in this
> merge pass; if the sort-relevant parts of the records fit in the
> cache, the bad cache behaviour of heapsort does not play a role. I am
> still not convinced that dedicated instructions provides a benefit.
> Most likely they are microcoded and do just the same thing in
> microcode that library routines do in machine code; and even if they
> are implemented in hardware, they still have to do all the cache
> accesses, so I don't expect them to be faster.

I knew that IBM had implemented instructions to speed up sort, but I
didn't know the details. This thread led me to spend some time
researching them. There is actually a lot of documentation in the IBM
Principles of Operation manuals for the relevant CPUs (available free
on-line for at least the slightly older CPUs. - search "Principles of
operation manual". The newer ones include the 64 bit versions as well
as the 24/32 bit ones). In particular, Appendix A contains a several
page discussion of them, with a usage example, instruction trace output,
etc. Starting with the manuals led me to other places, including
digging out my old copy of Knuth Vol 3!

There are two separable questions from Anton's post. Do the IBM
instructions speed up sorts, especially in a time of very large memory
and disk systems, and is the usage of replacement (reading more unsorted
records into main memory as older sorted records are written out, and
attempting to add them to the run) worthwhile. I will attempt to answer
each, but first I need to clear up some confusion of what the
instructions do, and where they are used.

For the following, I will use Anton's example of sorting a 500 TB disk
file onto another 500 TB of disk on a system with 128 GB of available
main memory. I assume there is some additional RAM and disk storage to
hold the OS, program, other misc. requirements, etc. I also add the
assumptions that the disk farm is made up of 10 TB hard drives and I
ignore any RAID, fancy storage controllers, etc. Furthermore, just to
make things concrete and make the math simple, I assume the file is made
up of 500 byte records with 16 byte keys.

It is clear that the instructions (there are two of them) are designed
to help in the merge part of an external sort. Taking Anton's example,
We sort the file into 128 GB runs and write them to disk, then after the
sort, we have a 4,000 way merge. We will read 1/4000 of each run into
memory and merge them, replenishing each run from its run on disk.

The naive way of doing the merge is comparing the first unmerged record
in each string to the first unmerged record in the other 3,999 runs.
This is clearly not optimal. The accepted algorithm is to create a
binary tree (called a tournament tree) where the leaf of each branch
points to the the lowest unmerged record in each run. So, for 4,000
runs, the tree has depth of 12, and we only have to do 12 comparisons
instead of 4,000. This is a big issue, as we are going to do this one
trillion times! Improving the CPU cost of managing the tree is worthwhile.

There is another issue. Keys may be arbitrarily long. I assumed 16
bytes which is pretty modest. Remember, there may be a compound key
consisting of multiple fields within each record. But doing comparisons
of arbitrary length fields is fairly costly, and the keys are duplicated
in each tree element, so the tree elements could get pretty large.
IBM's solution to this problem is to create what they call "code words"
for each key. In 64 bit mode, each code word consists of eight bytes.
The low order 6 bytes contain the three half words that contain the
first first bit where the full key differs from the key of that node's
tournament winner, and two bytes to tell the starting half word of the
code word within the full key. At the cost of the code to create the
code words, this usually speeds up the comparison (always a 64 bit
compare), and generally reduces the size of the tree, which improves
data cache usage.

The two instructions IBM created are for the searching and update of the
tree (Update Tree), and the insertion of new records into the tree after
each winner is taken out (Compare and Form Code Word). Their operation
and usage are pretty well described in Appendix A of the Principles of
Operation manual. More details are in the instruction descriptions in
chapter 7.

I saw nothing to indicate that these instructions are used in the sort
phase at all. They could conceivably be used to merge the replacement
records into the sorted run, but I saw no evidence that they were used
there.

So, to answer the first question. It is clear, as Anton indicated, that
their functionality could be done with a sequence of other instructions,
perhaps in a library routine. Of course, that is true of most complex
instructions, including, for example, floating point arithmetic. But
looking at Appendix A, it is clear that code would be pretty complex
with lots of instructions. Just saving on those many instruction's
fetch and decode might be significant.

I saw no benchmark data to indicate any speed advantage of the new
instructions, however there is some "indirect evidence". There was a
substantial market for sort programs for S/360 architecture, with at
least two big competitors (Syncsort and CAsort). If IBM tried to
"foist" some feature on its customers that was supposed to improve
performance, but actually didn't, I am pretty sure their competitors
would call them out on it. Also, when 64 bit mode came out, IBM spent
the resources necessary to enhance the instructions for 64 bit mode,
rather than deprecating them. That they did this also provides some
evidence for there continued continuing value in a large memory system.
Probably better evidence is that fact that the improvement would be
more pronounced with larger memory. Specifically, no one would, or
probably net even could do a 40 way merge on a small memory system. But
the advantage of the tree (log2 N versus N) grows with the increasing
merge ways allowed with large memory. I realize that these bits of
evidence are circumstantial and sort of hand waving, and I do wish I
could find actual benchmarks, but it is what it is.


Click here to read the complete article
Re: merge accelerators, was OoO S/360 descendants

<sh5ohe$2fui$2@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20299&group=comp.arch#20299

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: merge accelerators, was OoO S/360 descendants
Date: Mon, 6 Sep 2021 19:01:02 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sh5ohe$2fui$2@gal.iecc.com>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me>
Injection-Date: Mon, 6 Sep 2021 19:01:02 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="81874"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <sde7eg$hgb$1@newsreader4.netcologne.de> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Mon, 6 Sep 2021 19:01 UTC

Thanks, this was great. As far as I can see your analysis is correct,
these instructions are partcicularly useful for large merges.

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>On 8/21/2021 5:04 AM, Anton Ertl wrote:
>> ...
>>> A well known trick to make this faster is every time you are about to
>>> write a record from the sorted memory buffer, read an input record and
>>> try to insert it into the sorted buffer, which you can do unless the
>>> new record sorts before the one you are writing. This can make the
>>> chunks bigger and reduce the number of merge passes. In the common
>>> case that the file is already mostly sorted, it is particularly
>>> effective. The "try to insert it into the sorted buffer" part is
>>> heapify, not surprisingly what their sort accelerator does.
>>
>> Given that in this scenario the buffer consumes most of the available
>> RAM, heapsort is going to incur ld n main memory latencies for every
>> insertion, and every extraction. I don't expect that special
>> instructions will provide any benefit for that.
>>
>> I would also not use heapsort for this trick; instead, just have the
>> buffer sorted when starting to write out, collect the late arrivals at
>> the front (in the places freed by the writeouts), and remember the
>> smallest of them (maybe keep it at the front). When the smallest of
>> the late arrivals becomes the smallest overall, sort the late
>> arrivals, and then select between the main buffer and the late
>> arrivals on output. You can have another generation of late arrivals,
>> but that complicates memory management, and merging on output, and I
>> doubt it gives much benefit.
>>
>> Finally, the whole scenario seems to be designed for small main memory
>> and huge mass storage (maybe mainframes in the 1960s). Currently I
>> can buy 128GB DRAM for <EUR 500 (a little more for ECC DRAM). Even
>> with 1000TB of mass storage (costs at least EUR 18000 for the HDDs
>> alone, plus infrastructure, plus redundancy), the mass storage is only
>> 8000 times as big as the RAM, so I don't need this trick: I just do
>> one pass creating 4000 sorted chunks (we can only use half of the mass
>> storage, because the other half is consumed by the last generation),
>> and then another pass for merging the chunks (only 4000 records need
>> to be in memory in this pass). Maybe heapsort could be used in this
>> merge pass; if the sort-relevant parts of the records fit in the
>> cache, the bad cache behaviour of heapsort does not play a role. I am
>> still not convinced that dedicated instructions provides a benefit.
>> Most likely they are microcoded and do just the same thing in
>> microcode that library routines do in machine code; and even if they
>> are implemented in hardware, they still have to do all the cache
>> accesses, so I don't expect them to be faster.
>
>I knew that IBM had implemented instructions to speed up sort, but I
>didn't know the details. This thread led me to spend some time
>researching them. There is actually a lot of documentation in the IBM
>Principles of Operation manuals for the relevant CPUs (available free
>on-line for at least the slightly older CPUs. - search "Principles of
>operation manual". The newer ones include the 64 bit versions as well
>as the 24/32 bit ones). In particular, Appendix A contains a several
>page discussion of them, with a usage example, instruction trace output,
>etc. Starting with the manuals led me to other places, including
>digging out my old copy of Knuth Vol 3!
>
>There are two separable questions from Anton's post. Do the IBM
>instructions speed up sorts, especially in a time of very large memory
>and disk systems, and is the usage of replacement (reading more unsorted
>records into main memory as older sorted records are written out, and
>attempting to add them to the run) worthwhile. I will attempt to answer
>each, but first I need to clear up some confusion of what the
>instructions do, and where they are used.
>
>For the following, I will use Anton's example of sorting a 500 TB disk
>file onto another 500 TB of disk on a system with 128 GB of available
>main memory. I assume there is some additional RAM and disk storage to
>hold the OS, program, other misc. requirements, etc. I also add the
>assumptions that the disk farm is made up of 10 TB hard drives and I
>ignore any RAID, fancy storage controllers, etc. Furthermore, just to
>make things concrete and make the math simple, I assume the file is made
>up of 500 byte records with 16 byte keys.
>
>It is clear that the instructions (there are two of them) are designed
>to help in the merge part of an external sort. Taking Anton's example,
>We sort the file into 128 GB runs and write them to disk, then after the
>sort, we have a 4,000 way merge. We will read 1/4000 of each run into
>memory and merge them, replenishing each run from its run on disk.
>
>The naive way of doing the merge is comparing the first unmerged record
>in each string to the first unmerged record in the other 3,999 runs.
>This is clearly not optimal. The accepted algorithm is to create a
>binary tree (called a tournament tree) where the leaf of each branch
>points to the the lowest unmerged record in each run. So, for 4,000
>runs, the tree has depth of 12, and we only have to do 12 comparisons
>instead of 4,000. This is a big issue, as we are going to do this one
>trillion times! Improving the CPU cost of managing the tree is worthwhile.
>
>There is another issue. Keys may be arbitrarily long. I assumed 16
>bytes which is pretty modest. Remember, there may be a compound key
>consisting of multiple fields within each record. But doing comparisons
>of arbitrary length fields is fairly costly, and the keys are duplicated
>in each tree element, so the tree elements could get pretty large.
>IBM's solution to this problem is to create what they call "code words"
>for each key. In 64 bit mode, each code word consists of eight bytes.
>The low order 6 bytes contain the three half words that contain the
>first first bit where the full key differs from the key of that node's
>tournament winner, and two bytes to tell the starting half word of the
>code word within the full key. At the cost of the code to create the
>code words, this usually speeds up the comparison (always a 64 bit
>compare), and generally reduces the size of the tree, which improves
>data cache usage.
>
>The two instructions IBM created are for the searching and update of the
>tree (Update Tree), and the insertion of new records into the tree after
>each winner is taken out (Compare and Form Code Word). Their operation
>and usage are pretty well described in Appendix A of the Principles of
>Operation manual. More details are in the instruction descriptions in
>chapter 7.
>
>
>I saw nothing to indicate that these instructions are used in the sort
>phase at all. They could conceivably be used to merge the replacement
>records into the sorted run, but I saw no evidence that they were used
>there.
>
>So, to answer the first question. It is clear, as Anton indicated, that
>their functionality could be done with a sequence of other instructions,
>perhaps in a library routine. Of course, that is true of most complex
>instructions, including, for example, floating point arithmetic. But
>looking at Appendix A, it is clear that code would be pretty complex
>with lots of instructions. Just saving on those many instruction's
>fetch and decode might be significant.
>
>I saw no benchmark data to indicate any speed advantage of the new
>instructions, however there is some "indirect evidence". There was a
>substantial market for sort programs for S/360 architecture, with at
>least two big competitors (Syncsort and CAsort). If IBM tried to
>"foist" some feature on its customers that was supposed to improve
>performance, but actually didn't, I am pretty sure their competitors
>would call them out on it. Also, when 64 bit mode came out, IBM spent
>the resources necessary to enhance the instructions for 64 bit mode,
>rather than deprecating them. That they did this also provides some
>evidence for there continued continuing value in a large memory system.
> Probably better evidence is that fact that the improvement would be
>more pronounced with larger memory. Specifically, no one would, or
>probably net even could do a 40 way merge on a small memory system. But
>the advantage of the tree (log2 N versus N) grows with the increasing
>merge ways allowed with large memory. I realize that these bits of
>evidence are circumstantial and sort of hand waving, and I do wish I
>could find actual benchmarks, but it is what it is.
>
>As to the second question, whether to do the replacement when in the
>sort phase, I think that answer is clearly yes. While it does increase
>CPU time a little, and adds complexity, the lapsed time savings are huge.
>
>Using replacement selection, if the data is random, the average run
>length is about twice the length is the non replacement runs. As
>mentioned above, in the real world, it is frequently even longer. But
>let's assume a factor of two. So instead of merging 4000 runs of length
>N, we will merge 2000 runs of length 2N. In either case, we read in a
>"chunk" of each run, and when all the records in a chunk are merged, it
>will be replaced by reading in the next chunk in that run. The fact that
>we have half as many runs to merge means that the chunks of each run
>that we read in can be twice as large. We are assuming 1,000 TB of disk
>consisting of 10 TB drives, so we have 100 physical drives. The runs
>are spread evenly across all the drives (It doesn't make too much
>difference if you assume the input file and the intermediate file are on
>separate sets of 50 drives or both spread across 100 physical drives).
>Since the chunks are exhausted randomly in the merge, you can't predict
>from which run you are going to read the next chunk. Since there are 2
>or 4 thousand runs and 100 drives, the odds of needing the next chunk
>from the next chunk of the last run you read on that drive is very
>modest (about .05). Thus each read is likely to require a full seek and
>latency. So the advantage of twice as big chunks is clear. It reduces
>the number of seek plus latency times by a factor of two. (Note that the
>transfer time is the same for the two cases.) In our example, 4,000 I/Os
>to 2,0000 I/Os. If we assume 10 ms per seek plus latency, this 2,000
>times 10 ms or 20 seconds, and dwarfs the additional CPU time required
>to do the replacement.
>
>In any case, I apologize for such a long post, and, of course, any
>errors. And I thank the other posters in this thread for giving me the
>motivation to spend the time learning about this stuff. :-)
>
>
>
>--
> - Stephen Fuld
>(e-mail address disguised to prevent spam)


Click here to read the complete article
Re: OoO S/360 descendants

<c4743268-4cf7-47f7-92ae-5780b3f0817dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20301&group=comp.arch#20301

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:181d:: with SMTP id t29mr12407985qtc.309.1630955793982;
Mon, 06 Sep 2021 12:16:33 -0700 (PDT)
X-Received: by 2002:a05:6808:2114:: with SMTP id r20mr462354oiw.110.1630955793719;
Mon, 06 Sep 2021 12:16:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 6 Sep 2021 12:16:33 -0700 (PDT)
In-Reply-To: <sh5nuv$brr$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c4743268-4cf7-47f7-92ae-5780b3f0817dn@googlegroups.com>
Subject: Re: OoO S/360 descendants
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Sep 2021 19:16:33 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 199
 by: MitchAlsup - Mon, 6 Sep 2021 19:16 UTC

On Monday, September 6, 2021 at 1:51:14 PM UTC-5, Stephen Fuld wrote:
> On 8/21/2021 5:04 AM, Anton Ertl wrote:
> > John Levine <jo...@taugh.com> writes:
> >> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
> >>>> IBM has lots of very complex instructions to speed up specific tasks.
> >>>
> >>> As the VAX vs. RISC discussion taught us, just because it has very
> >>> complex instructions does not mean that these speed up the tasks. And
> >>> IIRC IBM' 801 project discovered the same even earlier wrt complex 360
> >>> instructions.
> >>
> >> Yes, but IBM is not stupid. It is hard to believe they would have added new
> >> instructions if they didn't address hot spots.
> >
> > I find it easy to believe that they add instructions for customer
> > needs (as perceived with the help from IBM's marketing and sales
> > force) in order to have a unique selling proposition, whether the
> > instructions address real or imagined hot spots, and whether the
> > instructions actually provide a speedup or not.
> >
> > The lack of benchmark results makes me suspicious.
> >
> >>> They really have an instruction for heapsort, the sort with the worst
> >>> cache locality among the n*ln(n) sorts?
> > ...
> >> A well known trick to make this faster is every time you are about to
> >> write a record from the sorted memory buffer, read an input record and
> >> try to insert it into the sorted buffer, which you can do unless the
> >> new record sorts before the one you are writing. This can make the
> >> chunks bigger and reduce the number of merge passes. In the common
> >> case that the file is already mostly sorted, it is particularly
> >> effective. The "try to insert it into the sorted buffer" part is
> >> heapify, not surprisingly what their sort accelerator does.
> >
> > Given that in this scenario the buffer consumes most of the available
> > RAM, heapsort is going to incur ld n main memory latencies for every
> > insertion, and every extraction. I don't expect that special
> > instructions will provide any benefit for that.
> >
> > I would also not use heapsort for this trick; instead, just have the
> > buffer sorted when starting to write out, collect the late arrivals at
> > the front (in the places freed by the writeouts), and remember the
> > smallest of them (maybe keep it at the front). When the smallest of
> > the late arrivals becomes the smallest overall, sort the late
> > arrivals, and then select between the main buffer and the late
> > arrivals on output. You can have another generation of late arrivals,
> > but that complicates memory management, and merging on output, and I
> > doubt it gives much benefit.
> >
> > Finally, the whole scenario seems to be designed for small main memory
> > and huge mass storage (maybe mainframes in the 1960s). Currently I
> > can buy 128GB DRAM for <EUR 500 (a little more for ECC DRAM). Even
> > with 1000TB of mass storage (costs at least EUR 18000 for the HDDs
> > alone, plus infrastructure, plus redundancy), the mass storage is only
> > 8000 times as big as the RAM, so I don't need this trick: I just do
> > one pass creating 4000 sorted chunks (we can only use half of the mass
> > storage, because the other half is consumed by the last generation),
> > and then another pass for merging the chunks (only 4000 records need
> > to be in memory in this pass). Maybe heapsort could be used in this
> > merge pass; if the sort-relevant parts of the records fit in the
> > cache, the bad cache behaviour of heapsort does not play a role. I am
> > still not convinced that dedicated instructions provides a benefit.
> > Most likely they are microcoded and do just the same thing in
> > microcode that library routines do in machine code; and even if they
> > are implemented in hardware, they still have to do all the cache
> > accesses, so I don't expect them to be faster.
> I knew that IBM had implemented instructions to speed up sort, but I
> didn't know the details. This thread led me to spend some time
> researching them. There is actually a lot of documentation in the IBM
> Principles of Operation manuals for the relevant CPUs (available free
> on-line for at least the slightly older CPUs. - search "Principles of
> operation manual". The newer ones include the 64 bit versions as well
> as the 24/32 bit ones). In particular, Appendix A contains a several
> page discussion of them, with a usage example, instruction trace output,
> etc. Starting with the manuals led me to other places, including
> digging out my old copy of Knuth Vol 3!
>
> There are two separable questions from Anton's post. Do the IBM
> instructions speed up sorts, especially in a time of very large memory
> and disk systems, and is the usage of replacement (reading more unsorted
> records into main memory as older sorted records are written out, and
> attempting to add them to the run) worthwhile. I will attempt to answer
> each, but first I need to clear up some confusion of what the
> instructions do, and where they are used.
>
> For the following, I will use Anton's example of sorting a 500 TB disk
> file onto another 500 TB of disk on a system with 128 GB of available
> main memory. I assume there is some additional RAM and disk storage to
> hold the OS, program, other misc. requirements, etc. I also add the
> assumptions that the disk farm is made up of 10 TB hard drives and I
> ignore any RAID, fancy storage controllers, etc. Furthermore, just to
> make things concrete and make the math simple, I assume the file is made
> up of 500 byte records with 16 byte keys.
>
> It is clear that the instructions (there are two of them) are designed
> to help in the merge part of an external sort. Taking Anton's example,
> We sort the file into 128 GB runs and write them to disk, then after the
> sort, we have a 4,000 way merge. We will read 1/4000 of each run into
> memory and merge them, replenishing each run from its run on disk.
>
> The naive way of doing the merge is comparing the first unmerged record
> in each string to the first unmerged record in the other 3,999 runs.
> This is clearly not optimal. The accepted algorithm is to create a
> binary tree (called a tournament tree) where the leaf of each branch
> points to the the lowest unmerged record in each run. So, for 4,000
> runs, the tree has depth of 12, and we only have to do 12 comparisons
> instead of 4,000. This is a big issue, as we are going to do this one
> trillion times! Improving the CPU cost of managing the tree is worthwhile.
>
> There is another issue. Keys may be arbitrarily long. I assumed 16
> bytes which is pretty modest. Remember, there may be a compound key
> consisting of multiple fields within each record. But doing comparisons
> of arbitrary length fields is fairly costly, and the keys are duplicated
> in each tree element, so the tree elements could get pretty large.
> IBM's solution to this problem is to create what they call "code words"
> for each key. In 64 bit mode, each code word consists of eight bytes.
> The low order 6 bytes contain the three half words that contain the
> first first bit where the full key differs from the key of that node's
> tournament winner, and two bytes to tell the starting half word of the
> code word within the full key. At the cost of the code to create the
> code words, this usually speeds up the comparison (always a 64 bit
> compare), and generally reduces the size of the tree, which improves
> data cache usage.
>
> The two instructions IBM created are for the searching and update of the
> tree (Update Tree), and the insertion of new records into the tree after
> each winner is taken out (Compare and Form Code Word). Their operation
> and usage are pretty well described in Appendix A of the Principles of
> Operation manual. More details are in the instruction descriptions in
> chapter 7.
>
>
> I saw nothing to indicate that these instructions are used in the sort
> phase at all. They could conceivably be used to merge the replacement
> records into the sorted run, but I saw no evidence that they were used
> there.
>
> So, to answer the first question. It is clear, as Anton indicated, that
> their functionality could be done with a sequence of other instructions,
> perhaps in a library routine. Of course, that is true of most complex
> instructions, including, for example, floating point arithmetic. But
> looking at Appendix A, it is clear that code would be pretty complex
> with lots of instructions. Just saving on those many instruction's
> fetch and decode might be significant.
>
> I saw no benchmark data to indicate any speed advantage of the new
> instructions, however there is some "indirect evidence". There was a
> substantial market for sort programs for S/360 architecture, with at
> least two big competitors (Syncsort and CAsort). If IBM tried to
> "foist" some feature on its customers that was supposed to improve
> performance, but actually didn't, I am pretty sure their competitors
> would call them out on it. Also, when 64 bit mode came out, IBM spent
> the resources necessary to enhance the instructions for 64 bit mode,
> rather than deprecating them. That they did this also provides some
> evidence for there continued continuing value in a large memory system.
> Probably better evidence is that fact that the improvement would be
> more pronounced with larger memory. Specifically, no one would, or
> probably net even could do a 40 way merge on a small memory system. But
> the advantage of the tree (log2 N versus N) grows with the increasing
> merge ways allowed with large memory. I realize that these bits of
> evidence are circumstantial and sort of hand waving, and I do wish I
> could find actual benchmarks, but it is what it is.
>
> As to the second question, whether to do the replacement when in the
> sort phase, I think that answer is clearly yes. While it does increase
> CPU time a little, and adds complexity, the lapsed time savings are huge.
>
> Using replacement selection, if the data is random, the average run
> length is about twice the length is the non replacement runs. As
> mentioned above, in the real world, it is frequently even longer. But
> let's assume a factor of two. So instead of merging 4000 runs of length
> N, we will merge 2000 runs of length 2N. In either case, we read in a
> "chunk" of each run, and when all the records in a chunk are merged, it
> will be replaced by reading in the next chunk in that run. The fact that
> we have half as many runs to merge means that the chunks of each run
> that we read in can be twice as large. We are assuming 1,000 TB of disk
> consisting of 10 TB drives, so we have 100 physical drives. The runs
> are spread evenly across all the drives (It doesn't make too much
> difference if you assume the input file and the intermediate file are on
> separate sets of 50 drives or both spread across 100 physical drives).
> Since the chunks are exhausted randomly in the merge, you can't predict
> from which run you are going to read the next chunk. Since there are 2
> or 4 thousand runs and 100 drives, the odds of needing the next chunk
> from the next chunk of the last run you read on that drive is very
> modest (about .05). Thus each read is likely to require a full seek and
> latency. So the advantage of twice as big chunks is clear. It reduces
> the number of seek plus latency times by a factor of two. (Note that the
> transfer time is the same for the two cases.) In our example, 4,000 I/Os
> to 2,0000 I/Os. If we assume 10 ms per seek plus latency, this 2,000
> times 10 ms or 20 seconds, and dwarfs the additional CPU time required
> to do the replacement.
>
> In any case, I apologize for such a long post, and, of course, any
> errors. And I thank the other posters in this thread for giving me the
> motivation to spend the time learning about this stuff. :-)
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)
<
A nice little write up.


Click here to read the complete article
Re: OoO S/360 descendants

<2021Sep7.131827@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20320&group=comp.arch#20320

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Tue, 07 Sep 2021 11:18:27 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 122
Message-ID: <2021Sep7.131827@mips.complang.tuwien.ac.at>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="643681736b9831748407e62204c8f6bf";
logging-data="26029"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oqeFRNe6FSmTIax8TUbV9"
Cancel-Lock: sha1:G/LjTUQO7c7/lpXJnA8xJcAFLk8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 7 Sep 2021 11:18 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>The naive way of doing the merge is comparing the first unmerged record
>in each string to the first unmerged record in the other 3,999 runs.
>This is clearly not optimal. The accepted algorithm is to create a
>binary tree (called a tournament tree) where the leaf of each branch
>points to the the lowest unmerged record in each run.

I wondered whether there is a difference between a tournament tree and
a heap (as in heapsort). It seems that the difference is that all the
data in the tournament tree is in the leaves, and some of it is
repeated in the internal nodes; by contrast, in the heap each record
is represented only once, either in a leaf or in an internal node.

It's not clear to me that a tournament tree is superior to a heap, and
if so, why.

>I saw nothing to indicate that these instructions are used in the sort
>phase at all. They could conceivably be used to merge the replacement
>records into the sorted run, but I saw no evidence that they were used
>there.

Ok, that confirms my thoughts.

>So, to answer the first question. It is clear, as Anton indicated, that
>their functionality could be done with a sequence of other instructions,
>perhaps in a library routine. Of course, that is true of most complex
>instructions, including, for example, floating point arithmetic. But
>looking at Appendix A, it is clear that code would be pretty complex
>with lots of instructions. Just saving on those many instruction's
>fetch and decode might be significant.

I doubt that instruction fetching and decoding is an issue. This code
would be executed all the time and therefore would be in the decoded
instruction cache (in a CPU that has that, as many modern CPUs have).
And fetching the decoded microcode from the microcode store costs,
too.

The reorganization into code words may be something that you can do
more efficiently in dedicated hardware. How that works and when it
helps is unclear to me.

> Probably better evidence is that fact that the improvement would be
>more pronounced with larger memory. Specifically, no one would, or
>probably net even could do a 40 way merge on a small memory system.

Why not? You need only memory for 40 records; or 4000 for a 4000-way
merge. With 500-byte records and 4000-way merge, that would be 2MB.
And if you are really pressed for memory, you only need the keys in
memory and load the rest of the record only when you know you need it
for copying to the output.

>As to the second question, whether to do the replacement when in the
>sort phase, I think that answer is clearly yes. While it does increase
>CPU time a little, and adds complexity, the lapsed time savings are huge.
>
>Using replacement selection, if the data is random, the average run
>length is about twice the length is the non replacement runs. As
>mentioned above, in the real world, it is frequently even longer. But
>let's assume a factor of two. So instead of merging 4000 runs of length
>N, we will merge 2000 runs of length 2N. In either case, we read in a
>"chunk" of each run, and when all the records in a chunk are merged, it
>will be replaced by reading in the next chunk in that run. The fact that
>we have half as many runs to merge means that the chunks of each run
>that we read in can be twice as large. We are assuming 1,000 TB of disk
>consisting of 10 TB drives, so we have 100 physical drives. The runs
>are spread evenly across all the drives (It doesn't make too much
>difference if you assume the input file and the intermediate file are on
>separate sets of 50 drives or both spread across 100 physical drives).
>Since the chunks are exhausted randomly in the merge, you can't predict
>from which run you are going to read the next chunk. Since there are 2
>or 4 thousand runs and 100 drives, the odds of needing the next chunk
>from the next chunk of the last run you read on that drive is very
>modest (about .05). Thus each read is likely to require a full seek and
>latency. So the advantage of twice as big chunks is clear. It reduces
>the number of seek plus latency times by a factor of two. (Note that the
>transfer time is the same for the two cases.) In our example, 4,000 I/Os
>to 2,0000 I/Os. If we assume 10 ms per seek plus latency, this 2,000
>times 10 ms or 20 seconds, and dwarfs the additional CPU time required
>to do the replacement.

If we assume that the whole merging is CPU-bound (otherwise, why add
instructions that speed up the CPU?), loading the next chunk of a run
can be started when the existing chunk of that run reaches a low-water
mark. In that case the seeks are in parallel with the CPU, and the
difference between 2000-way and 4000-way merge does not make a
difference on the I/O side. It reduces ln n from 12 to 11, though,
which makes a difference in a CPU-bound setting (and more of a
difference than the 20s, see below).

If the whole merging step is I/O-bound, it means that we have to read
5TB from each drive and write 5TB to each drive. At 300MB/s (these
are HDDs), this takes 33333s (>9h). Do you go to extra lengths to
save 20s in that setting?

Is it CPU-bound or I/O-bound? The 100 HDDs can deliver (or consume)
at a total of 30GB/s; that's about what PCIe 4.0x16 can do, so a
mainstream PC may be (barely) able to support that much I/O; probably
something like a Threadripper or so is more appropriate. Anyway, if
you have 8 cores, that's ~2GB/s of input and 2GB/s of output per core;
with the factor of 12 from 4000-way merging that means looking at
24GB/s (in some caches) or 6-8 bytes/cycle to produce the 2GB/s of
output. Actually, given that only a part of the record is compared,
it's less. I think that this should be doable without code words, and
if not, a Ryzen 5950X would give you 16 cores to play with. So I
think it's relatively easy (and cheap) to make that I/O bound.

If you have data on SSDs, things are different, but the cost is even
higher: You need a lot of PCIe lanes to avoid a bottleneck there, and
then you have a platform that supports more cores. If you buy all
these cores, the result might still be I/O-bound: If 8 cores are
enough for 16 PCIe 4.0 lanes, 64 cores may be enough for 128 PCIe 4.0
lanes.

>In any case, I apologize for such a long post

What is the world coming to? People apologize for informative and
well-thought-out posts with the corresponding length:-).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: OoO S/360 descendants

<sh8570$t79$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20326&group=comp.arch#20326

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Tue, 7 Sep 2021 09:49:36 -0700
Organization: A noiseless patient Spider
Lines: 167
Message-ID: <sh8570$t79$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 7 Sep 2021 16:49:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7cc6d863c83299a204796e743ef84ac0";
logging-data="29929"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197LHHGHlN2zpG5WEHCX4Lb1+666fsJ0s0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:Uch8Gn0u1y9/QwXGV2E1rArVB14=
In-Reply-To: <2021Sep7.131827@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Tue, 7 Sep 2021 16:49 UTC

On 9/7/2021 4:18 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> The naive way of doing the merge is comparing the first unmerged record
>> in each string to the first unmerged record in the other 3,999 runs.
>> This is clearly not optimal. The accepted algorithm is to create a
>> binary tree (called a tournament tree) where the leaf of each branch
>> points to the the lowest unmerged record in each run.
>
> I wondered whether there is a difference between a tournament tree and
> a heap (as in heapsort). It seems that the difference is that all the
> data in the tournament tree is in the leaves, and some of it is
> repeated in the internal nodes; by contrast, in the heap each record
> is represented only once, either in a leaf or in an internal node.
>
> It's not clear to me that a tournament tree is superior to a heap, and
> if so, why.

See Knuth, Volume 3 section 5.4.1. Note that this reference is to the
original Vol 3.

>> I saw nothing to indicate that these instructions are used in the sort
>> phase at all. They could conceivably be used to merge the replacement
>> records into the sorted run, but I saw no evidence that they were used
>> there.
>
> Ok, that confirms my thoughts.
>
>> So, to answer the first question. It is clear, as Anton indicated, that
>> their functionality could be done with a sequence of other instructions,
>> perhaps in a library routine. Of course, that is true of most complex
>> instructions, including, for example, floating point arithmetic. But
>> looking at Appendix A, it is clear that code would be pretty complex
>> with lots of instructions. Just saving on those many instruction's
>> fetch and decode might be significant.
>
> I doubt that instruction fetching and decoding is an issue. This code
> would be executed all the time and therefore would be in the decoded
> instruction cache (in a CPU that has that, as many modern CPUs have).
> And fetching the decoded microcode from the microcode store costs,
> too.

Perhaps. But remember, these instructions were developed way before
things like decoded instruction cache were even a glimmer in the HW guys
eyes! :-)

I do vaguely remember, (perhaps John Levine or the Wheelers could give
more specifics) that in the early 1980s (?), IBM spent some effort
looking at executed instruction sequences in all their code with the
idea of combining a frequently executed sequence of instructions into a
single instruction. The sort instructions probably came out of that.

> The reorganization into code words may be something that you can do
> more efficiently in dedicated hardware. How that works and when it
> helps is unclear to me.

How it works is quite well documented in the aforementioned Appendix A.
It is helpful when the key is longer than 4 bytes (32 bit mode), or 8
bytes (64 bit mode). The longer the key is, the more the benefit.
Remember, in commercial workloads, the keys can be quite long, and even
made up of multiple fields, e.g. last name, first name, DOB.

>> Probably better evidence is that fact that the improvement would be
>> more pronounced with larger memory. Specifically, no one would, or
>> probably net even could do a 40 way merge on a small memory system.
>
> Why not? You need only memory for 40 records; or 4000 for a 4000-way
> merge. With 500-byte records and 4000-way merge, that would be 2MB.
> And if you are really pressed for memory, you only need the keys in
> memory and load the rest of the record only when you know you need it
> for copying to the output.

Count the number of I/O operations, then add 10 ms to the run time for
each one. Reading in one record at a time, even if you could (remember
this is OS/360 with CKD disks, not Unix), would be a disaster.

>> As to the second question, whether to do the replacement when in the
>> sort phase, I think that answer is clearly yes. While it does increase
>> CPU time a little, and adds complexity, the lapsed time savings are huge.

snip

> If we assume that the whole merging is CPU-bound (otherwise, why add
> instructions that speed up the CPU?), loading the next chunk of a run
> can be started when the existing chunk of that run reaches a low-water
> mark.

I struggled with trying to figure out if it was CPU versus I/O bound.
What I finally came done with was that I think it alternates.

I don't think your low water mark would work. First of all, it is
likely (given a random key distribution) that many blocks will hit that
mark at about the same time. You don't want to start the disk seeking
for one, then find out that you really need a different one on the same
disk. You would have to wait for the first seek to finish before
starting the "correct" one. You have just lengthened the wait for the
next block. :-(

> In that case the seeks are in parallel with the CPU, and the
> difference between 2000-way and 4000-way merge does not make a
> difference on the I/O side.

You can't really do them in parallel, so the difference is still a
factor of two in the number of I/Os. That is why I came up with the
idea of alternatively CPU and I/O bound.

> It reduces ln n from 12 to 11, though,
> which makes a difference in a CPU-bound setting (and more of a
> difference than the 20s, see below).
>
> If the whole merging step is I/O-bound, it means that we have to read
> 5TB from each drive and write 5TB to each drive. At 300MB/s (these
> are HDDs), this takes 33333s (>9h). Do you go to extra lengths to
> save 20s in that setting?
>
> Is it CPU-bound or I/O-bound?

As I said, I struggled with the same question. :-) I suspect that the
amount of overlapped I/O is small. Thus the instructions help during
part of the merge. And remember, this is a multi-programming
environment with a very expensive CPU, so reducing the CPU usage allows
it to be more available for the other programs running.

> The 100 HDDs can deliver (or consume)
> at a total of 30GB/s; that's about what PCIe 4.0x16 can do, so a
> mainstream PC may be (barely) able to support that much I/O; probably
> something like a Threadripper or so is more appropriate. Anyway, if
> you have 8 cores, that's ~2GB/s of input and 2GB/s of output per core;
> with the factor of 12 from 4000-way merging that means looking at
> 24GB/s (in some caches) or 6-8 bytes/cycle to produce the 2GB/s of
> output. Actually, given that only a part of the record is compared,
> it's less. I think that this should be doable without code words, and
> if not, a Ryzen 5950X would give you 16 cores to play with. So I
> think it's relatively easy (and cheap) to make that I/O bound.
>
> If you have data on SSDs, things are different, but the cost is even
> higher: You need a lot of PCIe lanes to avoid a bottleneck there, and
> then you have a platform that supports more cores. If you buy all
> these cores, the result might still be I/O-bound: If 8 cores are
> enough for 16 PCIe 4.0 lanes, 64 cores may be enough for 128 PCIe 4.0
> lanes.

I haven't given any thought at all to what the situation is on an X86
CPU with fixed length block disks, etc. Remember, these instructions
are S/360 descendants specific.

>> In any case, I apologize for such a long post
>
> What is the world coming to? People apologize for informative and
> well-thought-out posts with the corresponding length:-).

Thank you!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<58fb9406-6a25-4554-ac22-173334f9c6ecn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20327&group=comp.arch#20327

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4742:: with SMTP id c2mr13850129qvx.13.1631033836496;
Tue, 07 Sep 2021 09:57:16 -0700 (PDT)
X-Received: by 2002:a9d:ecc:: with SMTP id 70mr16428908otj.96.1631033836247;
Tue, 07 Sep 2021 09:57:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 7 Sep 2021 09:57:16 -0700 (PDT)
In-Reply-To: <sh8570$t79$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <58fb9406-6a25-4554-ac22-173334f9c6ecn@googlegroups.com>
Subject: Re: OoO S/360 descendants
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 07 Sep 2021 16:57:16 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 152
 by: MitchAlsup - Tue, 7 Sep 2021 16:57 UTC

On Tuesday, September 7, 2021 at 11:49:38 AM UTC-5, Stephen Fuld wrote:
> On 9/7/2021 4:18 AM, Anton Ertl wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >> The naive way of doing the merge is comparing the first unmerged record
> >> in each string to the first unmerged record in the other 3,999 runs.
> >> This is clearly not optimal. The accepted algorithm is to create a
> >> binary tree (called a tournament tree) where the leaf of each branch
> >> points to the the lowest unmerged record in each run.
> >
> > I wondered whether there is a difference between a tournament tree and
> > a heap (as in heapsort). It seems that the difference is that all the
> > data in the tournament tree is in the leaves, and some of it is
> > repeated in the internal nodes; by contrast, in the heap each record
> > is represented only once, either in a leaf or in an internal node.
> >
> > It's not clear to me that a tournament tree is superior to a heap, and
> > if so, why.
> See Knuth, Volume 3 section 5.4.1. Note that this reference is to the
> original Vol 3.
> >> I saw nothing to indicate that these instructions are used in the sort
> >> phase at all. They could conceivably be used to merge the replacement
> >> records into the sorted run, but I saw no evidence that they were used
> >> there.
> >
> > Ok, that confirms my thoughts.
> >
> >> So, to answer the first question. It is clear, as Anton indicated, that
> >> their functionality could be done with a sequence of other instructions,
> >> perhaps in a library routine. Of course, that is true of most complex
> >> instructions, including, for example, floating point arithmetic. But
> >> looking at Appendix A, it is clear that code would be pretty complex
> >> with lots of instructions. Just saving on those many instruction's
> >> fetch and decode might be significant.
> >
> > I doubt that instruction fetching and decoding is an issue. This code
> > would be executed all the time and therefore would be in the decoded
> > instruction cache (in a CPU that has that, as many modern CPUs have).
> > And fetching the decoded microcode from the microcode store costs,
> > too.
<
> Perhaps. But remember, these instructions were developed way before
> things like decoded instruction cache were even a glimmer in the HW guys
> eyes! :-)
<
I was doing a Decoded Instruction Cache in 1990, merely as a point of reference.
>
> I do vaguely remember, (perhaps John Levine or the Wheelers could give
> more specifics) that in the early 1980s (?), IBM spent some effort
> looking at executed instruction sequences in all their code with the
> idea of combining a frequently executed sequence of instructions into a
> single instruction. The sort instructions probably came out of that.
<
> > The reorganization into code words may be something that you can do
> > more efficiently in dedicated hardware. How that works and when it
> > helps is unclear to me.
<
> How it works is quite well documented in the aforementioned Appendix A.
> It is helpful when the key is longer than 4 bytes (32 bit mode), or 8
> bytes (64 bit mode). The longer the key is, the more the benefit.
> Remember, in commercial workloads, the keys can be quite long, and even
> made up of multiple fields, e.g. last name, first name, DOB.
<
> >> Probably better evidence is that fact that the improvement would be
> >> more pronounced with larger memory. Specifically, no one would, or
> >> probably net even could do a 40 way merge on a small memory system.
> >
> > Why not? You need only memory for 40 records; or 4000 for a 4000-way
> > merge. With 500-byte records and 4000-way merge, that would be 2MB.
> > And if you are really pressed for memory, you only need the keys in
> > memory and load the rest of the record only when you know you need it
> > for copying to the output.
<
> Count the number of I/O operations, then add 10 ms to the run time for
> each one. Reading in one record at a time, even if you could (remember
> this is OS/360 with CKD disks, not Unix), would be a disaster.
<
> >> As to the second question, whether to do the replacement when in the
> >> sort phase, I think that answer is clearly yes. While it does increase
> >> CPU time a little, and adds complexity, the lapsed time savings are huge.
> snip
> > If we assume that the whole merging is CPU-bound (otherwise, why add
> > instructions that speed up the CPU?), loading the next chunk of a run
> > can be started when the existing chunk of that run reaches a low-water
> > mark.
<
> I struggled with trying to figure out if it was CPU versus I/O bound.
> What I finally came done with was that I think it alternates.
>
> I don't think your low water mark would work. First of all, it is
> likely (given a random key distribution) that many blocks will hit that
> mark at about the same time. You don't want to start the disk seeking
> for one, then find out that you really need a different one on the same
> disk. You would have to wait for the first seek to finish before
> starting the "correct" one. You have just lengthened the wait for the
> next block. :-(
<
Modern SATA disks can seek faster than the platter rotates, and so
read/write multiple sectors per revolution.
<
> > In that case the seeks are in parallel with the CPU, and the
> > difference between 2000-way and 4000-way merge does not make a
> > difference on the I/O side.
<
> You can't really do them in parallel, so the difference is still a
> factor of two in the number of I/Os. That is why I came up with the
> idea of alternatively CPU and I/O bound.
<
> > It reduces ln n from 12 to 11, though,
> > which makes a difference in a CPU-bound setting (and more of a
> > difference than the 20s, see below).
> >
> > If the whole merging step is I/O-bound, it means that we have to read
> > 5TB from each drive and write 5TB to each drive. At 300MB/s (these
> > are HDDs), this takes 33333s (>9h). Do you go to extra lengths to
> > save 20s in that setting?
> >
> > Is it CPU-bound or I/O-bound?
<
> As I said, I struggled with the same question. :-) I suspect that the
> amount of overlapped I/O is small. Thus the instructions help during
> part of the merge. And remember, this is a multi-programming
> environment with a very expensive CPU, so reducing the CPU usage allows
> it to be more available for the other programs running.
<
> > The 100 HDDs can deliver (or consume)
> > at a total of 30GB/s; that's about what PCIe 4.0x16 can do, so a
> > mainstream PC may be (barely) able to support that much I/O; probably
> > something like a Threadripper or so is more appropriate. Anyway, if
> > you have 8 cores, that's ~2GB/s of input and 2GB/s of output per core;
> > with the factor of 12 from 4000-way merging that means looking at
> > 24GB/s (in some caches) or 6-8 bytes/cycle to produce the 2GB/s of
> > output. Actually, given that only a part of the record is compared,
> > it's less. I think that this should be doable without code words, and
> > if not, a Ryzen 5950X would give you 16 cores to play with. So I
> > think it's relatively easy (and cheap) to make that I/O bound.
> >
> > If you have data on SSDs, things are different, but the cost is even
> > higher: You need a lot of PCIe lanes to avoid a bottleneck there, and
> > then you have a platform that supports more cores. If you buy all
> > these cores, the result might still be I/O-bound: If 8 cores are
> > enough for 16 PCIe 4.0 lanes, 64 cores may be enough for 128 PCIe 4.0
> > lanes.
> I haven't given any thought at all to what the situation is on an X86
> CPU with fixed length block disks, etc. Remember, these instructions
> are S/360 descendants specific.
> >> In any case, I apologize for such a long post
> >
> > What is the world coming to? People apologize for informative and
> > well-thought-out posts with the corresponding length:-).
> Thank you!
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<2021Sep8.082614@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20339&group=comp.arch#20339

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 08 Sep 2021 06:26:14 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 115
Message-ID: <2021Sep8.082614@mips.complang.tuwien.ac.at>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="ba95806a9367736e941b1ff3aa8ec260";
logging-data="10159"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uDs7xCCavd5gB12G4Z8X0"
Cancel-Lock: sha1:iEXuokrUYORk++w63c3P3h88hO4=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Wed, 8 Sep 2021 06:26 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 9/7/2021 4:18 AM, Anton Ertl wrote:
>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>> Just saving on those many instruction's
>>> fetch and decode might be significant.
>>
>> I doubt that instruction fetching and decoding is an issue. This code
>> would be executed all the time and therefore would be in the decoded
>> instruction cache (in a CPU that has that, as many modern CPUs have).
>> And fetching the decoded microcode from the microcode store costs,
>> too.
>
>Perhaps. But remember, these instructions were developed way before
>things like decoded instruction cache were even a glimmer in the HW guys
>eyes! :-)

Sure, when microcode was faster than ordinary instructions (in the
1970s?), such an instruction might have had an advantage from that
alone, but today?

>I do vaguely remember, (perhaps John Levine or the Wheelers could give
>more specifics) that in the early 1980s (?), IBM spent some effort
>looking at executed instruction sequences in all their code with the
>idea of combining a frequently executed sequence of instructions into a
>single instruction. The sort instructions probably came out of that.

The IBM 801 project certainly looked at instruction usage statistics
and eliminated all complex ones in order to allow fast pipelined
execution of the rest.

The way instruction statistics are typically taken, you probably would
not see something like tournament tree selection. I expect that this
stuff came from a higher-level profiling activity, and that's the
usual narrative: Some significant percentage of cycles is spent on
sorting, that's why they designed instructions to support that.

>>> Probably better evidence is that fact that the improvement would be
>>> more pronounced with larger memory. Specifically, no one would, or
>>> probably net even could do a 40 way merge on a small memory system.
>>
>> Why not? You need only memory for 40 records; or 4000 for a 4000-way
>> merge. With 500-byte records and 4000-way merge, that would be 2MB.
>> And if you are really pressed for memory, you only need the keys in
>> memory and load the rest of the record only when you know you need it
>> for copying to the output.
>
>Count the number of I/O operations, then add 10 ms to the run time for
>each one. Reading in one record at a time, even if you could (remember
>this is OS/360 with CKD disks, not Unix), would be a disaster.

Yes, it's probably more efficient to do a 4000-way merge in multiple
smaller steps if you have so little memory, but you certainly can do
the 4000-way merge in one step, it's just slow.

>> If we assume that the whole merging is CPU-bound (otherwise, why add
>> instructions that speed up the CPU?), loading the next chunk of a run
>> can be started when the existing chunk of that run reaches a low-water
>> mark.
>
>I struggled with trying to figure out if it was CPU versus I/O bound.
>What I finally came done with was that I think it alternates.

Certainly in the merging phase it's relatively straightforward to
overlap CPU and I/O so much that one of them becomes the limit all the
time. For the initial sorting phase, you would need to do half the
memory as a buffer for the CPU stuff and half as a buffer for the I/O
stuff, but then the runs would only be half as long (~64GB), and you
would need an 8000-way merge in the end.

>I don't think your low water mark would work. First of all, it is
>likely (given a random key distribution) that many blocks will hit that
>mark at about the same time. You don't want to start the disk seeking
>for one, then find out that you really need a different one on the same
>disk. You would have to wait for the first seek to finish before
>starting the "correct" one. You have just lengthened the wait for the
>next block. :-(

Not a problem. Just make the low-water-mark high enough, and the
probability of running out of customers (CPUs being idle) becomes
arbitrarily low (basic queuing theory). You can stagger the
low-water-marks for runs on the same HDD to make it less likely to
have the requests bunch up. You only get a problem if you need so
many seeks that seek+read time exceeds the CPU time.

>> The 100 HDDs can deliver (or consume)
>> at a total of 30GB/s; that's about what PCIe 4.0x16 can do, so a
>> mainstream PC may be (barely) able to support that much I/O; probably
>> something like a Threadripper or so is more appropriate. Anyway, if
>> you have 8 cores, that's ~2GB/s of input and 2GB/s of output per core;
>> with the factor of 12 from 4000-way merging that means looking at
>> 24GB/s (in some caches) or 6-8 bytes/cycle to produce the 2GB/s of
>> output. Actually, given that only a part of the record is compared,
>> it's less. I think that this should be doable without code words, and
>> if not, a Ryzen 5950X would give you 16 cores to play with. So I
>> think it's relatively easy (and cheap) to make that I/O bound.
>>
>> If you have data on SSDs, things are different, but the cost is even
>> higher: You need a lot of PCIe lanes to avoid a bottleneck there, and
>> then you have a platform that supports more cores. If you buy all
>> these cores, the result might still be I/O-bound: If 8 cores are
>> enough for 16 PCIe 4.0 lanes, 64 cores may be enough for 128 PCIe 4.0
>> lanes.
>
>I haven't given any thought at all to what the situation is on an X86
>CPU with fixed length block disks, etc. Remember, these instructions
>are S/360 descendants specific.

I am not limiting myself to S/360 descendants just because they have
such instructions. If I can solve the problem just as effectively
with AMD64-based servers, there is no need for specialized hardware.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: OoO S/360 descendants

<2021Sep8.105140@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20340&group=comp.arch#20340

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 08 Sep 2021 08:51:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 25
Message-ID: <2021Sep8.105140@mips.complang.tuwien.ac.at>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="ba95806a9367736e941b1ff3aa8ec260";
logging-data="7019"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ieEmeb1e2HLfK030MtSxf"
Cancel-Lock: sha1:qq9ZCMOScJMCs5RYSwsA1qYU7p8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Wed, 8 Sep 2021 08:51 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 9/7/2021 4:18 AM, Anton Ertl wrote:
>> It's not clear to me that a tournament tree is superior to a heap, and
>> if so, why.
>
>See Knuth, Volume 3 section 5.4.1. Note that this reference is to the
>original Vol 3.

Being too lazy to visit my non-home office where that book resides, I
looked at
<https://en.wikipedia.org/wiki/K-way_merge_algorithm#Direct_k-way_merge>,
and it says:

|A heap uses approximately 2*log(k) comparisons in each step because it
|handles the tree from the root down to the bottom and needs to compare
|both children of each node. Meanwhile, a tournament tree only needs
|log(k) comparisons because it starts on the bottom of the tree and
|works up to the root, only making a single comparison in each
|layer. The tournament tree should therefore be the preferred
|implementation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: OoO S/360 descendants

<nb3_I.19779$rsCb.16610@fx01.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20344&group=comp.arch#20344

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx01.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
In-Reply-To: <2021Sep8.082614@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 29
Message-ID: <nb3_I.19779$rsCb.16610@fx01.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 08 Sep 2021 13:49:07 UTC
Date: Wed, 08 Sep 2021 09:48:53 -0400
X-Received-Bytes: 2461
 by: EricP - Wed, 8 Sep 2021 13:48 UTC

Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> I haven't given any thought at all to what the situation is on an X86
>> CPU with fixed length block disks, etc. Remember, these instructions
>> are S/360 descendants specific.
>
> I am not limiting myself to S/360 descendants just because they have
> such instructions. If I can solve the problem just as effectively
> with AMD64-based servers, there is no need for specialized hardware.

Sort-merge might make use of general HW scatter-gather DMA mechanism
integrated with IO so records are gathered during the disk write operation.

Rather than sort records by moving them, sort a vector of descriptors.
Then pass this sorted vector containing virtual address and length of each
item to the IO write. IO translates each virtual descriptor to one or more
physical descriptors, because records can straddle physical pages.
Then pass this physical fragment descriptor vector directly to the
disk DMA device so it can stream the record fragments directly
to sequential disk blocks in one long write.

In other words, the physical descriptor vector acts like an IOMMU
but for arbitrary byte length fragments rather than whole pages.

Using multiple buffers, async IO, multiple SMP threads,
cache-aware sorts on multiple cores,
one could really go to town on concurrency for this.

Re: OoO S/360 descendants

<shamca$r0u$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20347&group=comp.arch#20347

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 8 Sep 2021 08:54:48 -0700
Organization: A noiseless patient Spider
Lines: 170
Message-ID: <shamca$r0u$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 8 Sep 2021 15:54:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ed9a25e958235b0c11e52da35061d41e";
logging-data="27678"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/40iyyRyX61jZdGN1+Xb1jIWY9w24pDR0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:at+r5VA0m7yNUtohUSFXmYCVim4=
In-Reply-To: <2021Sep8.082614@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Wed, 8 Sep 2021 15:54 UTC

On 9/7/2021 11:26 PM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> On 9/7/2021 4:18 AM, Anton Ertl wrote:
>>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>>> Just saving on those many instruction's
>>>> fetch and decode might be significant.
>>>
>>> I doubt that instruction fetching and decoding is an issue. This code
>>> would be executed all the time and therefore would be in the decoded
>>> instruction cache (in a CPU that has that, as many modern CPUs have).
>>> And fetching the decoded microcode from the microcode store costs,
>>> too.
>>
>> Perhaps. But remember, these instructions were developed way before
>> things like decoded instruction cache were even a glimmer in the HW guys
>> eyes! :-)
>
> Sure, when microcode was faster than ordinary instructions (in the
> 1970s?), such an instruction might have had an advantage from that
> alone, but today?

Well, that is a different question. While they are, of course,
maintained for compatibility, IBM has a new solution.

https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

Note that this paper even gives some benchmark results, though they are
IBM internal.

>> I do vaguely remember, (perhaps John Levine or the Wheelers could give
>> more specifics) that in the early 1980s (?), IBM spent some effort
>> looking at executed instruction sequences in all their code with the
>> idea of combining a frequently executed sequence of instructions into a
>> single instruction. The sort instructions probably came out of that.
>
> The IBM 801 project certainly looked at instruction usage statistics
> and eliminated all complex ones in order to allow fast pipelined
> execution of the rest.

And that certainly affected later designs. But since the 801 and the
sort instructions were roughly contemporaneous (I don't have an accurate
data for the later, but it appears to be late 1970s, early 1980s.) and
given the inertia of a large company like IBM, I can certainly see
people in S/370 land getting the development resources for the new
instructions.

> The way instruction statistics are typically taken, you probably would
> not see something like tournament tree selection. I expect that this
> stuff came from a higher-level profiling activity, and that's the
> usual narrative: Some significant percentage of cycles is spent on
> sorting, that's why they designed instructions to support that.

Could very well be.

>
>>>> Probably better evidence is that fact that the improvement would be
>>>> more pronounced with larger memory. Specifically, no one would, or
>>>> probably net even could do a 40 way merge on a small memory system.
>>>
>>> Why not? You need only memory for 40 records; or 4000 for a 4000-way
>>> merge. With 500-byte records and 4000-way merge, that would be 2MB.
>>> And if you are really pressed for memory, you only need the keys in
>>> memory and load the rest of the record only when you know you need it
>>> for copying to the output.
>>
>> Count the number of I/O operations, then add 10 ms to the run time for
>> each one. Reading in one record at a time, even if you could (remember
>> this is OS/360 with CKD disks, not Unix), would be a disaster.
>
> Yes, it's probably more efficient to do a 4000-way merge in multiple
> smaller steps if you have so little memory, but you certainly can do
> the 4000-way merge in one step, it's just slow.
>
>>> If we assume that the whole merging is CPU-bound (otherwise, why add
>>> instructions that speed up the CPU?), loading the next chunk of a run
>>> can be started when the existing chunk of that run reaches a low-water
>>> mark.
>>
>> I struggled with trying to figure out if it was CPU versus I/O bound.
>> What I finally came done with was that I think it alternates.
>
> Certainly in the merging phase it's relatively straightforward to
> overlap CPU and I/O so much that one of them becomes the limit all the
> time. For the initial sorting phase, you would need to do half the
> memory as a buffer for the CPU stuff and half as a buffer for the I/O
> stuff, but then the runs would only be half as long (~64GB), and you
> would need an 8000-way merge in the end.

Given that, according to Wikipedia (DFSORT)

> Sort/Merge is very frequently used; often the most commonly used application program in a mainframe shop generally consuming about twenty percent of the processing power of the shop.

I am sure that lots of smart people spent lots of time and brainpower to
optimize it.

>
>> I don't think your low water mark would work. First of all, it is
>> likely (given a random key distribution) that many blocks will hit that
>> mark at about the same time. You don't want to start the disk seeking
>> for one, then find out that you really need a different one on the same
>> disk. You would have to wait for the first seek to finish before
>> starting the "correct" one. You have just lengthened the wait for the
>> next block. :-(
>
> Not a problem. Just make the low-water-mark high enough, and the
> probability of running out of customers (CPUs being idle) becomes
> arbitrarily low (basic queuing theory). You can stagger the
> low-water-marks for runs on the same HDD to make it less likely to
> have the requests bunch up. You only get a problem if you need so
> many seeks that seek+read time exceeds the CPU time.

I'm missing something in your explanation. You don't want to have
prefetches for all the runs, as the prefetch buffers would take as much
space as the runs themselves. I suppose you could prefetch only a small
part of each run, but then you need another I/O to read in the rest of
the run, and again, you don't know which run you will need, so you have
the same problem.

>>> The 100 HDDs can deliver (or consume)
>>> at a total of 30GB/s; that's about what PCIe 4.0x16 can do, so a
>>> mainstream PC may be (barely) able to support that much I/O; probably
>>> something like a Threadripper or so is more appropriate. Anyway, if
>>> you have 8 cores, that's ~2GB/s of input and 2GB/s of output per core;
>>> with the factor of 12 from 4000-way merging that means looking at
>>> 24GB/s (in some caches) or 6-8 bytes/cycle to produce the 2GB/s of
>>> output. Actually, given that only a part of the record is compared,
>>> it's less. I think that this should be doable without code words, and
>>> if not, a Ryzen 5950X would give you 16 cores to play with. So I
>>> think it's relatively easy (and cheap) to make that I/O bound.
>>>
>>> If you have data on SSDs, things are different, but the cost is even
>>> higher: You need a lot of PCIe lanes to avoid a bottleneck there, and
>>> then you have a platform that supports more cores. If you buy all
>>> these cores, the result might still be I/O-bound: If 8 cores are
>>> enough for 16 PCIe 4.0 lanes, 64 cores may be enough for 128 PCIe 4.0
>>> lanes.
>>
>> I haven't given any thought at all to what the situation is on an X86
>> CPU with fixed length block disks, etc. Remember, these instructions
>> are S/360 descendants specific.
>
> I am not limiting myself to S/360 descendants just because they have
> such instructions.

I don't think anyone is proposing such instructions for X86, almost
certainly for very good and sufficient reasons. When they were
developed, no one would consider "farms" of X86 systems. I have no idea
what percentage of time is spent on sorts for X86 systems, but it it
almost undoubtedly much less than Z series, so even for the X86 server
market chips, it is almost certainly not worth while.

> If I can solve the problem just as effectively
> with AMD64-based servers, there is no need for specialized hardware.

You are ignoring the reality of why customers stay on IBM mainframes.
Efficiency is only a relatively small part of it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<357b0396-74c4-4bd9-a9a9-3d743cb5f25dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20350&group=comp.arch#20350

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4c8f:: with SMTP id j15mr4429791qtv.324.1631117389462;
Wed, 08 Sep 2021 09:09:49 -0700 (PDT)
X-Received: by 2002:a05:6808:f90:: with SMTP id o16mr2812502oiw.37.1631117389072;
Wed, 08 Sep 2021 09:09:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 8 Sep 2021 09:09:48 -0700 (PDT)
In-Reply-To: <nb3_I.19779$rsCb.16610@fx01.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <357b0396-74c4-4bd9-a9a9-3d743cb5f25dn@googlegroups.com>
Subject: Re: OoO S/360 descendants
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 08 Sep 2021 16:09:49 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 34
 by: MitchAlsup - Wed, 8 Sep 2021 16:09 UTC

On Wednesday, September 8, 2021 at 8:49:10 AM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >> I haven't given any thought at all to what the situation is on an X86
> >> CPU with fixed length block disks, etc. Remember, these instructions
> >> are S/360 descendants specific.
> >
> > I am not limiting myself to S/360 descendants just because they have
> > such instructions. If I can solve the problem just as effectively
> > with AMD64-based servers, there is no need for specialized hardware.
> Sort-merge might make use of general HW scatter-gather DMA mechanism
> integrated with IO so records are gathered during the disk write operation.
>
> Rather than sort records by moving them, sort a vector of descriptors.
> Then pass this sorted vector containing virtual address and length of each
> item to the IO write. IO translates each virtual descriptor to one or more
> physical descriptors, because records can straddle physical pages.
> Then pass this physical fragment descriptor vector directly to the
> disk DMA device so it can stream the record fragments directly
> to sequential disk blocks in one long write.
>
> In other words, the physical descriptor vector acts like an IOMMU
> but for arbitrary byte length fragments rather than whole pages.
<
But if you already HAVE an I/O MMU, then you just need the virtual
address and length (into an address space) and simply program the
I/O using virtual. The fact any given record straddles a page boundary
is then not exposed until DMA actually crosses the page boundary.
<
This should save a bunch of analysis and copying prior to performing
the I/O.
>
> Using multiple buffers, async IO, multiple SMP threads,
> cache-aware sorts on multiple cores,
> one could really go to town on concurrency for this.

Re: OoO S/360 descendants

<shapci$gp8$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20351&group=comp.arch#20351

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 8 Sep 2021 09:46:08 -0700
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <shapci$gp8$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 8 Sep 2021 16:46:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ed9a25e958235b0c11e52da35061d41e";
logging-data="17192"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+sJap3rAcU5LLQFxlrbvBO4VPll8+9wmE="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:g1WBvgvWT2qW4869oCuF8OCU3ZQ=
In-Reply-To: <nb3_I.19779$rsCb.16610@fx01.iad>
Content-Language: en-US
 by: Stephen Fuld - Wed, 8 Sep 2021 16:46 UTC

On 9/8/2021 6:48 AM, EricP wrote:
> Anton Ertl wrote:
>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>> I haven't given any thought at all to what the situation is on an X86
>>> CPU with fixed length block disks, etc.  Remember, these instructions
>>> are S/360 descendants specific.
>>
>> I am not limiting myself to S/360 descendants just because they have
>> such instructions.  If I can solve the problem just as effectively
>> with AMD64-based servers, there is no need for specialized hardware.
>
> Sort-merge might make use of general HW scatter-gather DMA mechanism
> integrated with IO so records are gathered during the disk write operation.

Given that S/360 had the equivalent of scatter/gather (called Data
Chaining in S/360 parlance), they might very well have used it. As I
said, I really couldn't find much information on the internal algorithm
used by IBM's sort program.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)


devel / comp.arch / Re: Could we build a better 6502?

Pages:12345678910111213
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor