Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Different all twisty a of in maze are you, passages little.

Re: Could we build a better 6502?

Subject	Author
Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Quadibloc
Re: Could we build a better 6502?	John Levine
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	aph
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	Quadibloc
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Brian G. Lucas
Re: Could we build a better 6502?	Quadibloc
Re: Could we build a better 6502?	Brian G. Lucas
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	Stephen Fuld
Re: Could we build a better 6502?	Terje Mathisen
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Stephen Fuld
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Timothy McCaffrey
Re: Could we build a better 6502?	Michael Barry
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Timothy McCaffrey
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Michael Barry
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	chris
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	chris
Re: Could we build a better 6502?	George Neuner
Re: Could we build a better 6502?	chris
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Bernd Linsel
Re: Could we build a better 6502?	David Brown
Re: Could we build a better 6502?	chris
Re: Could we build a better 6502?	David Brown
Re: Could we build a better 6502?	Terje Mathisen
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Terje Mathisen
Re: Could we build a better 6502?	Al Grant
Re: Could we build a better 6502?	chris
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Stefan Monnier
Re: Could we build a better 6502?	Ivan Godard
Re: Could we build a better 6502?	Stefan Monnier
Re: Could we build a better 6502?	John Dallman
Re: Could we build a better 6502?	Stefan Monnier
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Ivan Godard
Re: Could we build a better 6502?	Stephen Fuld
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	David Schultz
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	David Schultz
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Marcus
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Quadibloc
Re: Could we build a better PDP-8, was 6502?	John Levine
Re: Could we build a better 6502?	Tim Rentsch
Re: Could we build a better 6502?	Quadibloc
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	David Schultz
Re: Could we build a better 6502?	Brett
Re: Could we build a better 6502?	David Schultz
Re: Could we build a better 6502?	Brett
Re: Could we build a better 6502?	David Schultz
Re: Could we build a better 6502?	Brett
Re: Could we build a better 6502?	David Schultz
Re: Could we build a better 6502?	Stefan Monnier
Re: Could we build a better 6502?	Thomas Koenig
Re: Could we build a better 6502?	Stefan Monnier
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	Marcus
Re: Could we build a better 6502?	MitchAlsup
Re: Could we build a better 6502?	EricP
Re: Could we build a better 6502?	Guillaume
Re: Could we build a better 6502?	EricP
Re: Could we build a better 6502?	Timothy McCaffrey
Re: Could we build a better 6502?	JimBrakefield
Re: Could we build a better 6502?	Anssi Saari
Re: Could we build a better 6502?	John Dallman
Re: Could we build a better 6502?	Anton Ertl
Re: Could we build a better 6502?	Michael Barry
Re: Could we build a better 6502?	pec...@gmail.com
Re: Could we build a better 6502?	Bernd Linsel
Re: Could we build a better 6502?	clamky
Re: Could we build a better 6502?	Quadibloc
Re: Could we build a better 6502?	Quadibloc

Pages:1 2 3 4 5 6 7 8 9 10 11 1213

Re: OoO S/360 descendants

<61kpjgdl06tr73icbhf6fk3fr8bcvbpqr1@4ax.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20409&group=comp.arch#20409

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Sat, 11 Sep 2021 12:27:31 -0400
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <61kpjgdl06tr73icbhf6fk3fr8bcvbpqr1@4ax.com>
References: <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="18bf165de774a610e34d2e16eab226cc";
logging-data="2788"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189kjWE61dvwQY5kwipO9GpiNWdocQGOKY="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:uZyuqTiB46J6ODbPTD7fh4IDkFA=

by: George Neuner - Sat, 11 Sep 2021 16:27 UTC

On Fri, 10 Sep 2021 07:47:06 -0700, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:

>On 9/9/2021 7:14 AM, EricP wrote:
>
>snip
>
>> Combining the "external sort" approach with multiple rotating work buffers,
>> buffer A is doing an async read, B is sorting, C is async writing.
>> When all are done, rotate the buffers.
>> Balance the buffer size, read and write IO times vs sort time.
>> That keeps 1 cpu/thread and 2 disks busy.
>
>I am not sure that is the best approach. If you have N bytes of main
>memory, you have divided it into three equal sized pieces (in order to
>be able to "rotate them". That means the length of the run will be
>about 1/3 of what it would be if you used all of the memory as one big
>buffer. Read data into that buffer, then sort it, then write it out.

Mostly agreed, but with one caveat: there /may/ be some value to
multiplexing if storage is /really/ slow compared to RAM.

>The main determinant of elapsed time in an external sort is the number
>of merge passes. As the size of the sorted runs goes up, the number of
>passes goes down, and as we discussed in Anton's example, even when you
>get to the final merge pass, having fewer runs allows fewer I/Os. So,
>yes, you lose concurrency, but the longer runs may gain in elapsed time
>by doing less work.

Only one merge actually is required: partial merges are needed only if
you don't have enough on-line storage to simultaneously access all the
slices.

Merging requires buffering just one record from each slice. Depending
on RAM, potentially very many slices can be merged together in a
single pass.

Re: OoO S/360 descendants

<shini3$l13$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20410&group=comp.arch#20410

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Sat, 11 Sep 2021 10:04:02 -0700
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <shini3$l13$1@dont-email.me>
References: <sflqmi$16gk$1@gal.iecc.com>
<2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com>
<2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me>
<2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me>
<2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <61kpjgdl06tr73icbhf6fk3fr8bcvbpqr1@4ax.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 Sep 2021 17:04:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="55b9c3774d49381d89c2586ab3ab136c";
logging-data="21539"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19/lMbLN4FHl5TKE9UqT+Lm8R8n0uI+uhE="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:0CdvCR8bk8zHK6Kr9HFRXnyyDOg=
In-Reply-To: <61kpjgdl06tr73icbhf6fk3fr8bcvbpqr1@4ax.com>
Content-Language: en-US

by: Stephen Fuld - Sat, 11 Sep 2021 17:04 UTC

On 9/11/2021 9:27 AM, George Neuner wrote:
> On Fri, 10 Sep 2021 07:47:06 -0700, Stephen Fuld
> <sfuld@alumni.cmu.edu.invalid> wrote:

snip

>> The main determinant of elapsed time in an external sort is the number
>> of merge passes. As the size of the sorted runs goes up, the number of
>> passes goes down, and as we discussed in Anton's example, even when you
>> get to the final merge pass, having fewer runs allows fewer I/Os. So,
>> yes, you lose concurrency, but the longer runs may gain in elapsed time
>> by doing less work.
>
> Only one merge actually is required: partial merges are needed only if
> you don't have enough on-line storage to simultaneously access all the
> slices.

Yes, but see my post in reply to Anton's about the benefit longer runs.

>
> Merging requires buffering just one record from each slice. Depending
> on RAM, potentially very many slices can be merged together in a
> single pass.

You are ignoring the disk I/O. First of all, we are not talking about a
Unix model of I/O where the actual disk reads are hidden behind a big
buffer. But even if we were, you have to look at the actual disk I/Os
to that buffer. You want to minimize the number of physical I/Os. The
way you do that is to read longer runs into memory with one disk read.
The way you do that is to reduce the "ways" of the merge, so you can get
more data from each run into memory.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<547%I.12692$dI3.3254@fx10.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20412&group=comp.arch#20412

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me>
In-Reply-To: <shfr5c$vc8$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 62
Message-ID: <547%I.12692$dI3.3254@fx10.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 11 Sep 2021 19:03:29 UTC
Date: Sat, 11 Sep 2021 15:02:52 -0400
X-Received-Bytes: 4038

by: EricP - Sat, 11 Sep 2021 19:02 UTC

Stephen Fuld wrote:
> On 9/9/2021 7:14 AM, EricP wrote:
>
> snip
>
>> Combining the "external sort" approach with multiple rotating work
>> buffers,
>> buffer A is doing an async read, B is sorting, C is async writing.
>> When all are done, rotate the buffers.
>> Balance the buffer size, read and write IO times vs sort time.
>> That keeps 1 cpu/thread and 2 disks busy.
>
> I am not sure that is the best approach. If you have N bytes of main
> memory, you have divided it into three equal sized pieces (in order to
> be able to "rotate them". That means the length of the run will be
> about 1/3 of what it would be if you used all of the memory as one big
> buffer. Read data into that buffer, then sort it, then write it out.

My suggestion keeps the bottleneck source HDD continuously busy and
parallels the read of the source, the write of the temp, and sort.
That gets 2 IO going at once (actually more because I'd start the
merges before the source read was finished).

Your suggestion only utilizes the source disk less than 50% because
it serializes everything.

The dest drive sequential file is the other bottleneck because it can't
start to be written until the last record is read from the source
(that last record might sort first in the dest file).

A "minimum" sort time would start final merge and dest file write
immediately after reading the last source record, and continuously
stream the output as fast as the dest disk can take it.

> The main determinant of elapsed time in an external sort is the number
> of merge passes. As the size of the sorted runs goes up, the number of
> passes goes down, and as we discussed in Anton's example, even when you
> get to the final merge pass, having fewer runs allows fewer I/Os. So,
> yes, you lose concurrency, but the longer runs may gain in elapsed time
> by doing less work.

Yes, in part. But for the final merge file write, if there are 8 temp
and 1 dest HDD then we can read and merge far faster than write.

Looking at a randomly selected Enterprise disk from Western Digital,
a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
and a 8 TB SATA disk is 255 MB/s.
They don't give seek time anymore but I'll assume it's 15-20 ms average.

The system bus which operate in the 6 giga-transfers per second range
hopefully can support 8 or 9 concurrent DMA's plus execute code.

Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
A 15 ms seek costs 2.76 MB of lost disk transfer time.

What I've been trying to work out is what is the optimum order of
operations that keeps as many disks busy as possible doing overlapped,
concurrent reads, sort, merges, and writes such that the final merge
file write can start the instant the last source record is read.

Re: OoO S/360 descendants

<Np8%I.119318$o45.36113@fx46.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20413&group=comp.arch#20413

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
In-Reply-To: <547%I.12692$dI3.3254@fx10.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 38
Message-ID: <Np8%I.119318$o45.36113@fx46.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 11 Sep 2021 20:34:53 UTC
Date: Sat, 11 Sep 2021 16:34:32 -0400
X-Received-Bytes: 2946

by: EricP - Sat, 11 Sep 2021 20:34 UTC

EricP wrote:
> Stephen Fuld wrote:
>
>> The main determinant of elapsed time in an external sort is the number
>> of merge passes. As the size of the sorted runs goes up, the number
>> of passes goes down, and as we discussed in Anton's example, even when
>> you get to the final merge pass, having fewer runs allows fewer I/Os.
>> So, yes, you lose concurrency, but the longer runs may gain in elapsed
>> time by doing less work.
>
> Yes, in part. But for the final merge file write, if there are 8 temp
> and 1 dest HDD then we can read and merge far faster than write.
>
> Looking at a randomly selected Enterprise disk from Western Digital,
> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
> and a 8 TB SATA disk is 255 MB/s.
> They don't give seek time anymore but I'll assume it's 15-20 ms average.
>
> The system bus which operate in the 6 giga-transfers per second range
> hopefully can support 8 or 9 concurrent DMA's plus execute code.
>
> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
> A 15 ms seek costs 2.76 MB of lost disk transfer time.

Also I'm assuming a sequential read or write has 0 seek time,
just transfer time, as the disk controller remembers where the head is,
and the disks are private.

> What I've been trying to work out is what is the optimum order of
> operations that keeps as many disks busy as possible doing overlapped,
> concurrent reads, sort, merges, and writes such that the final merge
> file write can start the instant the last source record is read.

This is a bit like Towers of Hanoi but with 8 pegs.
I'm also reminded of a Pert Chart for evaluating project schedules.

Re: OoO S/360 descendants

<shorsb$48h$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20436&group=comp.arch#20436

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Mon, 13 Sep 2021 17:54:33 -0700
Organization: A noiseless patient Spider
Lines: 120
Message-ID: <shorsb$48h$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 00:54:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2033d2784763cde1b4479e4ba38020e8";
logging-data="4369"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3SSiymreVhWhFSn6SyM7rTQyFzLBKprM="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:VlQRipORinx5g0n6JfQPF4DJ4+k=
In-Reply-To: <547%I.12692$dI3.3254@fx10.iad>
Content-Language: en-US

by: Stephen Fuld - Tue, 14 Sep 2021 00:54 UTC

On 9/11/2021 12:02 PM, EricP wrote:
> Stephen Fuld wrote:
>> On 9/9/2021 7:14 AM, EricP wrote:
>>
>> snip
>>
>>> Combining the "external sort" approach with multiple rotating work
>>> buffers,
>>> buffer A is doing an async read, B is sorting, C is async writing.
>>> When all are done, rotate the buffers.
>>> Balance the buffer size, read and write IO times vs sort time.
>>> That keeps 1 cpu/thread and 2 disks busy.
>>
>> I am not sure that is the best approach. If you have N bytes of main
>> memory, you have divided it into three equal sized pieces (in order to
>> be able to "rotate them". That means the length of the run will be
>> about 1/3 of what it would be if you used all of the memory as one big
>> buffer. Read data into that buffer, then sort it, then write it out.
>
> My suggestion keeps the bottleneck source HDD continuously busy

No, it doesn't. See below.

> and
> parallels the read of the source, the write of the temp, and sort.
> That gets 2 IO going at once (actually more because I'd start the
> merges before the source read was finished).
>
> Your suggestion only utilizes the source disk less than 50% because
> it serializes everything.
>
> The dest drive sequential file is the other bottleneck because it can't
> start to be written until the last record is read from the source
> (that last record might sort first in the dest file).
>
> A "minimum" sort time would start final merge and dest file write
> immediately after reading the last source record, and continuously
> stream the output as fast as the dest disk can take it.
>
>> The main determinant of elapsed time in an external sort is the number
>> of merge passes. As the size of the sorted runs goes up, the number
>> of passes goes down, and as we discussed in Anton's example, even when
>> you get to the final merge pass, having fewer runs allows fewer I/Os.
>> So, yes, you lose concurrency, but the longer runs may gain in elapsed
>> time by doing less work.
>
> Yes, in part. But for the final merge file write, if there are 8 temp
> and 1 dest HDD then we can read and merge far faster than write.

Probably not.

> Looking at a randomly selected Enterprise disk from Western Digital,
> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
> and a 8 TB SATA disk is 255 MB/s.
> They don't give seek time anymore but I'll assume it's 15-20 ms average.

OK. But remember, that is seek plus rotational latency. For a 7,200
RPM drive, the latency averages 4.17 ms.

> The system bus which operate in the 6 giga-transfers per second range
> hopefully can support 8 or 9 concurrent DMA's plus execute code.
>
> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
> A 15 ms seek costs 2.76 MB of lost disk transfer time.
>
> What I've been trying to work out is what is the optimum order of
> operations that keeps as many disks busy as possible doing overlapped,
> concurrent reads, sort, merges, and writes such that the final merge
> file write can start the instant the last source record is read.

OK, let's split the analysis into two parts, the sort phase and the
merge phase. As we will see, your proposal may (there is not enough
information to tell) win on the sort phase, it will lose big on the
merge phase.

For the sort phase, your big error is assuming you can keep the source
disk 100% busy. This is due to the rotational latency. After you
complete a read, you can't start processing the next read during the
inter-record gap, so you will slip a revolution. The there will be a
"pause" of 8.33 ms between the end of one transfer and the start of the
next. Of course, some of this is overlapped with the command processing
time.

Both your proposal and mine will suffer this pause, but your proposal
suffers it three times as often, as it requires three times the number
of reads. Assuming you stripe the sorted output blocks across the
(assumed) 8 intermediate drives, it is just too hard to figure out how
much of the 8.33 ms after the last write to that drive is used, so let's
just assume it is random, i.e. 4.17 ms. Once again, your proposal
suffers from three times as many of those as my proposal.

So, your proposal gains on parallelism, but loses on disk latency. We
just don't have enough specifics to know which effect is larger. :-(

I should also note that your proposal uses more CPU time, both for
processing three times as many read and write commands, and on the sort
itself. I.e. it costs less CPU time to sort one block of length 3N as
three blocks of length N. Its not clear if this effects elapsed time.

OK, let's move on to the merge phase. As I went into in response to
Anton, the benefits of the 3X longer runs, which means 1/3 the number of
runs to merge really shows up here. In summary, the fewer the number of
runs to merge, the more of each run you can keep in main memory, so the
fewer reads to get it into memory. Your proposal requires three times
the number of I/Os, to the intermediate disks which means 3 times the
full (assumed) 15-20 ms seek plus latency (the transfer time is, of
course, the same).

You mentioned above that the reads of the intermediate disks would be
faster than the writes to the output drive. The problem is that you are
paying the full seek plus latency on the reads, but only the latency on
the writes. In fact, they will be serialized, as you have to complete
the write to free up the memory to be able to do the read.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<j710J.25224$rsCb.8927@fx01.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20444&group=comp.arch#20444

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx01.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shorsb$48h$1@dont-email.me>
In-Reply-To: <shorsb$48h$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 212
Message-ID: <j710J.25224$rsCb.8927@fx01.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 14 Sep 2021 13:06:23 UTC
Date: Tue, 14 Sep 2021 09:05:32 -0400
X-Received-Bytes: 11135

by: EricP - Tue, 14 Sep 2021 13:05 UTC

Stephen Fuld wrote:
> On 9/11/2021 12:02 PM, EricP wrote:
>> Stephen Fuld wrote:
>>> On 9/9/2021 7:14 AM, EricP wrote:
>>>
>>> snip
>>>
>>>> Combining the "external sort" approach with multiple rotating work
>>>> buffers,
>>>> buffer A is doing an async read, B is sorting, C is async writing.
>>>> When all are done, rotate the buffers.
>>>> Balance the buffer size, read and write IO times vs sort time.
>>>> That keeps 1 cpu/thread and 2 disks busy.
>>>
>>> I am not sure that is the best approach. If you have N bytes of main
>>> memory, you have divided it into three equal sized pieces (in order
>>> to be able to "rotate them". That means the length of the run will
>>> be about 1/3 of what it would be if you used all of the memory as one
>>> big buffer. Read data into that buffer, then sort it, then write it
>>> out.
>>
>> My suggestion keeps the bottleneck source HDD continuously busy
>
> No, it doesn't. See below.

I will try to change your conclusion.

>> and
>> parallels the read of the source, the write of the temp, and sort.
>> That gets 2 IO going at once (actually more because I'd start the
>> merges before the source read was finished).
>>
>> Your suggestion only utilizes the source disk less than 50% because
>> it serializes everything.
>>
>> The dest drive sequential file is the other bottleneck because it can't
>> start to be written until the last record is read from the source
>> (that last record might sort first in the dest file).
>>
>> A "minimum" sort time would start final merge and dest file write
>> immediately after reading the last source record, and continuously
>> stream the output as fast as the dest disk can take it.
>>
>>> The main determinant of elapsed time in an external sort is the
>>> number of merge passes. As the size of the sorted runs goes up, the
>>> number of passes goes down, and as we discussed in Anton's example,
>>> even when you get to the final merge pass, having fewer runs allows
>>> fewer I/Os. So, yes, you lose concurrency, but the longer runs may
>>> gain in elapsed time by doing less work.
>>
>> Yes, in part. But for the final merge file write, if there are 8 temp
>> and 1 dest HDD then we can read and merge far faster than write.
>
> Probably not.
>
>> Looking at a randomly selected Enterprise disk from Western Digital,
>> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
>> and a 8 TB SATA disk is 255 MB/s.
>> They don't give seek time anymore but I'll assume it's 15-20 ms average.
>
> OK. But remember, that is seek plus rotational latency. For a 7,200
> RPM drive, the latency averages 4.17 ms.

I assume the seek time includes the rotational latency.
I don't see any reason to separate it out.

>> The system bus which operate in the 6 giga-transfers per second range
>> hopefully can support 8 or 9 concurrent DMA's plus execute code.
>>
>> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
>> A 15 ms seek costs 2.76 MB of lost disk transfer time.
>>
>> What I've been trying to work out is what is the optimum order of
>> operations that keeps as many disks busy as possible doing overlapped,
>> concurrent reads, sort, merges, and writes such that the final merge
>> file write can start the instant the last source record is read.
>
> OK, let's split the analysis into two parts, the sort phase and the
> merge phase.

I would overlap some of the merge with the sort phase.
Those temp disks are mostly just sitting there - might as well use them.
But go on.

> As we will see, your proposal may (there is not enough
> information to tell) win on the sort phase, it will lose big on the
> merge phase.
>
> For the sort phase, your big error is assuming you can keep the source
> disk 100% busy. This is due to the rotational latency. After you
> complete a read, you can't start processing the next read during the
> inter-record gap, so you will slip a revolution. The there will be a
> "pause" of 8.33 ms between the end of one transfer and the start of the
> next. Of course, some of this is overlapped with the command processing
> time.

Understood.
But this is why it is important to do contiguous sequential reads
on the source. The drive actually reads and caches the whole track,
and only transfers the sectors you asked for. If you do another read
to the same track there is zero seek/rotate time, just transfer time.

I also believe the drive manufacturer's specified "sustained" transfer
rate must already include minor sequential track seek time
for it to be a sustained rate.

Also I would try to order the actual async read operations such that
there is always a next read queued before the current read finishes.
Most SATA drives now incorporate Native Command Queuing so
those commands could even already be in the drive controller.

For these reasons it is reasonable to believe that multiple reads
to the source file's contiguous sequential logical block numbers
should stream at the drives specified "max sustained transfer rate"
with no interruptions.

> Both your proposal and mine will suffer this pause, but your proposal
> suffers it three times as often, as it requires three times the number
> of reads. Assuming you stripe the sorted output blocks across the
> (assumed) 8 intermediate drives, it is just too hard to figure out how
> much of the 8.33 ms after the last write to that drive is used, so let's
> just assume it is random, i.e. 4.17 ms. Once again, your proposal
> suffers from three times as many of those as my proposal.

Understood. But there are two seeks, the source read and the temp write.
I explained why I believe the source reads can be streamed at the
specified sustained transfer rate with no interruptions.

The temp writes do seek but I would write fragments to multiple disks
so the seeks time can be overlapped with writes to other disks.
Plus I'd be merging earlier fragments in parallel
with writes of new fragments - ideally the source and
all 8 temp disks are seeking, reading or writing at all times.
This is the parallel Towers of Hanoi I referred to.

Also I didn't say this before but should mention,
I'd preallocate all temp files as contiguous disk blocks
at the run start so there is no file system interaction,
and no reason for the file system to move the disk head
for meta data reading or writing.
Then manage the assignment of fragments to file blocks myself.
This helps ensure the disk heads are always where I left them.

> So, your proposal gains on parallelism, but loses on disk latency. We
> just don't have enough specifics to know which effect is larger. :-(
>
> I should also note that your proposal uses more CPU time, both for
> processing three times as many read and write commands, and on the sort
> itself. I.e. it costs less CPU time to sort one block of length 3N as
> three blocks of length N. Its not clear if this effects elapsed time.

I disagree with this.

IO time approximates T = A + B*size
where for A there is an OS time queuing any IO and disk time to seek,
and for B the OS time pinning pages and setting up IOMMU page maps
and tearing them down afterwards, and for the disk transfer blocks.

Time to read, say, 32 MB should be dominated by per-block OS overhead
and transfer time 32 MB/184 MB/s = 173 ms or 10 times the seek time,
if any seek was needed.

For the sort, it doesn't matter if it is more expensive as long as it is
not on the critical path and is complete before its parallel read writes.
The goal is minimum elapsed time.

> OK, let's move on to the merge phase. As I went into in response to
> Anton, the benefits of the 3X longer runs, which means 1/3 the number of
> runs to merge really shows up here. In summary, the fewer the number of
> runs to merge, the more of each run you can keep in main memory, so the
> fewer reads to get it into memory. Your proposal requires three times
> the number of I/Os, to the intermediate disks which means 3 times the
> full (assumed) 15-20 ms seek plus latency (the transfer time is, of
> course, the same).

I agree if this was solely what I proposed.
But I also said I'd start part of the merge during the sort phase.
That keeps many disks and cpus busy concurrently.
This is the parallel Towers of Hanoi part.

So the question is, is there an order to the parallel Towers of Hanoi
that results in the shortest elapsed time.

> You mentioned above that the reads of the intermediate disks would be
> faster than the writes to the output drive. The problem is that you are
> paying the full seek plus latency on the reads, but only the latency on
> the writes. In fact, they will be serialized, as you have to complete
> the write to free up the memory to be able to do the read.

Click here to read the complete article

Re: OoO S/360 descendants

<DY10J.22723$6u6.16167@fx03.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20445&group=comp.arch#20445

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shorsb$48h$1@dont-email.me> <j710J.25224$rsCb.8927@fx01.iad>
In-Reply-To: <j710J.25224$rsCb.8927@fx01.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 74
Message-ID: <DY10J.22723$6u6.16167@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 14 Sep 2021 14:03:15 UTC
Date: Tue, 14 Sep 2021 10:02:31 -0400
X-Received-Bytes: 4424

by: EricP - Tue, 14 Sep 2021 14:02 UTC

EricP wrote:
> Stephen Fuld wrote:
>> On 9/11/2021 12:02 PM, EricP wrote:
>>> Stephen Fuld wrote:
>>>> On 9/9/2021 7:14 AM, EricP wrote:
>>>>
>>>> snip
>>>>
>>>>> Combining the "external sort" approach with multiple rotating work
>>>>> buffers,
>>>>> buffer A is doing an async read, B is sorting, C is async writing.
>>>>> When all are done, rotate the buffers.
>>>>> Balance the buffer size, read and write IO times vs sort time.
>>>>> That keeps 1 cpu/thread and 2 disks busy.
>>>>
>>>> I am not sure that is the best approach. If you have N bytes of
>>>> main memory, you have divided it into three equal sized pieces (in
>>>> order to be able to "rotate them". That means the length of the run
>>>> will be about 1/3 of what it would be if you used all of the memory
>>>> as one big buffer. Read data into that buffer, then sort it, then
>>>> write it out.
>>>
>>> My suggestion keeps the bottleneck source HDD continuously busy
>>
>> No, it doesn't. See below.
>
> I will try to change your conclusion.

Let me walk through examples of the two scenarios.
Lets say I have 128 MB of memory, 1 source and 8 temp disks.
Lets say a "chunk" is 32 MB and that it takes 1 time "tick"
to read, sort, or write one chunk.

The big-fragment way uses all 128 MB memory to read 1 fragment of
size 4 chunks, sort 4, write 4 to temp disk.
Elapsed time: 12 ticks for 1 fragment of size 4 chunks.

The small-pipelined way splits memory 4 ways, 32 MB for each concurrent
read, sort, write of the source, plus the last 32 MB for concurrent merge.

In the following table, Rn means Read chunk n, Sn Sort chunk n,
Wn Write n, MWabc Merge chunks a b c and Write as one fragment.
Dn are temp disks 1..8

Tick Source Cpu D1 D2 D3 D4 D5 D6 D7 D8
1 R1 <-- Read chunk 1
2 R2 S1 <-- Sort chunk 1
3 R3 S2 W1 <-- Write chunk 1
4 R4 S3 W2
5 R5 S4 W3
6 R6 S5 R1 R2 R3 W4 <-- Read 123 for merge starts
7 R7 S6 R1 R2 R3 MW123 W5 <-- Merge Write to D4 starts
8 R8 S7 R1 R2 R3 MW123 W6
9 R9 S8 W7 R4 R5 R6 <-- Read 456 for merge starts
10 R10 S9 W8 R4 R5 R6 MW456 <-- Merge Write to D8 starts
11 R11 S10 W9 R4 R5 R6 MW456
12 R12 S11 R7 R8 R9 W10 MW456 <-- Read 789 for merge starts

At the end of 12 ticks the pipelined approach will have:
- read 12 chunks of 32 MB, sorted 11, written 10
= 320 MB transferred from source to temp disk.
- merged chunks 1,2,3 and 4,5,6 into two 96 MB fragments.
- started the merge on 7,8,9 which completes at tick 15.

The question is how to continue this and merge multiple 96 MB fragments,
say 2, 3 or 4, while continuing with the above.

The goal is to wind up with the minimum number of pre-merged fragments,
distributed across the disks, such that there are few enough fragments
so that the final merge can stream its output at full speed.

Re: OoO S/360 descendants

<r820J.12892$dI3.10242@fx10.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20446&group=comp.arch#20446

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shorsb$48h$1@dont-email.me> <j710J.25224$rsCb.8927@fx01.iad>
In-Reply-To: <j710J.25224$rsCb.8927@fx01.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 78
Message-ID: <r820J.12892$dI3.10242@fx10.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 14 Sep 2021 14:15:51 UTC
Date: Tue, 14 Sep 2021 10:15:32 -0400
X-Received-Bytes: 4499

by: EricP - Tue, 14 Sep 2021 14:15 UTC

Updated table... missed the last merge-write MW123 to D4 at tick 9.

The big-fragment way uses all 128 MB memory to read 1 fragment of
size 4 chunks, sort 4, write 4 to temp disk.
Elapsed time: 12 ticks for 1 fragment of size 4 chunks.

The small-pipelined way splits memory 4 ways, 32 MB for each concurrent
read, sort, write of the source, plus the last 32 MB for concurrent merge.

In the following table, Rn means Read chunk n, Sn Sort chunk n,
Wn Write n, MWabc Merge chunks a b c and Write as one fragment.
Dn are temp disks 1..8

Tick Source Cpu D1 D2 D3 D4 D5 D6 D7 D8
1 R1 <-- Read chunk 1
2 R2 S1 <-- Sort chunk 1
3 R3 S2 W1 <-- Write chunk 1
4 R4 S3 W2
5 R5 S4 W3
6 R6 S5 R1 R2 R3 W4 <-- Read 123 for merge starts
7 R7 S6 R1 R2 R3 MW123 W5 <-- Merge Write to D4 starts
8 R8 S7 R1 R2 R3 MW123 W6
9 R9 S8 W7 MW123 R4 R5 R6 <-- Read 456 for merge starts
10 R10 S9 W8 R4 R5 R6 MW456 <-- Merge Write to D8 starts
11 R11 S10 W9 R4 R5 R6 MW456
12 R12 S11 R7 R8 R9 W10 MW456 <-- Read 789 for merge starts

The question is how to continue this and merge multiple 96 MB fragments,
say 2, 3 or 4, while continuing with the above.

Re: OoO S/360 descendants

<shqf88$186v$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20448&group=comp.arch#20448

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Tue, 14 Sep 2021 17:31:20 +0200
Organization: Aioe.org NNTP Server
Message-ID: <shqf88$186v$1@gioia.aioe.org>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
<shorsb$48h$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="41183"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Tue, 14 Sep 2021 15:31 UTC

Stephen Fuld wrote:
> OK, let's split the analysis into two parts, the sort phase and the
> merge phase. As we will see, your proposal may (there is not enough
> information to tell) win on the sort phase, it will lose big on the
> merge phase.

I agree with your analysis, i.e. have as few merge sources as possible,
but I think you are wrong here:
>
> For the sort phase, your big error is assuming you can keep the source
> disk 100% busy. This is due to the rotational latency. After you
> complete a read, you can't start processing the next read during the
> inter-record gap, so you will slip a revolution. The there will be a
> "pause" of 8.33 ms between the end of one transfer and the start of the
> next. Of course, some of this is overlapped with the command processing
> time.

This is where I'm pretty sure that the sustained read/write rate
includes the skip from one cylinder to the next, and that to minimize
these the drives are formatted with a skew between each pair of
cylinders, as well as a skew from one platter to the next.

It makes perfect sense for disk manufactureres to make this simple
optimization since it doesn't cost them anything, it just gives a free
improvement in the full disk read/write times.

I belive that they will also start reading from the first sector which
is readable, so that the read rate would be optimal no matter if they
employ skew or not, i.e. just the minimum possible head to head or
cylinder to cylinder settle time plus half a sector.

I know that this was done way back in the floppy disk era, before having
a full track buffer was effectively free.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: OoO S/360 descendants

<shqluf$tnc$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20450&group=comp.arch#20450

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Tue, 14 Sep 2021 10:25:32 -0700
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <shqluf$tnc$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
<shorsb$48h$1@dont-email.me> <shqf88$186v$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 17:25:35 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="728b35c2b6000ff47e06eadc4d9a326b";
logging-data="30444"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KdCVxI6oNmezgbUOIsq9Mw/2STV1Mv44="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:jRtznq3PtWbNXDVTGHigHXrJQnw=
In-Reply-To: <shqf88$186v$1@gioia.aioe.org>
Content-Language: en-US

by: Stephen Fuld - Tue, 14 Sep 2021 17:25 UTC

On 9/14/2021 8:31 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> OK, let's split the analysis into two parts, the sort phase and the
>> merge phase. As we will see, your proposal may (there is not enough
>> information to tell) win on the sort phase, it will lose big on the
>> merge phase.
>
> I agree with your analysis, i.e. have as few merge sources as possible,
> but I think you are wrong here:
>>
>> For the sort phase, your big error is assuming you can keep the source
>> disk 100% busy. This is due to the rotational latency. After you
>> complete a read, you can't start processing the next read during the
>> inter-record gap, so you will slip a revolution. The there will be a
>> "pause" of 8.33 ms between the end of one transfer and the start of
>> the next. Of course, some of this is overlapped with the command
>> processing time.
>
> This is where I'm pretty sure that the sustained read/write rate
> includes the skip from one cylinder to the next, and that to minimize
> these the drives are formatted with a skew between each pair of
> cylinders, as well as a skew from one platter to the next.

Yes, that is true. But that wasn't what I was talking about. Once the
transfer starts, head and cylinder skew keep the sustained rate pretty
constant. I was talking about the initial latency before the start of
the first byte transferred. See below.

>
> It makes perfect sense for disk manufactureres to make this simple
> optimization since it doesn't cost them anything, it just gives a free
> improvement in the full disk read/write times.
>
> I belive that they will also start reading from the first sector which
> is readable, so that the read rate would be optimal no matter if they
> employ skew or not, i.e. just the minimum possible head to head or
> cylinder to cylinder settle time plus half a sector.

For all of this analysis, I was assuming an unbuffered disk. While all
modern disks have a large buffer/cache, including it complicates the
analysis because you would then have to get into cache size and
organizations on the drives, differences between transfer rate between
the buffer and the disk versus between the host and the buffer, etc.
This makes doing the analysis pretty much impossible without a lot more
data from the drive manufacturer.

Without a buffer, when the drive first comes on cylinder at the
completion of the seek, you have to wait for the record you want to come
under the read/write heads, i.e. you can't just start reading wherever
the disk is and pick up the beginning of the transfer when the disk
spins around, as you have no place to store the end part of the track.
That is the rotational latency. And as Eric said, today it is usually
included in the seek time specified (if it is) by the manufacturer, but
given the RPM of the drive, it is trivially calculated and separated out
for analysis. If the drive is unbuffered, it is necessary to consider
the latency, as you pay for it even if the drive is on the requested
cylinder.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<Fb50J.83957$F26.72380@fx44.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20455&group=comp.arch#20455

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shorsb$48h$1@dont-email.me> <shqf88$186v$1@gioia.aioe.org>
In-Reply-To: <shqf88$186v$1@gioia.aioe.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 58
Message-ID: <Fb50J.83957$F26.72380@fx44.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 14 Sep 2021 17:44:05 UTC
Date: Tue, 14 Sep 2021 13:42:17 -0400
X-Received-Bytes: 4011

by: EricP - Tue, 14 Sep 2021 17:42 UTC

Terje Mathisen wrote:
> Stephen Fuld wrote:
>> OK, let's split the analysis into two parts, the sort phase and the
>> merge phase. As we will see, your proposal may (there is not enough
>> information to tell) win on the sort phase, it will lose big on the
>> merge phase.
>
> I agree with your analysis, i.e. have as few merge sources as possible,
> but I think you are wrong here:
>>
>> For the sort phase, your big error is assuming you can keep the source
>> disk 100% busy. This is due to the rotational latency. After you
>> complete a read, you can't start processing the next read during the
>> inter-record gap, so you will slip a revolution. The there will be a
>> "pause" of 8.33 ms between the end of one transfer and the start of
>> the next. Of course, some of this is overlapped with the command
>> processing time.
>
> This is where I'm pretty sure that the sustained read/write rate
> includes the skip from one cylinder to the next, and that to minimize
> these the drives are formatted with a skew between each pair of
> cylinders, as well as a skew from one platter to the next.
>
> It makes perfect sense for disk manufactureres to make this simple
> optimization since it doesn't cost them anything, it just gives a free
> improvement in the full disk read/write times.
>
> I belive that they will also start reading from the first sector which
> is readable, so that the read rate would be optimal no matter if they
> employ skew or not, i.e. just the minimum possible head to head or
> cylinder to cylinder settle time plus half a sector.
>
> I know that this was done way back in the floppy disk era, before having
> a full track buffer was effectively free.
>
> Terje

Also SATA and other disk control protocols optionally support
Native Command Queuing (NCQ) which allows up to 32 commands
to be transferred into the controller at once.
This allow the disk controller to make local optimizations and
choose the next command to immediately begin after the current.

NCQ means potentially any mix of DMA to 32 different process address
spaces taking place concurrently, or 32 IO to the same address space,
or any interleaving of the above.

A SATA-3.0 link can do 6 Gb/s, 600 MB/s transfer rate.
The disk I was looking at had 128 MB of cache.
Such a disk controller could easily DMA to read/write a surface
while servicing multiple other cached DMA's.

While a current file IO is outstanding, one can use async overlapped IO
to prepare the next file op (inode lookups, pages pinned, IOMMU set up)
and queued in the driver, and transferred using NCQ to the controller
before the first IO completes.

Re: OoO S/360 descendants

<shqq0u$rvq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20457&group=comp.arch#20457

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Tue, 14 Sep 2021 11:35:08 -0700
Organization: A noiseless patient Spider
Lines: 334
Message-ID: <shqq0u$rvq$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
<shorsb$48h$1@dont-email.me> <j710J.25224$rsCb.8927@fx01.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 18:35:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="728b35c2b6000ff47e06eadc4d9a326b";
logging-data="28666"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19qj81duxZR3DKwnavRKnVgXZuKQzaYzbU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:HT9CN8YYOjfbaigELREF5wJjA8M=
In-Reply-To: <j710J.25224$rsCb.8927@fx01.iad>
Content-Language: en-US

by: Stephen Fuld - Tue, 14 Sep 2021 18:35 UTC

On 9/14/2021 6:05 AM, EricP wrote:
> Stephen Fuld wrote:
>> On 9/11/2021 12:02 PM, EricP wrote:
>>> Stephen Fuld wrote:
>>>> On 9/9/2021 7:14 AM, EricP wrote:
>>>>
>>>> snip
>>>>
>>>>> Combining the "external sort" approach with multiple rotating work
>>>>> buffers,
>>>>> buffer A is doing an async read, B is sorting, C is async writing.
>>>>> When all are done, rotate the buffers.
>>>>> Balance the buffer size, read and write IO times vs sort time.
>>>>> That keeps 1 cpu/thread and 2 disks busy.
>>>>
>>>> I am not sure that is the best approach. If you have N bytes of
>>>> main memory, you have divided it into three equal sized pieces (in
>>>> order to be able to "rotate them". That means the length of the run
>>>> will be about 1/3 of what it would be if you used all of the memory
>>>> as one big buffer. Read data into that buffer, then sort it, then
>>>> write it out.
>>>
>>> My suggestion keeps the bottleneck source HDD continuously busy
>>
>> No, it doesn't. See below.
>
> I will try to change your conclusion.

Good. I enjoy an intellectual disagreement, as each of us might learn
something. :-)

>>> and
>>> parallels the read of the source, the write of the temp, and sort.
>>> That gets 2 IO going at once (actually more because I'd start the
>>> merges before the source read was finished).
>>>
>>> Your suggestion only utilizes the source disk less than 50% because
>>> it serializes everything.
>>>
>>> The dest drive sequential file is the other bottleneck because it can't
>>> start to be written until the last record is read from the source
>>> (that last record might sort first in the dest file).
>>>
>>> A "minimum" sort time would start final merge and dest file write
>>> immediately after reading the last source record, and continuously
>>> stream the output as fast as the dest disk can take it.
>>>
>>>> The main determinant of elapsed time in an external sort is the
>>>> number of merge passes. As the size of the sorted runs goes up, the
>>>> number of passes goes down, and as we discussed in Anton's example,
>>>> even when you get to the final merge pass, having fewer runs allows
>>>> fewer I/Os. So, yes, you lose concurrency, but the longer runs may
>>>> gain in elapsed time by doing less work.
>>>
>>> Yes, in part. But for the final merge file write, if there are 8 temp
>>> and 1 dest HDD then we can read and merge far faster than write.
>>
>> Probably not.
>>
>>> Looking at a randomly selected Enterprise disk from Western Digital,
>>> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
>>> and a 8 TB SATA disk is 255 MB/s.
>>> They don't give seek time anymore but I'll assume it's 15-20 ms average.
>>
>> OK. But remember, that is seek plus rotational latency. For a 7,200
>> RPM drive, the latency averages 4.17 ms.
>
> I assume the seek time includes the rotational latency.
> I don't see any reason to separate it out.

The reason to separate it out is that, if you are assuming unbuffered
disks, as I am (see my response to Terje later in this thread), you pay
the latency on each transfer, even if you have total control over where
the heads are and can reasonably assume zero seek time.

>>> The system bus which operate in the 6 giga-transfers per second range
>>> hopefully can support 8 or 9 concurrent DMA's plus execute code.
>>>
>>> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
>>> A 15 ms seek costs 2.76 MB of lost disk transfer time.
>>>
>>> What I've been trying to work out is what is the optimum order of
>>> operations that keeps as many disks busy as possible doing overlapped,
>>> concurrent reads, sort, merges, and writes such that the final merge
>>> file write can start the instant the last source record is read.
>>
>> OK, let's split the analysis into two parts, the sort phase and the
>> merge phase.
>
> I would overlap some of the merge with the sort phase.
> Those temp disks are mostly just sitting there - might as well use them.

To preview my detailed response below, yes, the disks are not fully
used, but the memory is. If you use some of the memory for the merge
simultaneously with the sort, then less memory is available for the
sort, and the sorted runs will be shorter, hence more of them. That
means a higher N in the N way merge, thus more reads to read them in in
the merge phase.

> But go on.
>
>> As we will see, your proposal may (there is not enough information to
>> tell) win on the sort phase, it will lose big on the merge phase.
>>
>> For the sort phase, your big error is assuming you can keep the source
>> disk 100% busy. This is due to the rotational latency. After you
>> complete a read, you can't start processing the next read during the
>> inter-record gap, so you will slip a revolution. The there will be a
>> "pause" of 8.33 ms between the end of one transfer and the start of
>> the next. Of course, some of this is overlapped with the command
>> processing time.
>
> Understood.
> But this is why it is important to do contiguous sequential reads
> on the source. The drive actually reads and caches the whole track,
> and only transfers the sectors you asked for. If you do another read
> to the same track there is zero seek/rotate time, just transfer time.

You are assuming a buffered disk, which I specifically am not assuming.
In my response to Terje, I tried to explain that if you don't assume
an unbuffered disk, but instead (as you seem to imply below), assume a
current modern disk with all its smarts and large buffer/cache, the
analysis becomes very dependent upon the details of the caching
algorithm, etc. within the disk, and without more information from the
manufacturer, is almost impossible.

> I also believe the drive manufacturer's specified "sustained" transfer
> rate must already include minor sequential track seek time
> for it to be a sustained rate.

That's true.

>
> Also I would try to order the actual async read operations such that
> there is always a next read queued before the current read finishes.
> Most SATA drives now incorporate Native Command Queuing so
> those commands could even already be in the drive controller.

Again, that assumes a buffered drive with read ahead, which I was not
assuming. Note that if the drive has a read ahead buffer, you don't
even need command queuing, as, assuming the buffer to host rate is
enough larger than the disk to buffer rate, the time lost to the command
processing is made up by the rate difference.

> For these reasons it is reasonable to believe that multiple reads
> to the source file's contiguous sequential logical block numbers
> should stream at the drives specified "max sustained transfer rate"
> with no interruptions.

If you assume buffered drives, then yes.

>> Both your proposal and mine will suffer this pause, but your proposal
>> suffers it three times as often, as it requires three times the number
>> of reads. Assuming you stripe the sorted output blocks across the
>> (assumed) 8 intermediate drives, it is just too hard to figure out how
>> much of the 8.33 ms after the last write to that drive is used, so
>> let's just assume it is random, i.e. 4.17 ms. Once again, your
>> proposal suffers from three times as many of those as my proposal.
>
> Understood. But there are two seeks, the source read and the temp write.
> I explained why I believe the source reads can be streamed at the
> specified sustained transfer rate with no interruptions.

Yes, with the buffered drive assumption.

> The temp writes do seek but I would write fragments to multiple disks
> so the seeks time can be overlapped with writes to other disks.

I am not sure what you are saying here. Are you saying that, assuming
your initial sorted run is of size M/3, where M is the amount of memory
available for the buffer, that you are going to split this up such that
each run on disk will be (M/3)/8 long? You can do that, of course, and
you do overlap thus reduce the time to write out the intermediate disks
in the sort phase, but you more than lose that during the merge phase.
See below.

> Plus I'd be merging earlier fragments in parallel
> with writes of new fragments - ideally the source and
> all 8 temp disks are seeking, reading or writing at all times.
> This is the parallel Towers of Hanoi I referred to.

Click here to read the complete article

Re: OoO S/360 descendants

<shrng9$8mr$1@z-news.wcss.wroc.pl>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20469&group=comp.arch#20469

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.pionier.net.pl!pwr.wroc.pl!news.wcss.wroc.pl!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 15 Sep 2021 02:58:17 +0000 (UTC)
Organization: Politechnika Wroclawska
Lines: 132
Message-ID: <shrng9$8mr$1@z-news.wcss.wroc.pl>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
NNTP-Posting-Host: hera.math.uni.wroc.pl
X-Trace: z-news.wcss.wroc.pl 1631674697 8923 156.17.86.1 (15 Sep 2021 02:58:17 GMT)
X-Complaints-To: abuse@news.pwr.wroc.pl
NNTP-Posting-Date: Wed, 15 Sep 2021 02:58:17 +0000 (UTC)
Cancel-Lock: sha1:KfogUCXZxnl4VWodKENOddcYMcc=
User-Agent: tin/2.4.3-20181224 ("Glen Mhor") (UNIX) (Linux/4.19.0-10-amd64 (x86_64))

by: antis...@math.uni.wroc.pl - Wed, 15 Sep 2021 02:58 UTC

EricP <ThatWouldBeTelling@thevillage.com> wrote:
> Stephen Fuld wrote:
> > On 9/9/2021 7:14 AM, EricP wrote:
> >
> > snip
> >
> >> Combining the "external sort" approach with multiple rotating work
> >> buffers,
> >> buffer A is doing an async read, B is sorting, C is async writing.
> >> When all are done, rotate the buffers.
> >> Balance the buffer size, read and write IO times vs sort time.
> >> That keeps 1 cpu/thread and 2 disks busy.
> >
> > I am not sure that is the best approach. If you have N bytes of main
> > memory, you have divided it into three equal sized pieces (in order to
> > be able to "rotate them". That means the length of the run will be
> > about 1/3 of what it would be if you used all of the memory as one big
> > buffer. Read data into that buffer, then sort it, then write it out.
>
> My suggestion keeps the bottleneck source HDD continuously busy and
> parallels the read of the source, the write of the temp, and sort.
> That gets 2 IO going at once (actually more because I'd start the
> merges before the source read was finished).

Basic problem with you suggestion is that you base it on lousy
wikipedia method. Classic approch is to use tree or heap sort.
Each handles input as it comes in, so rather small double buffer
is enough to keep input disc busy. Sort "moves" data from one
buffer to sort work area, in parallel the second buffer receives
disc data. Once one buffer is worked out you switch roles of
buffers. Moreover, classic approach uses two trees (or heaps),
old and new. All records in old tree are smaller than all records
in new tree. Incoming records are put in old tree if they are
in range of old tree, otherwise are put in new tree. Largest
record from old tree is sent to output file. Note that after
each step old tree either stays the same or shrinks so eventually
it will shrink to nothing. Then current output run will end,
former new tree will take role of old tree, and the new new tree
will be empty. Also, new output run will start. Assuming random
distribution of records this tends to produce runs _twice_ as
long as work memory. Wikipedia example had file which is
9 times larger than work memory, with such file it does not
make much difference if you get 9 way merge (as Wikepedia
suggest), 27 ways merge (your approach) or 5 way merge
(expected with classic approach). But numer of runs
affects number of buffer you need and consequently size
of buffers. As long as number of runs is bigger than
number of available discs (which IMHO is expected case)
there will be seeks durning merge and bigger buffers means
less time lost on seeks.

I wrote "moved" above because heapsort actually moves records.
Treesort just moves pointers and records stay in their buffers,
so copying is only from input buffer to output buffer (once
output is done buffer gets ready for reuse as input buffer).

> Your suggestion only utilizes the source disk less than 50% because
> it serializes everything.

As explained classiscaly this is not the case: input, sorting and
output can work in parallel with only moderate amout of memory
assigned to buffers.

> The dest drive sequential file is the other bottleneck because it can't
> start to be written until the last record is read from the source
> (that last record might sort first in the dest file).
>
> A "minimum" sort time would start final merge and dest file write
> immediately after reading the last source record, and continuously
> stream the output as fast as the dest disk can take it.

Well, start merge mean read starting blocks of _all_ runs and you
need enough memory for this. Your assumption of single input
drive and single output drive but multiple work drives makes problem
potentially quite asymetic and in such asymetric setup there may
be some gain from making final run smaller and using freed memory
to start merging earlier. However, really big files will not
fit on single disc and splitting big files between multiple
disc is standard performance trick, so you assumption is
_very_ unnatural.

> > of merge passes. As the size of the sorted runs goes up, the number of
> > passes goes down, and as we discussed in Anton's example, even when you
> > get to the final merge pass, having fewer runs allows fewer I/Os. So,
> > yes, you lose concurrency, but the longer runs may gain in elapsed time
> > by doing less work.
>
> Yes, in part. But for the final merge file write, if there are 8 temp
> and 1 dest HDD then we can read and merge far faster than write.
>
> Looking at a randomly selected Enterprise disk from Western Digital,
> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
> and a 8 TB SATA disk is 255 MB/s.
> They don't give seek time anymore but I'll assume it's 15-20 ms average.
>
> The system bus which operate in the 6 giga-transfers per second range
> hopefully can support 8 or 9 concurrent DMA's plus execute code.
>
> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
> A 15 ms seek costs 2.76 MB of lost disk transfer time.

Which means that 3M buffer will reduce loss due to seeks to 50%.
With resonable amout of RAM 3M buffers allow 1000 way merge which
should be enough for anything that fits on 8T drive. Which in
turn means that there is no gain from using more than 2 work drives.

Other folks tried to go beyond single input drive limit. As you
and other noted system bus can support more drives. Assuming that
time to transfer file from disc is T (for simplictity I assume
equal read and write speed). Your assumption means that sort
time is not better that 2T (input read + output write). Two
phase approach means that date is read twice (once from input,
second time from temporary files) and written twice (once to
temporary files, second time to final output). With 8 drives
and evenly split work this can be done in 4T/8 = T/2, that
is 4 times faster than your approach. Of course, there will
be some time lost to seeks, but 50% loss still would lead to
doing work twice as fast as you can. Since orignal input is
essentially sequential and the same with final output one
can use much larger block size there limiting time lost to
seeks. Alternatively, one can connect more discs than
bus bandwidth limit, so time spent seeking is essentially
free.

--
Waldek Hebisch

Re: OoO S/360 descendants

<shrvtf$12vj$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20471&group=comp.arch#20471

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 15 Sep 2021 07:21:50 +0200
Organization: Aioe.org NNTP Server
Message-ID: <shrvtf$12vj$1@gioia.aioe.org>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
<shorsb$48h$1@dont-email.me> <shqf88$186v$1@gioia.aioe.org>
<shqluf$tnc$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="35827"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Wed, 15 Sep 2021 05:21 UTC

Stephen Fuld wrote:
> On 9/14/2021 8:31 AM, Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>> OK, let's split the analysis into two parts, the sort phase and the
>>> merge phase.Â As we will see, your proposal may (there is not enough
>>> information to tell) win on the sort phase, it will lose big on the
>>> merge phase.
>>
>> I agree with your analysis, i.e. have as few merge sources as
>> possible, but I think you are wrong here:
>>>
>>> For the sort phase, your big error is assuming you can keep the
>>> source disk 100% busy.Â This is due to the rotational latency.Â
>>> After you complete a read, you can't start processing the next read
>>> during the inter-record gap, so you will slip a revolution.Â The
>>> there will be a "pause" of 8.33 ms between the end of one transfer
>>> and the start of the next.Â Of course, some of this is overlapped
>>> with the command processing time.
>>
>> This is where I'm pretty sure that the sustained read/write rate
>> includes the skip from one cylinder to the next, and that to minimize
>> these the drives are formatted with a skew between each pair of
>> cylinders, as well as a skew from one platter to the next.
>
>
> Yes, that is true. But that wasn't what I was talking about. Once the
> transfer starts, head and cylinder skew keep the sustained rate pretty
> constant. I was talking about the initial latency before the start of
> the first byte transferred. See below.

OK, that's fine, we are in complete agreement here!

I just make sure that I always read/write disk data in sufficiently
large streaming blocks that the initial seek cannot dominate. I.e. at
the very minimum I will have a block stream time (i.e. size) which is
about the same as the seek time. Back in the 100 MB/s days this meant at
least 1 MB/block in order to match a 10 ms seek time. Increase for
higher transfer rates, decrease for shorter seeks.

For our particular example, any chunk size of 10 MB or larger will be
enough to drown the initial seek overhead.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: OoO S/360 descendants

<shs0dc$ilh$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20472&group=comp.arch#20472

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Tue, 14 Sep 2021 22:30:18 -0700
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <shs0dc$ilh$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com>
<sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at>
<sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at>
<sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at>
<sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
<shorsb$48h$1@dont-email.me> <shqf88$186v$1@gioia.aioe.org>
<shqluf$tnc$1@dont-email.me> <shrvtf$12vj$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Sep 2021 05:30:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="069dde333bf2ad63f9d5b82aeb172f4a";
logging-data="19121"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18mQEQtT6ntZhUBFDweMglNqNchTKkouYo="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.1.0
Cancel-Lock: sha1:+/+ix7oKHYON5mhkmONcnrxITVQ=
In-Reply-To: <shrvtf$12vj$1@gioia.aioe.org>
Content-Language: en-US

by: Stephen Fuld - Wed, 15 Sep 2021 05:30 UTC

On 9/14/2021 10:21 PM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 9/14/2021 8:31 AM, Terje Mathisen wrote:
>>> Stephen Fuld wrote:
>>>> OK, let's split the analysis into two parts, the sort phase and the
>>>> merge phase.Â As we will see, your proposal may (there is not
>>>> enough information to tell) win on the sort phase, it will lose big
>>>> on the merge phase.
>>>
>>> I agree with your analysis, i.e. have as few merge sources as
>>> possible, but I think you are wrong here:
>>>>
>>>> For the sort phase, your big error is assuming you can keep the
>>>> source disk 100% busy.Â This is due to the rotational latency.Â
>>>> After you complete a read, you can't start processing the next read
>>>> during the inter-record gap, so you will slip a revolution.Â The
>>>> there will be a "pause" of 8.33 ms between the end of one transfer
>>>> and the start of the next.Â Of course, some of this is overlapped
>>>> with the command processing time.
>>>
>>> This is where I'm pretty sure that the sustained read/write rate
>>> includes the skip from one cylinder to the next, and that to minimize
>>> these the drives are formatted with a skew between each pair of
>>> cylinders, as well as a skew from one platter to the next.
>>
>>
>> Yes, that is true. But that wasn't what I was talking about. Once
>> the transfer starts, head and cylinder skew keep the sustained rate
>> pretty constant. I was talking about the initial latency before the
>> start of the first byte transferred. See below.
>
> OK, that's fine, we are in complete agreement here!
>
> I just make sure that I always read/write disk data in sufficiently
> large streaming blocks that the initial seek cannot dominate. I.e. at
> the very minimum I will have a block stream time (i.e. size) which is
> about the same as the seek time. Back in the 100 MB/s days this meant at
> least 1 MB/block in order to match a 10 ms seek time. Increase for
> higher transfer rates, decrease for shorter seeks.

Yes. And the problem with Eric's approach is that the large N for the N
way merge means that you have less memory per string, and so shorter blocks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<ceo0J.24026$6u6.12365@fx03.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20473&group=comp.arch#20473

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shrng9$8mr$1@z-news.wcss.wroc.pl>
In-Reply-To: <shrng9$8mr$1@z-news.wcss.wroc.pl>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 201
Message-ID: <ceo0J.24026$6u6.12365@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 15 Sep 2021 15:23:52 UTC
Date: Wed, 15 Sep 2021 11:23:02 -0400
X-Received-Bytes: 11058

by: EricP - Wed, 15 Sep 2021 15:23 UTC

antispam@math.uni.wroc.pl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> Stephen Fuld wrote:
>>> On 9/9/2021 7:14 AM, EricP wrote:
>>>
>>> snip
>>>
>>>> Combining the "external sort" approach with multiple rotating work
>>>> buffers,
>>>> buffer A is doing an async read, B is sorting, C is async writing.
>>>> When all are done, rotate the buffers.
>>>> Balance the buffer size, read and write IO times vs sort time.
>>>> That keeps 1 cpu/thread and 2 disks busy.
>>> I am not sure that is the best approach. If you have N bytes of main
>>> memory, you have divided it into three equal sized pieces (in order to
>>> be able to "rotate them". That means the length of the run will be
>>> about 1/3 of what it would be if you used all of the memory as one big
>>> buffer. Read data into that buffer, then sort it, then write it out.
>> My suggestion keeps the bottleneck source HDD continuously busy and
>> parallels the read of the source, the write of the temp, and sort.
>> That gets 2 IO going at once (actually more because I'd start the
>> merges before the source read was finished).
>
> Basic problem with you suggestion is that you base it on lousy
> wikipedia method.

I'm not sure which wikipedia method you are referring to.
If you mean the one here (I'll call this the "batch method"):

https://en.wikipedia.org/wiki/External_sorting#External_merge_sort

then that does not allow streaming data from the source disk.
Assuming read, sort, and write phases are equal in duration
that utilizes the source disk 33% of the time.

I am suggesting of a pipelined approach which keeps the source disk
streaming 100% of the time by using many temp disks in
parallel to work around the disk IO bandwidth bottleneck.

> Classic approch is to use tree or heap sort.
> Each handles input as it comes in, so rather small double buffer
> is enough to keep input disc busy. Sort "moves" data from one
> buffer to sort work area, in parallel the second buffer receives
> disc data. Once one buffer is worked out you switch roles of
> buffers.

Right, a pipeline.
But pipeline requires that memory be spit into multiple buffers
and therefore the chunk (unit of sorted records) size must be
smaller than batch method, resulting in more fragments to merge.
Which is Stephen's point.

> Moreover, classic approach uses two trees (or heaps),
> old and new. All records in old tree are smaller than all records
> in new tree. Incoming records are put in old tree if they are
> in range of old tree, otherwise are put in new tree. Largest
> record from old tree is sent to output file. Note that after
> each step old tree either stays the same or shrinks so eventually
> it will shrink to nothing. Then current output run will end,
> former new tree will take role of old tree, and the new new tree
> will be empty. Also, new output run will start. Assuming random
> distribution of records this tends to produce runs _twice_ as
> long as work memory. Wikipedia example had file which is
> 9 times larger than work memory, with such file it does not
> make much difference if you get 9 way merge (as Wikepedia
> suggest), 27 ways merge (your approach) or 5 way merge
> (expected with classic approach). But numer of runs
> affects number of buffer you need and consequently size
> of buffers. As long as number of runs is bigger than
> number of available discs (which IMHO is expected case)
> there will be seeks durning merge and bigger buffers means
> less time lost on seeks.

I am not familiar with the 2-tree algorithm giving runs twice as long.
Which wikipedia article is this?

> I wrote "moved" above because heapsort actually moves records.
> Treesort just moves pointers and records stay in their buffers,
> so copying is only from input buffer to output buffer (once
> output is done buffer gets ready for reuse as input buffer).

Treesort using descriptors is what got me started on this.
Using the sorted descriptors as a DMA fragment list that
is passed directly to the output device so that the records
are gathered by the DMA, eliminating the move step(s).

>> Your suggestion only utilizes the source disk less than 50% because
>> it serializes everything.
>
> As explained classiscaly this is not the case: input, sorting and
> output can work in parallel with only moderate amout of memory
> assigned to buffers.

The pipelined approach, which is what I was suggesting.

>> The dest drive sequential file is the other bottleneck because it can't
>> start to be written until the last record is read from the source
>> (that last record might sort first in the dest file).
>>
>> A "minimum" sort time would start final merge and dest file write
>> immediately after reading the last source record, and continuously
>> stream the output as fast as the dest disk can take it.
>
> Well, start merge mean read starting blocks of _all_ runs and you
> need enough memory for this. Your assumption of single input
> drive and single output drive but multiple work drives makes problem
> potentially quite asymetic and in such asymetric setup there may
> be some gain from making final run smaller and using freed memory
> to start merging earlier. However, really big files will not
> fit on single disc and splitting big files between multiple
> disc is standard performance trick, so you assumption is
> _very_ unnatural.

As I understand it, the scenario is defined by the benchmark
(see Common Rules section):

http://sortbenchmark.org/

as a trillion+ byte record sort. The file size is chosen to be larger
than the amount of main memory so they must partition the sort.
Presumable this is a primary key sort prior to a database load.

Note that the above list of current winners are using SSD's
not HDD's so they have zero seek time to deal with.
Past winners, listed below them, used HDD.

This came from the 1985 Datamation sort benchmark defined as:
- Input is a disk-resident file of a million 100-byte records.
- Records have 10-byte key fields and can't be compressed.
- The input record keys are in random order.
- The output file must be a permutation of the input file sorted
in key ascending order.

Other than that there are few rules (it must use the OS file system).
The implementation may use all the "mean tricks" typical of operating
systems utilities. It can access the files via low-level interfaces,
it can use undocumented interfaces, and it can use as many disks,
processors and as much memory as it likes.

The file size is chosen to be larger than main memory,
and $ per sort-time is a benchmark consideration.

>>> of merge passes. As the size of the sorted runs goes up, the number of
>>> passes goes down, and as we discussed in Anton's example, even when you
>>> get to the final merge pass, having fewer runs allows fewer I/Os. So,
>>> yes, you lose concurrency, but the longer runs may gain in elapsed time
>>> by doing less work.
>> Yes, in part. But for the final merge file write, if there are 8 temp
>> and 1 dest HDD then we can read and merge far faster than write.
>>
>> Looking at a randomly selected Enterprise disk from Western Digital,
>> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
>> and a 8 TB SATA disk is 255 MB/s.
>> They don't give seek time anymore but I'll assume it's 15-20 ms average.
>>
>> The system bus which operate in the 6 giga-transfers per second range
>> hopefully can support 8 or 9 concurrent DMA's plus execute code.
>>
>> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
>> A 15 ms seek costs 2.76 MB of lost disk transfer time.
>
> Which means that 3M buffer will reduce loss due to seeks to 50%.
> With resonable amout of RAM 3M buffers allow 1000 way merge which
> should be enough for anything that fits on 8T drive. Which in
> turn means that there is no gain from using more than 2 work drives.

All benchmark winners use many disks, 8 seems to be a common number.
I see Nsort in 2004 used 2350 disks.

>> What I've been trying to work out is what is the optimum order of
>> operations that keeps as many disks busy as possible doing overlapped,
>> concurrent reads, sort, merges, and writes such that the final merge
>> file write can start the instant the last source record is read.
>
> Other folks tried to go beyond single input drive limit. As you
> and other noted system bus can support more drives. Assuming that
> time to transfer file from disc is T (for simplictity I assume
> equal read and write speed). Your assumption means that sort
> time is not better that 2T (input read + output write). Two
> phase approach means that date is read twice (once from input,
> second time from temporary files) and written twice (once to
> temporary files, second time to final output). With 8 drives
> and evenly split work this can be done in 4T/8 = T/2, that
> is 4 times faster than your approach. Of course, there will
> be some time lost to seeks, but 50% loss still would lead to
> doing work twice as fast as you can. Since orignal input is
> essentially sequential and the same with final output one
> can use much larger block size there limiting time lost to
> seeks. Alternatively, one can connect more discs than
> bus bandwidth limit, so time spent seeking is essentially
> free.

Click here to read the complete article

Re: OoO S/360 descendants

<sht9p3$2tl$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20474&group=comp.arch#20474

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 15 Sep 2021 10:16:19 -0700
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <sht9p3$2tl$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com>
<2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me>
<2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me>
<2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org>
<Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad>
<shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad>
<shrng9$8mr$1@z-news.wcss.wroc.pl> <ceo0J.24026$6u6.12365@fx03.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Sep 2021 17:16:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="069dde333bf2ad63f9d5b82aeb172f4a";
logging-data="2997"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xTO9a8w1kA9SX4R/6fAuQNKJQtZTjagE="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.1.0
Cancel-Lock: sha1:DEA3maBLKj1i9+VSK2Qy20EebB4=
In-Reply-To: <ceo0J.24026$6u6.12365@fx03.iad>
Content-Language: en-US

by: Stephen Fuld - Wed, 15 Sep 2021 17:16 UTC

On 9/15/2021 8:23 AM, EricP wrote:
> antispam@math.uni.wroc.pl wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>> Stephen Fuld wrote:
>>>> On 9/9/2021 7:14 AM, EricP wrote:
>>>>
>>>> snip
>>>>
>>>>> Combining the "external sort" approach with multiple rotating work
>>>>> buffers,
>>>>> buffer A is doing an async read, B is sorting, C is async writing.
>>>>> When all are done, rotate the buffers.
>>>>> Balance the buffer size, read and write IO times vs sort time.
>>>>> That keeps 1 cpu/thread and 2 disks busy.
>>>> I am not sure that is the best approach. If you have N bytes of
>>>> main memory, you have divided it into three equal sized pieces (in
>>>> order to be able to "rotate them". That means the length of the run
>>>> will be about 1/3 of what it would be if you used all of the memory
>>>> as one big buffer. Read data into that buffer, then sort it, then
>>>> write it out.
>>> My suggestion keeps the bottleneck source HDD continuously busy and
>>> parallels the read of the source, the write of the temp, and sort.
>>> That gets 2 IO going at once (actually more because I'd start the
>>> merges before the source read was finished).
>>
>> Basic problem with you suggestion is that you base it on lousy
>> wikipedia method.
>
> I'm not sure which wikipedia method you are referring to.
> If you mean the one here (I'll call this the "batch method"):
>
> https://en.wikipedia.org/wiki/External_sorting#External_merge_sort
>
> then that does not allow streaming data from the source disk.
> Assuming read, sort, and write phases are equal in duration
> that utilizes the source disk 33% of the time.

I believe what Waldek was referring to was that article, specifically
the statement number 1, saying you sort the initial load "by some
conventional method". With no disrespect for either your or Waldek, I
will try to explain it to you using different language.

He was talking about using a merge sort as the basic sorting method,
which combines naturally with the run elongation methods we have
discussed (and the article mentions as the last sentence of that section
with its reference to Knuth), and allows parallel reading of input with
doing the sort, and writing the intermediate file.

Specifically,

1. Read the first block of the input file.

2. Then you create the merge tree for that block in parallel with
reading the second block.

3. Read the 3rd block of the input file in parallel with adding the
records in the 2nd block to the merge tree.

4. Repeat step 3 until the sort work area in main memory is filled.

5. Write the lowest block from the merge tree to the intermediate file.

6. Read the next block of the input file into the space vacated by the
just written record.

7. Merge the records in the newly read block into the bit merge tree.
If any of them have keys lower than the highest key in the block just
written out, start a new tree.

8. Continue in parallel, reading blocks from the input file, merging the
records into either the "old" merge tree (if their keys are greater than
the key of the lowest record written out, or the "new" merge tree, if
not, and writing a block from the low end of the "old" merge tree to the
intermediate file.

9. At some point, you will have written out all of the records from the
"old" tree. At that point, the "new" tree becomes the new "old" tree,
and you start a new "new" tree. That marks the end of the run. Repeat
as needed.

This method naturally generates longer runs, and allows parallelism
between reading the source disk, sorting (really merging) the records in
main memory, and writing the intermediate file. The question of whether
to do the writes to the intermediate file by moving the records to an
output buffer, or using gather write techniques depends upon the
relative costs for the specific hardware involved.

I hope this helps.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<shtgrr$4j4$1@z-news.wcss.wroc.pl>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20475&group=comp.arch#20475

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!alphared!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.pionier.net.pl!pwr.wroc.pl!news.wcss.wroc.pl!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Wed, 15 Sep 2021 19:17:15 +0000 (UTC)
Organization: Politechnika Wroclawska
Lines: 255
Message-ID: <shtgrr$4j4$1@z-news.wcss.wroc.pl>
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shrng9$8mr$1@z-news.wcss.wroc.pl> <ceo0J.24026$6u6.12365@fx03.iad>
NNTP-Posting-Host: hera.math.uni.wroc.pl
X-Trace: z-news.wcss.wroc.pl 1631733435 4708 156.17.86.1 (15 Sep 2021 19:17:15 GMT)
X-Complaints-To: abuse@news.pwr.wroc.pl
NNTP-Posting-Date: Wed, 15 Sep 2021 19:17:15 +0000 (UTC)
Cancel-Lock: sha1:y20oC0s9n6Zmh9qT4Ik8HYMgunY=
User-Agent: tin/2.4.3-20181224 ("Glen Mhor") (UNIX) (Linux/4.19.0-10-amd64 (x86_64))

by: antis...@math.uni.wroc.pl - Wed, 15 Sep 2021 19:17 UTC

EricP <ThatWouldBeTelling@thevillage.com> wrote:
> antispam@math.uni.wroc.pl wrote:
> > EricP <ThatWouldBeTelling@thevillage.com> wrote:
> >> Stephen Fuld wrote:
> >>> On 9/9/2021 7:14 AM, EricP wrote:
> >>>
> >>> snip
> >>>
> >>>> Combining the "external sort" approach with multiple rotating work
> >>>> buffers,
> >>>> buffer A is doing an async read, B is sorting, C is async writing.
> >>>> When all are done, rotate the buffers.
> >>>> Balance the buffer size, read and write IO times vs sort time.
> >>>> That keeps 1 cpu/thread and 2 disks busy.
> >>> I am not sure that is the best approach. If you have N bytes of main
> >>> memory, you have divided it into three equal sized pieces (in order to
> >>> be able to "rotate them". That means the length of the run will be
> >>> about 1/3 of what it would be if you used all of the memory as one big
> >>> buffer. Read data into that buffer, then sort it, then write it out.
> >> My suggestion keeps the bottleneck source HDD continuously busy and
> >> parallels the read of the source, the write of the temp, and sort.
> >> That gets 2 IO going at once (actually more because I'd start the
> >> merges before the source read was finished).
> >
> > Basic problem with you suggestion is that you base it on lousy
> > wikipedia method.
>
> I'm not sure which wikipedia method you are referring to.
> If you mean the one here (I'll call this the "batch method"):
>
> https://en.wikipedia.org/wiki/External_sorting#External_merge_sort
>
> then that does not allow streaming data from the source disk.
> Assuming read, sort, and write phases are equal in duration
> that utilizes the source disk 33% of the time.

Yes, that one.
> I am suggesting of a pipelined approach which keeps the source disk
> streaming 100% of the time by using many temp disks in
> parallel to work around the disk IO bandwidth bottleneck.
>
> > Classic approch is to use tree or heap sort.
> > Each handles input as it comes in, so rather small double buffer
> > is enough to keep input disc busy. Sort "moves" data from one
> > buffer to sort work area, in parallel the second buffer receives
> > disc data. Once one buffer is worked out you switch roles of
> > buffers.
>
> Right, a pipeline.
> But pipeline requires that memory be spit into multiple buffers
> and therefore the chunk (unit of sorted records) size must be
> smaller than batch method, resulting in more fragments to merge.
> Which is Stephen's point.
>
> > Moreover, classic approach uses two trees (or heaps),
> > old and new. All records in old tree are smaller than all records
> > in new tree. Incoming records are put in old tree if they are
> > in range of old tree, otherwise are put in new tree. Largest
> > record from old tree is sent to output file. Note that after
> > each step old tree either stays the same or shrinks so eventually
> > it will shrink to nothing. Then current output run will end,
> > former new tree will take role of old tree, and the new new tree
> > will be empty. Also, new output run will start. Assuming random
> > distribution of records this tends to produce runs _twice_ as
> > long as work memory. Wikipedia example had file which is
> > 9 times larger than work memory, with such file it does not
> > make much difference if you get 9 way merge (as Wikepedia
> > suggest), 27 ways merge (your approach) or 5 way merge
> > (expected with classic approach). But numer of runs
> > affects number of buffer you need and consequently size
> > of buffers. As long as number of runs is bigger than
> > number of available discs (which IMHO is expected case)
> > there will be seeks durning merge and bigger buffers means
> > less time lost on seeks.
>
> I am not familiar with the 2-tree algorithm giving runs twice as long.
> Which wikipedia article is this?

I am not aware of ony reference in wikipedia. I leared that from
Wirth "Algorithms + data structures = programs" (old edition,
I had once new edition in hand and did not look at sort part
but I have noticed that some other material was gone).

> > I wrote "moved" above because heapsort actually moves records.
> > Treesort just moves pointers and records stay in their buffers,
> > so copying is only from input buffer to output buffer (once
> > output is done buffer gets ready for reuse as input buffer).
>
> Treesort using descriptors is what got me started on this.
> Using the sorted descriptors as a DMA fragment list that
> is passed directly to the output device so that the records
> are gathered by the DMA, eliminating the move step(s).
>
> >> Your suggestion only utilizes the source disk less than 50% because
> >> it serializes everything.
> >
> > As explained classiscaly this is not the case: input, sorting and
> > output can work in parallel with only moderate amout of memory
> > assigned to buffers.
>
> The pipelined approach, which is what I was suggesting.
>
> >> The dest drive sequential file is the other bottleneck because it can't
> >> start to be written until the last record is read from the source
> >> (that last record might sort first in the dest file).
> >>
> >> A "minimum" sort time would start final merge and dest file write
> >> immediately after reading the last source record, and continuously
> >> stream the output as fast as the dest disk can take it.
> >
> > Well, start merge mean read starting blocks of _all_ runs and you
> > need enough memory for this. Your assumption of single input
> > drive and single output drive but multiple work drives makes problem
> > potentially quite asymetic and in such asymetric setup there may
> > be some gain from making final run smaller and using freed memory
> > to start merging earlier. However, really big files will not
> > fit on single disc and splitting big files between multiple
> > disc is standard performance trick, so you assumption is
> > _very_ unnatural.
>
> As I understand it, the scenario is defined by the benchmark
> (see Common Rules section):
>
> http://sortbenchmark.org/
>
> as a trillion+ byte record sort. The file size is chosen to be larger
> than the amount of main memory so they must partition the sort.
> Presumable this is a primary key sort prior to a database load.
>
> Note that the above list of current winners are using SSD's
> not HDD's so they have zero seek time to deal with.
> Past winners, listed below them, used HDD.

Data rates clearly show that data is _not_ coming from
single disc. Note:

: * File or device striping (RAID 0) are allowed (encouraged) to get
: bandwidth. If file striping is used then the concatenated files
: must form a sorted file.

so they indeed encourage distributing files between multiple discs.

> This came from the 1985 Datamation sort benchmark defined as:
> - Input is a disk-resident file of a million 100-byte records.
> - Records have 10-byte key fields and can't be compressed.
> - The input record keys are in random order.
^^^^^^^^^^^^^
They write:

: * The sort input records must be 100 bytes in length, with the first
: 10 bytes being a random key.
^^^^^^^^^^
Random key is easier assumption than arbitrary (possibly hostile)
key distribution sorted in random order. Anyway, at the scale
of this benchmark probable scheme is 3-phase:

1) Collect statistic from random sample and compute approximate
p-quantiles for p=1/N,2/N, ..., (N-1)/N. Use them in mulitway
split to get N parts of approximately equal size.
Split can be done in parallel using N compute nodes.
2) Form runs for merge stage
3) Merge runs

Steps 2 and 3 are done separately on each part, assigning separate
node for each part. Those steps are essentially the merge sort
discussed before. Part 1, that is splitting, allows distributing
load between nodes.

> - The output file must be a permutation of the input file sorted
> in key ascending order.
>
> Other than that there are few rules (it must use the OS file system).
> The implementation may use all the "mean tricks" typical of operating
> systems utilities. It can access the files via low-level interfaces,
> it can use undocumented interfaces, and it can use as many disks,
> processors and as much memory as it likes.
>
> The file size is chosen to be larger than main memory,
> and $ per sort-time is a benchmark consideration.
>
> >>> of merge passes. As the size of the sorted runs goes up, the number of
> >>> passes goes down, and as we discussed in Anton's example, even when you
> >>> get to the final merge pass, having fewer runs allows fewer I/Os. So,
> >>> yes, you lose concurrency, but the longer runs may gain in elapsed time
> >>> by doing less work.
> >> Yes, in part. But for the final merge file write, if there are 8 temp
> >> and 1 dest HDD then we can read and merge far faster than write.
> >>
> >> Looking at a randomly selected Enterprise disk from Western Digital,
> >> a 1 TB SATA disk has a max sustained transfer rate of 184 MB/s
> >> and a 8 TB SATA disk is 255 MB/s.
> >> They don't give seek time anymore but I'll assume it's 15-20 ms average.
> >>
> >> The system bus which operate in the 6 giga-transfers per second range
> >> hopefully can support 8 or 9 concurrent DMA's plus execute code.
> >>
> >> Using the 1 TB as source&dest gives 5435 sec = 90 min to read or write.
> >> A 15 ms seek costs 2.76 MB of lost disk transfer time.
> >
> > Which means that 3M buffer will reduce loss due to seeks to 50%.
> > With resonable amout of RAM 3M buffers allow 1000 way merge which
> > should be enough for anything that fits on 8T drive. Which in
> > turn means that there is no gain from using more than 2 work drives.
>
> All benchmark winners use many disks, 8 seems to be a common number.
> I see Nsort in 2004 used 2350 disks.

Click here to read the complete article

Re: OoO S/360 descendants

<FJw0J.84969$Kv2.19863@fx47.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20476&group=comp.arch#20476

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shrng9$8mr$1@z-news.wcss.wroc.pl> <ceo0J.24026$6u6.12365@fx03.iad> <sht9p3$2tl$1@dont-email.me>
In-Reply-To: <sht9p3$2tl$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 106
Message-ID: <FJw0J.84969$Kv2.19863@fx47.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 16 Sep 2021 01:03:33 UTC
Date: Wed, 15 Sep 2021 21:02:38 -0400
X-Received-Bytes: 5994

by: EricP - Thu, 16 Sep 2021 01:02 UTC

Stephen Fuld wrote:
> On 9/15/2021 8:23 AM, EricP wrote:
>> antispam@math.uni.wroc.pl wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>>>> Stephen Fuld wrote:
>>>>> On 9/9/2021 7:14 AM, EricP wrote:
>>>>>
>>>>> snip
>>>>>
>>>>>> Combining the "external sort" approach with multiple rotating work
>>>>>> buffers,
>>>>>> buffer A is doing an async read, B is sorting, C is async writing.
>>>>>> When all are done, rotate the buffers.
>>>>>> Balance the buffer size, read and write IO times vs sort time.
>>>>>> That keeps 1 cpu/thread and 2 disks busy.
>>>>> I am not sure that is the best approach. If you have N bytes of
>>>>> main memory, you have divided it into three equal sized pieces (in
>>>>> order to be able to "rotate them". That means the length of the
>>>>> run will be about 1/3 of what it would be if you used all of the
>>>>> memory as one big buffer. Read data into that buffer, then sort
>>>>> it, then write it out.
>>>> My suggestion keeps the bottleneck source HDD continuously busy and
>>>> parallels the read of the source, the write of the temp, and sort.
>>>> That gets 2 IO going at once (actually more because I'd start the
>>>> merges before the source read was finished).
>>>
>>> Basic problem with you suggestion is that you base it on lousy
>>> wikipedia method.
>>
>> I'm not sure which wikipedia method you are referring to.
>> If you mean the one here (I'll call this the "batch method"):
>>
>> https://en.wikipedia.org/wiki/External_sorting#External_merge_sort
>>
>> then that does not allow streaming data from the source disk.
>> Assuming read, sort, and write phases are equal in duration
>> that utilizes the source disk 33% of the time.
>
> I believe what Waldek was referring to was that article, specifically
> the statement number 1, saying you sort the initial load "by some
> conventional method". With no disrespect for either your or Waldek, I
> will try to explain it to you using different language.
>
> He was talking about using a merge sort as the basic sorting method,
> which combines naturally with the run elongation methods we have
> discussed (and the article mentions as the last sentence of that section
> with its reference to Knuth), and allows parallel reading of input with
> doing the sort, and writing the intermediate file.
>
> Specifically,
>
> 1. Read the first block of the input file.
>
> 2. Then you create the merge tree for that block in parallel with
> reading the second block.
>
> 3. Read the 3rd block of the input file in parallel with adding the
> records in the 2nd block to the merge tree.
>
> 4. Repeat step 3 until the sort work area in main memory is filled.
>
> 5. Write the lowest block from the merge tree to the intermediate file.
>
> 6. Read the next block of the input file into the space vacated by
> the just written record.

Ah... Understood. Reused the empty holes.

> 7. Merge the records in the newly read block into the bit merge tree.
> If any of them have keys lower than the highest key in the block just
> written out, start a new tree.
>
> 8. Continue in parallel, reading blocks from the input file, merging
> the records into either the "old" merge tree (if their keys are greater
> than the key of the lowest record written out, or the "new" merge tree,
> if not, and writing a block from the low end of the "old" merge tree to
> the intermediate file.
>
> 9. At some point, you will have written out all of the records from
> the "old" tree. At that point, the "new" tree becomes the new "old"
> tree, and you start a new "new" tree. That marks the end of the run.
> Repeat as needed.
>
> This method naturally generates longer runs, and allows parallelism
> between reading the source disk, sorting (really merging) the records in
> main memory, and writing the intermediate file. The question of whether
> to do the writes to the intermediate file by moving the records to an
> output buffer, or using gather write techniques depends upon the
> relative costs for the specific hardware involved.
>
> I hope this helps.

It did, thank you.
This is the replacement sort you referred to earlier.

The result sort chunks would be variable length with a minimum size
of the initial 3 input buffers, and the largest size of the
whole source file if it is already sorted.

I might not do it quite this exact way.
I want to do big, contiguous reads with double buffers for overlapped
read ahead and this makes it a little more complicated.
But moving records in memory is cheaper than doing another IO
so a bit of shuffling records could pipeline this.

Re: OoO S/360 descendants

<ItJ0J.70507$YW.34958@fx05.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20480&group=comp.arch#20480

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx05.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
References: <sde7eg$hgb$1@newsreader4.netcologne.de> <5f1ee158-53e6-43dd-9a68-f3a1a37bb4f5n@googlegroups.com> <sflqmi$16gk$1@gal.iecc.com> <2021Aug19.173450@mips.complang.tuwien.ac.at> <sfm656$2lrr$1@gal.iecc.com> <2021Aug21.140442@mips.complang.tuwien.ac.at> <sh5nuv$brr$1@dont-email.me> <2021Sep7.131827@mips.complang.tuwien.ac.at> <sh8570$t79$1@dont-email.me> <2021Sep8.082614@mips.complang.tuwien.ac.at> <nb3_I.19779$rsCb.16610@fx01.iad> <shb2ea$1mso$1@gioia.aioe.org> <Ez9_I.3$_n1.2@fx14.iad> <2Fo_I.1458$ol1.620@fx42.iad> <shfr5c$vc8$1@dont-email.me> <547%I.12692$dI3.3254@fx10.iad> <shorsb$48h$1@dont-email.me> <j710J.25224$rsCb.8927@fx01.iad> <shqq0u$rvq$1@dont-email.me>
In-Reply-To: <shqq0u$rvq$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 229
Message-ID: <ItJ0J.70507$YW.34958@fx05.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 16 Sep 2021 15:34:00 UTC
Date: Thu, 16 Sep 2021 11:33:27 -0400
X-Received-Bytes: 11543
X-Original-Bytes: 11491

by: EricP - Thu, 16 Sep 2021 15:33 UTC

Stephen Fuld wrote:
> On 9/14/2021 6:05 AM, EricP wrote:
>
>> The temp writes do seek but I would write fragments to multiple disks
>> so the seeks time can be overlapped with writes to other disks.
>
> I am not sure what you are saying here. Are you saying that, assuming
> your initial sorted run is of size M/3, where M is the amount of memory
> available for the buffer, that you are going to split this up such that
> each run on disk will be (M/3)/8 long? You can do that, of course, and
> you do overlap thus reduce the time to write out the intermediate disks
> in the sort phase, but you more than lose that during the merge phase.
> See below.

No. It starts with a streaming sort pass of M/4 (I needed some extra
working memory for the later merge pass so M/3 became M/4).
That streams chunks 1,2,3 to disks D1 D2 D3.

Then while chunks 4,5,6 are streaming to disks D5 D6 D7
it merges 1,2,3 and puts that C3 triple size chunk on D4.

Then while chunks 7,8,9 are streaming to disks D1 D2 D3
it merges 4,5,6 and puts that C3 triple size chunk on D8.
And so on.

All that can be done with minimal disk head movement that can be
overlapped across disks without disturbing the source streaming.

To merge the C3 size chunks requires moving the disk heads.
When there are four C3 triple chunks they merge to a C12 chunk.

So it starts out streaming smaller chunks but almost immediately
starts concurrent merging.

The net result is (hopefully) that doing something less efficiently
in parallel is faster than doing it more efficiently serially.

(This is what my later post with the table tried to show with
the chunks/runs sorting and merging and moving between disks.)

>> Plus I'd be merging earlier fragments in parallel
>> with writes of new fragments - ideally the source and
>> all 8 temp disks are seeking, reading or writing at all times.
>> This is the parallel Towers of Hanoi I referred to.
>
> But then you are "interleaving" the writes of the sort phase with the
> reads for the merge phase, and you give up the locality on the writes,
> so each write incurs a full seek, not just a latency.

Initially it is reading and writing different disks sequentially
at the same time so there is locality and the stream is continuous.

However later concurrent merges do require seeks, reads and writes
to disks that will causes interference between the operations.
My hope is that by juggling the order of disk operations that
any such seeks can be hidden in overlap.

>> Also I didn't say this before but should mention,
>> I'd preallocate all temp files as contiguous disk blocks
>> at the run start so there is no file system interaction,
>> and no reason for the file system to move the disk head
>> for meta data reading or writing.
>> Then manage the assignment of fragments to file blocks myself.
>> This helps ensure the disk heads are always where I left them.
>
> Sure. I was essentially assuming the same thing.
>
>
>>> So, your proposal gains on parallelism, but loses on disk latency.
>>> We just don't have enough specifics to know which effect is larger.
>>> :-(
>>>
>>> I should also note that your proposal uses more CPU time, both for
>>> processing three times as many read and write commands, and on the
>>> sort itself. I.e. it costs less CPU time to sort one block of length
>>> 3N as three blocks of length N. Its not clear if this effects
>>> elapsed time.
>>
>> I disagree with this.
>>
>> IO time approximates T = A + B*size
>> where for A there is an OS time queuing any IO and disk time to seek,
>
> Yes, although with the unbuffered drive assumption, it is useful to
> split A into seek and latency. I assumed that the OS time was
> negligible compared to those.
>
>> and for B the OS time pinning pages and setting up IOMMU page maps
>> and tearing them down afterwards, and for the disk transfer blocks.
>
> Yes. Again, for simplicity, I was ignoring the OS time.
>
>
>> Time to read, say, 32 MB should be dominated by per-block OS overhead
>> and transfer time 32 MB/184 MB/s = 173 ms or 10 times the seek time,
>> if any seek was needed.
>
> OK.
>
>
>> For the sort, it doesn't matter if it is more expensive as long as it is
>> not on the critical path and is complete before its parallel read writes.
>
> Agreed, but I think it is on the critical path.
>
>> The goal is minimum elapsed time.
>
> On that, we absolutely agree! :-)
>
>>> OK, let's move on to the merge phase. As I went into in response to
>>> Anton, the benefits of the 3X longer runs, which means 1/3 the number
>>> of runs to merge really shows up here. In summary, the fewer the
>>> number of runs to merge, the more of each run you can keep in main
>>> memory, so the fewer reads to get it into memory. Your proposal
>>> requires three times the number of I/Os, to the intermediate disks
>>> which means 3 times the full (assumed) 15-20 ms seek plus latency
>>> (the transfer time is, of course, the same).
>>
>> I agree if this was solely what I proposed.
>> But I also said I'd start part of the merge during the sort phase.
>> That keeps many disks and cpus busy concurrently.
>> This is the parallel Towers of Hanoi part.
>
> I don't think you have taken into account the reduction of memory
> allocated to the sort by using some of it for the merge, and the
> resultant increase in the number of intermediate runs, thus increase in
> the number of I/Os required.

That is a factor, but there is also that it allows the source to
be continuously streamed, and the merges are concurrent.
The net result could be a win (is worth exploring anyway).

> You also have introduced a potential queue for the intermediate disks,
> that is, a write for the sort phase may be queued behind a read for the
> merge phase or vice versa. This will appear as an increase in A in your
> equation above. Note this is independent of whether you assume the
> drive does the queuing or the host.

I did think of that - not sure of a solution though.
There isn't a way to indicate priorities to the drives.

The merge buffer reads and writes are smaller anyway so if a write of
a larger sort chunk gets stuck behind a read of a smaller merge chunk,
the interference will be short duration.

>> So the question is, is there an order to the parallel Towers of Hanoi
>> that results in the shortest elapsed time.
>>
>>> You mentioned above that the reads of the intermediate disks would be
>>> faster than the writes to the output drive. The problem is that you
>>> are paying the full seek plus latency on the reads, but only the
>>> latency on the writes. In fact, they will be serialized, as you have
>>> to complete the write to free up the memory to be able to do the read.
>>
>> Yes, for either approach.
>> If there are too many fragments to merge then the seeks to load
>> the fragment buffers will be longer than it takes to write
>> the next merged output buffer and it will stall.
>
> Agreed. And since your proposal results in more fragments, for a given
> amount of memory, each fragment will be shorter, thus more reads will be
> required for the merge.

Initially shorter, but they immediately merge concurrently
but on different disks.

>> I see this as, each sorted fragment is read in small chunks into
>> a circular buffer. Merge looks at the record at the head of each
>> of those circular buffers and copies the next one to the output.
>
> OK. I am assuming you will allocate as many circular buffers as you
> have fragments. So the more fragments you have, the smaller each buffer
> will have to be and so each fragment will divided into a larger number
> of chunks. That larger number of chunks means more disk read operations
> to read each fragment.

It sort phase merge doesn't have to allocate a particular number of buffers.
For it to be a win it just has to improve things so the final sort is better.

>> As each input fragment is emptied, we must do a seek and read to refill.
>
> Not fragment, (in your terminology) chunk. BTW, the usual terminology
> for what you are calling fragments is "runs".
>
>> As each output circular buffer 1/2 fills up, that half is written
>> and we start the next half.
>
> I am not sure I follow this. Are the output circular buffers just
> traditional "double buffering" before writing to the disk? How many
> sets of output circular buffers do we have? If more than one, why write
> only half at a time?

Click here to read the complete article

Re: more than we wanted to know about sort/merge, was OoO S/360 descendants

<si2ba8$sc5$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20487&group=comp.arch#20487

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: more than we wanted to know about sort/merge, was OoO S/360
descendants
Date: Fri, 17 Sep 2021 08:13:05 -0700
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <si2ba8$sc5$1@dont-email.me>
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
<2021Sep8.082614@mips.complang.tuwien.ac.at>
<nb3_I.19779$rsCb.16610@fx01.iad> <shapci$gp8$1@dont-email.me>
<shb3ic$poi$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 17 Sep 2021 15:13:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="32ab916d4223fc35344731d5c0028ecb";
logging-data="29061"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/iydVhAo0H3qmkLcpnEeujo07PE8BU9S4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.1.0
Cancel-Lock: sha1:JWuiksGCgicUz0X2FUEK+gOIscQ=
In-Reply-To: <shb3ic$poi$1@gal.iecc.com>
Content-Language: en-US

by: Stephen Fuld - Fri, 17 Sep 2021 15:13 UTC

On 9/8/2021 12:39 PM, John Levine wrote:
> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>>> Sort-merge might make use of general HW scatter-gather DMA mechanism
>>> integrated with IO so records are gathered during the disk write operation.
>>
>> Given that S/360 had the equivalent of scatter/gather (called Data
>> Chaining in S/360 parlance), they might very well have used it. As I
>> said, I really couldn't find much information on the internal algorithm
>> used by IBM's sort program.
>
> Probably not. On low end systems data chaining didn't work very well
> because the system sometimes didn't have time to fetch and decode another
> CCW in the middle of a disk block.

OK, I didn't know that. Thanks.

> I get the impression it was mostly used
> in specific applications like separating the key and the data in ISAM data.
>
>> As I said, I really couldn't find much information on the internal algorithm
>> used by IBM's sort program.
>
> http://bitsavers.org/pdf/ibm/360/dos/plm/Y24-5021-0_DOS_Sort_Merge_PLM_Aug66.pdf
> http://bitsavers.org/pdf/ibm/370/OS_VS/sort_merge/LY33-8042-6_OS_VS_Sort-Merge_Logic_197909.pdf
>
> Looks to me like they paid a lot of attention to arranging the data on disk
> to minimize rotational and seek delay, but they moved the data into the
> output buffer, no chaining funny business.

I spent a little time going over those, and I agree about the disk
arrangement stuff. I didn't spend the time needed to understand it in
detail, and I don't intend to. :-). But they still don't give
information on the "blocksort" algorithm, specifically, how they sort
the data within the blocks, i.e. quicksort, heapsort, etc. I suppose
they could use a version of mergesort, as they have the instructions to
make the tree manipulation fast, but ISTM that the memory space required
for the tree (unsorted data has an average run length of 2), would be
prohibitive. But I really don't know.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<si2ere$i36$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20488&group=comp.arch#20488

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Fri, 17 Sep 2021 09:13:32 -0700
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <si2ere$i36$1@dont-email.me>
References: <k9vTI.36153$Fu2.9688@fx16.iad>
<memo.20210819214956.15456M@jgd.cix.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 17 Sep 2021 16:13:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="32ab916d4223fc35344731d5c0028ecb";
logging-data="18534"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dnuLyMUvrIKGcUgDZ4Zl9sVtjz74zoCk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.1.0
Cancel-Lock: sha1:4+7EQ7Xcle+HgRpPnvlGuvfO1pA=
In-Reply-To: <memo.20210819214956.15456M@jgd.cix.co.uk>
Content-Language: en-US

by: Stephen Fuld - Fri, 17 Sep 2021 16:13 UTC

On 8/19/2021 1:48 PM, John Dallman wrote:
> In article <k9vTI.36153$Fu2.9688@fx16.iad>,
> ThatWouldBeTelling@thevillage.com (EricP) wrote:
>> Michael S wrote:
>>> I don't believe that in last 35-40 years IBM ever published CPU
>>> benchmarks that can be compared directly against IBM's competitors
>>> or even vs other IBM-made computers.
>> Probably because it doesn't help their sales so why do it.
>> Does anybody port _to_ zSeries?
>
> Not often, but it happens. Linux runs on zSeries, so the software that
> comes bundled with RHEL and SLES has all been ported. I took a look in
> the spring out of curiosity, and it isn't all that strange, although a
> lot of the nomenclature is weird. I stopped when I discovered the entry
> price for a z15.

I am pretty sure that running your company's modeling software is not
the target market for the Z series.

For some ideas about what the target market is, see the comments to the
Ars Technica article on Telum

https://arstechnica.com/gadgets/2021/09/ibms-telum-mainframe-processor-introduces-new-cache-architecture/?comments=1

BTW, I suspect the differences between IBM's design rules and those for
commodity processors may account for the "mercurial cores" we discussed
in another thread here.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: OoO S/360 descendants

<memo.20210917203610.17860B@jgd.cix.co.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20490&group=comp.arch#20490

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: OoO S/360 descendants
Date: Fri, 17 Sep 2021 20:36 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <memo.20210917203610.17860B@jgd.cix.co.uk>
References: <si2ere$i36$1@dont-email.me>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="3eb1f83a374bc7b52f59f9b5736b351c";
logging-data="14701"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Gu59NvmQC274abz2yR/Ev7O5YeaWKqg0="
Cancel-Lock: sha1:j3wW0haOTkz3FslpenQf/0O9AFc=

by: John Dallman - Fri, 17 Sep 2021 19:36 UTC

In article <si2ere$i36$1@dont-email.me>, sfuld@alumni.cmu.edu.invalid
(Stephen Fuld) wrote:

> > I took a look in the spring out of curiosity, and it isn't all
> > that strange, although a lot of the nomenclature is weird. I
> > stopped when I discovered the entry price for a z15.
> I am pretty sure that running your company's modeling software is
> not the target market for the Z series.

Me too. I first looked at it because it's the only big-endian platform
whose market share is not plummeting.

John

Re: OoO S/360 descendants

<16b04a52-04ed-47a7-a5c3-5c138666c673n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20493&group=comp.arch#20493

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:734:: with SMTP id c20mr17533828qvz.13.1631993326043;
Sat, 18 Sep 2021 12:28:46 -0700 (PDT)
X-Received: by 2002:a4a:e7d4:: with SMTP id y20mr14051853oov.16.1631993325805;
Sat, 18 Sep 2021 12:28:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 18 Sep 2021 12:28:45 -0700 (PDT)
In-Reply-To: <si2ere$i36$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.224; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.224
References: <k9vTI.36153$Fu2.9688@fx16.iad> <memo.20210819214956.15456M@jgd.cix.co.uk>
<si2ere$i36$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <16b04a52-04ed-47a7-a5c3-5c138666c673n@googlegroups.com>
Subject: Re: OoO S/360 descendants
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 18 Sep 2021 19:28:46 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 34

by: Michael S - Sat, 18 Sep 2021 19:28 UTC

On Friday, September 17, 2021 at 7:13:36 PM UTC+3, Stephen Fuld wrote:
> On 8/19/2021 1:48 PM, John Dallman wrote:
> > In article <k9vTI.36153$Fu2....@fx16.iad>,
> > ThatWould...@thevillage.com (EricP) wrote:
> >> Michael S wrote:
> >>> I don't believe that in last 35-40 years IBM ever published CPU
> >>> benchmarks that can be compared directly against IBM's competitors
> >>> or even vs other IBM-made computers.
> >> Probably because it doesn't help their sales so why do it.
> >> Does anybody port _to_ zSeries?
> >
> > Not often, but it happens. Linux runs on zSeries, so the software that
> > comes bundled with RHEL and SLES has all been ported. I took a look in
> > the spring out of curiosity, and it isn't all that strange, although a
> > lot of the nomenclature is weird. I stopped when I discovered the entry
> > price for a z15.
>
> I am pretty sure that running your company's modeling software is not
> the target market for the Z series.
>
> For some ideas about what the target market is, see the comments to the
> Ars Technica article on Telum
>
> https://arstechnica.com/gadgets/2021/09/ibms-telum-mainframe-processor-introduces-new-cache-architecture/?comments=1
>
> BTW, I suspect the differences between IBM's design rules and those for
> commodity processors may account for the "mercurial cores" we discussed
> in another thread here.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

My guess is that "mercurial cores" are caused by out-of-spec PSU.
My another guess is that on Dell/HPE/Lenovo servers it happens much less often than on servers that Google orders directly from QCT,
even when both types of severs are manufactured on the same floor.

Re: Could we build a better 6502?

<3fd3af12-9e1c-4e41-8c67-a67c8df73921n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21216&group=comp.arch#21216

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7319:: with SMTP id x25mr20015252qto.147.1634395144340;
Sat, 16 Oct 2021 07:39:04 -0700 (PDT)
X-Received: by 2002:a05:6808:1806:: with SMTP id bh6mr4738936oib.0.1634395144136;
Sat, 16 Oct 2021 07:39:04 -0700 (PDT)
Path: rocksolid2!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 16 Oct 2021 07:39:03 -0700 (PDT)
In-Reply-To: <sde7eg$hgb$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3a:e00:2d8c:b6a1:c50f:c954;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3a:e00:2d8c:b6a1:c50f:c954
References: <sde7eg$hgb$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3fd3af12-9e1c-4e41-8c67-a67c8df73921n@googlegroups.com>
Subject: Re: Could we build a better 6502?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 16 Oct 2021 14:39:04 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 9

by: Quadibloc - Sat, 16 Oct 2021 14:39 UTC

While I felt that a "better 6502" in the sense mentioned - a more
optimized minimal design - probably was not achievable to any
significant extent, other kinds of "better 6502" have been discussed
in this thread.

In another newsgroup, I learned of this one:

http://www.e-basteln.de/computing/65f02/65f02/#supported-host-systems

John Savard

Pages:1 2 3 4 5 6 7 8 9 10 11 1213

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor