Message-ID:

A conclusion is simply the place where someone got tired of thinking.

devel / comp.arch / Re: Squeezing Those Bits: Concertina II

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>And if we take these metrics at face value, long superinstructions
>(i.e., that combine many simple instructions, e.g., a whole basic
>block) seem to be optimal. But they typically do not occur in hot
>code of other programs (unless another program contains pretty much
>the same basic block in hot code, which is not the case for most of
>the hot basic blocks in the corpus), so most of these
>superinstructions go to waste; we typically limit the number of
>superinstructions to limit the compile time and the size of the
>interpreter, so wasted superinstructions are a problem. In the end,
>we do not get good superinstructions out of this approach.

To make this more concrete, consider the selection of 400
superinstructions. If we select the dynamically most frequent
sequences, among these superinstructions will be maybe the 300 most
frequently executed basic blocks, and 100 superinstructions that
happen to statically occur several times in, say the basic blocks with
rank 301-500. If 250 of the 300 most frequently executed basic blocks
are not repeated in other programs, this strategy means that you pay
the price for 400 superinstructions, but get the benefit of only 150.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<cf4be607-9728-431a-8651-4e5856bf2e86n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17380&group=comp.arch#17380

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:ee23:: with SMTP id l3mr4289145qvs.55.1622811160505; Fri, 04 Jun 2021 05:52:40 -0700 (PDT)
X-Received: by 2002:a9d:27a1:: with SMTP id c30mr3400724otb.342.1622811160225; Fri, 04 Jun 2021 05:52:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 05:52:40 -0700 (PDT)
In-Reply-To: <s9ceab$82d$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7d19:93b4:957f:8509; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7d19:93b4:957f:8509
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <s9ceab$82d$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cf4be607-9728-431a-8651-4e5856bf2e86n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Jun 2021 12:52:40 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 14

by: MitchAlsup - Fri, 4 Jun 2021 12:52 UTC

On Friday, June 4, 2021 at 12:39:25 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Really fast machines 12-gates
> > more typical machines 16-gates
> [...]
>
> Thanks for the numbers!
>
> What kind of gates are being counted here? An inverter certainly
> has a lower delay than an XOR gate, for example. Are the
> delay values converted to some normalized form, like a NAND gate?

4-input NAND gate driving 4 inputs of another 4-input NAND gate,
or
3-input NOR gate driving 4 inputs of a 4-input NAND gate.

Re: Squeezing Those Bits: Concertina II

<9b4bc6c4-8071-4a6f-b884-061155c06685n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17381&group=comp.arch#17381

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:418d:: with SMTP id o135mr4070338qka.418.1622811316447;
Fri, 04 Jun 2021 05:55:16 -0700 (PDT)
X-Received: by 2002:a9d:6244:: with SMTP id i4mr3393280otk.182.1622811316197;
Fri, 04 Jun 2021 05:55:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 05:55:15 -0700 (PDT)
In-Reply-To: <96041685-57a9-49d2-a4ad-476b32f8e59cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7d19:93b4:957f:8509;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7d19:93b4:957f:8509
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <s9ceab$82d$2@newsreader4.netcologne.de>
<96041685-57a9-49d2-a4ad-476b32f8e59cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9b4bc6c4-8071-4a6f-b884-061155c06685n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Jun 2021 12:55:16 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 4 Jun 2021 12:55 UTC

On Friday, June 4, 2021 at 3:41:56 AM UTC-5, Quadibloc wrote:
> On Thursday, June 3, 2021 at 11:39:25 PM UTC-6, Thomas Koenig wrote:
>
> > What kind of gates are being counted here? An inverter certainly
> > has a lower delay than an XOR gate, for example. Are the
> > delay values converted to some normalized form, like a NAND gate?
> Yes, you have it. The delay required by a NAND gate is considered to be
> 'one gate delay', and so an XOR gate counts as two layers of gates.
<
If you have both polarities of a signal, the 3-input XOR gate is 1 gate of delay.
>
> John Savard

In article <5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>,
jsavard@ecn.ab.ca (Quadibloc) wrote:

> And, indeed, while Intel's original small Atom cores were in-order,
> they eventually switched over to even giving those a simple
> out-of-order capability, since transistor densities had increased,
> and the original Atom cores were percieved as having very poor
> performance.
>
> And yet people didn't complain about the performance of the
> 486 DX. So I would be inclined to blame software bloat.

The 486DX was the fastest thing around in the Windows space for several
years. You could not get anything better.

The Atoms were compared against much faster Intels and AMDs. That made
them look slow, even if they were faster than the 486 (I don't know if
they were - never used 'em).

John

Anton Ertl wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> With a 20 gate per cycle design point, one can build a 6-wide reservation
>>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
>>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
>>> pipeline.
>> If we count 5-gates of delay for the clock-boundary's flip-flop, that
>> means:
>>
>> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
>>
>>> At 16 cycles this necessarily becomes 9-10 stages.
>>> At 12 gates this necessarily becomes 12-15 stages.
>> And that gives:
>>
>> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
>> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
>>
>> So at least in terms of the latency of a single instruction going
>> through the whole pipeline, the gain of targetting a lower-clocked
>> design seems clear ;-)
>
> But that's not particularly relevant. You want to minimize the total
> execution time of a program; and, with a few exceptions (e.g., PAUSE),
> one instruction does not wait until the previous instruction has left
> the pipeline; if it did, there would be no point in pipelining.
>
> Instead, a data-flow instruction waits until its operands are
> available (and the functional unit is available). For simple ALU
> operations, this typically takes 1 cycle (exceptions:
> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
> made deep pipelines a win, until CPUs ran into power limits ~2005.
>
> - anton

The relevance of latency comes in, I think, when one considers the effect
of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
injects a bubble whose size is independent of the number of stages.

If we go from 6 stages, 20+5 gates to 12 stages, 12+5 gates
we increase the clock by a factor of (20+5)/(12+5) = 1.47.
But a bubble now takes 2x as many clocks to recover from.

Also adding pipeline stages doesn't change the speed of data cache.
Adding pipeline stages to increase the frequency means we somewhat
decrease the unused cache access time between loads and stores.
However if the D$ cache access saturates, stages should have minimal impact.

It also depends on how one measures performance.
More stages means higher frequency means higher potential issued MIPS.
If instead we count retired MIPS, to take into account bubbles and
any back pressure (stall) effects of D$ cache access,
I would expect to see much less actual benefit.

On Friday, June 4, 2021 at 6:55:30 AM UTC-6, John Dallman wrote:

> The Atoms were compared against much faster Intels and AMDs. That made
> them look slow, even if they were faster than the 486 (I don't know if
> they were - never used 'em).

That's true enough, but I think it was more than their being slow _by comparison_.

Except perhaps indirectly.

Since the other chips were so fast, software used that speed to provide extra
functionality. The Atom would have been fast enough to run Windows 3.1
and the software for it at a speed no one would be inclined to complain about.

John Savard

Re: Squeezing Those Bits: Concertina II

<a52db472-bdca-4b91-b838-48a7aafea082n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17385&group=comp.arch#17385

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5d8f:: with SMTP id d15mr5062885qtx.350.1622818910621; Fri, 04 Jun 2021 08:01:50 -0700 (PDT)
X-Received: by 2002:aca:aacb:: with SMTP id t194mr10179137oie.155.1622818910220; Fri, 04 Jun 2021 08:01:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 08:01:49 -0700 (PDT)
In-Reply-To: <9b4bc6c4-8071-4a6f-b884-061155c06685n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:eca4:8396:11d7:5173; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:eca4:8396:11d7:5173
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <s9ceab$82d$2@newsreader4.netcologne.de> <96041685-57a9-49d2-a4ad-476b32f8e59cn@googlegroups.com> <9b4bc6c4-8071-4a6f-b884-061155c06685n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a52db472-bdca-4b91-b838-48a7aafea082n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 04 Jun 2021 15:01:50 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 7

by: Quadibloc - Fri, 4 Jun 2021 15:01 UTC

On Friday, June 4, 2021 at 6:55:17 AM UTC-6, MitchAlsup wrote:

> If you have both polarities of a signal, the 3-input XOR gate is 1 gate of delay.

Yes, but since we don't use ECL any more (except for a low-power variant in
certain specialized chips) that usually is not the case.

John Savard

Quadibloc <jsavard@ecn.ab.ca> writes:
>And, indeed, while Intel's original small Atom cores were in-order,
>they eventually switched over to even giving those a simple
>out-of-order capability, since transistor densities had increased,
>and the original Atom cores were percieved as having very poor
>performance.

The OoO AMD E-450 and Celeron J1900 are indeed about twice as fast in
my measurements as the in-order Atom 330.

Silvermont (OoO "Atom") and Bobcat (AMD's low-power core, OoO) are
indeed twice as fast as Bonnell (in-order "Atom") in my measurements.

I don't think that transistor density was the issue: Intel could do
big dual-core CPUs on a single chip in 2008. I think they believed
that they could save power with in-order, and then AMD showed them
that one can also do low-power OoO.

TDP Model uarch proc Turbo cores execution
8W Atom 330 Bonnell 45nm 1.6GHz dual-core in-order
10W Atom D2700 Bonnell 32nm 2.13GHz dual-core+graphics in-order
10W Celeron J1900 Silvermont 22nm 2.41GHz quad-core+graphics OoO
18W AMD E-450 Bobcat 40nm 1.65GHz dual-core+graphics OoO
15W AMD A8-6410 Puma 28nm 2.4GHz quad-core+graphics OoO

>And yet people didn't complain about the performance of the
>486 DX. So I would be inclined to blame software bloat.

People certainly want to run software on current hardware that would
be too slow on a 486. Some of the increased resource requirements are
bloat, some are better functionality.

But unlike the 486, Bonnell was not the best-performing CPU for
desktops on its introduction. Instead, it was released into a world
where software had been written for Athlon 64 and Core 2 Duo CPUs.
And compared to those, it was slow. Below are LaTeX benchmark results
listed in the order of CPU release dates.

run time (s)
2003 Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76
2006 Core 2 Duo E6600, 2400MHz, 4MB L2, Debian Etch (64-bit) 0.592
2008 Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
2011 AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
2013 Celeron J1900 (Silvermont) 2416MHz (Shuttle XS35V4) Ubuntu16.10 1.052

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

jgd@cix.co.uk (John Dallman) writes:
>The Atoms were compared against much faster Intels and AMDs. That made
>them look slow, even if they were faster than the 486 (I don't know if
>they were - never used 'em).

LaTeX benchmark:
run time (s)
- Intel 486, 66 MHz, 256K L2-Cache, Redhat-Linux (pcs) 93.4
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323

24x the clock rate, 40 times the performance.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17396&group=comp.arch#17396

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:13b9:: with SMTP id m25mr5561176qki.369.1622828304731; Fri, 04 Jun 2021 10:38:24 -0700 (PDT)
X-Received: by 2002:a05:6808:f94:: with SMTP id o20mr11596113oiw.30.1622828304461; Fri, 04 Jun 2021 10:38:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 10:38:24 -0700 (PDT)
In-Reply-To: <F%puI.17928$jf1.10608@fx37.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7d19:93b4:957f:8509; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7d19:93b4:957f:8509
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org> <2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Jun 2021 17:38:24 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 83

by: MitchAlsup - Fri, 4 Jun 2021 17:38 UTC

On Friday, June 4, 2021 at 8:36:40 AM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > Stefan Monnier <mon...@iro.umontreal.ca> writes:
> >>> With a 20 gate per cycle design point, one can build a 6-wide reservation
> >>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
> >>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
> >>> pipeline.
> >> If we count 5-gates of delay for the clock-boundary's flip-flop, that
> >> means:
> >>
> >> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
> >>
> >>> At 16 cycles this necessarily becomes 9-10 stages.
> >>> At 12 gates this necessarily becomes 12-15 stages.
> >> And that gives:
> >>
> >> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
> >> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
> >>
> >> So at least in terms of the latency of a single instruction going
> >> through the whole pipeline, the gain of targetting a lower-clocked
> >> design seems clear ;-)
> >
> > But that's not particularly relevant. You want to minimize the total
> > execution time of a program; and, with a few exceptions (e.g., PAUSE),
> > one instruction does not wait until the previous instruction has left
> > the pipeline; if it did, there would be no point in pipelining.
> >
> > Instead, a data-flow instruction waits until its operands are
> > available (and the functional unit is available). For simple ALU
> > operations, this typically takes 1 cycle (exceptions:
> > Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
> > made deep pipelines a win, until CPUs ran into power limits ~2005.
> >
> > - anton
<
> The relevance of latency comes in, I think, when one considers the effect
> of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
> injects a bubble whose size is independent of the number of stages.
<
Make that number of stages TIMES width of execution.
{and yes, I saw that you wrote that in negative context}
>
> If we go from 6 stages, 20+5 gates to 12 stages, 12+5 gates
> we increase the clock by a factor of (20+5)/(12+5) = 1.47.
> But a bubble now takes 2x as many clocks to recover from.
<
And whereas branch predictors continue to get better, L1 cache hit
ratios are essentially frozen by size and sets. So what started out as
branch prediction limited (1990) ends up as L2 latency limited (2000+).
>
> Also adding pipeline stages doesn't change the speed of data cache.
<
What changes the throughput of the data cache is ports. If you can
perform 4 accesses per cycle to a 4-way banked cache, throughput
goes way up and you quickly realize that you have to adequately port
the L2 similarly. You want simultaneous misses in L1 to be handled
simultaneously in the L2 !!
<
We worried a lot about the number of wires in the 1990s, but even GPUs
get 10 layers of metal, and IBM is using 17 layers in its modern mainframes.
With this wire resource, there is little reason NOT to bank the cache hierarchy.
<
{Aside: many GPUs run 1024 wires from and another 1024 wires to the L1 cache
(nor including the addresses and control)} And these busses pass data back
and forth in the same "beat" structure as the SIMT calculation beat structure.}
<
> Adding pipeline stages to increase the frequency means we somewhat
> decrease the unused cache access time between loads and stores.
> However if the D$ cache access saturates, stages should have minimal impact.
<
Add ports and AGEN width to eliminate saturation.
>
> It also depends on how one measures performance.
<
There is only one sane metric here: wall clock time for entire application.
<
> More stages means higher frequency means higher potential issued MIPS.
> If instead we count retired MIPS, to take into account bubbles and
<
Only compiler and CPU architects should be able to see unretired statistics.
<
> any back pressure (stall) effects of D$ cache access,
> I would expect to see much less actual benefit.

Re: Squeezing Those Bits: Concertina II

<21f5f7af-9bcc-4549-ad0c-82c17bff5dc1n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17397&group=comp.arch#17397

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:7c02:: with SMTP id x2mr5421615qkc.483.1622828520842;
Fri, 04 Jun 2021 10:42:00 -0700 (PDT)
X-Received: by 2002:a9d:2287:: with SMTP id y7mr4494570ota.22.1622828520590;
Fri, 04 Jun 2021 10:42:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 10:42:00 -0700 (PDT)
In-Reply-To: <a52db472-bdca-4b91-b838-48a7aafea082n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7d19:93b4:957f:8509;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7d19:93b4:957f:8509
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <s9ceab$82d$2@newsreader4.netcologne.de>
<96041685-57a9-49d2-a4ad-476b32f8e59cn@googlegroups.com> <9b4bc6c4-8071-4a6f-b884-061155c06685n@googlegroups.com>
<a52db472-bdca-4b91-b838-48a7aafea082n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <21f5f7af-9bcc-4549-ad0c-82c17bff5dc1n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Jun 2021 17:42:00 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 4 Jun 2021 17:42 UTC

On Friday, June 4, 2021 at 10:01:51 AM UTC-5, Quadibloc wrote:
> On Friday, June 4, 2021 at 6:55:17 AM UTC-6, MitchAlsup wrote:
>
> > If you have both polarities of a signal, the 3-input XOR gate is 1 gate of delay.
<
> Yes, but since we don't use ECL any more (except for a low-power variant in
> certain specialized chips) that usually is not the case.
<
After a flip-flop, you ALWAYS have both polarities !!
<
Certain places (like carry chains) is is faster to compute both true and complement
signaling at only minor area and power costs.
<
In multiplier trees, computing both true and complement 3-in XORs saves 1/3rd
of the fall through time of the entire tree !
<
So, some places it is worth it.
>
> John Savard

Anton Ertl wrote:
> jgd@cix.co.uk (John Dallman) writes:
>> The Atoms were compared against much faster Intels and AMDs. That made
>> them look slow, even if they were faster than the 486 (I don't know if
>> they were - never used 'em).
>
> LaTeX benchmark:
> run time (s)
> - Intel 486, 66 MHz, 256K L2-Cache, Redhat-Linux (pcs) 93.4
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
>
> 24x the clock rate, 40 times the performance.
>
> - anton

Did they restrict the benchmark to one core?

"The Atom 330 is actually two identical Atom 230 dies packaged together..."
https://en.wikichip.org/wiki/intel/atom/330

The 330 is Bonnell microarchitecture, in-order, superscalar, 16-19 stages.
Cache for _each_ of the 230 dies is:
L1I$: 32 KB 8 way, L1D$: 24 KB 6 way, L2$: 512 KB 8 way
https://en.wikichip.org/wiki/intel/microarchitectures/bonnell

The 80486DX-66 is 5 stage pipeline.
On-chip cache (probably direct mapped) I$: 8KB, D$: 16 KB, L2$ is external.

Re: Squeezing Those Bits: Concertina II

<s9dr3k$59c$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17400&group=comp.arch#17400

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 4 Jun 2021 13:23:37 -0500
Organization: A noiseless patient Spider
Lines: 144
Message-ID: <s9dr3k$59c$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 4 Jun 2021 18:23:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="be4adef6c9734bc2693a6aa371936b50";
logging-data="5420"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Klpk/l6qUHk4A8/fHNSJS"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:EobCi7seLLA5yWBLShpDtrqs/f8=
In-Reply-To: <2021Jun4.104421@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Fri, 4 Jun 2021 18:23 UTC

On 6/4/2021 3:44 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> So, I bought a new phone as a replacement, and what kind of fancy new
>> CPU is it running?... Cortex-A53...
>>
>>
>> So, given that apparently phones are still happy enough mostly running
>> on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
>> cores seems mostly limited to "higher end" devices.
>
> Apple uses OoO for both their big cores and their little cores.
>

Granted, I wasn't necessarily talking about Apple here. Most of their
phones are rather expensive and mostly higher-end devices.

I was thinking mostly in terms of phones you find in the $100 - $300
range, which I suspect are probably a lot more common than some $600 -
$1000 iPhone or similar.

Granted, RasPi has gone over to OoO now, and the Raspberry Pi 4 is still
relatively affordable, implying that the cost of the CPU isn't really
the dominant factor here.

> Why ARM's customers still go for in-order is somewhat of a mystery to
> me. We can see on
>
> <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
>
> that the OoO Cortex A75 is more efficient (as well as more performant)
> than the A55 at almost all performance points of the A55, for SPEC2006
> Int+FP. ARM claims that the workloads on their little cores differ
> significantly from that used for producing these kinds of benchmarks,
> and that the A55 has better efficiency there. Unfortunately, they do
> not give any evidence for that, so this may be just the usual
> marketing stuff to make their bad decisions look good. But even if
> they are right, what is the intended application area and what are the
> efficiency needs for your architecture?
>

I was originally designing my ISA mostly with the intent of using it as
a robot controller. Granted, I have mostly been using it to run Doom and
Quake and similar, but there was non-zero overlap.

Similar for the software-rasterized OpenGL: Some of the features used
for the software GL were originally imagined in the context of real-time
image processing tasks.

Even if not exactly blazingly fast (and I don't particularly consider
GLQuake all that playable at single digit framerates), it seems to be
doing reasonably well for what it is. Software Quake is similar, both
cases running at single-digit framerates (and GLQuake is roughly in a
similar area in terms of framerate).

Though, I suspect it would fare a lot better (in relation to an ARM
device or similar) if it could run at a similar clock speed (and with
similar memory bandwidth).

Something like a Raspberry Pi kinda holds a bit of an advantage, but the
relative difference seems to be smaller than one might otherwise expect
given the RasPi having a (fairly massive) advantage in terms of clock-speed.

In particular, software-rendered OpenGL on a RasPi still doesn't exactly
make it into the double-digits. Even on my Ryzen, it is still "kinda meh".

By reverse extrapolation, one would expect software GL on something
running at 50MHz to be more in "frames per minute" territory.

Can note that for my Software GL, partly to work around weak
code-generation, some amount of the inner core of the rasterizer was
written mostly in ASM. Whereas on x86 and ARM, it is mostly using a C
version, and some of BJX2's SIMD constructs don't map particularly
closely to those of SSE or NEON.

Though, granted, things is somewhat less impressive if using Doom as the
benchmark, which seems to follow a more linear relationship here.

Things like Dhrystone are more ambiguous, and it seems to look better
with "vintage" stats than modern ones (it appears that the DMIPS value
has inflated over time for comparable hardware).

Though, at the moment it is pulling off ~ 37400 dhrystones/second (~ 21
DMIPS, 0.42 DMIPS/MHz), which appears to put it in similar territory to
a 486DX2-66.

Though, can note that the benchmark does depend a bit on integer
division and strcmp, neither of which are "particularly" fast in my case
(there are not any specialized instructions for these cases).

Then again, I had noted one time though, that when I tried to do a port
of BGBCC to generate code for ARM32, its performance (relative to GCC or
Clang) was pretty much atrocious...

Though, its generated code isn't *that* awful, so it is unclear what the
main factor, apart from ARM's relative lack of register space meaning
that the generated code consists mostly of LD/ST ops... (*)

So, it is also possible that this could be a factor as well.

*: I suspect a factor here is using registers for temporaries, where
cases where a variables' value goes through a temporary register rather
than being able to used directly is not ideal for register pressure.
Combined with a compiler which isn't really smart enough to realize when
temporary values are no longer needed and can be discarded (so these
intermediate values from temporaries tend to frequently end up being
stored back to the stack frame in the off chance they are needed later,
....).

With BJX2 having roughly 27 (generic/usable) GPRs, it is able to keep a
lot more stuff in registers, vs ARM32 only having 11.

But, there isn't really a good/easy way to fix some of this.

>> And, it appears this is not entirely recent: these sorts of 2-wide
>> superscalar cores seem to have been dominant in phones and consumer
>> electronics for roughly the past 15-20 years or so.
>
> Not sure what you mean with dominant. OoO cores have been used on
> smartphones since the Cortex-A9, used in, e.g., the Apple A5 (2011).
>
> As for other consumer electronics: If you don't need much performance,
> no need for an expensive OoO core.
>

Dominant, as-in, the vast majority are using 2-wide superscalar, rather
than OoO cores. While OoO isn't exactly new, and presumably not that
much more expensive (if it is competitive in terms of area, ...), only a
relative minority of devices use it.

It seems like in the late 90s, consumer electronics / phones / ...
mostly went from single-issue cores to dual-issue, and then just sort of
sat there...

Re: Squeezing Those Bits: Concertina II

<s9dtqg$hbi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17402&group=comp.arch#17402

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 4 Jun 2021 21:10:07 +0200
Organization: A noiseless patient Spider
Lines: 159
Message-ID: <s9dtqg$hbi$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<s9dr3k$59c$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 4 Jun 2021 19:10:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f83770d144e8196fba59138f3ed4daa0";
logging-data="17778"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6Dsv8z/VjF6ZU5DNeIzT6irTTKiZhMos="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:eRD/3aJ56rnpOZO2mXIdEe7LwdY=
In-Reply-To: <s9dr3k$59c$1@dont-email.me>
Content-Language: en-US

by: Marcus - Fri, 4 Jun 2021 19:10 UTC

On 2021-06-04, BGB wrote:
> On 6/4/2021 3:44 AM, Anton Ertl wrote:
>> BGB <cr88192@gmail.com> writes:
>>> So, I bought a new phone as a replacement, and what kind of fancy new
>>> CPU is it running?... Cortex-A53...
>>>
>>>
>>> So, given that apparently phones are still happy enough mostly running
>>> on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
>>> cores seems mostly limited to "higher end" devices.
>>
>> Apple uses OoO for both their big cores and their little cores.
>>
>
> Granted, I wasn't necessarily talking about Apple here. Most of their
> phones are rather expensive and mostly higher-end devices.
>
> I was thinking mostly in terms of phones you find in the $100 - $300
> range, which I suspect are probably a lot more common than some $600 -
> $1000 iPhone or similar.
>
>
> Granted, RasPi has gone over to OoO now, and the Raspberry Pi 4 is still
> relatively affordable, implying that the cost of the CPU isn't really
> the dominant factor here.
>
>
>> Why ARM's customers still go for in-order is somewhat of a mystery to
>> me. We can see on
>>
>> <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
>>
>>
>> that the OoO Cortex A75 is more efficient (as well as more performant)
>> than the A55 at almost all performance points of the A55, for SPEC2006
>> Int+FP. ARM claims that the workloads on their little cores differ
>> significantly from that used for producing these kinds of benchmarks,
>> and that the A55 has better efficiency there. Unfortunately, they do
>> not give any evidence for that, so this may be just the usual
>> marketing stuff to make their bad decisions look good. But even if
>> they are right, what is the intended application area and what are the
>> efficiency needs for your architecture?
>>
>
> I was originally designing my ISA mostly with the intent of using it as
> a robot controller. Granted, I have mostly been using it to run Doom and
> Quake and similar, but there was non-zero overlap.
>
> Similar for the software-rasterized OpenGL: Some of the features used
> for the software GL were originally imagined in the context of real-time
> image processing tasks.
>
> Even if not exactly blazingly fast (and I don't particularly consider
> GLQuake all that playable at single digit framerates), it seems to be
> doing reasonably well for what it is. Software Quake is similar, both
> cases running at single-digit framerates (and GLQuake is roughly in a
> similar area in terms of framerate).
>
>
>
> Though, I suspect it would fare a lot better (in relation to an ARM
> device or similar) if it could run at a similar clock speed (and with
> similar memory bandwidth).
>
> Something like a Raspberry Pi kinda holds a bit of an advantage, but the
> relative difference seems to be smaller than one might otherwise expect
> given the RasPi having a (fairly massive) advantage in terms of
> clock-speed.
>
> In particular, software-rendered OpenGL on a RasPi still doesn't exactly
> make it into the double-digits. Even on my Ryzen, it is still "kinda meh".
>
> By reverse extrapolation, one would expect software GL on something
> running at 50MHz to be more in "frames per minute" territory.
>
>
> Can note that for my Software GL, partly to work around weak
> code-generation, some amount of the inner core of the rasterizer was
> written mostly in ASM. Whereas on x86 and ARM, it is mostly using a C
> version, and some of BJX2's SIMD constructs don't map particularly
> closely to those of SSE or NEON.
>
>
> Though, granted, things is somewhat less impressive if using Doom as the
> benchmark, which seems to follow a more linear relationship here.
>
>
> Things like Dhrystone are more ambiguous, and it seems to look better
> with "vintage" stats than modern ones (it appears that the DMIPS value
> has inflated over time for comparable hardware).
>
> Though, at the moment it is pulling off ~ 37400 dhrystones/second (~ 21
> DMIPS, 0.42 DMIPS/MHz), which appears to put it in similar territory to
> a 486DX2-66.
>

I would guess that part of what you're measuring is the compiler
maturity level. On my in-order CPU that does not even have I$ nor D$
(but a shared single cycle 32-bit BRAM bus), I got something like
0.7-0.8 DMIPS/MHz - but that's using GCC 11 and some hand-optimized
C library functions (memcpy etc). Before optimizing the libc routines
I got 0.5 DMIPS/MHz.

With a proper I$ (that I'm currently working on) I expect to get closer
to 1 DMIPS/MHz.

> Though, can note that the benchmark does depend a bit on integer
> division and strcmp, neither of which are "particularly" fast in my case
> (there are not any specialized instructions for these cases).
>
>
>
> Then again, I had noted one time though, that when I tried to do a port
> of BGBCC to generate code for ARM32, its performance (relative to GCC or
> Clang) was pretty much atrocious...
>
> Though, its generated code isn't *that* awful, so it is unclear what the
> main factor, apart from ARM's relative lack of register space meaning
> that the generated code consists mostly of LD/ST ops... (*)
>
> So, it is also possible that this could be a factor as well.
>
>
> *: I suspect a factor here is using registers for temporaries, where
> cases where a variables' value goes through a temporary register rather
> than being able to used directly is not ideal for register pressure.
> Combined with a compiler which isn't really smart enough to realize when
> temporary values are no longer needed and can be discarded (so these
> intermediate values from temporaries tend to frequently end up being
> stored back to the stack frame in the off chance they are needed later,
> ...).
>
> With BJX2 having roughly 27 (generic/usable) GPRs, it is able to keep a
> lot more stuff in registers, vs ARM32 only having 11.
>
> But, there isn't really a good/easy way to fix some of this.
>
>
>>> And, it appears this is not entirely recent: these sorts of 2-wide
>>> superscalar cores seem to have been dominant in phones and consumer
>>> electronics for roughly the past 15-20 years or so.
>>
>> Not sure what you mean with dominant. OoO cores have been used on
>> smartphones since the Cortex-A9, used in, e.g., the Apple A5 (2011).
>>
>> As for other consumer electronics: If you don't need much performance,
>> no need for an expensive OoO core.
>>
>
> Dominant, as-in, the vast majority are using 2-wide superscalar, rather
> than OoO cores. While OoO isn't exactly new, and presumably not that
> much more expensive (if it is competitive in terms of area, ...), only a
> relative minority of devices use it.
>
> It seems like in the late 90s, consumer electronics / phones / ...
> mostly went from single-issue cores to dual-issue, and then just sort of
> sat there...
>

Re: Squeezing Those Bits: Concertina II

<s9e5r2$6j3$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17404&group=comp.arch#17404

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 4 Jun 2021 16:26:46 -0500
Organization: A noiseless patient Spider
Lines: 295
Message-ID: <s9e5r2$6j3$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<s9dr3k$59c$1@dont-email.me> <s9dtqg$hbi$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 4 Jun 2021 21:26:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="be4adef6c9734bc2693a6aa371936b50";
logging-data="6755"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19v0KhV6GEn1OW6UgVLlequ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:yOwL12LFKJjNy5QQHCgWvTfuV64=
In-Reply-To: <s9dtqg$hbi$1@dont-email.me>
Content-Language: en-US

by: BGB - Fri, 4 Jun 2021 21:26 UTC

On 6/4/2021 2:10 PM, Marcus wrote:
> On 2021-06-04, BGB wrote:
>> On 6/4/2021 3:44 AM, Anton Ertl wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> So, I bought a new phone as a replacement, and what kind of fancy new
>>>> CPU is it running?... Cortex-A53...
>>>>
>>>>
>>>> So, given that apparently phones are still happy enough mostly running
>>>> on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
>>>> cores seems mostly limited to "higher end" devices.
>>>
>>> Apple uses OoO for both their big cores and their little cores.
>>>
>>
>> Granted, I wasn't necessarily talking about Apple here. Most of their
>> phones are rather expensive and mostly higher-end devices.
>>
>> I was thinking mostly in terms of phones you find in the $100 - $300
>> range, which I suspect are probably a lot more common than some $600 -
>> $1000 iPhone or similar.
>>
>>
>> Granted, RasPi has gone over to OoO now, and the Raspberry Pi 4 is
>> still relatively affordable, implying that the cost of the CPU isn't
>> really the dominant factor here.
>>
>>
>>> Why ARM's customers still go for in-order is somewhat of a mystery to
>>> me. We can see on
>>>
>>> <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
>>>
>>>
>>> that the OoO Cortex A75 is more efficient (as well as more performant)
>>> than the A55 at almost all performance points of the A55, for SPEC2006
>>> Int+FP. ARM claims that the workloads on their little cores differ
>>> significantly from that used for producing these kinds of benchmarks,
>>> and that the A55 has better efficiency there. Unfortunately, they do
>>> not give any evidence for that, so this may be just the usual
>>> marketing stuff to make their bad decisions look good. But even if
>>> they are right, what is the intended application area and what are the
>>> efficiency needs for your architecture?
>>>
>>
>> I was originally designing my ISA mostly with the intent of using it
>> as a robot controller. Granted, I have mostly been using it to run
>> Doom and Quake and similar, but there was non-zero overlap.
>>
>> Similar for the software-rasterized OpenGL: Some of the features used
>> for the software GL were originally imagined in the context of
>> real-time image processing tasks.
>>
>> Even if not exactly blazingly fast (and I don't particularly consider
>> GLQuake all that playable at single digit framerates), it seems to be
>> doing reasonably well for what it is. Software Quake is similar, both
>> cases running at single-digit framerates (and GLQuake is roughly in a
>> similar area in terms of framerate).
>>
>>
>>
>> Though, I suspect it would fare a lot better (in relation to an ARM
>> device or similar) if it could run at a similar clock speed (and with
>> similar memory bandwidth).
>>
>> Something like a Raspberry Pi kinda holds a bit of an advantage, but
>> the relative difference seems to be smaller than one might otherwise
>> expect given the RasPi having a (fairly massive) advantage in terms of
>> clock-speed.
>>
>> In particular, software-rendered OpenGL on a RasPi still doesn't
>> exactly make it into the double-digits. Even on my Ryzen, it is still
>> "kinda meh".
>>
>> By reverse extrapolation, one would expect software GL on something
>> running at 50MHz to be more in "frames per minute" territory.
>>
>>
>> Can note that for my Software GL, partly to work around weak
>> code-generation, some amount of the inner core of the rasterizer was
>> written mostly in ASM. Whereas on x86 and ARM, it is mostly using a C
>> version, and some of BJX2's SIMD constructs don't map particularly
>> closely to those of SSE or NEON.
>>
>>
>> Though, granted, things is somewhat less impressive if using Doom as
>> the benchmark, which seems to follow a more linear relationship here.
>>
>>
>> Things like Dhrystone are more ambiguous, and it seems to look better
>> with "vintage" stats than modern ones (it appears that the DMIPS value
>> has inflated over time for comparable hardware).
>>
>> Though, at the moment it is pulling off ~ 37400 dhrystones/second (~
>> 21 DMIPS, 0.42 DMIPS/MHz), which appears to put it in similar
>> territory to a 486DX2-66.
>>
>
> I would guess that part of what you're measuring is the compiler
> maturity level. On my in-order CPU that does not even have I$ nor D$
> (but a shared single cycle 32-bit BRAM bus), I got something like
> 0.7-0.8 DMIPS/MHz - but that's using GCC 11 and some hand-optimized
> C library functions (memcpy etc). Before optimizing the libc routines
> I got 0.5 DMIPS/MHz.
>
> With a proper I$ (that I'm currently working on) I expect to get closer
> to 1 DMIPS/MHz.
>

In terms of stats at present:
I have an L1 I$ and D$ (both 16K, direct-mapped)
Memcpy (L1): 250MB/s
Memset (L1): 320MB/s
L2 Cache is 128K, 2-way set-associative
Memcpy (L2): ~ 50MB/s
Memset (L2): ~ 90MB/s
RAM (DDR2, 50MHz):
Memcpy: ~ 9MB/s
Memset: ~ 17MB/s

Memory access, if properly pipelined, is 1 cycle for an L1 hit.
It is 2 or 3 cycles if an interlock stall occurs (eg: trying to use a
value directly following a load).

There is also a branch-predictor and similar, ...

My compiler does have a few weaknesses:
It isn't really able to use VLIW capabilities effectively, so most of
what it produces is scalar code;
It isn't super great at avoiding things like needless MOV instructions
or Load/Store ops;
It always creates stack frames, even for trivial leaf functions (could
be changed but would add a lot of complexity to the C compiler, and
would require the codegen to first prove that the function doesn't
contain any hidden function calls or similar);
Doesn't perform inlining, at all;
A certain subset of operators are implemented effectively using
"call-threading" (eg, rather then the compiler doing it itself, it spits
out hidden calls into the C runtime, *1);
....

*1: This generally happens for operators which don't exist natively in
the ISA and which can't be implemented effectively within a short
instruction sequence. Things like integer divide, modulo, large
multiply, etc, generally fall into this category. Large arrays, VLAs,
large struct variables, copying or returning a struct by value, ..., may
also generate hidden runtime calls. Some vector ops also involve runtime
calls, and some extensions (such as the __variant type, __float128, ...)
are implemented almost entirely via runtime calls.

As noted, its code generation tends to go through a stack model, which
in turn uses temporary variables.

So:
z=3*x+y;
Might be compiled as (pseudocode):
PUSH 3
LOAD x
BINOP '*'
LOAD y
BINOP '+'
STORE z
Or, in something closer to its original notation:
3 $x * $y + =z

Which might become, effectively (C-like pseudocode):
_t0_1i = 3;
_t1_1i = x;
_t0_2i = _t0_1i * _t1_1i;
_t1_2i = y;
_t0_3i = _t0_2i + _t1_2i;
z = _t0_3i;
But, then "optimized" back to:
_t0_2i = x * 3;
z = _t0_2i + y;

But, not always. If the types don't match exactly, there might be
left-over type-conversion ops, say:
_t0_1 = (int)3;
_t1_1 = (int)x;

Or, say, the values are computed as "int" but the destination is "long":
_t0_3i = _t0_2i + _t1_2i;
_t0_4l = (long)_t0_3i;
z = _t0_4l;

These cases may prevent the forwarding, but don't otherwise actually
change the value. In this case, these result in the occasional needless
MOV or EXTS.L instruction (and also increases register pressure).

The codegen backend then does more or less a direct translation of this
into machine-code instructions, with a register allocator and similar
which maps variables temporarily onto CPU registers (in which case they
are loaded on demand from memory, and written back to memory at the end
of the current basic block). The register allocator may also evict (and
write back) values for registers if it needs to access another variable
and no unassigned registers are left.

A variable may be statically assigned to a CPU register in which case no
memory write-back occurs, and the same variable maps to the same
register throughout the entire function. This only works for local
variables within a certain range of primitive types, and up to a certain
maximum number of variables in any given function. If the "register"
keyword is used, it adds a fairly big weight to the variable being
picked for this.

Click here to read the complete article

Quadibloc <jsavard@ecn.ab.ca> writes:

> On Friday, June 4, 2021 at 3:07:37 AM UTC-6, Anton Ertl wrote:
>
>> Apple uses OoO for both their big cores and their little cores.
>
> And, indeed, while Intel's original small Atom cores were in-order,
> they eventually switched over to even giving those a simple
> out-of-order capability, since transistor densities had increased,
> and the original Atom cores were percieved as having very poor
> performance.

I'm actually retiring an old Atom system. D510 CPU, Bonnell uarch, 45
nm, dual cores, 1.67 GHz. Early last decade these sold for $60 and that
included a motherboard.

It has served as a little file server and for that it's fine. But things
like a web browser, even starting one let alone trying to render any
pages is pretty frustrating. Any crypto likewise. Even a remote desktop
thing like x2go is bogged down when starting up, that's apparently
because some parts of it are shell scripts or Perl.

I replaced it with the cheapest recent Intel CPU thing I could find, a
Celeron G5900 (Comet Lake, 14 nm, dual cores, 3.4 GHz). It runs rings
around the old Atom.

> And yet people didn't complain about the performance of the
> 486 DX. So I would be inclined to blame software bloat.

I don't know, I seem to recall decoding and showing jpegs was pretty
slow on a 486. MP3 audio decoding in software took a Pentium or at least
a fairly fast 486 and highly optimized software. Crappy MPEG-1 video
needed a hardware decoder card... Word for Windows 2.0 ran fine.

Re: Squeezing Those Bits: Concertina II

<6b323233-0b86-4b3a-b8b3-3b6652a60275n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17406&group=comp.arch#17406

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:6884:: with SMTP id d126mr6306889qkc.497.1622843317852; Fri, 04 Jun 2021 14:48:37 -0700 (PDT)
X-Received: by 2002:a9d:4e88:: with SMTP id v8mr5161125otk.110.1622843317392; Fri, 04 Jun 2021 14:48:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 14:48:37 -0700 (PDT)
In-Reply-To: <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:eca4:8396:11d7:5173; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:eca4:8396:11d7:5173
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org> <2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad> <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6b323233-0b86-4b3a-b8b3-3b6652a60275n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 04 Jun 2021 21:48:37 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 15

by: Quadibloc - Fri, 4 Jun 2021 21:48 UTC

On Friday, June 4, 2021 at 11:38:25 AM UTC-6, MitchAlsup wrote:

> Only compiler and CPU architects should be able to see unretired statistics.

I wouldn't be quite as strict as that, although I agree that it should not be possible
for malware to read this kind of information about programs to use it as a side
channel and so on.

It's enough if you have to turn on the visibility of this stuff in the BIOS before
booting up. Having to prove you're a compiler or CPU architect in order to
buy a special edition chip at the computer store is just too complicated.

I suppose the chipmaker could ship you a special cryptographic key with which
to flash the microcode too, but I think that is also too complicated.

John Savard

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> jgd@cix.co.uk (John Dallman) writes:
>>> The Atoms were compared against much faster Intels and AMDs. That made
>>> them look slow, even if they were faster than the 486 (I don't know if
>>> they were - never used 'em).
>>
>> LaTeX benchmark:
>> run time (s)
>> - Intel 486, 66 MHz, 256K L2-Cache, Redhat-Linux (pcs) 93.4
>> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
>>
>> 24x the clock rate, 40 times the performance.
>>
>> - anton
>
>Did they restrict the benchmark to one core?

No, but it's a single-threaded benchmark.

>The 330 is Bonnell microarchitecture, in-order, superscalar, 16-19 stages.

Interesting that they had such a long pipeline and yet a relatively low
clock.

>The 80486DX-66 is 5 stage pipeline.
>On-chip cache (probably direct mapped) I$: 8KB, D$: 16 KB, L2$ is external.

The 486 has a unified L1 cache. On the 486 DX2/66 the L1 cache is 8KB,
IIRC 4-way set-associative.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<s9ea03$30n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17412&group=comp.arch#17412

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 4 Jun 2021 17:37:45 -0500
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <s9ea03$30n$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 4 Jun 2021 22:37:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="959b79a4a883a8bae8aefeae07dcb163";
logging-data="3095"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+sgS69d5goDGyVLcO97Led"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:KFVNpNRJZaE6yF+nMR06qBynypg=
In-Reply-To: <5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 4 Jun 2021 22:37 UTC

On 6/4/2021 5:00 AM, Quadibloc wrote:
> On Friday, June 4, 2021 at 3:07:37 AM UTC-6, Anton Ertl wrote:
>
>> Apple uses OoO for both their big cores and their little cores.
>
> And, indeed, while Intel's original small Atom cores were in-order,
> they eventually switched over to even giving those a simple
> out-of-order capability, since transistor densities had increased,
> and the original Atom cores were percieved as having very poor
> performance.
>

In my experience with them, at similar clock speeds, the original Atom
gets beaten pretty hard by ARM32.

Even in some cases where the x86 PC should have a pretty solid advantage
(such as a 2003 era laptop vs a RasPi2), in my tests, it was still
pretty close to an even match.

From what I can tell, it appears that x86 benefits a lot more from OoO
than ARM did, and Aarch64 is still pretty solid even with in-order
implementations.

However, an OoO x86 machine does seem to be a lot more tolerant of
lackluster code generation than an in-order ARM machine (where, if the
generated code kinda sucks, its performance on an ARM machine also
sucks). The x86 machine seems to just sort of take whatever garbage one
throws at it and makes it "sorta fast-ish" (even if it is basically just
a big mess of memory loads and stores with a bunch of hidden function
calls and similar thrown in).

So, in any case, having an effective compiler does seems to be a big
factor in terms of getting good performance from ARM32 or Aarch64.

> And yet people didn't complain about the performance of the
> 486 DX. So I would be inclined to blame software bloat.
>

While there are limits to what the hardware can do, a lot of the
"general slowness and unresponsiveness" in modern PC's seems more likely
due to bloat than anything else...

Re: Squeezing Those Bits: Concertina II

<4260718b-60a4-4e76-b228-63100a2c386en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17416&group=comp.arch#17416

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:f309:: with SMTP id p9mr6868007qkg.363.1622849508558;
Fri, 04 Jun 2021 16:31:48 -0700 (PDT)
X-Received: by 2002:a4a:d781:: with SMTP id c1mr5502117oou.23.1622849508328;
Fri, 04 Jun 2021 16:31:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 16:31:48 -0700 (PDT)
In-Reply-To: <6b323233-0b86-4b3a-b8b3-3b6652a60275n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d86e:2dae:1be0:6121;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d86e:2dae:1be0:6121
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
<2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad>
<b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com> <6b323233-0b86-4b3a-b8b3-3b6652a60275n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4260718b-60a4-4e76-b228-63100a2c386en@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Jun 2021 23:31:48 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 4 Jun 2021 23:31 UTC

On Friday, June 4, 2021 at 4:48:38 PM UTC-5, Quadibloc wrote:
> On Friday, June 4, 2021 at 11:38:25 AM UTC-6, MitchAlsup wrote:
>
> > Only compiler and CPU architects should be able to see unretired statistics.
> I wouldn't be quite as strict as that, although I agree that it should not be possible
> for malware to read this kind of information about programs to use it as a side
> channel and so on.
<
Other than "curiosity" why should an application user see unretired statistics?
Other than "curiosity" why should an application writer see unretired statistics ?
<
Can you think of ANY reason a JIT should be able to see unretired statistics ?
Can you think of ANY reason a JIT writer should be able to see unretired statistics ?
I fear there are a myriad of side-channels in there........
>
> It's enough if you have to turn on the visibility of this stuff in the BIOS before
> booting up.
<
It is possible to configure a My 66000 system to come out of reset already
running multi-threaded, multi-tasking, Hypervisor-Supervisor, TLBs turned on
.......all that some "special" tasks need to do is configure and enumerate I/O
devices, and initialize and clear DRAM............................................So who is
going to get sufficient privilege to turn this stuff off ?
<
> Having to prove you're a compiler or CPU architect in order to
> buy a special edition chip at the computer store is just too complicated.
>
> I suppose the chipmaker could ship you a special cryptographic key with which
> to flash the microcode too, but I think that is also too complicated.
>
> John Savard

Re: Squeezing Those Bits: Concertina II

<0e582144-9ba1-4907-82b1-320fdd0bc11en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17422&group=comp.arch#17422

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:798:: with SMTP id 24mr7463407qka.202.1622864617806;
Fri, 04 Jun 2021 20:43:37 -0700 (PDT)
X-Received: by 2002:aca:4a4f:: with SMTP id x76mr4991147oia.157.1622864617622;
Fri, 04 Jun 2021 20:43:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 20:43:37 -0700 (PDT)
In-Reply-To: <4260718b-60a4-4e76-b228-63100a2c386en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:a57e:b1c4:1baa:feb6;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:a57e:b1c4:1baa:feb6
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
<2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad>
<b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com> <6b323233-0b86-4b3a-b8b3-3b6652a60275n@googlegroups.com>
<4260718b-60a4-4e76-b228-63100a2c386en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0e582144-9ba1-4907-82b1-320fdd0bc11en@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 05 Jun 2021 03:43:37 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sat, 5 Jun 2021 03:43 UTC

On Friday, June 4, 2021 at 5:31:49 PM UTC-6, MitchAlsup wrote:

> Other than "curiosity" why should an application user see unretired statistics?
> Other than "curiosity" why should an application writer see unretired statistics ?

My comment wasn't aimed at saying they should, merely at the impracticality
of enforcing the restriction if you do want to make those statistics visible to
compiler and CPU architects.

In the case of CPU architects, I suppose its simple enough, since the ones that
are to see those statistics for a given CPU *work for the company that made
it*. And so of course they can be shown many things about the CPU they're
working on that are hidden from mere mortals. But in the case of compiler
architects, my comment would stand.

> So who is
> going to get sufficient privilege to turn this stuff off ?

Nobody should be able to get sufficient privilege to turn that stuff
*on* except at bootup... but I don't see the problem of letting ordinary
privileged processes turn it _off_, or, indeed, even the riff-raff.

John Savard

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> Instead, a data-flow instruction waits until its operands are
>> available (and the functional unit is available). For simple ALU
>> operations, this typically takes 1 cycle (exceptions:
>> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
>> made deep pipelines a win, until CPUs ran into power limits ~2005.
>>
>> - anton
>
>The relevance of latency comes in, I think, when one considers the effect
>of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
>injects a bubble whose size is independent of the number of stages.

A branch mispredict is (in the best case) feedback from the stage that
recognizes the misprediction to the instruction fetch stage. Here the
latency (in cycles and in ns) becomes longer with more pipeline
stages. Fortunately branch mispredictions are rare.

Caches these days seem to be clocked and pipelined, allowing a request
per cycle or so (more for L1), with the shared L3 having its own
clock. So maybe the latency also increases with the pipelining
overhead. This could explain that Apple can access 128KB in the same
~1ns that Intel needs for accessing 48KB: Apple only divides the 1ns
into 3 cycles, Intel into 5.

>It also depends on how one measures performance.
>More stages means higher frequency means higher potential issued MIPS.
>If instead we count retired MIPS, to take into account bubbles and
>any back pressure (stall) effects of D$ cache access,
>I would expect to see much less actual benefit.

Of course you measure the time to complete the program. Given the
quality of branch prediction in the early 2000s, 52 stages seemed to
be the optimal pipeline depth for the Pentium 4 [sprangle&carmean02],
and both Intel (Tejas) and AMD (Mitch Alsup's K9) were on that path,
until both canceled the projects in 2005. My guess is that they were
both betting on a cooling technology that evaporated in 2005.

Since then the sweet spot seems to have been the 14-19 stages or so
that Intel and AMD have been using (wikichip claims 19 stages for
Zen-Zen3 and 14-19 for Skylake and Ice Lake). But Apple's A14 shows
us that you can do lower-clocked (and likely shorter-pipeline) cores
that have so much more IPC that they have competetive performance.
Makes me wonder whether there is an even sweeter spot in between.

@InProceedings{sprangle&carmean02,
author = {Eric Sprangle and Doug Carmean},
title = {Increasing Processor Performance by Implementing
Deeper Pipelines},
crossref = {isca02},
pages = {25--34},
url = {http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/deep-pipelines-isca02.pdf},
annote = {This paper starts with the Williamette (Pentium~4)
pipeline and discusses and evaluates changes to the
pipeline length. In particular, it gives numbers on
how lengthening various latencies would affect IPC;
on a per-cycle basis the ALU latency is most
important, then L1 cache, then L2 cache, then branch
misprediction; however, the total effect of
lengthening the pipeline to double the clock rate
gives the reverse order (because branch
misprediction gains more cycles than the other
latencies). The paper reports 52 pipeline stages
with 1.96 times the original clock rate as optimal
for the Pentium~4 microarchitecture, resulting in a
reduction of 1.45 of core time and an overall
speedup of about 1.29 (including waiting for
memory). Various other topics are discussed, such as
nonlinear effects when introducing bypasses, and
varying cache sizes. Recommended reading.}
}

@InProceedings{hrishikesh+02,
author = {M. S. Hrishikesh and Norman P. Jouppi and Keith
I. Farkas and Doug Burger and Stephen W. Keckler and
Premkishore Shivakumar},
title = {The Optimal Logic Depth per Pipeline Stage is 6 to 8
FO4 Inverter Delays},
crossref = {isca02},
pages = {14--24},
annote = {This paper takes a low-level simulator of the 21264,
varies the number of pipeline stages, uses this to
run a number of workloads (actually only traces from
them), and reports performance results for
them. With a latch overhead of about 2 FO4
inverters, the optimal pipeline stage length is
about 8 FO4 inverters (with work-load-dependent
variations). Discusses various issues involved in
quite some depth. In particular, this paper
discusses how to pipeline the instruction window
design (which has been identified as a bottleneck in
earlier papers).}
}

@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

BGB <cr88192@gmail.com> writes:
>In my experience with them, at similar clock speeds, the original Atom
>gets beaten pretty hard by ARM32.

What do you mean with "ARM32"? If you mean the 32-bit ARM
architecture, there are many different cores that implement this
architecture. For the LateX benchmark I have:

run time (s)
- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
- OMAP4 Panda board ES (1.2GHz Cortex-A9) Ubuntu 12.04 2.984
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323

For Gforth I have, e.g.:

sieve bubble matrix fib fft
0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
0.410 0.520 0.260 0.635 0.280 Exynos 4 (Cortex A9) 1.6GHz; gcc-4.8.x
0.600 0.650 0.310 0.870 0.450 Odroid C2 Cortex A53 32b 1536MHz, gcc 5.3.1
0.390 0.490 0.270 0.520 0.260 Odroid C2 Cortex A53 64b 1536MHz, gcc 5.3.1

So, yes, OoO 32-bit ARMs like the Cortex-A9 have better
performance/clock than Bonnell, but the Cortex-A53 in 32b-moe not su
much. Which is quite surprising, because I would expect a RISC to
suffer less from the in-order implementation than a CISC. This
expected advantage is realized in the 64b-A53 result.

> From what I can tell, it appears that x86 benefits a lot more from OoO
>than ARM did, and Aarch64 is still pretty solid even with in-order
>implementations.

OoO implementations are a lot faster on both architectures:

LaTeX:

- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
- Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224

Gforth:

sieve bubble matrix fib fft
0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0

>However, an OoO x86 machine does seem to be a lot more tolerant of
>lackluster code generation than an in-order ARM machine (where, if the
>generated code kinda sucks, its performance on an ARM machine also
>sucks).

Not sure what you mean with "lackluster code generation", but OoO of
course deals better with code that has not been scheduled for in-order
architectures. There is also the effect on OoO that instructions can
often hide in the shadows of long dependency paths. But if you make
the dependency path longer, you feel that at least as hard on an OoO
CPU than on an in-order CPU.

>The x86 machine seems to just sort of take whatever garbage one
>throws at it and makes it "sorta fast-ish" (even if it is basically just
>a big mess of memory loads and stores with a bunch of hidden function
>calls and similar thrown in).

I don't know what you mean with "hidden function calls and similar",
but the stuff about "a big mess of memory loads and stores" sounds
like the ancient (and wrong) myth that loads and stores are free on
IA-32 and AMD64. I actually have a nice benchmark for that:

The difference between gforth and gforth-fast --ss-number=0 is that
gforth has some extra loads and stores (it stores and loads the
top-of-stack all the time, and it stores the Forth instruction pointer
all the time; let's see how they perform on a Skylake (i5 6600K):

sieve bubble matrix fib fft
0.080 0.108 0.044 0.080 0.028 gforth-fast --ss-number=0
0.128 0.208 0.084 0.140 0.056 gforth

One might think that the Zen3 (Ryzen 7 5800X) with its improved
store-to-load forwarding is more tolerant of the extra loads and
stores of gforth, but there is still a lot of difference:

sieve bubble matrix fib fft
0.079 0.062 0.034 0.053 0.022 Zen3 gforth-fast --ss-number=0
0.102 0.135 0.053 0.161 0.056 Zen3 gforth

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

MitchAlsup <MitchAlsup@aol.com> writes:
>On Friday, June 4, 2021 at 4:48:38 PM UTC-5, Quadibloc wrote:
>> On Friday, June 4, 2021 at 11:38:25 AM UTC-6, MitchAlsup wrote:
>>
>> > Only compiler and CPU architects should be able to see unretired statistics.
>> I wouldn't be quite as strict as that, although I agree that it should not be possible
>> for malware to read this kind of information about programs to use it as a side
>> channel and so on.
><
>Other than "curiosity" why should an application user see unretired statistics?
>Other than "curiosity" why should an application writer see unretired statistics ?

Isn't curiosity enough? Maybe we want to see how mispredictions
translate into speculated and non-retired instructions.

>I fear there are a myriad of side-channels in there........

Possibly, but if I can read the data directly, there is no need to
worry about side channels.

>So who is
>going to get sufficient privilege to turn this stuff off ?

The default in Linux is that users cannot use performance counters,
neither those for retired stuff nor those for unretired stuff. root
can do it and can enable it for all users on machines where all users
are trusted. AFAIK there is a capability that allows more
fine-grained control, but I have not looked into that.

Why should one have to enable this in the BIOS when root can already
access all of memory?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<94e94636-a2eb-4e0e-b1ec-3b704cc4a9c3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17438&group=comp.arch#17438

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4d0:: with SMTP id 16mr9182942qks.496.1622906209904;
Sat, 05 Jun 2021 08:16:49 -0700 (PDT)
X-Received: by 2002:aca:4a4f:: with SMTP id x76mr6290058oia.157.1622906209710;
Sat, 05 Jun 2021 08:16:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 5 Jun 2021 08:16:49 -0700 (PDT)
In-Reply-To: <2021Jun5.165558@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8505:8d30:ca2a:f69f;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8505:8d30:ca2a:f69f
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
<jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org> <2021Jun4.102515@mips.complang.tuwien.ac.at>
<F%puI.17928$jf1.10608@fx37.iad> <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>
<6b323233-0b86-4b3a-b8b3-3b6652a60275n@googlegroups.com> <4260718b-60a4-4e76-b228-63100a2c386en@googlegroups.com>
<2021Jun5.165558@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <94e94636-a2eb-4e0e-b1ec-3b704cc4a9c3n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 05 Jun 2021 15:16:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sat, 5 Jun 2021 15:16 UTC

On Saturday, June 5, 2021 at 10:04:58 AM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Friday, June 4, 2021 at 4:48:38 PM UTC-5, Quadibloc wrote:
> >> On Friday, June 4, 2021 at 11:38:25 AM UTC-6, MitchAlsup wrote:
> >>
> >> > Only compiler and CPU architects should be able to see unretired statistics.
> >> I wouldn't be quite as strict as that, although I agree that it should not be possible
> >> for malware to read this kind of information about programs to use it as a side
> >> channel and so on.
> ><
> >Other than "curiosity" why should an application user see unretired statistics?
> >Other than "curiosity" why should an application writer see unretired statistics ?
<
> Isn't curiosity enough? Maybe we want to see how mispredictions
> translate into speculated and non-retired instructions.
<
And maybe you want to see if your Spectré attack hit on anything !?!
<
> >I fear there are a myriad of side-channels in there........
<
> Possibly, but if I can read the data directly, there is no need to
> worry about side channels.
<
The Branch predictor carries too much weight not to be shared.
<
> >So who is
> >going to get sufficient privilege to turn this stuff off ?
> The default in Linux is that users cannot use performance counters,
> neither those for retired stuff nor those for unretired stuff. root
> can do it and can enable it for all users on machines where all users
> are trusted. AFAIK there is a capability that allows more
> fine-grained control, but I have not looked into that.
>
> Why should one have to enable this in the BIOS when root can already
> access all of memory?
<
My 66000 comes out of reset with the TLB turned on, so even root
has limitations.
<
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Subject	Author
Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Stephen Fuld
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Ivan Godard
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	John Dallman
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Anssi Saari
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Ivan Godard
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	BGB
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Ivan Godard
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Stefan Monnier
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	George Neuner
Re: Squeezing Those Bits: Concertina II	Terje Mathisen
Re: Squeezing Those Bits: Concertina II	Anton Ertl
Re: Squeezing Those Bits: Concertina II	Stefan Monnier
Re: Squeezing Those Bits: Concertina II	Thomas Koenig
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	EricP
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	JimBrakefield
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Stephen Fuld
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	MitchAlsup
Re: Squeezing Those Bits: Concertina II	Marcus
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc
Re: Squeezing Those Bits: Concertina II	Quadibloc