Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Avoid strange women and temporary variables.


devel / comp.arch / Re: Squeezing Those Bits: Concertina II

SubjectAuthor
* Squeezing Those Bits: Concertina IIQuadibloc
+* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|`* Re: Squeezing Those Bits: Concertina IIQuadibloc
| `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |`* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  | `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |  `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |   `* Re: Squeezing Those Bits: Concertina IIStephen Fuld
|  |    +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |    `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |     `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |+- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |`* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      | `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |  `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |   `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    |+* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    ||`- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    |+- Re: Squeezing Those Bits: Concertina IIIvan Godard
|  |      |    |+- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    |`* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | | |`- Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | | `* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  |+* Re: Squeezing Those Bits: Concertina IIJohn Dallman
|  |      |    | |  ||+- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  ||`* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  || `* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | |  ||  `- Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  |+- Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  |+* Re: Squeezing Those Bits: Concertina IIAnssi Saari
|  |      |    | |  ||`- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | |  |`* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |  | `* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  |  `- Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |  `* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |   `* Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | |    `* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |     `* Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | |      `* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |       `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |        `- Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIIvan Godard
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`- Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIIvan Godard
|  |      |    | | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | | `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  `- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | | `- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |`- Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIStefan Monnier
|  |      |    | | `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |  `* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |   +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |   |+- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | |   |`- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |   `* Re: Squeezing Those Bits: Concertina IIGeorge Neuner
|  |      |    | |    +- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | |    +* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |    |`- Re: Squeezing Those Bits: Concertina IIStefan Monnier
|  |      |    | |    +- Re: Squeezing Those Bits: Concertina IIThomas Koenig
|  |      |    | |    `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |     `* Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | |      `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |       `- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | | |`* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | | | `- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | | `- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIJimBrakefield
|  |      |    | `- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |     `* Re: Squeezing Those Bits: Concertina IIStephen Fuld
|  |      |      `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |       +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |       `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      `* Re: Squeezing Those Bits: Concertina IIMarcus
|  +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  `- Re: Squeezing Those Bits: Concertina IIQuadibloc
`- Re: Squeezing Those Bits: Concertina IIQuadibloc

Pages:1234567
Re: Squeezing Those Bits: Concertina II

<s9b3kg$nvf$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17354&group=comp.arch#17354

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 3 Jun 2021 10:30:55 -0700
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <s9b3kg$nvf$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<s94le0$3cr$1@dont-email.me>
<d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com>
<s963s5$4sa$1@newsreader4.netcologne.de> <s97m1u$8kh$1@dont-email.me>
<2021Jun2.183620@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 3 Jun 2021 17:30:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a5110513ec270d430a6bccab1b0663c4";
logging-data="24559"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18k09bXSSCJtCMTD2q7GCKB3YLsuq32JTM="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:K2w6/vGlc1uavSCoDZwt3Zld03Y=
In-Reply-To: <2021Jun2.183620@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Thu, 3 Jun 2021 17:30 UTC

On 6/2/2021 9:36 AM, Anton Ertl wrote:
> Marcus <m.delete@this.bitsnbites.eu> writes:
>> A different axis is the dynamic instruction count (i.e. instruction
>> execution frequencies) which I think is more relevant for
>> performance analysis/tuning of an ISA.
>
> One would think so, but the problem is that in a corpus of large
> programs the dynamically executed instructions are mostly confined to
> a few hot spots, and the rest of the large programs them plays hardly
> any role. And as a consequence, the most frequent dynamically
> executed instruction sequences tend to be not very representative of
> what happens in other programs.

While I am sure that happens, I don't see why it is a problem, assuming
your corpus of programs is representative of the workload you expect.
While each program's hot spots may be different, the "average" across
all of them should be what you are optimizing for.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Squeezing Those Bits: Concertina II

<ID8uI.26574$431.1627@fx39.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17355&group=comp.arch#17355

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx39.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
In-Reply-To: <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 60
Message-ID: <ID8uI.26574$431.1627@fx39.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 03 Jun 2021 17:50:32 UTC
Date: Thu, 03 Jun 2021 13:50:02 -0400
X-Received-Bytes: 4333
 by: EricP - Thu, 3 Jun 2021 17:50 UTC

MitchAlsup wrote:
> On Thursday, June 3, 2021 at 2:08:55 AM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> In 30 total gates, and in 4-gates of delay, one can decode My 66000
>>> instructions to determine their length, and the offsets in the instruction
>>> stream of any constants.
>> How many gate delays does a modern architecture usually have per cycle?
>> (I know different gates have different delays, but a ballpark figure
>> would be very interesting).
> <
> Really fast machines 12-gates
> more typical machines 16-gates
> <
>>> IBM 360 does this in 2-gates of delay. Both
>>> share the advantage that the bits decoded are all fixed over all instruction
>>> formats. So, My 66000 is only 2-gates behind at this point.
>>> <
>>> From here we setup a tree of find next instruction logic (2-gates) and we
>>> parse (determine starting word and all constant words) 2 then 4 then 8
>>> then 16 instructions at 2-gates of delay each level. So parsing 16 instructions
>>> per cycle is a 12-gate delay problem. Easy Peasy in a 16-gate cycle. The
>>> original IBM 360 would be 2-gates shorter, the current monstrosity would
>>> be longer.
> <
>> Is 16 gates per cycle particularly short?
> <
> It is enough to perform an add, drive the result bus and make setup timing
> as a forwarded result->operand. 64-bit add = 11-gates. Drive result bus 3-
> gates, consume as operand 2-gates.
> <
> It used to be enough to perform a direct mapped cache hit and load-align
> in 2 cycles, it is no longer, the minimum number is 3 cycles. Many of the
> really fast machines have gone to 4 cycles to use set associative caches.
> <
> I prefer a 20-gate per cycle design point. This is slow enough to reuse
> the bit lines in the register file reads on 1/2 cycle, writes on the other
> 1/2 cycle, and slow enough to "hit" cache SRAMs twice per cycle
> (solving many porting problems.)
> <
> With a 20 gate per cycle design point, one can build a 6-wide reservation
> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
> pipeline.
> <
> At 16 cycles this necessarily becomes 9-10 stages.
> <
> At 12 gates this necessarily becomes 12-15 stages.

It should also be pointed out that the stage flip-flops remain
as a constant 5 gate delay overhead for each stage.
Also a larger stage does not necessarilly split conveniently
into evenly sized smaller stages.

This means that 2x the stages at 1/2 the size does not run at 2x frequency,
and the total latency from pipeline input to output grows.

Re: Squeezing Those Bits: Concertina II

<jwvfsxy6dna.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17356&group=comp.arch#17356

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 03 Jun 2021 13:53:42 -0400
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <jwvfsxy6dna.fsf-monnier+comp.arch@gnu.org>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me>
<21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com>
<s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="3f1fa4a65ab4c290457d7bf7c381fbdf";
logging-data="29802"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181EmT7u12N6HzHZcNTusTE"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:r03ClLu9F0HTcmudKVxty/MUWxg=
sha1:xXwyOUY4N+/+i4+NrFztUsQKMkI=
 by: Stefan Monnier - Thu, 3 Jun 2021 17:53 UTC

MitchAlsup [2021-06-03 10:02:35] wrote:
> Really fast machines 12-gates
> more typical machines 16-gates

By "fast" I think you mean "high clock rate" rather than talking about
the actual real-life performance, right?

Also, here and in earlier posts you've mentioned such numbers a few
times and they've always been multiples of 4. Any particular reason why
the target cycle length (counted in gates) should be a multiple of 4?

Stefan

Re: Squeezing Those Bits: Concertina II

<jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17357&group=comp.arch#17357

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 03 Jun 2021 14:02:19 -0400
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me>
<21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com>
<s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="3f1fa4a65ab4c290457d7bf7c381fbdf";
logging-data="4425"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+EhRKcqMKb68GaSaI2hQPZ"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:QxHfGmVrSwcb0C7LdmYeR8oXhss=
sha1:k57pKca7LyiCnyUacu8mFdRtZeI=
 by: Stefan Monnier - Thu, 3 Jun 2021 18:02 UTC

> With a 20 gate per cycle design point, one can build a 6-wide reservation
> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
> pipeline.

If we count 5-gates of delay for the clock-boundary's flip-flop, that
means:

(20+5)gates * 6-7 stages = 150-175 gates of total pipeline length

> At 16 cycles this necessarily becomes 9-10 stages.
> At 12 gates this necessarily becomes 12-15 stages.

And that gives:

(16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
(12+5)gates * 12-15 stages = 204-255 gates of total pipeline length

So at least in terms of the latency of a single instruction going
through the whole pipeline, the gain of targetting a lower-clocked
design seems clear ;-)

Stefan

Re: Squeezing Those Bits: Concertina II

<1911b2be-aa70-46af-b76f-99e4889fa133n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17358&group=comp.arch#17358

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5fd5:: with SMTP id k21mr675008qta.231.1622743376959;
Thu, 03 Jun 2021 11:02:56 -0700 (PDT)
X-Received: by 2002:a4a:d781:: with SMTP id c1mr407380oou.23.1622743376672;
Thu, 03 Jun 2021 11:02:56 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Jun 2021 11:02:56 -0700 (PDT)
In-Reply-To: <jwvfsxy6dna.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:6947:3c86:73e1:a64e
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwvfsxy6dna.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1911b2be-aa70-46af-b76f-99e4889fa133n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 03 Jun 2021 18:02:56 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 36
 by: Quadibloc - Thu, 3 Jun 2021 18:02 UTC

On Thursday, June 3, 2021 at 11:53:45 AM UTC-6, Stefan Monnier wrote:
> MitchAlsup [2021-06-03 10:02:35] wrote:

> > Really fast machines 12-gates
> > more typical machines 16-gates

> By "fast" I think you mean "high clock rate" rather than talking about
> the actual real-life performance, right?

When the Pentium 4 first came out, I was confused about this issue. However, since
then, I've learned what is going on here.

You're quite correct that the number of gate delays required to get the result of
any given instruction, like for example a floating-point multiply, isn't going to be
changed by how many cycles that one chops it into.

Therefore, using fewer gate delays per cycle for a higher clock rate is just a marketing
ploy, right? *Wrong.*

Because there's latency, and there's _throughput_.

Just as you can have four programs running at the same time if you have four cores,
even though none of those programs runs any faster than if you had just one core,
when a computer is pipelined, all those in-between clock cycles where the cuirrent
instruction from one program is continuing to execute can be used to start instructions
from other programs.

But there's an _additional_ advantage gained from splitting up the pipeline this way
that you don't get from more cores. With out-of-order execution, those extra instructions
you execute on the in-between cycles can be from the _same_ program.

So, yes, splitting the pipeline up into more stages with fewer gate delays can be useful.

However, this can be taken too far, which is why both the Pentium 4 and Bulldozer
were failures instead of successes.

John Savard

Re: Squeezing Those Bits: Concertina II

<s9bbkq$kh4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17359&group=comp.arch#17359

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 3 Jun 2021 14:47:27 -0500
Organization: A noiseless patient Spider
Lines: 173
Message-ID: <s9bbkq$kh4$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<s94le0$3cr$1@dont-email.me>
<d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com>
<s963s5$4sa$1@newsreader4.netcologne.de> <s97m1u$8kh$1@dont-email.me>
<2021Jun2.183620@mips.complang.tuwien.ac.at> <s9a90s$g5j$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Jun 2021 19:47:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5f5ddcb28afa18149d49cdb312b5f10e";
logging-data="21028"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+6dV6Q1O1rmw0M+Yh/NESZ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:mGgB1Qoj61b32KkZF3Ffdth5ey8=
In-Reply-To: <s9a90s$g5j$1@dont-email.me>
Content-Language: en-US
 by: BGB - Thu, 3 Jun 2021 19:47 UTC

On 6/3/2021 4:56 AM, Marcus wrote:
> On 2021-06-02 Anton Ertl wrote:
>> Marcus <m.delete@this.bitsnbites.eu> writes:
>>> A different axis is the dynamic instruction count (i.e. instruction
>>> execution frequencies) which I think is more relevant for
>>> performance analysis/tuning of an ISA.
>>
>> One would think so, but the problem is that in a corpus of large
>> programs the dynamically executed instructions are mostly confined to
>> a few hot spots, and the rest of the large programs them plays hardly
>> any role.  And as a consequence, the most frequent dynamically
>> executed instruction sequences tend to be not very representative of
>> what happens in other programs.
>
> Interesting observation. I suspected that something similar would be the
> case, since looping/hot code must give a very strong bias.
>
> I have planned to add instruction frequency profiling to my simulator (I
> already have symbol-based function profiling which has proven very
> useful), but I'll keep in mind the difficulties that you pointed out.
>
> I wonder if you could improve the situation if you used something like
> different scales (logarithmic?) and some sort of filtering or
> thresholding (e.g. count each memory location at most N times) to
> reduce the bias from loops? Or possibly categorize counts into different
> bins (e.g. "cold", "medium", "hot").
>

I have instruction-use stats in my emulator, but they tend to look
something like:
Program hammers it with various load/store instructions;
Slightly down the list, one encounters some branch ops;
Then ADD, TEST, SHLD / SHAD, ...;
Then CMPxx, and then the other ALU ops;

Pretty much everything else tends to be sort of a soup of fractional
percentages at the bottom (except under fairly specific workloads).

The patterns do tend to imply that some higher-priority features are:
Load/Store Ops:
Load/Store with constant displacement;
Load/Store with register index scaled by element size;
Load/Store in both signed and unsigned variants.
Followed by Branch ops in (PC, Disp) form;
Followed by ADD/SUB and TEST/CMP in Reg/Immed forms;
Followed by shift ops in "Reg, Imm, Reg" form;
Followed by 3R ALU ops (Both "Reg, Reg, Reg" and "Reg, Imm, Reg");
...

Overall instruction counts are also useful to consider for both size and
performance optimization.

For general-purpose code, there is seemingly no real way to eliminate
the high frequency of load/store operations. In theory, it can be helped
by "strict aliasing" semantics in C, but this only goes so far and is
mostly N/A for code which does these optimizations itself.

Being able to keep things in registers is good, but there is still a
limiting factor when most of ones' data is passed around in memory via
pointers (as is typical in most C code).

Loops tend to be fairly common and particularly biased towards using
lots of load/store ops, since frequently they are working on array data,
and arrays are almost invariably kept in memory.

For loads/stores:
The vast majority of constant displacements tend to be positive (unless
one designs the ABI to use negative-offsets relative to a frame-pointer
or similar);
The vast majority of displacements also tend to be scaled by the element
size, so this makes sense as a default.

Negative and misaligned displacements are still common enough though to
where one may need some way to be able encode them without an excessive
penalty. In my ISA, these cases were mostly relegated to loading the
displacement into R0 and using a special (Rb, R0) encoding with an
unscaled index.

IME, one also needs both sign and zero extending loads, since then
whatever is the non-default case ends up hammering the sign or zero
extension instructions.

This is also a partial reason why "ADDS.L" and "ADDU.L" and similar
exist in my ISA, since typically:
Code makes heavy use of 32-bit integers;
For better or for worse, wrap-on-overflow is the commonly expected
behavior, and some amount of code will fail otherwise;
These eliminate the vast majority of cases where one might otherwise
need to sign or zero extend a value for a memory access (if the index
register is greater than 32 bits).

Note that AND/OR/XOR don't need 32-bit variants, since their behavior
with sign or zero extended values tends to be identical to what a
matching size or zero extended operation would have produced.

In my compiler output, there also tends to be a lot of "MOV Reg,Reg"
instructions, but most of these are more due to my compiler being
stupid. If the generated code were closer to optimal, there would be a
lot fewer of these.

At the moment, there doesn't seem to be much obvious "this needs to be
better" stuff left in my core ISA, at least within the confines of the
general ISA design (eg, "Load/Store", ...).

There are a few semi-common patterns which fall outside the load/store
model, most commonly using a memory operand as a source to an ALU
instruction (typically an integer add/subtract).

But, I don't really want to go there, as doing stuff like this would
likely require using a longer pipeline.

However, if one were willing to spend the cost of having such a longer
pipeline, they could potentially gain a few benefits:
Single cycle ALU ops against values loaded from memory;
Potentially being able to have fully pipelined FPU instructions;
...

With the downsides:
The resource cost of register forwarding and interlock handling goes up
rapidly with pipeline length;
Unpredictable are not friendly to longer pipelines;
....

There is also the potential for added complexity for instruction
decoding if one wants to go an x86-like route.

Though, it looks like the majority of these cases could be addressed via
a "simple" encoding, eg:
ADDS.L (Rb, Disp), Rn
SUBU.L (Rb, Ri), Rn
...
Basically, encoded like a memory load.

Though, the utility is diminished slightly in that these would (in many
cases) require an additional register MOV:
MOV Rs, Rn
ADDS.L (Rb, Disp), Rn

Whereas:
ADDS.L Rs, (Rb, Disp), Rn
SUBU.L Rs, (Rb, Ri), Rn
Would avoid an additional MOV, but would require being able to encode an
instruction with 4 register IDs.

Another option, eg:
ADDS.L Rt, (Rb), Rn
Would be possible, but would be hindered in that it would (nearly
always) require a LEA (thus making it "kinda useless").

....

Either way, it doesn't seem that likely the added cost/complexity of a
longer pipeline is likely to pay off. My guess is that one would end up
paying a lot more from the branch misses than they would save by saving
a few cycles here and there on ALU ops.

It could be possible also to try to shove it in as a special case
without making the pipeline longer, but (to have any hope of passing
timing) would require restricting the range of operand sizes (small
integer types only, with a sign/zero extended result).

....

Re: Squeezing Those Bits: Concertina II

<s9bga3$ljs$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17360&group=comp.arch#17360

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 3 Jun 2021 16:07:04 -0500
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <s9bga3$ljs$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com>
<38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 3 Jun 2021 21:07:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5f5ddcb28afa18149d49cdb312b5f10e";
logging-data="22140"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19V/v42heylH/4e8az961iK"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:OIKNba6deWeei2UKMCu9pY+yjI8=
In-Reply-To: <7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 3 Jun 2021 21:07 UTC

On 6/2/2021 6:33 PM, Quadibloc wrote:
> On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
>> On Tuesday, June 1, 2021 at 10:33:28 PM UTC-5, Quadibloc wrote:
>
>>> Suppose there's a dependency. If the machine has to take a cycle between starting
>>> each instruction (one-wide decode unit) and an instruction takes X cycles to execute, then
>>> I can deal with dependencies by coding X-1 instructions between the instructions involved in
>>> the dependency.
>
>> And this is where VLIW breaks down. What is X is variable or X changes between implementations ?
>
> Ah, but I guess I didn't make myself clear here.
>
> That's how I could deal with dependencies on a RISC CPU, which doesn't have the extra
> features that VLIW offers.
>
> If I have OoO, obviously, if X changes between implementations, it's not an issue.
>
> If I have VLIW, I can now explicitly indicate 'this instruction depends on that previous
> instruction', so the processor knows to wait only for the minimum time necessary (or, at
> worst, the worst-case time required by the previous instruction).
>

Admittedly, I still have slight skepticism of "OoO everywhere" at this
point...

Eg, recently, my old phone had a problem and needed to be replaced. The
battery basically puffed up like a balloon and more or less forcibly
disassembled the phone. Did get a replacement battery though (after
ordering the new phone), so technically have 2 phones now.

So, I bought a new phone as a replacement, and what kind of fancy new
CPU is it running?... Cortex-A53...

So, given that apparently phones are still happy enough mostly running
on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
cores seems mostly limited to "higher end" devices.

And, it appears this is not entirely recent: these sorts of 2-wide
superscalar cores seem to have been dominant in phones and consumer
electronics for roughly the past 15-20 years or so.

I suspect that the limited case of a 3-wide VLIW can potentially do
slightly better at the 2-wide superscalar game than an actual 2-wide
in-order superscalar, provided it can be kept cost-competitive in other
areas.

Then one can mostly try to ride along in a similar "sweet spot" to the
one that seems to favor 2-wide superscalar cores.

In my own design effort, 3-wide seemed to be the local optimum for the
CPU core design (is a bit more capable at 2-wide code than the 2-wide
design would have been, even if cases where 3 instructions can be run in
parallel tend to be infrequent).

Meanwhile, going much wider than this introduces a bunch of "difficult"
problems, which would require a lot of added heavy lifting to justify
(eg: a 5 or 6 wide core requires "some way to make modulo loop
scheduling effective and do aggressive inlining", otherwise it seems
kinda pointless).

Well, and also going wider it likely to need 64 GPRs and multiple
predicate registers, otherwise these seem to be a bottleneck when trying
to do modulo loops and inlining (*), ...

But, in normal linear code, going beyond 32 GPRs and a single predicate
offers little advantage (the multiple predicates would make more sense
for scheduling multiple "if()" blocks on top of each other, or
potentially several instances of the same "if()" branch overlapping itself).

*: Mostly noted in my case when writing ASM for some rasterizer loops
and similar. What would otherwise be a single variable needs multiple
registers due to duplication and similar, which can add a significant
amount of register pressure.

....

Re: Squeezing Those Bits: Concertina II

<29a1831c-ead7-4041-b018-778a04f025c5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17361&group=comp.arch#17361

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:b:: with SMTP id x11mr1884698qtw.272.1622761212810;
Thu, 03 Jun 2021 16:00:12 -0700 (PDT)
X-Received: by 2002:a05:6830:1bd3:: with SMTP id v19mr1358424ota.276.1622761212516;
Thu, 03 Jun 2021 16:00:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Jun 2021 16:00:12 -0700 (PDT)
In-Reply-To: <ID8uI.26574$431.1627@fx39.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7917:d6f0:4403:b07e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7917:d6f0:4403:b07e
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <ID8uI.26574$431.1627@fx39.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <29a1831c-ead7-4041-b018-778a04f025c5n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Jun 2021 23:00:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Thu, 3 Jun 2021 23:00 UTC

On Thursday, June 3, 2021 at 12:50:35 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Thursday, June 3, 2021 at 2:08:55 AM UTC-5, Thomas Koenig wrote:
> >> MitchAlsup <Mitch...@aol.com> schrieb:
> >>> In 30 total gates, and in 4-gates of delay, one can decode My 66000
> >>> instructions to determine their length, and the offsets in the instruction
> >>> stream of any constants.
> >> How many gate delays does a modern architecture usually have per cycle?
> >> (I know different gates have different delays, but a ballpark figure
> >> would be very interesting).
> > <
> > Really fast machines 12-gates
> > more typical machines 16-gates
> > <
> >>> IBM 360 does this in 2-gates of delay. Both
> >>> share the advantage that the bits decoded are all fixed over all instruction
> >>> formats. So, My 66000 is only 2-gates behind at this point.
> >>> <
> >>> From here we setup a tree of find next instruction logic (2-gates) and we
> >>> parse (determine starting word and all constant words) 2 then 4 then 8
> >>> then 16 instructions at 2-gates of delay each level. So parsing 16 instructions
> >>> per cycle is a 12-gate delay problem. Easy Peasy in a 16-gate cycle. The
> >>> original IBM 360 would be 2-gates shorter, the current monstrosity would
> >>> be longer.
> > <
> >> Is 16 gates per cycle particularly short?
> > <
> > It is enough to perform an add, drive the result bus and make setup timing
> > as a forwarded result->operand. 64-bit add = 11-gates. Drive result bus 3-
> > gates, consume as operand 2-gates.
> > <
> > It used to be enough to perform a direct mapped cache hit and load-align
> > in 2 cycles, it is no longer, the minimum number is 3 cycles. Many of the
> > really fast machines have gone to 4 cycles to use set associative caches.
> > <
> > I prefer a 20-gate per cycle design point. This is slow enough to reuse
> > the bit lines in the register file reads on 1/2 cycle, writes on the other
> > 1/2 cycle, and slow enough to "hit" cache SRAMs twice per cycle
> > (solving many porting problems.)
> > <
> > With a 20 gate per cycle design point, one can build a 6-wide reservation
> > station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
> > 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
> > pipeline.
> > <
> > At 16 cycles this necessarily becomes 9-10 stages.
> > <
> > At 12 gates this necessarily becomes 12-15 stages.
<
> It should also be pointed out that the stage flip-flops remain
> as a constant 5 gate delay overhead for each stage.
<
Indeed !! 2 actual gates of delay, and 1.5 gates of skew and 1.5 gates of jitter.
<
> Also a larger stage does not necessarilly split conveniently
> into evenly sized smaller stages.
>
> This means that 2x the stages at 1/2 the size does not run at 2x frequency,
> and the total latency from pipeline input to output grows.
<
Mitch's second law:: If you have a pipeline of depth k and you slice
each stage in 1/2, you end up with a pipeline that has 2.5× as many
stages.
<
But EricP is also correct.

If you slice a 20 gate machine in to 10 gate stages, you go from a 25
gate clock rate to a 15 gate clock rate, 66% faster rather than 100%
faster.

Re: Squeezing Those Bits: Concertina II

<3a92ab08-46c3-41a0-a74b-1fced6dacacan@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17362&group=comp.arch#17362

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2941:: with SMTP id n1mr1639863qkp.330.1622761319738;
Thu, 03 Jun 2021 16:01:59 -0700 (PDT)
X-Received: by 2002:a05:6830:1251:: with SMTP id s17mr1370229otp.81.1622761319532;
Thu, 03 Jun 2021 16:01:59 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Jun 2021 16:01:59 -0700 (PDT)
In-Reply-To: <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7917:d6f0:4403:b07e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7917:d6f0:4403:b07e
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3a92ab08-46c3-41a0-a74b-1fced6dacacan@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Jun 2021 23:01:59 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 3 Jun 2021 23:01 UTC

On Thursday, June 3, 2021 at 1:02:22 PM UTC-5, Stefan Monnier wrote:
> > With a 20 gate per cycle design point, one can build a 6-wide reservation
> > station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
> > 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
> > pipeline.
> If we count 5-gates of delay for the clock-boundary's flip-flop, that
> means:
>
> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
> > At 16 cycles this necessarily becomes 9-10 stages.
> > At 12 gates this necessarily becomes 12-15 stages.
> And that gives:
>
> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
>
> So at least in terms of the latency of a single instruction going
> through the whole pipeline, the gain of targetting a lower-clocked
> design seems clear ;-)
<
Clear, even before one takes a look at the power dissipation !
>
>
> Stefan

Re: Squeezing Those Bits: Concertina II

<b88b70b5-e4ad-4bcc-85c5-bd42bfbbdc8en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17363&group=comp.arch#17363

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:693:: with SMTP id 141mr1559813qkg.453.1622761915569;
Thu, 03 Jun 2021 16:11:55 -0700 (PDT)
X-Received: by 2002:a9d:63cd:: with SMTP id e13mr1455918otl.206.1622761915282;
Thu, 03 Jun 2021 16:11:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Jun 2021 16:11:55 -0700 (PDT)
In-Reply-To: <1911b2be-aa70-46af-b76f-99e4889fa133n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7917:d6f0:4403:b07e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7917:d6f0:4403:b07e
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwvfsxy6dna.fsf-monnier+comp.arch@gnu.org>
<1911b2be-aa70-46af-b76f-99e4889fa133n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b88b70b5-e4ad-4bcc-85c5-bd42bfbbdc8en@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Jun 2021 23:11:55 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 3 Jun 2021 23:11 UTC

On Thursday, June 3, 2021 at 1:02:58 PM UTC-5, Quadibloc wrote:
> On Thursday, June 3, 2021 at 11:53:45 AM UTC-6, Stefan Monnier wrote:
> > MitchAlsup [2021-06-03 10:02:35] wrote:
>
> > > Really fast machines 12-gates
> > > more typical machines 16-gates
>
> > By "fast" I think you mean "high clock rate" rather than talking about
> > the actual real-life performance, right?
> When the Pentium 4 first came out, I was confused about this issue. However, since
> then, I've learned what is going on here.
>
> You're quite correct that the number of gate delays required to get the result of
> any given instruction, like for example a floating-point multiply, isn't going to be
> changed by how many cycles that one chops it into.
<
Often in the early stages of design, one will construction the logic diagram with no
flip-flops in the design at all. Then with this logic in hand, one examines where the
stages can be positioned so that all stages are fairly well balanced, an that certain
clock dependent logic (like clocked SRAMs) are positioned at an appropriate clock
edge.
<
By and large, given the fall-through logic, one can design multiple different pipelines
each of which is suited for a different position in the performance and power category.
>
> Therefore, using fewer gate delays per cycle for a higher clock rate is just a marketing
> ploy, right? *Wrong.*
>
> Because there's latency, and there's _throughput_.
>
> Just as you can have four programs running at the same time if you have four cores,
> even though none of those programs runs any faster than if you had just one core,
> when a computer is pipelined, all those in-between clock cycles where the cuirrent
> instruction from one program is continuing to execute can be used to start instructions
> from other programs.
<
With the branch predictors one had in 1990 one could not afford the depth of execution
windows we see today--branch misprediction repair would waste too many cycles and too
much power.
>
> But there's an _additional_ advantage gained from splitting up the pipeline this way
> that you don't get from more cores. With out-of-order execution, those extra instructions
> you execute on the in-between cycles can be from the _same_ program.
<
This is simply an argument that since the FUs are present and idle, someone else can
use them. GPUs, on the other hand, bring the cores to the data rather than the other
way around !?!
>
> So, yes, splitting the pipeline up into more stages with fewer gate delays can be useful.
<
It is all a balancing point, and the CPU calculation pipeline(s) actually have little to
do with the usable length--branch prediction accuracy, and memory hierarchy latency
determine where the line should be drawn.
>
> However, this can be taken too far, which is why both the Pentium 4 and Bulldozer
> were failures instead of successes.
<
Yep.
>
> John Savard

Re: Squeezing Those Bits: Concertina II

<29140af3-45cc-4aca-9a90-52cfaab6388bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17364&group=comp.arch#17364

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1d0e:: with SMTP id e14mr2063001qvd.23.1622762801830;
Thu, 03 Jun 2021 16:26:41 -0700 (PDT)
X-Received: by 2002:a05:6808:53:: with SMTP id v19mr8290176oic.175.1622762801605;
Thu, 03 Jun 2021 16:26:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Jun 2021 16:26:41 -0700 (PDT)
In-Reply-To: <s9bbkq$kh4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7917:d6f0:4403:b07e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7917:d6f0:4403:b07e
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <s94le0$3cr$1@dont-email.me>
<d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com> <s963s5$4sa$1@newsreader4.netcologne.de>
<s97m1u$8kh$1@dont-email.me> <2021Jun2.183620@mips.complang.tuwien.ac.at>
<s9a90s$g5j$1@dont-email.me> <s9bbkq$kh4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <29140af3-45cc-4aca-9a90-52cfaab6388bn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Jun 2021 23:26:41 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 3 Jun 2021 23:26 UTC

On Thursday, June 3, 2021 at 2:47:40 PM UTC-5, BGB wrote:
> On 6/3/2021 4:56 AM, Marcus wrote:
> > On 2021-06-02 Anton Ertl wrote:
> >> Marcus <m.de...@this.bitsnbites.eu> writes:
> >>> A different axis is the dynamic instruction count (i.e. instruction
> >>> execution frequencies) which I think is more relevant for
> >>> performance analysis/tuning of an ISA.
> >>
> >> One would think so, but the problem is that in a corpus of large
> >> programs the dynamically executed instructions are mostly confined to
> >> a few hot spots, and the rest of the large programs them plays hardly
> >> any role. And as a consequence, the most frequent dynamically
> >> executed instruction sequences tend to be not very representative of
> >> what happens in other programs.
> >
> > Interesting observation. I suspected that something similar would be the
> > case, since looping/hot code must give a very strong bias.
> >
> > I have planned to add instruction frequency profiling to my simulator (I
> > already have symbol-based function profiling which has proven very
> > useful), but I'll keep in mind the difficulties that you pointed out.
> >
> > I wonder if you could improve the situation if you used something like
> > different scales (logarithmic?) and some sort of filtering or
> > thresholding (e.g. count each memory location at most N times) to
> > reduce the bias from loops? Or possibly categorize counts into different
> > bins (e.g. "cold", "medium", "hot").
> >
<
> I have instruction-use stats in my emulator, but they tend to look
> something like:
> Program hammers it with various load/store instructions;
> Slightly down the list, one encounters some branch ops;
> Then ADD, TEST, SHLD / SHAD, ...;
> Then CMPxx, and then the other ALU ops;
>
> Pretty much everything else tends to be sort of a soup of fractional
> percentages at the bottom (except under fairly specific workloads).
>
>
> The patterns do tend to imply that some higher-priority features are:
> Load/Store Ops:
> Load/Store with constant displacement;
> Load/Store with register index scaled by element size;
> Load/Store in both signed and unsigned variants.
> Followed by Branch ops in (PC, Disp) form;
> Followed by ADD/SUB and TEST/CMP in Reg/Immed forms;
> Followed by shift ops in "Reg, Imm, Reg" form;
> Followed by 3R ALU ops (Both "Reg, Reg, Reg" and "Reg, Imm, Reg");
> ...
>
> Overall instruction counts are also useful to consider for both size and
> performance optimization.
>
>
> For general-purpose code, there is seemingly no real way to eliminate
> the high frequency of load/store operations. In theory, it can be helped
> by "strict aliasing" semantics in C, but this only goes so far and is
> mostly N/A for code which does these optimizations itself.
>
> Being able to keep things in registers is good, but there is still a
> limiting factor when most of ones' data is passed around in memory via
> pointers (as is typical in most C code).
>
>
> Loops tend to be fairly common and particularly biased towards using
> lots of load/store ops, since frequently they are working on array data,
> and arrays are almost invariably kept in memory.
<
It is moderately difficult to make a loop that is both useful and doe not need loads
or stores! But I digress.........
>
>
> For loads/stores:
> The vast majority of constant displacements tend to be positive (unless
> one designs the ABI to use negative-offsets relative to a frame-pointer
> or similar);
<
And of these, with 16-bit displacements, one would actually want about 7/8ths
of the displacements to be positive rather than 1/2 positive and 1/2 negative.
<
> The vast majority of displacements also tend to be scaled by the element
> size, so this makes sense as a default.
>
> Negative and misaligned displacements are still common enough though to
> where one may need some way to be able encode them without an excessive
> penalty. In my ISA, these cases were mostly relegated to loading the
> displacement into R0 and using a special (Rb, R0) encoding with an
> unscaled index.
>
>
> IME, one also needs both sign and zero extending loads, since then
> whatever is the non-default case ends up hammering the sign or zero
> extension instructions.
<
absofriggenlutely !!
>
> This is also a partial reason why "ADDS.L" and "ADDU.L" and similar
> exist in my ISA, since typically:
> Code makes heavy use of 32-bit integers;
> For better or for worse, wrap-on-overflow is the commonly expected
> behavior, and some amount of code will fail otherwise;
> These eliminate the vast majority of cases where one might otherwise
> need to sign or zero extend a value for a memory access (if the index
> register is greater than 32 bits).
>
> Note that AND/OR/XOR don't need 32-bit variants, since their behavior
> with sign or zero extended values tends to be identical to what a
> matching size or zero extended operation would have produced.
>
>
>
> In my compiler output, there also tends to be a lot of "MOV Reg,Reg"
> instructions, but most of these are more due to my compiler being
> stupid. If the generated code were closer to optimal, there would be a
> lot fewer of these.
>
I see the majority of MOV Rd,Rs1 instructions at the middle of nested loops
getting setup to do the outer ADD/CMP/BC so the top of the loop code runs
smoothly.
>
>
> At the moment, there doesn't seem to be much obvious "this needs to be
> better" stuff left in my core ISA, at least within the confines of the
> general ISA design (eg, "Load/Store", ...).
<
see below::
>
> There are a few semi-common patterns which fall outside the load/store
> model, most commonly using a memory operand as a source to an ALU
> instruction (typically an integer add/subtract).
>
> But, I don't really want to go there, as doing stuff like this would
> likely require using a longer pipeline.
>
> However, if one were willing to spend the cost of having such a longer
> pipeline, they could potentially gain a few benefits:
> Single cycle ALU ops against values loaded from memory;
> Potentially being able to have fully pipelined FPU instructions;
> ...
>
> With the downsides:
> The resource cost of register forwarding and interlock handling goes up
> rapidly with pipeline length;
> Unpredictable are not friendly to longer pipelines;
> ...
>
> There is also the potential for added complexity for instruction
> decoding if one wants to go an x86-like route.
>
> Though, it looks like the majority of these cases could be addressed via
> a "simple" encoding, eg:
> ADDS.L (Rb, Disp), Rn
> SUBU.L (Rb, Ri), Rn
> ...
> Basically, encoded like a memory load.
<
from above::
<
There is nothing that prevents an implementation from CoIssuing these
as paired instructions.
>
> Though, the utility is diminished slightly in that these would (in many
> cases) require an additional register MOV:
> MOV Rs, Rn
> ADDS.L (Rb, Disp), Rn
<
CoIssue
<
If you dig into CoIssue enough> you will see that as many as 30% of RISC
instructions can be paired
<
> Whereas:
> ADDS.L Rs, (Rb, Disp), Rn
> SUBU.L Rs, (Rb, Ri), Rn
> Would avoid an additional MOV, but would require being able to encode an
> instruction with 4 register IDs.
>
> Another option, eg:
> ADDS.L Rt, (Rb), Rn
> Would be possible, but would be hindered in that it would (nearly
> always) require a LEA (thus making it "kinda useless").
>
> ...
>
>
> Either way, it doesn't seem that likely the added cost/complexity of a
> longer pipeline is likely to pay off. My guess is that one would end up
> paying a lot more from the branch misses than they would save by saving
> a few cycles here and there on ALU ops.
>
> It could be possible also to try to shove it in as a special case
> without making the pipeline longer, but (to have any hope of passing
> timing) would require restricting the range of operand sizes (small
> integer types only, with a sign/zero extended result).
>
> ...

Re: Squeezing Those Bits: Concertina II

<5bb7446c-6c7e-46b9-86e9-756baa3b1351n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17365&group=comp.arch#17365

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:cd:: with SMTP id p13mr2053991qtw.144.1622763184404;
Thu, 03 Jun 2021 16:33:04 -0700 (PDT)
X-Received: by 2002:a4a:1145:: with SMTP id 66mr1418401ooc.14.1622763184192;
Thu, 03 Jun 2021 16:33:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Jun 2021 16:33:03 -0700 (PDT)
In-Reply-To: <s9bga3$ljs$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7917:d6f0:4403:b07e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7917:d6f0:4403:b07e
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5bb7446c-6c7e-46b9-86e9-756baa3b1351n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Jun 2021 23:33:04 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 3 Jun 2021 23:33 UTC

On Thursday, June 3, 2021 at 4:07:17 PM UTC-5, BGB wrote:
> On 6/2/2021 6:33 PM, Quadibloc wrote:
> > On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
> >> On Tuesday, June 1, 2021 at 10:33:28 PM UTC-5, Quadibloc wrote:
> >
> >>> Suppose there's a dependency. If the machine has to take a cycle between starting
> >>> each instruction (one-wide decode unit) and an instruction takes X cycles to execute, then
> >>> I can deal with dependencies by coding X-1 instructions between the instructions involved in
> >>> the dependency.
> >
> >> And this is where VLIW breaks down. What is X is variable or X changes between implementations ?
> >
> > Ah, but I guess I didn't make myself clear here.
> >
> > That's how I could deal with dependencies on a RISC CPU, which doesn't have the extra
> > features that VLIW offers.
> >
> > If I have OoO, obviously, if X changes between implementations, it's not an issue.
> >
> > If I have VLIW, I can now explicitly indicate 'this instruction depends on that previous
> > instruction', so the processor knows to wait only for the minimum time necessary (or, at
> > worst, the worst-case time required by the previous instruction).
> >
> Admittedly, I still have slight skepticism of "OoO everywhere" at this
> point...
>
>
> Eg, recently, my old phone had a problem and needed to be replaced. The
> battery basically puffed up like a balloon and more or less forcibly
> disassembled the phone. Did get a replacement battery though (after
> ordering the new phone), so technically have 2 phones now.
>
> So, I bought a new phone as a replacement, and what kind of fancy new
> CPU is it running?... Cortex-A53...
>
>
> So, given that apparently phones are still happy enough mostly running
> on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
> cores seems mostly limited to "higher end" devices.
<
Typically, this is more of a power budget thing than IO versus OoO thing.
>
> And, it appears this is not entirely recent: these sorts of 2-wide
> superscalar cores seem to have been dominant in phones and consumer
> electronics for roughly the past 15-20 years or so.
>
>
> I suspect that the limited case of a 3-wide VLIW can potentially do
> slightly better at the 2-wide superscalar game than an actual 2-wide
> in-order superscalar, provided it can be kept cost-competitive in other
> areas.
>
> Then one can mostly try to ride along in a similar "sweet spot" to the
> one that seems to favor 2-wide superscalar cores.
>
> In my own design effort, 3-wide seemed to be the local optimum for the
> CPU core design (is a bit more capable at 2-wide code than the 2-wide
> design would have been, even if cases where 3 instructions can be run in
> parallel tend to be infrequent).
>
>
>
> Meanwhile, going much wider than this introduces a bunch of "difficult"
> problems, which would require a lot of added heavy lifting to justify
> (eg: a 5 or 6 wide core requires "some way to make modulo loop
> scheduling effective and do aggressive inlining", otherwise it seems
> kinda pointless).
<
These modulo schedulers come with names like "reservation stations",
"Scoreboards" and "Dispatch Stacks".
>
> Well, and also going wider it likely to need 64 GPRs and multiple
> predicate registers, otherwise these seem to be a bottleneck when trying
> to do modulo loops and inlining (*), ...
<
While I am fine equipping a piece of HW with 128 to 256 actual physical registers,
I have given up on making these programmable by the compiler.
>
> But, in normal linear code, going beyond 32 GPRs and a single predicate
> offers little advantage (the multiple predicates would make more sense
> for scheduling multiple "if()" blocks on top of each other, or
> potentially several instances of the same "if()" branch overlapping itself).
<
I might note that putting the result of comparisons into registers enables
one to use said comparison more than once along with other comparisons.
This is a big bug-a-boo for condition codes.
>
> *: Mostly noted in my case when writing ASM for some rasterizer loops
> and similar. What would otherwise be a single variable needs multiple
> registers due to duplication and similar, which can add a significant
> amount of register pressure.
>
> ...

Re: Squeezing Those Bits: Concertina II

<s9bp36$k74$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17366&group=comp.arch#17366

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 3 Jun 2021 18:36:58 -0500
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <s9bp36$k74$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com>
<38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<8ecd4d89-47b3-427e-be13-91cdf0476668n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 3 Jun 2021 23:37:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="be4adef6c9734bc2693a6aa371936b50";
logging-data="20708"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zzZjEProIBIlI33v9x7f/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:1TUkbryKU57mMLYa+XSrX17Fejo=
In-Reply-To: <8ecd4d89-47b3-427e-be13-91cdf0476668n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 3 Jun 2021 23:36 UTC

On 6/2/2021 9:36 PM, Quadibloc wrote:
> On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
>
>> Taking the light weight nature of VLIW and crushing it with added baggage.
>
> Yes, that's a fair criticism. But implementations can omit features. Including the option of
> VLIW. (Preferred is to accept VLIW programs, but basically ignore all the hinting.)
>

This is basically what BJX2 does in a single wide core, or if one tries
to use an unsupported profile (it ignores the WEX hints and falls back
to executing instructions sequentially).

Profiles are currently:
0: WEX is not used;
1: 2-wide profile (not actually used ATM);
2: 3-wide profile (currently the only form of WEX used).
3: 3-wide with 2 memory lanes (possible).
4+: TBD

There could, in premise, be wider profiles, but as noted, I suspect the
current set of 32 GPRs and a single predicate is likely to limit this.

One thing I "could" do, in theory, is overload the encoding space I had
used for 24-bit ops in the microcontroller config to give either more
GPRs or more predicates, but the actual encoding space for opcodes would
be pretty small (basically useless with a 32-bit encoding).

A potentially more viable option would be putting this stuff in the
currently unused Op48 space (and then need to deal with the complexity
of multiple instruction sizes within bundles).

Hmm:
1111-111w-qnst-tttt oooo-oooo-oooo-oooo 10pp-pnst-nnnn-ssss

wppp:
0000: Always
0001: -
0010: Pred?T (SR.T)
0011: Pred?F (SR.T)
0100: Pred?T (SR.S)
0101: Pred?F (SR.S)
0110: Pred?T (?)
0111: Pred?F (?)
1000: WEX, Always
1001: -
1010: WEX, Pred?T (SR.T)
1011: WEX, Pred?F (SR.T)
1100: WEX, Pred?T (SR.S)
1101: WEX, Pred?F (SR.S)
1110: WEX, Pred?T (?)
1111: WEX, Pred?F (?)

Pretty ugly, but at least it could work, in theory...

It is possible that the bundle configurations could be restricted in
order to limit the number of decoders required.

....

>> Today's mainframes from IBM are not S/360 (or even S/370, 3080, 3090) they are another
>> super extended 64-bit, but sometimes 32-bit ISA will 3 kinds of floating point (HEX, binary, decimal)
>> and everything including the kitchen sink.
>
> Although they still have 00 -> 16 bits, 11-> 48 bits, and 01 and 10 -> 32 bits.
>
> And the things they've added were all things that met the needs of their customers; hex and binary
> floating point give compatibility with old software on the one hand, and the better numerical
> properties of IEEE-754 on the other.
>
> John Savard
>

Re: Squeezing Those Bits: Concertina II

<s9buuh$iaj$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17367&group=comp.arch#17367

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 3 Jun 2021 20:16:54 -0500
Organization: A noiseless patient Spider
Lines: 139
Message-ID: <s9buuh$iaj$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me>
<5bb7446c-6c7e-46b9-86e9-756baa3b1351n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 4 Jun 2021 01:17:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="be4adef6c9734bc2693a6aa371936b50";
logging-data="18771"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XGomPIn8wdmPkXr6U0Xkg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:wEcyr1w4F/ZL5u9oZwxZ+SAeIEY=
In-Reply-To: <5bb7446c-6c7e-46b9-86e9-756baa3b1351n@googlegroups.com>
Content-Language: en-US
 by: BGB - Fri, 4 Jun 2021 01:16 UTC

On 6/3/2021 6:33 PM, MitchAlsup wrote:
> On Thursday, June 3, 2021 at 4:07:17 PM UTC-5, BGB wrote:
>> On 6/2/2021 6:33 PM, Quadibloc wrote:
>>> On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
>>>> On Tuesday, June 1, 2021 at 10:33:28 PM UTC-5, Quadibloc wrote:
>>>
>>>>> Suppose there's a dependency. If the machine has to take a cycle between starting
>>>>> each instruction (one-wide decode unit) and an instruction takes X cycles to execute, then
>>>>> I can deal with dependencies by coding X-1 instructions between the instructions involved in
>>>>> the dependency.
>>>
>>>> And this is where VLIW breaks down. What is X is variable or X changes between implementations ?
>>>
>>> Ah, but I guess I didn't make myself clear here.
>>>
>>> That's how I could deal with dependencies on a RISC CPU, which doesn't have the extra
>>> features that VLIW offers.
>>>
>>> If I have OoO, obviously, if X changes between implementations, it's not an issue.
>>>
>>> If I have VLIW, I can now explicitly indicate 'this instruction depends on that previous
>>> instruction', so the processor knows to wait only for the minimum time necessary (or, at
>>> worst, the worst-case time required by the previous instruction).
>>>
>> Admittedly, I still have slight skepticism of "OoO everywhere" at this
>> point...
>>
>>
>> Eg, recently, my old phone had a problem and needed to be replaced. The
>> battery basically puffed up like a balloon and more or less forcibly
>> disassembled the phone. Did get a replacement battery though (after
>> ordering the new phone), so technically have 2 phones now.
>>
>> So, I bought a new phone as a replacement, and what kind of fancy new
>> CPU is it running?... Cortex-A53...
>>
>>
>> So, given that apparently phones are still happy enough mostly running
>> on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
>> cores seems mostly limited to "higher end" devices.
> <
> Typically, this is more of a power budget thing than IO versus OoO thing.

Not sure, a lot of these phones claim ~ 2 days of battery life from a
4500mAh battery, which is ~ 12x longer (relative to capacity) than I was
getting from a RasPi running Doom (using a bank of 3x 18650 cells).

Granted, I suspect their metric assumes that it is mostly in ones'
pocket not being used for much.

When I tried running my BJX2 emulator on my last phone, it drained down
the battery in a few hours.

>>
>> And, it appears this is not entirely recent: these sorts of 2-wide
>> superscalar cores seem to have been dominant in phones and consumer
>> electronics for roughly the past 15-20 years or so.
>>
>>
>> I suspect that the limited case of a 3-wide VLIW can potentially do
>> slightly better at the 2-wide superscalar game than an actual 2-wide
>> in-order superscalar, provided it can be kept cost-competitive in other
>> areas.
>>
>> Then one can mostly try to ride along in a similar "sweet spot" to the
>> one that seems to favor 2-wide superscalar cores.
>>
>> In my own design effort, 3-wide seemed to be the local optimum for the
>> CPU core design (is a bit more capable at 2-wide code than the 2-wide
>> design would have been, even if cases where 3 instructions can be run in
>> parallel tend to be infrequent).
>>
>>
>>
>> Meanwhile, going much wider than this introduces a bunch of "difficult"
>> problems, which would require a lot of added heavy lifting to justify
>> (eg: a 5 or 6 wide core requires "some way to make modulo loop
>> scheduling effective and do aggressive inlining", otherwise it seems
>> kinda pointless).
> <
> These modulo schedulers come with names like "reservation stations",
> "Scoreboards" and "Dispatch Stacks".

If it is done in hardware...

I was imagining if the compiler can manage to pull it off.
My compiler isn't smart enough for this, but some stuff claims that GCC
supports modulo scheduling, so in premise it is possible.

I can do it on my ISA as well, though it generally requires using
hand-written ASM.

>>
>> Well, and also going wider it likely to need 64 GPRs and multiple
>> predicate registers, otherwise these seem to be a bottleneck when trying
>> to do modulo loops and inlining (*), ...
> <
> While I am fine equipping a piece of HW with 128 to 256 actual physical registers,
> I have given up on making these programmable by the compiler.

Probably.

An FPGA can handle 64 GPRs, but yeah, exposing it at the ISA level is
more of an issue with 32-bit encodings.

>>
>> But, in normal linear code, going beyond 32 GPRs and a single predicate
>> offers little advantage (the multiple predicates would make more sense
>> for scheduling multiple "if()" blocks on top of each other, or
>> potentially several instances of the same "if()" branch overlapping itself).
> <
> I might note that putting the result of comparisons into registers enables
> one to use said comparison more than once along with other comparisons.
> This is a big bug-a-boo for condition codes.

The idea would be to try to use one of several bits in SR or similar as
predicate flags, or as several virtual 1-bit registers (as opposed to
throwing GPRs at the problem).

But, then one needs a few bits to indicate the source flag to use for
predication, and the destination flag for compare ops.

As noted, at present the ISA uses SR.T, but there are potential
candidates for an extended set: T/S, P/Q/R/O.

>>
>> *: Mostly noted in my case when writing ASM for some rasterizer loops
>> and similar. What would otherwise be a single variable needs multiple
>> registers due to duplication and similar, which can add a significant
>> amount of register pressure.
>>
>> ...

Re: Squeezing Those Bits: Concertina II

<s9c62v$i4m$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17368&group=comp.arch#17368

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Thu, 3 Jun 2021 22:18:44 -0500
Organization: A noiseless patient Spider
Lines: 296
Message-ID: <s9c62v$i4m$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<s94le0$3cr$1@dont-email.me>
<d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com>
<s963s5$4sa$1@newsreader4.netcologne.de> <s97m1u$8kh$1@dont-email.me>
<2021Jun2.183620@mips.complang.tuwien.ac.at> <s9a90s$g5j$1@dont-email.me>
<s9bbkq$kh4$1@dont-email.me>
<29140af3-45cc-4aca-9a90-52cfaab6388bn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 4 Jun 2021 03:18:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="be4adef6c9734bc2693a6aa371936b50";
logging-data="18582"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/k8MhnmzZKlU7BB/l6qodx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:yKrnmKZO2pSJqi54clE+rxWEQ2E=
In-Reply-To: <29140af3-45cc-4aca-9a90-52cfaab6388bn@googlegroups.com>
Content-Language: en-US
 by: BGB - Fri, 4 Jun 2021 03:18 UTC

On 6/3/2021 6:26 PM, MitchAlsup wrote:
> On Thursday, June 3, 2021 at 2:47:40 PM UTC-5, BGB wrote:
>> On 6/3/2021 4:56 AM, Marcus wrote:
>>> On 2021-06-02 Anton Ertl wrote:
>>>> Marcus <m.de...@this.bitsnbites.eu> writes:
>>>>> A different axis is the dynamic instruction count (i.e. instruction
>>>>> execution frequencies) which I think is more relevant for
>>>>> performance analysis/tuning of an ISA.
>>>>
>>>> One would think so, but the problem is that in a corpus of large
>>>> programs the dynamically executed instructions are mostly confined to
>>>> a few hot spots, and the rest of the large programs them plays hardly
>>>> any role. And as a consequence, the most frequent dynamically
>>>> executed instruction sequences tend to be not very representative of
>>>> what happens in other programs.
>>>
>>> Interesting observation. I suspected that something similar would be the
>>> case, since looping/hot code must give a very strong bias.
>>>
>>> I have planned to add instruction frequency profiling to my simulator (I
>>> already have symbol-based function profiling which has proven very
>>> useful), but I'll keep in mind the difficulties that you pointed out.
>>>
>>> I wonder if you could improve the situation if you used something like
>>> different scales (logarithmic?) and some sort of filtering or
>>> thresholding (e.g. count each memory location at most N times) to
>>> reduce the bias from loops? Or possibly categorize counts into different
>>> bins (e.g. "cold", "medium", "hot").
>>>
> <
>> I have instruction-use stats in my emulator, but they tend to look
>> something like:
>> Program hammers it with various load/store instructions;
>> Slightly down the list, one encounters some branch ops;
>> Then ADD, TEST, SHLD / SHAD, ...;
>> Then CMPxx, and then the other ALU ops;
>>
>> Pretty much everything else tends to be sort of a soup of fractional
>> percentages at the bottom (except under fairly specific workloads).
>>
>>
>> The patterns do tend to imply that some higher-priority features are:
>> Load/Store Ops:
>> Load/Store with constant displacement;
>> Load/Store with register index scaled by element size;
>> Load/Store in both signed and unsigned variants.
>> Followed by Branch ops in (PC, Disp) form;
>> Followed by ADD/SUB and TEST/CMP in Reg/Immed forms;
>> Followed by shift ops in "Reg, Imm, Reg" form;
>> Followed by 3R ALU ops (Both "Reg, Reg, Reg" and "Reg, Imm, Reg");
>> ...
>>
>> Overall instruction counts are also useful to consider for both size and
>> performance optimization.
>>
>>
>> For general-purpose code, there is seemingly no real way to eliminate
>> the high frequency of load/store operations. In theory, it can be helped
>> by "strict aliasing" semantics in C, but this only goes so far and is
>> mostly N/A for code which does these optimizations itself.
>>
>> Being able to keep things in registers is good, but there is still a
>> limiting factor when most of ones' data is passed around in memory via
>> pointers (as is typical in most C code).
>>
>>
>> Loops tend to be fairly common and particularly biased towards using
>> lots of load/store ops, since frequently they are working on array data,
>> and arrays are almost invariably kept in memory.
> <
> It is moderately difficult to make a loop that is both useful and doe not need loads
> or stores! But I digress.........

Pretty much.

Loops are sort of a place where memory loads and stores like to gather...

>>
>>
>> For loads/stores:
>> The vast majority of constant displacements tend to be positive (unless
>> one designs the ABI to use negative-offsets relative to a frame-pointer
>> or similar);
> <
> And of these, with 16-bit displacements, one would actually want about 7/8ths
> of the displacements to be positive rather than 1/2 positive and 1/2 negative.
> <

I was thinking of x86, where the common sequence was:
PUSH EBP
MOV EBP, ESP
PUSH EDI
PUSH ESI
...

But, then local variables were frequently accessed using [EBP-Disp],
except in GCC which IIRC typically did [ESP+Disp] or "$disp(%esp)"

>> The vast majority of displacements also tend to be scaled by the element
>> size, so this makes sense as a default.
>>
>> Negative and misaligned displacements are still common enough though to
>> where one may need some way to be able encode them without an excessive
>> penalty. In my ISA, these cases were mostly relegated to loading the
>> displacement into R0 and using a special (Rb, R0) encoding with an
>> unscaled index.
>>

Though, forgot to mention that negative displacements can still be
encoded using a jumbo prefix.

I ended up going with positive-only displacements originally as, in my
testing, a 9-bit zero-extended displacement tended to do better in terms
of code density and performance than a 9-bit sign-extended displacement.

>>
>> IME, one also needs both sign and zero extending loads, since then
>> whatever is the non-default case ends up hammering the sign or zero
>> extension instructions.
> <
> absofriggenlutely !!

Yeah. Several ISAs only provide sign or zero extended loads, which isn't
ideal.

>>
>> This is also a partial reason why "ADDS.L" and "ADDU.L" and similar
>> exist in my ISA, since typically:
>> Code makes heavy use of 32-bit integers;
>> For better or for worse, wrap-on-overflow is the commonly expected
>> behavior, and some amount of code will fail otherwise;
>> These eliminate the vast majority of cases where one might otherwise
>> need to sign or zero extend a value for a memory access (if the index
>> register is greater than 32 bits).
>>
>> Note that AND/OR/XOR don't need 32-bit variants, since their behavior
>> with sign or zero extended values tends to be identical to what a
>> matching size or zero extended operation would have produced.
>>
>>
>>
>> In my compiler output, there also tends to be a lot of "MOV Reg,Reg"
>> instructions, but most of these are more due to my compiler being
>> stupid. If the generated code were closer to optimal, there would be a
>> lot fewer of these.
>>
> I see the majority of MOV Rd,Rs1 instructions at the middle of nested loops
> getting setup to do the outer ADD/CMP/BC so the top of the loop code runs
> smoothly.

In my case, there are a lot which exist more as an artifact of
intermediate IR stages and similar.

Front end uses ASTs;
ASTs are transformed into a Stack/RPN form;
RPN is converted to 3AC;
Codegen then uses the 3AC.

In many cases, the stack machine model works via pushing and popping
temporaries, where values are moved between operators, or between source
and destination variables and literals, via said temporaries.

This RPN form long predates my CPU ISA project (and some of my VM
projects had gone back and forth between RPN and 3AC models). RPN
generally had the advantage of being less painful for a compiler
frontend though.

It was generally easier to deal with some of these complexities in an
RPN -> 3AC/SSA converter stage, but this converter stage might also need
to do things like fold literal values back into an operator, fold the
destination back into the operator, ... Which may not have been
necessary had it gone more directly from the AST to 3AC.

The compiler logic doesn't entirely filter these out sometimes, so there
is a tendency for things like values to sometimes get loaded into a
register, then shuffled through other registers, and then operated on
after the fact.

Or a bug where:
y=x<<N;
Was being compiled as, effectively:
MOV N, R4
EXTS.L R4, R5
SHAD R8, R5, R9
Rather than:
SHAD R8, N, R9
....

Some other cases turn out to be code which was left over from before
various ops existed (eg: it uses MOV and a 2R op, because when that part
of the codegen was written, a 3R form hadn't been added yet).

And some amount of the compiler logic is a bit crufty as the ISA has
moved a fair bit from where it started out, and a lot of the design
changes were implemented using layers of hacks rather than doing a more
proper rewrite of that part of the compiler.

>>
>>
>> At the moment, there doesn't seem to be much obvious "this needs to be
>> better" stuff left in my core ISA, at least within the confines of the
>> general ISA design (eg, "Load/Store", ...).
> <
> see below::
>>
>> There are a few semi-common patterns which fall outside the load/store
>> model, most commonly using a memory operand as a source to an ALU
>> instruction (typically an integer add/subtract).
>>
>> But, I don't really want to go there, as doing stuff like this would
>> likely require using a longer pipeline.
>>
>> However, if one were willing to spend the cost of having such a longer
>> pipeline, they could potentially gain a few benefits:
>> Single cycle ALU ops against values loaded from memory;
>> Potentially being able to have fully pipelined FPU instructions;
>> ...
>>
>> With the downsides:
>> The resource cost of register forwarding and interlock handling goes up
>> rapidly with pipeline length;
>> Unpredictable are not friendly to longer pipelines;
>> ...
>>
>> There is also the potential for added complexity for instruction
>> decoding if one wants to go an x86-like route.
>>
>> Though, it looks like the majority of these cases could be addressed via
>> a "simple" encoding, eg:
>> ADDS.L (Rb, Disp), Rn
>> SUBU.L (Rb, Ri), Rn
>> ...
>> Basically, encoded like a memory load.
> <
> from above::
> <
> There is nothing that prevents an implementation from CoIssuing these
> as paired instructions.
>>
>> Though, the utility is diminished slightly in that these would (in many
>> cases) require an additional register MOV:
>> MOV Rs, Rn
>> ADDS.L (Rb, Disp), Rn
> <
> CoIssue
> <
> If you dig into CoIssue enough> you will see that as many as 30% of RISC
> instructions can be paired


Click here to read the complete article
Re: Squeezing Those Bits: Concertina II

<s9ceab$82d$2@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17369&group=comp.arch#17369

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-2e93-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 4 Jun 2021 05:39:23 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s9ceab$82d$2@newsreader4.netcologne.de>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me>
<21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com>
<s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
Injection-Date: Fri, 4 Jun 2021 05:39:23 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-2e93-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:2e93:0:7285:c2ff:fe6c:992d";
logging-data="8269"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 4 Jun 2021 05:39 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> Really fast machines 12-gates
> more typical machines 16-gates

[...]

Thanks for the numbers!

What kind of gates are being counted here? An inverter certainly
has a lower delay than an XOR gate, for example. Are the
delay values converted to some normalized form, like a NAND gate?

Re: Squeezing Those Bits: Concertina II

<2021Jun4.102515@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17370&group=comp.arch#17370

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 04 Jun 2021 08:25:15 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 38
Message-ID: <2021Jun4.102515@mips.complang.tuwien.ac.at>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="fd87124e72ac9b5a596aa7cb694975ef";
logging-data="4984"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jeTQprr2kIFPJGxazQYv5"
Cancel-Lock: sha1:4A7qnOPSuNu1G68G7tzXhkPkto4=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 4 Jun 2021 08:25 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> With a 20 gate per cycle design point, one can build a 6-wide reservation
>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
>> pipeline.
>
>If we count 5-gates of delay for the clock-boundary's flip-flop, that
>means:
>
> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
>
>> At 16 cycles this necessarily becomes 9-10 stages.
>> At 12 gates this necessarily becomes 12-15 stages.
>
>And that gives:
>
> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
>
>So at least in terms of the latency of a single instruction going
>through the whole pipeline, the gain of targetting a lower-clocked
>design seems clear ;-)

But that's not particularly relevant. You want to minimize the total
execution time of a program; and, with a few exceptions (e.g., PAUSE),
one instruction does not wait until the previous instruction has left
the pipeline; if it did, there would be no point in pipelining.

Instead, a data-flow instruction waits until its operands are
available (and the functional unit is available). For simple ALU
operations, this typically takes 1 cycle (exceptions:
Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
made deep pipelines a win, until CPUs ran into power limits ~2005.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<192ffd17-5b07-4c68-9171-bd2db1d72e89n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17371&group=comp.arch#17371

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:b807:: with SMTP id i7mr3110408qkf.331.1622796033212;
Fri, 04 Jun 2021 01:40:33 -0700 (PDT)
X-Received: by 2002:a9d:4f18:: with SMTP id d24mr2724879otl.16.1622796032897;
Fri, 04 Jun 2021 01:40:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 01:40:32 -0700 (PDT)
In-Reply-To: <s9c62v$i4m$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:eca4:8396:11d7:5173;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:eca4:8396:11d7:5173
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <s94le0$3cr$1@dont-email.me>
<d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com> <s963s5$4sa$1@newsreader4.netcologne.de>
<s97m1u$8kh$1@dont-email.me> <2021Jun2.183620@mips.complang.tuwien.ac.at>
<s9a90s$g5j$1@dont-email.me> <s9bbkq$kh4$1@dont-email.me> <29140af3-45cc-4aca-9a90-52cfaab6388bn@googlegroups.com>
<s9c62v$i4m$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <192ffd17-5b07-4c68-9171-bd2db1d72e89n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 04 Jun 2021 08:40:33 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Fri, 4 Jun 2021 08:40 UTC

On Thursday, June 3, 2021 at 9:18:58 PM UTC-6, BGB wrote:
> On 6/3/2021 6:26 PM, MitchAlsup wrote:
> > On Thursday, June 3, 2021 at 2:47:40 PM UTC-5, BGB wrote:

> >> IME, one also needs both sign and zero extending loads, since then
> >> whatever is the non-default case ends up hammering the sign or zero
> >> extension instructions.

> > absofriggenlutely !!

> Yeah. Several ISAs only provide sign or zero extended loads, which isn't
> ideal.

Here, for my fixed-point instructions, I avoid that. The load/store
instructions have two bits for the opcode once the operand type
is determined by whatever means.

Whenever the operand is shorter than the 64-bit integer register,
there are always *four* instructions available:

Load - sign extend
Store
Load Unsigned - zero extend
Insert - no extension, leave remaining bits untouched

That is, of course, merely a natural result of my slavish copying of the IBM
System/360, so I can not really claim _too_ much credit for recognizing the
necessity of doing so.

John Savard

Re: Squeezing Those Bits: Concertina II

<96041685-57a9-49d2-a4ad-476b32f8e59cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17372&group=comp.arch#17372

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:6c4:: with SMTP id 187mr3330973qkg.95.1622796115210;
Fri, 04 Jun 2021 01:41:55 -0700 (PDT)
X-Received: by 2002:a9d:6244:: with SMTP id i4mr2681865otk.182.1622796115008;
Fri, 04 Jun 2021 01:41:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 01:41:54 -0700 (PDT)
In-Reply-To: <s9ceab$82d$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:eca4:8396:11d7:5173;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:eca4:8396:11d7:5173
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <s9ceab$82d$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <96041685-57a9-49d2-a4ad-476b32f8e59cn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 04 Jun 2021 08:41:55 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Fri, 4 Jun 2021 08:41 UTC

On Thursday, June 3, 2021 at 11:39:25 PM UTC-6, Thomas Koenig wrote:

> What kind of gates are being counted here? An inverter certainly
> has a lower delay than an XOR gate, for example. Are the
> delay values converted to some normalized form, like a NAND gate?

Yes, you have it. The delay required by a NAND gate is considered to be
'one gate delay', and so an XOR gate counts as two layers of gates.

John Savard

Re: Squeezing Those Bits: Concertina II

<s9cphi$jm6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17373&group=comp.arch#17373

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 4 Jun 2021 01:50:59 -0700
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <s9cphi$jm6$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com>
<s989ik$itn$1@dont-email.me>
<21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com>
<7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com>
<s99v64$hsp$2@newsreader4.netcologne.de>
<44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com>
<jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org>
<2021Jun4.102515@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 4 Jun 2021 08:50:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8a6c6b62641f39aa7607348abcf59cea";
logging-data="20166"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ZHWZrAn+QU161WqGzxjHp"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:UvvDf3Aqie0/QJ4l1w/+k4XB3YY=
In-Reply-To: <2021Jun4.102515@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Fri, 4 Jun 2021 08:50 UTC

On 6/4/2021 1:25 AM, Anton Ertl wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> With a 20 gate per cycle design point, one can build a 6-wide reservation
>>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
>>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
>>> pipeline.
>>
>> If we count 5-gates of delay for the clock-boundary's flip-flop, that
>> means:
>>
>> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
>>
>>> At 16 cycles this necessarily becomes 9-10 stages.
>>> At 12 gates this necessarily becomes 12-15 stages.
>>
>> And that gives:
>>
>> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
>> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
>>
>> So at least in terms of the latency of a single instruction going
>> through the whole pipeline, the gain of targetting a lower-clocked
>> design seems clear ;-)
>
> But that's not particularly relevant. You want to minimize the total
> execution time of a program; and, with a few exceptions (e.g., PAUSE),
> one instruction does not wait until the previous instruction has left
> the pipeline; if it did, there would be no point in pipelining.
>
> Instead, a data-flow instruction waits until its operands are
> available (and the functional unit is available). For simple ALU
> operations, this typically takes 1 cycle (exceptions:
> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
> made deep pipelines a win, until CPUs ran into power limits ~2005.

Exception: static scheduling zer0 cycles

Re: Squeezing Those Bits: Concertina II

<2021Jun4.104421@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17374&group=comp.arch#17374

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 04 Jun 2021 08:44:21 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 40
Message-ID: <2021Jun4.104421@mips.complang.tuwien.ac.at>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com> <s9bga3$ljs$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="fd87124e72ac9b5a596aa7cb694975ef";
logging-data="26880"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Vxl4qtDUTIsk3dtCs5ptp"
Cancel-Lock: sha1:aOUiyRy2vKYM/f0b3D4aEPo1Re8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 4 Jun 2021 08:44 UTC

BGB <cr88192@gmail.com> writes:
>So, I bought a new phone as a replacement, and what kind of fancy new
>CPU is it running?... Cortex-A53...
>
>
>So, given that apparently phones are still happy enough mostly running
>on 2-wide in-order superscalar cores, the incentive to go to bigger OoO
>cores seems mostly limited to "higher end" devices.

Apple uses OoO for both their big cores and their little cores.

Why ARM's customers still go for in-order is somewhat of a mystery to
me. We can see on

<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>

that the OoO Cortex A75 is more efficient (as well as more performant)
than the A55 at almost all performance points of the A55, for SPEC2006
Int+FP. ARM claims that the workloads on their little cores differ
significantly from that used for producing these kinds of benchmarks,
and that the A55 has better efficiency there. Unfortunately, they do
not give any evidence for that, so this may be just the usual
marketing stuff to make their bad decisions look good. But even if
they are right, what is the intended application area and what are the
efficiency needs for your architecture?

>And, it appears this is not entirely recent: these sorts of 2-wide
>superscalar cores seem to have been dominant in phones and consumer
>electronics for roughly the past 15-20 years or so.

Not sure what you mean with dominant. OoO cores have been used on
smartphones since the Cortex-A9, used in, e.g., the Apple A5 (2011).

As for other consumer electronics: If you don't need much performance,
no need for an expensive OoO core.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<2021Jun4.110954@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17375&group=comp.arch#17375

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 04 Jun 2021 09:09:54 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 11
Distribution: world
Message-ID: <2021Jun4.110954@mips.complang.tuwien.ac.at>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <s9ceab$82d$2@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="fd87124e72ac9b5a596aa7cb694975ef";
logging-data="26880"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/OveGU4AQKCInKH60NLHvo"
Cancel-Lock: sha1:JZ1g4J5keki0xSTugYm73U22VE0=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 4 Jun 2021 09:09 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>What kind of gates are being counted here? An inverter certainly
>has a lower delay than an XOR gate, for example. Are the
>delay values converted to some normalized form, like a NAND gate?

https://en.wikipedia.org/wiki/FO4

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<2021Jun4.111129@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17376&group=comp.arch#17376

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Fri, 04 Jun 2021 09:11:29 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 49
Message-ID: <2021Jun4.111129@mips.complang.tuwien.ac.at>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <s94le0$3cr$1@dont-email.me> <d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com> <s963s5$4sa$1@newsreader4.netcologne.de> <s97m1u$8kh$1@dont-email.me> <2021Jun2.183620@mips.complang.tuwien.ac.at> <s9b3kg$nvf$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="fd87124e72ac9b5a596aa7cb694975ef";
logging-data="26880"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+v52vKDmqixIgw9+AlfsZ2"
Cancel-Lock: sha1:k4Y6V53y7j/jPyJReyZ/e5oufzc=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 4 Jun 2021 09:11 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 6/2/2021 9:36 AM, Anton Ertl wrote:
>> Marcus <m.delete@this.bitsnbites.eu> writes:
>>> A different axis is the dynamic instruction count (i.e. instruction
>>> execution frequencies) which I think is more relevant for
>>> performance analysis/tuning of an ISA.
>>
>> One would think so, but the problem is that in a corpus of large
>> programs the dynamically executed instructions are mostly confined to
>> a few hot spots, and the rest of the large programs them plays hardly
>> any role. And as a consequence, the most frequent dynamically
>> executed instruction sequences tend to be not very representative of
>> what happens in other programs.
>
>While I am sure that happens, I don't see why it is a problem, assuming
>your corpus of programs is representative of the workload you expect.
>While each program's hot spots may be different, the "average" across
>all of them should be what you are optimizing for.

The basic problem is that no realistic corpus is representative of
programs outside the corpus if the characteristics you look at are too
specific. A similar problem is overfitting in machine learning
applications.

In this particular case, using dynamic execution count as the metric
means that of the large programs in the corpus, most code is not
considered for our training purposes. Only the small pieces of hot
code are considered.

And if we take these metrics at face value, long superinstructions
(i.e., that combine many simple instructions, e.g., a whole basic
block) seem to be optimal. But they typically do not occur in hot
code of other programs (unless another program contains pretty much
the same basic block in hot code, which is not the case for most of
the hot basic blocks in the corpus), so most of these
superinstructions go to waste; we typically limit the number of
superinstructions to limit the compile time and the size of the
interpreter, so wasted superinstructions are a problem. In the end,
we do not get good superinstructions out of this approach.

You can mitigate this effect by selecting only short superinstructions
(say, 2-3 simple instructions in length), which tend to be more widely
applicable, but then you have already admitted that there is something
wrong with using dynamic execution count as metric.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Squeezing Those Bits: Concertina II

<5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17377&group=comp.arch#17377

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:576e:: with SMTP id r14mr3682408qvx.61.1622800844227;
Fri, 04 Jun 2021 03:00:44 -0700 (PDT)
X-Received: by 2002:a4a:d781:: with SMTP id c1mr2847374oou.23.1622800843952;
Fri, 04 Jun 2021 03:00:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 03:00:43 -0700 (PDT)
In-Reply-To: <2021Jun4.104421@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:eca4:8396:11d7:5173;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:eca4:8396:11d7:5173
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com> <s9bga3$ljs$1@dont-email.me>
<2021Jun4.104421@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 04 Jun 2021 10:00:44 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Fri, 4 Jun 2021 10:00 UTC

On Friday, June 4, 2021 at 3:07:37 AM UTC-6, Anton Ertl wrote:

> Apple uses OoO for both their big cores and their little cores.

And, indeed, while Intel's original small Atom cores were in-order,
they eventually switched over to even giving those a simple
out-of-order capability, since transistor densities had increased,
and the original Atom cores were percieved as having very poor
performance.

And yet people didn't complain about the performance of the
486 DX. So I would be inclined to blame software bloat.

John Savard

Re: Squeezing Those Bits: Concertina II

<8bc75431-bac1-4546-8719-a335ea877d4dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17378&group=comp.arch#17378

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:6c4:: with SMTP id 187mr3577111qkg.95.1622801232583;
Fri, 04 Jun 2021 03:07:12 -0700 (PDT)
X-Received: by 2002:a9d:4f18:: with SMTP id d24mr2928598otl.16.1622801232376;
Fri, 04 Jun 2021 03:07:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Jun 2021 03:07:12 -0700 (PDT)
In-Reply-To: <192ffd17-5b07-4c68-9171-bd2db1d72e89n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:eca4:8396:11d7:5173;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:eca4:8396:11d7:5173
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <s94le0$3cr$1@dont-email.me>
<d934adf6-2832-4f14-8235-d3bddc8f0c26n@googlegroups.com> <s963s5$4sa$1@newsreader4.netcologne.de>
<s97m1u$8kh$1@dont-email.me> <2021Jun2.183620@mips.complang.tuwien.ac.at>
<s9a90s$g5j$1@dont-email.me> <s9bbkq$kh4$1@dont-email.me> <29140af3-45cc-4aca-9a90-52cfaab6388bn@googlegroups.com>
<s9c62v$i4m$1@dont-email.me> <192ffd17-5b07-4c68-9171-bd2db1d72e89n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8bc75431-bac1-4546-8719-a335ea877d4dn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 04 Jun 2021 10:07:12 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Fri, 4 Jun 2021 10:07 UTC

On Friday, June 4, 2021 at 2:40:34 AM UTC-6, Quadibloc wrote:

> Whenever the operand is shorter than the 64-bit integer register,
> there are always *four* instructions available:
>
> Load - sign extend
> Store
> Load Unsigned - zero extend
> Insert - no extension, leave remaining bits untouched
>
> That is, of course, merely a natural result of my slavish copying of the IBM
> System/360, so I can not really claim _too_ much credit for recognizing the
> necessity of doing so.

But then, while the System/360 has Insert Character, it did not have Insert Halfword,
and while it had Load Positive and Load Negative, which instructions did not even
occur to me, it had no Load Unsigned, so I guess I can claim _some_ credit... although,
of course, the Load Unsigned could be seen as merely a consequence of the fact that
once one adds Insert to Load and Store, one needs *two* bits for the opcode field, and
so Load Unsigned just naturally suggests itself.

John Savard

Pages:1234567
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor