Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

The trouble with computers is that they do what you tell them, not what you want. -- D. Cohen


devel / comp.arch / Re: Squeezing Those Bits: Concertina II

SubjectAuthor
* Squeezing Those Bits: Concertina IIQuadibloc
+* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|`* Re: Squeezing Those Bits: Concertina IIQuadibloc
| `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |`* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  | `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |  `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |   `* Re: Squeezing Those Bits: Concertina IIStephen Fuld
|  |    +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |    `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |     `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |+- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |`* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      | `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |  `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |   `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    |+* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    ||`- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    |+- Re: Squeezing Those Bits: Concertina IIIvan Godard
|  |      |    |+- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    |`* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | | |`- Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | | `* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  |+* Re: Squeezing Those Bits: Concertina IIJohn Dallman
|  |      |    | |  ||+- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  ||`* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  || `* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | |  ||  `- Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  |+- Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  |+* Re: Squeezing Those Bits: Concertina IIAnssi Saari
|  |      |    | |  ||`- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | |  |`* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |  | `* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |  |  `- Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |  `* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |   `* Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | |    `* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |     `* Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | |      `* Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | |       `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |        `- Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIIvan Godard
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`- Re: Squeezing Those Bits: Concertina IIBGB
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIIvan Godard
|  |      |    | | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | | `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |  `- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | | `- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |`- Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIStefan Monnier
|  |      |    | | `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |  `* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |   +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |   |+- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | |   |`- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |   `* Re: Squeezing Those Bits: Concertina IIGeorge Neuner
|  |      |    | |    +- Re: Squeezing Those Bits: Concertina IITerje Mathisen
|  |      |    | |    +* Re: Squeezing Those Bits: Concertina IIAnton Ertl
|  |      |    | |    |`- Re: Squeezing Those Bits: Concertina IIStefan Monnier
|  |      |    | |    +- Re: Squeezing Those Bits: Concertina IIThomas Koenig
|  |      |    | |    `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |     `* Re: Squeezing Those Bits: Concertina IIMarcus
|  |      |    | |      `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | |       `- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | |`* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | | +* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | | |`* Re: Squeezing Those Bits: Concertina IIEricP
|  |      |    | | | `- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | | `- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |    | +- Re: Squeezing Those Bits: Concertina IIJimBrakefield
|  |      |    | `- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |    `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |     `* Re: Squeezing Those Bits: Concertina IIStephen Fuld
|  |      |      `* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  |      |       +- Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      |       `* Re: Squeezing Those Bits: Concertina IIMitchAlsup
|  |      `* Re: Squeezing Those Bits: Concertina IIMarcus
|  +* Re: Squeezing Those Bits: Concertina IIQuadibloc
|  `- Re: Squeezing Those Bits: Concertina IIQuadibloc
`- Re: Squeezing Those Bits: Concertina IIQuadibloc

Pages:1234567
Re: Squeezing Those Bits: Concertina II

<IpNuI.714160$nn2.306746@fx48.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17440&group=comp.arch#17440

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org> <2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad> <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>
In-Reply-To: <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 116
Message-ID: <IpNuI.714160$nn2.306746@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 05 Jun 2021 16:14:32 UTC
Date: Sat, 05 Jun 2021 12:14:11 -0400
X-Received-Bytes: 7068
 by: EricP - Sat, 5 Jun 2021 16:14 UTC

MitchAlsup wrote:
> On Friday, June 4, 2021 at 8:36:40 AM UTC-5, EricP wrote:
>> Anton Ertl wrote:
>>> Stefan Monnier <mon...@iro.umontreal.ca> writes:
>>>>> With a 20 gate per cycle design point, one can build a 6-wide reservation
>>>>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
>>>>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
>>>>> pipeline.
>>>> If we count 5-gates of delay for the clock-boundary's flip-flop, that
>>>> means:
>>>>
>>>> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
>>>>
>>>>> At 16 cycles this necessarily becomes 9-10 stages.
>>>>> At 12 gates this necessarily becomes 12-15 stages.
>>>> And that gives:
>>>>
>>>> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
>>>> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
>>>>
>>>> So at least in terms of the latency of a single instruction going
>>>> through the whole pipeline, the gain of targetting a lower-clocked
>>>> design seems clear ;-)
>>> But that's not particularly relevant. You want to minimize the total
>>> execution time of a program; and, with a few exceptions (e.g., PAUSE),
>>> one instruction does not wait until the previous instruction has left
>>> the pipeline; if it did, there would be no point in pipelining.
>>>
>>> Instead, a data-flow instruction waits until its operands are
>>> available (and the functional unit is available). For simple ALU
>>> operations, this typically takes 1 cycle (exceptions:
>>> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
>>> made deep pipelines a win, until CPUs ran into power limits ~2005.
>>>
>>> - anton
> <
>> The relevance of latency comes in, I think, when one considers the effect
>> of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
>> injects a bubble whose size is independent of the number of stages.
> <
> Make that number of stages TIMES width of execution.
> {and yes, I saw that you wrote that in negative context}
>> If we go from 6 stages, 20+5 gates to 12 stages, 12+5 gates
>> we increase the clock by a factor of (20+5)/(12+5) = 1.47.
>> But a bubble now takes 2x as many clocks to recover from.
> <
> And whereas branch predictors continue to get better, L1 cache hit
> ratios are essentially frozen by size and sets. So what started out as
> branch prediction limited (1990) ends up as L2 latency limited (2000+).
>> Also adding pipeline stages doesn't change the speed of data cache.
> <
> What changes the throughput of the data cache is ports. If you can
> perform 4 accesses per cycle to a 4-way banked cache, throughput
> goes way up and you quickly realize that you have to adequately port
> the L2 similarly. You want simultaneous misses in L1 to be handled
> simultaneously in the L2 !!

There are two design dimensions at play here.
The original discussion was the degree to which adding stages to
an in-order pipeline would improve performance. Another dimension
is making stages wider by carrying multiple uOps per stage packet.

Multiple cache ports might benefit with wider stages but I wonder how much.
To utilize the multiple ports the packet must contain multiple load ops
(stores that cache miss can be saved in cache MSHR buffers if necessary).
It could allow some optimizations, like load prefetching
under miss for multiple load uOps, early address translation.
However for a packet to move on from the load stage,
all the load uOps in it must be finished.

Wide stages are orthogonal to the original question of the
beneficial effects of adding more stages.
I don't see that multiple cache ports could be utilized by adding
more stages (I suppose it allows overlap of translate and load
if two memory uOps are sequential, but that's really it).

> <
> We worried a lot about the number of wires in the 1990s, but even GPUs
> get 10 layers of metal, and IBM is using 17 layers in its modern mainframes.
> With this wire resource, there is little reason NOT to bank the cache hierarchy.
> <
> {Aside: many GPUs run 1024 wires from and another 1024 wires to the L1 cache
> (nor including the addresses and control)} And these busses pass data back
> and forth in the same "beat" structure as the SIMT calculation beat structure.}
> <
>> Adding pipeline stages to increase the frequency means we somewhat
>> decrease the unused cache access time between loads and stores.
>> However if the D$ cache access saturates, stages should have minimal impact.
> <
> Add ports and AGEN width to eliminate saturation.

Yes but wouldn't addition full cache ports be prohibitively expensive,
requiring whole extra decoder, word lines, bit lines, sense amps?
Its one thing for a register file, but for a large-ish cache?

>> It also depends on how one measures performance.
> <
> There is only one sane metric here: wall clock time for entire application.
> <

Which is why one should not assume that adding more stages to increase
the clock frequency will necessarily decrease wall clock time.

>> More stages means higher frequency means higher potential issued MIPS.
>> If instead we count retired MIPS, to take into account bubbles and
> <
> Only compiler and CPU architects should be able to see unretired statistics.

I said this because the discussion at that point seemed to be assuming
that by adding pipeline stages to increase the clock frequency by, say,
a factor of 1.47 that it would increase pipeline output by 1.47.
Increasing the frequency potentially allows instructions to be stuffed
into the pipeline faster. There are other considerations that limit
the actual gains to < 1.47.

Re: Squeezing Those Bits: Concertina II

<s9ghd6$uuc$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17445&group=comp.arch#17445

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Sat, 5 Jun 2021 13:55:26 -0500
Organization: A noiseless patient Spider
Lines: 284
Message-ID: <s9ghd6$uuc$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>
<s9ea03$30n$1@dont-email.me> <2021Jun5.160330@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 5 Jun 2021 18:56:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="959b79a4a883a8bae8aefeae07dcb163";
logging-data="31692"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196an30K+/3ptCeUS3YjrTb"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:l7xwNIsSY01n3NuvPFqc5A7KYf0=
In-Reply-To: <2021Jun5.160330@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: BGB - Sat, 5 Jun 2021 18:55 UTC

On 6/5/2021 9:03 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> In my experience with them, at similar clock speeds, the original Atom
>> gets beaten pretty hard by ARM32.
>
> What do you mean with "ARM32"? If you mean the 32-bit ARM
> architecture, there are many different cores that implement this
> architecture. For the LateX benchmark I have:
>
> run time (s)
> - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
> - OMAP4 Panda board ES (1.2GHz Cortex-A9) Ubuntu 12.04 2.984
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
>

I had something which had an Atom N270 (an ASUS Eee running Linux), its
performance kinda sucked vs a RasPi2 in my tests (900MHz Cortex-A53),
running 32-bit Raspbian.

IIRC, at the time I was mostly testing it with color-cell based video
codecs and similar.

These mostly use some arithmetic to calculate color endpoints, then
typically fill blocks of memory with pixel values using 1 or 2 bit
selectors. These can be implemented either with a small lookup table, or
with "c?x:y".

Eg:
ct0=dest;
ct1=dest+stride;
...
ct0[0]=(bpx&0x0001)?clra:clrb;
ct0[1]=(bpx&0x0002)?clra:clrb;
ct0[2]=(bpx&0x0004)?clra:clrb;
ct0[3]=(bpx&0x0008)?clra:clrb;
ct1[0]=(bpx&0x0010)?clra:clrb;
ct1[1]=(bpx&0x0020)?clra:clrb;
...

> For Gforth I have, e.g.:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.410 0.520 0.260 0.635 0.280 Exynos 4 (Cortex A9) 1.6GHz; gcc-4.8.x
> 0.600 0.650 0.310 0.870 0.450 Odroid C2 Cortex A53 32b 1536MHz, gcc 5.3.1
> 0.390 0.490 0.270 0.520 0.260 Odroid C2 Cortex A53 64b 1536MHz, gcc 5.3.1
>
> So, yes, OoO 32-bit ARMs like the Cortex-A9 have better
> performance/clock than Bonnell, but the Cortex-A53 in 32b-moe not su
> much. Which is quite surprising, because I would expect a RISC to
> suffer less from the in-order implementation than a CISC. This
> expected advantage is realized in the 64b-A53 result.
>

OK.

>> From what I can tell, it appears that x86 benefits a lot more from OoO
>> than ARM did, and Aarch64 is still pretty solid even with in-order
>> implementations.
>
> OoO implementations are a lot faster on both architectures:
>
> LaTeX:
>
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
>
> Gforth:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
>

OK.

When I tested with running interpreters, this was a case that somewhat
favored x86.

>> However, an OoO x86 machine does seem to be a lot more tolerant of
>> lackluster code generation than an in-order ARM machine (where, if the
>> generated code kinda sucks, its performance on an ARM machine also
>> sucks).
>
> Not sure what you mean with "lackluster code generation", but OoO of
> course deals better with code that has not been scheduled for in-order
> architectures. There is also the effect on OoO that instructions can
> often hide in the shadows of long dependency paths. But if you make
> the dependency path longer, you feel that at least as hard on an OoO
> CPU than on an in-order CPU.
>

I was thinking stuff like (pseudocode):
Load A
Load B
Op C=A+B
Load D
Store E
Move E=A
Store C
Move C=D
Op F=C+E
Store F
...

Or (closer to typical GCC "-O0" output):
Load A
Load B
Op C=A+B
Store C
Load A
Load B
Op C=A-B
Store C
...

This stuff does "sorta OK" on x86 machines (within ~ 3x of optimized
code), but poorly on ARM machines (~ 5x-7x slower than optimized).

In another past test trying to do dynamic recompilation of BJX2 code to
32-bit ARM (on a RasPi3), it was a fair bit slower than the FPGA
implementation (despite the RasPi3 having a fairly significant
clock-frequency advantage).

>> The x86 machine seems to just sort of take whatever garbage one
>> throws at it and makes it "sorta fast-ish" (even if it is basically just
>> a big mess of memory loads and stores with a bunch of hidden function
>> calls and similar thrown in).
>
> I don't know what you mean with "hidden function calls and similar",
> but the stuff about "a big mess of memory loads and stores" sounds
> like the ancient (and wrong) myth that loads and stores are free on
> IA-32 and AMD64.

They are not "free", but their performance impact is a lot less obvious.

For the hidden function calls, say, assume that for a certain type,
operators are implemented with function calls, such that:
z=3*x+y;
Compiles as if it were, say:
t0=__convsixi(3);
t1=__mulsxi(t0, x);
t2=__addxi(t1, y);
z=t2;

Likewise, for a simpler compiler, the logic implemented by the
code-generator might be fairly minimal, with the generated code
consisting almost entirely of function calls.

In this case, the code-gen mostly consists of logic for looking up the
name of which function to call.

In BGBCC, runtime calls are still used a lot (BJX2):
Int (and smaller):
Handled directly: +, -, *, &, |, ^, <<, >>
Runtime call: /, % (1)
Long / Long Long:
Handled directly: +, -, &, |, ^, <<, >>
Runtime call: *, /, %
Int128:
Handled directly: &, |, ^
Handled directly (ALUX): +, -, <<, >>
Runtime call: *, /, %
Runtime call (Non-ALUX): +, -, <<, >>
Float / Double:
Handled directly: +, -, *
Runtime call: /
Float128 / Long Double:
Runtime call: Everything.
Variant:
Runtime call: Everything.
Vector (vec3f / vec4f):
Handled directly: +, -, * (pairwise)
Runtime call: /, % ^ (cross and dot product)
...

*1: Constant division may be implemented as ("(x*C)>>32") in some cases.

Type conversion paths also may or may not involve runtime calls.

The main part of the codegen doesn't necessarily know when function
calls may occur, so is paranoid by default (doesn't use any of the
scratch registers). There is logic that detects if a basic-block is
"pure" (no function calls or complex operations), which enables the use
of scratch registers (by the register allocator) within this block.

Contrast with the SH and BJX1 backends:
Int (and smaller):
Handled directly: +, -, &, |, ^
Runtime call: *, /, %, <<, >> (*2)
Long Long:
Runtime call: Everything.
...

*2: This was because SH kinda sucked in some ways.
Some of the SH variants also used shift-slides, so being able to do a
shift directly was more of a special case. Much of the FPU operations
were also implemented via function calls (or pretty much the entire FPU
for SoftFP cases).

Likewise, if in an local array one does:
int a[256]; //allocated directly
But:
int a[4099]; //runtime call (__alloca)
//__alloca is in turn built on top of "malloc()".

Likewise:
struct bigstruct_s {
int arr[1999];
};

struct bigstruct_s a, b; //implicit __alloca calls
...
b=a; //implicitly calls memcpy()

The prolog and epilog compression may or may not be counted. In this
case, the called functions are generated by the compiler. This was done
even for performance-optimized code as it tended to save more due to
fewer I$ misses than the cost of the extra branch instructions or
call/return overhead.

The presence of arrays or similar on the stack may also involve the
addition of a "security tokens" and similar to try to detect stack
thrashing due to buffer overruns (idea kinda borrowed from MSVC).

Very little of the C library is handled with builtins, with the main
exception of "memcpy" and similar potentially being handled as a
special-case (if the size is a small constant value, it may be
transformed into memory loads and stores).

Note that some of the C library functions were rewritten to be a little
more efficient from what they were originally.

Eg, the C library I am using originally did strcmp() kinda like:
while(*srca && *srca++==*srcb++);


Click here to read the complete article
Re: Squeezing Those Bits: Concertina II

<s9ghts$10i5$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17446&group=comp.arch#17446

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!/FKOcGQMirZgkZJCo9x3IA.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Sat, 5 Jun 2021 21:05:33 +0200
Organization: Aioe.org NNTP Server
Lines: 48
Message-ID: <s9ghts$10i5$1@gioia.aioe.org>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<5c59c673-da7b-49ad-9877-cf3a2ec313c4n@googlegroups.com>
<sm0a6o5mia9.fsf@lakka.kapsi.fi>
NNTP-Posting-Host: /FKOcGQMirZgkZJCo9x3IA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sat, 5 Jun 2021 19:05 UTC

Anssi Saari wrote:
> Quadibloc <jsavard@ecn.ab.ca> writes:
>
>> On Friday, June 4, 2021 at 3:07:37 AM UTC-6, Anton Ertl wrote:
>>
>>> Apple uses OoO for both their big cores and their little cores.
>>
>> And, indeed, while Intel's original small Atom cores were in-order,
>> they eventually switched over to even giving those a simple
>> out-of-order capability, since transistor densities had increased,
>> and the original Atom cores were percieved as having very poor
>> performance.
>
> I'm actually retiring an old Atom system. D510 CPU, Bonnell uarch, 45
> nm, dual cores, 1.67 GHz. Early last decade these sold for $60 and that
> included a motherboard.
>
> It has served as a little file server and for that it's fine. But things
> like a web browser, even starting one let alone trying to render any
> pages is pretty frustrating. Any crypto likewise. Even a remote desktop
> thing like x2go is bogged down when starting up, that's apparently
> because some parts of it are shell scripts or Perl.
>
> I replaced it with the cheapest recent Intel CPU thing I could find, a
> Celeron G5900 (Comet Lake, 14 nm, dual cores, 3.4 GHz). It runs rings
> around the old Atom.
>
>> And yet people didn't complain about the performance of the
>> 486 DX. So I would be inclined to blame software bloat.
>
> I don't know, I seem to recall decoding and showing jpegs was pretty
> slow on a 486. MP3 audio decoding in software took a Pentium or at least
> a fairly fast 486 and highly optimized software. Crappy MPEG-1 video
> needed a hardware decoder card... Word for Windows 2.0 ran fine.
>
I helped optimize the MMX asm in Zoran's SoftDVD which was the first
pure software DVD player which could handle 30 fps interlaced with zero
drops on a Pentium MMX - 200 MHz. This cpu was effectively 6-12 X faster
than the classic 33 MHz 486, but something like Blink video could
probably run well even on a 486.

Decoding MPEG1 in software was far easier than the DVD MPEG2 formats.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Squeezing Those Bits: Concertina II

<49a5427d-1d8a-4c8a-8b1e-30796df3da76n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17449&group=comp.arch#17449

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ae9:f309:: with SMTP id p9mr10238773qkg.363.1622920821400; Sat, 05 Jun 2021 12:20:21 -0700 (PDT)
X-Received: by 2002:aca:f452:: with SMTP id s79mr5382603oih.84.1622920821146; Sat, 05 Jun 2021 12:20:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 5 Jun 2021 12:20:20 -0700 (PDT)
In-Reply-To: <IpNuI.714160$nn2.306746@fx48.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f4c7:4bd5:d9e0:1891; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f4c7:4bd5:d9e0:1891
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org> <2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad> <b338ea39-397f-43c2-828a-14161e0964fan@googlegroups.com> <IpNuI.714160$nn2.306746@fx48.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <49a5427d-1d8a-4c8a-8b1e-30796df3da76n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 05 Jun 2021 19:20:21 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 169
 by: MitchAlsup - Sat, 5 Jun 2021 19:20 UTC

On Saturday, June 5, 2021 at 11:14:35 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Friday, June 4, 2021 at 8:36:40 AM UTC-5, EricP wrote:
> >> Anton Ertl wrote:
> >>> Stefan Monnier <mon...@iro.umontreal.ca> writes:
> >>>>> With a 20 gate per cycle design point, one can build a 6-wide reservation
> >>>>> station machine with back to back integer, 3 cycle LDs, 3 LDs per cycle,
> >>>>> 4 cycle FMAC, 17 cycle FDIV; and 6-ported register files into a 6-7 stage
> >>>>> pipeline.
> >>>> If we count 5-gates of delay for the clock-boundary's flip-flop, that
> >>>> means:
> >>>>
> >>>> (20+5)gates * 6-7 stages = 150-175 gates of total pipeline length
> >>>>
> >>>>> At 16 cycles this necessarily becomes 9-10 stages.
> >>>>> At 12 gates this necessarily becomes 12-15 stages.
> >>>> And that gives:
> >>>>
> >>>> (16+5)gates * 9-10 stages = 189-210 gates of total pipeline length
> >>>> (12+5)gates * 12-15 stages = 204-255 gates of total pipeline length
> >>>>
> >>>> So at least in terms of the latency of a single instruction going
> >>>> through the whole pipeline, the gain of targetting a lower-clocked
> >>>> design seems clear ;-)
> >>> But that's not particularly relevant. You want to minimize the total
> >>> execution time of a program; and, with a few exceptions (e.g., PAUSE),
> >>> one instruction does not wait until the previous instruction has left
> >>> the pipeline; if it did, there would be no point in pipelining.
> >>>
> >>> Instead, a data-flow instruction waits until its operands are
> >>> available (and the functional unit is available). For simple ALU
> >>> operations, this typically takes 1 cycle (exceptions:
> >>> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
> >>> made deep pipelines a win, until CPUs ran into power limits ~2005.
> >>>
> >>> - anton
> > <
> >> The relevance of latency comes in, I think, when one considers the effect
> >> of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
> >> injects a bubble whose size is independent of the number of stages.
> > <
> > Make that number of stages TIMES width of execution.
> > {and yes, I saw that you wrote that in negative context}
> >> If we go from 6 stages, 20+5 gates to 12 stages, 12+5 gates
> >> we increase the clock by a factor of (20+5)/(12+5) = 1.47.
> >> But a bubble now takes 2x as many clocks to recover from.
> > <
> > And whereas branch predictors continue to get better, L1 cache hit
> > ratios are essentially frozen by size and sets. So what started out as
> > branch prediction limited (1990) ends up as L2 latency limited (2000+).
> >> Also adding pipeline stages doesn't change the speed of data cache.
> > <
> > What changes the throughput of the data cache is ports. If you can
> > perform 4 accesses per cycle to a 4-way banked cache, throughput
> > goes way up and you quickly realize that you have to adequately port
> > the L2 similarly. You want simultaneous misses in L1 to be handled
> > simultaneously in the L2 !!
>
> There are two design dimensions at play here.
> The original discussion was the degree to which adding stages to
> an in-order pipeline would improve performance. Another dimension
> is making stages wider by carrying multiple uOps per stage packet.
>
> Multiple cache ports might benefit with wider stages but I wonder how much.
<
1/3rd of all RISC instructions are LD/ST (not counting LD constant).
So a 3-wide machine can use 1-port, but a 4-side machine needs 2-ports.
However, in x86 land, LD-Ops increase the mem-refs to 50% and 2-ports
are desirable.
<
> To utilize the multiple ports the packet must contain multiple load ops
> (stores that cache miss can be saved in cache MSHR buffers if necessary).
> It could allow some optimizations, like load prefetching
> under miss for multiple load uOps, early address translation.
> However for a packet to move on from the load stage,
> all the load uOps in it must be finished.
>
> Wide stages are orthogonal to the original question of the
> beneficial effects of adding more stages.
> I don't see that multiple cache ports could be utilized by adding
> more stages (I suppose it allows overlap of translate and load
> if two memory uOps are sequential, but that's really it).
>
> > <
> > We worried a lot about the number of wires in the 1990s, but even GPUs
> > get 10 layers of metal, and IBM is using 17 layers in its modern mainframes.
> > With this wire resource, there is little reason NOT to bank the cache hierarchy.
> > <
> > {Aside: many GPUs run 1024 wires from and another 1024 wires to the L1 cache
> > (nor including the addresses and control)} And these busses pass data back
> > and forth in the same "beat" structure as the SIMT calculation beat structure.}
> > <
> >> Adding pipeline stages to increase the frequency means we somewhat
> >> decrease the unused cache access time between loads and stores.
> >> However if the D$ cache access saturates, stages should have minimal impact.
> > <
> > Add ports and AGEN width to eliminate saturation.
>
> Yes but wouldn't addition full cache ports be prohibitively expensive,
> requiring whole extra decoder, word lines, bit lines, sense amps?
> Its one thing for a register file, but for a large-ish cache?
<
Block partition the data section (block[0]..[k]; k=2^n) then a bit of address comparisons
to see how to amortize the Tag array and TLB. Back this up with a fully associative,
temporarily organized L0 to clean up the common references, and you get good
performance on a 6-wide machine.
<
>
> >> It also depends on how one measures performance.
> > <
> > There is only one sane metric here: wall clock time for entire application.
> > <
>
> Which is why one should not assume that adding more stages to increase
> the clock frequency will necessarily decrease wall clock time.
<
The balance point is a lot more delicate than many people realize.
>
> >> More stages means higher frequency means higher potential issued MIPS.
> >> If instead we count retired MIPS, to take into account bubbles and
> > <
> > Only compiler and CPU architects should be able to see unretired statistics.
>
> I said this because the discussion at that point seemed to be assuming
> that by adding pipeline stages to increase the clock frequency by, say,
> a factor of 1.47 that it would increase pipeline output by 1.47.
<
This is Mitch's first law:: Whatever first level performance benefit you think
feature x will bring, it will actually bring no more than SQRT(x). First recognized
around 1988.
<
> Increasing the frequency potentially allows instructions to be stuffed
> into the pipeline faster. There are other considerations that limit
> the actual gains to < 1.47.
<
My guess is 1.21×

Re: Squeezing Those Bits: Concertina II

<631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17475&group=comp.arch#17475

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4f4a:: with SMTP id eu10mr13869949qvb.12.1622992608401;
Sun, 06 Jun 2021 08:16:48 -0700 (PDT)
X-Received: by 2002:a05:6830:a:: with SMTP id c10mr11573417otp.114.1622992608164;
Sun, 06 Jun 2021 08:16:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 6 Jun 2021 08:16:47 -0700 (PDT)
In-Reply-To: <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:dd8f:87d8:eab2:518c;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:dd8f:87d8:eab2:518c
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 06 Jun 2021 15:16:48 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Sun, 6 Jun 2021 15:16 UTC

On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:

> Taking the light weight nature of VLIW and crushing it with added baggage.

As I've noted, just because the ISA spec includes _both_ VLIW _and_
added baggage, that doesn't mean that implementations are required to
include both; they can omit either one. So the architecture can have two
subsets that make sense, even if the whole thing doesn't.

However, on the page

http://www.quadibloc.com/arch/cp0102.htm

I have now begun sketching out the instruction formats for the added
baggage.

John Savard

Re: Squeezing Those Bits: Concertina II

<BCavI.41354$gZ.37433@fx44.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17479&group=comp.arch#17479

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <859da8cd-bf0b-478d-8d8b-b0d11252dfe1n@googlegroups.com> <s989ik$itn$1@dont-email.me> <21c9b7a3-6dbe-4f84-a3bc-e3971552e772n@googlegroups.com> <7d48604f-f7cd-43f8-be3c-ad3fc9242058n@googlegroups.com> <s99v64$hsp$2@newsreader4.netcologne.de> <44eabf62-646d-429e-a977-06c11fdfb2c4n@googlegroups.com> <jwv5yyu6db6.fsf-monnier+comp.arch@gnu.org> <2021Jun4.102515@mips.complang.tuwien.ac.at> <F%puI.17928$jf1.10608@fx37.iad> <2021Jun5.151842@mips.complang.tuwien.ac.at>
In-Reply-To: <2021Jun5.151842@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 134
Message-ID: <BCavI.41354$gZ.37433@fx44.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 06 Jun 2021 20:54:57 UTC
Date: Sun, 06 Jun 2021 16:51:46 -0400
X-Received-Bytes: 8258
 by: EricP - Sun, 6 Jun 2021 20:51 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Anton Ertl wrote:
>>> Instead, a data-flow instruction waits until its operands are
>>> available (and the functional unit is available). For simple ALU
>>> operations, this typically takes 1 cycle (exceptions:
>>> Willamette/Northwood 1/2 cycle, Bulldozer: 2 cycles). And that's what
>>> made deep pipelines a win, until CPUs ran into power limits ~2005.
>>>
>>> - anton
>> The relevance of latency comes in, I think, when one considers the effect
>> of bubbles on the pipeline. A branch mispredict or I$L1 cache miss
>> injects a bubble whose size is independent of the number of stages.
>
> A branch mispredict is (in the best case) feedback from the stage that
> recognizes the misprediction to the instruction fetch stage. Here the
> latency (in cycles and in ns) becomes longer with more pipeline
> stages. Fortunately branch mispredictions are rare.
>
> Caches these days seem to be clocked and pipelined, allowing a request
> per cycle or so (more for L1), with the shared L3 having its own
> clock. So maybe the latency also increases with the pipelining
> overhead. This could explain that Apple can access 128KB in the same
> ~1ns that Intel needs for accessing 48KB: Apple only divides the 1ns
> into 3 cycles, Intel into 5.
>
>> It also depends on how one measures performance.
>> More stages means higher frequency means higher potential issued MIPS.
>> If instead we count retired MIPS, to take into account bubbles and
>> any back pressure (stall) effects of D$ cache access,
>> I would expect to see much less actual benefit.
>
> Of course you measure the time to complete the program. Given the
> quality of branch prediction in the early 2000s, 52 stages seemed to
> be the optimal pipeline depth for the Pentium 4 [sprangle&carmean02],
> and both Intel (Tejas) and AMD (Mitch Alsup's K9) were on that path,
> until both canceled the projects in 2005. My guess is that they were
> both betting on a cooling technology that evaporated in 2005.
>
> Since then the sweet spot seems to have been the 14-19 stages or so
> that Intel and AMD have been using (wikichip claims 19 stages for
> Zen-Zen3 and 14-19 for Skylake and Ice Lake). But Apple's A14 shows
> us that you can do lower-clocked (and likely shorter-pipeline) cores
> that have so much more IPC that they have competetive performance.
> Makes me wonder whether there is an even sweeter spot in between.

I was referring to in-order pipelines because the analysis is simpler.
Pipelining in OoO modules like rename, scheduling or issue brings
a whole extra level of complexity.

> @InProceedings{sprangle&carmean02,
> author = {Eric Sprangle and Doug Carmean},
> title = {Increasing Processor Performance by Implementing
> Deeper Pipelines},
> crossref = {isca02},
> pages = {25--34},
> url = {http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/deep-pipelines-isca02.pdf},
> annote = {This paper starts with the Williamette (Pentium~4)
> pipeline and discusses and evaluates changes to the
> pipeline length. In particular, it gives numbers on
> how lengthening various latencies would affect IPC;
> on a per-cycle basis the ALU latency is most
> important, then L1 cache, then L2 cache, then branch
> misprediction; however, the total effect of
> lengthening the pipeline to double the clock rate
> gives the reverse order (because branch
> misprediction gains more cycles than the other
> latencies). The paper reports 52 pipeline stages
> with 1.96 times the original clock rate as optimal
> for the Pentium~4 microarchitecture, resulting in a
> reduction of 1.45 of core time and an overall
> speedup of about 1.29 (including waiting for
> memory). Various other topics are discussed, such as
> nonlinear effects when introducing bypasses, and
> varying cache sizes. Recommended reading.}
> }

Thanks. I haven't seen this before, I'll have a look.

> @InProceedings{hrishikesh+02,
> author = {M. S. Hrishikesh and Norman P. Jouppi and Keith
> I. Farkas and Doug Burger and Stephen W. Keckler and
> Premkishore Shivakumar},
> title = {The Optimal Logic Depth per Pipeline Stage is 6 to 8
> FO4 Inverter Delays},
> crossref = {isca02},
> pages = {14--24},
> annote = {This paper takes a low-level simulator of the 21264,
> varies the number of pipeline stages, uses this to
> run a number of workloads (actually only traces from
> them), and reports performance results for
> them. With a latch overhead of about 2 FO4
> inverters, the optimal pipeline stage length is
> about 8 FO4 inverters (with work-load-dependent
> variations). Discusses various issues involved in
> quite some depth. In particular, this paper
> discusses how to pipeline the instruction window
> design (which has been identified as a bottleneck in
> earlier papers).}
> }
>
> @Proceedings{isca02,
> title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
> booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
> year = "2002",
> key = "ISCA 29",
> }
>
> - anton

Thanks. I had read the second paper long ago and just reread it.
It says exactly what I was saying and puts some numbers to it
with SPEC benchmarks running on an in-order Alpha.

"This in-order pipeline is similar to the Alpha 21264 pipeline except
that it issues instructions in-order. It has seven stages —
fetch, decode, issue, register read, execute, write back and commit.
The issue stage of the processor is capable of issuing up to four
instructions in each cycle. The execution stage consists of four
integer units and two floating-point units. All functional units
are fully pipelined, so new instructions can be assigned to them
at every clock cycle."

"In this experiment, when (stage gates) is reduced from 10(+2) to 6(+2)
FO4 the improvement in performance is only about 9% compared
to a clock frequency improvement of 50%."

Figure 4b show the plot for integer benchmarks as the stage size varies,
taking into account an extra latch overhead of 1.8 F04 per stage.
It has a slight hump, peaking at 6 F04 gates per stage,
but the curve is really not that pronounced.
The difference across all stage sizes from 2 to 16 F04 gates
looks to be under 30% max.

Re: Squeezing Those Bits: Concertina II

<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17486&group=comp.arch#17486

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:40c4:: with SMTP id g4mr14365894qko.378.1623030687911; Sun, 06 Jun 2021 18:51:27 -0700 (PDT)
X-Received: by 2002:a54:4e81:: with SMTP id c1mr10234183oiy.119.1623030687640; Sun, 06 Jun 2021 18:51:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 6 Jun 2021 18:51:27 -0700 (PDT)
In-Reply-To: <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:3020:cc9c:7c17:834f; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:3020:cc9c:7c17:834f
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com> <81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 01:51:27 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 28
 by: Quadibloc - Mon, 7 Jun 2021 01:51 UTC

On Sunday, June 6, 2021 at 9:16:49 AM UTC-6, Quadibloc wrote:
> On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
>
> > Taking the light weight nature of VLIW and crushing it with added baggage.
>
> As I've noted, just because the ISA spec includes _both_ VLIW _and_
> added baggage, that doesn't mean that implementations are required to
> include both; they can omit either one. So the architecture can have two
> subsets that make sense, even if the whole thing doesn't.

Also, most of the time, the VLIW as designed will indeed be useless. Real-world
code hardly _ever_ has an ILP of 8.

And yet I've designed things so that not even a single gate delay is spent determining
the length of an instruction, because if I did that, the number of gate delays would be
multiplied by 8 over the length of a block.

What am I up to?

The VLIW is intended, on larger implementations at least, for _occasional_ use, in
specially crafted subroutines which are designed to have a high ILP while doing a
specialized task. Not with the expectation that it is capable of providing much
general assistance, even if in lightweight implementations it might _approach_ the
benefits of OoO. (It would do a better job of _that_ had I encumbered the architecture,
as I did in some earlier attempts, with banks of 128 registers instead of 32, but you've
noted that would create issues in accessing them - and I don't have the opcode space
any more, at least in regular 32-bit long instructions.)

John Savard

Re: Squeezing Those Bits: Concertina II

<15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17487&group=comp.arch#17487

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4d0:: with SMTP id 16mr14681966qks.496.1623031239630;
Sun, 06 Jun 2021 19:00:39 -0700 (PDT)
X-Received: by 2002:a9d:1d49:: with SMTP id m67mr12492421otm.76.1623031239402;
Sun, 06 Jun 2021 19:00:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 6 Jun 2021 19:00:39 -0700 (PDT)
In-Reply-To: <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4090:14da:74c1:963;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4090:14da:74c1:963
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Jun 2021 02:00:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 7 Jun 2021 02:00 UTC

On Sunday, June 6, 2021 at 8:51:29 PM UTC-5, Quadibloc wrote:
> On Sunday, June 6, 2021 at 9:16:49 AM UTC-6, Quadibloc wrote:
> > On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
> >
> > > Taking the light weight nature of VLIW and crushing it with added baggage.
> >
> > As I've noted, just because the ISA spec includes _both_ VLIW _and_
> > added baggage, that doesn't mean that implementations are required to
> > include both; they can omit either one. So the architecture can have two
> > subsets that make sense, even if the whole thing doesn't.
> Also, most of the time, the VLIW as designed will indeed be useless. Real-world
> code hardly _ever_ has an ILP of 8.
>
> And yet I've designed things so that not even a single gate delay is spent determining
> the length of an instruction, because if I did that, the number of gate delays would be
> multiplied by 8 over the length of a block.
<
ln2(8) = 3.
>
> What am I up to?
<
{with only mild sarcasm::}
It appears you are wandering around in the dark without a candle.
>
> The VLIW is intended, on larger implementations at least, for _occasional_ use, in
> specially crafted subroutines which are designed to have a high ILP while doing a
> specialized task. Not with the expectation that it is capable of providing much
> general assistance, even if in lightweight implementations it might _approach_ the
> benefits of OoO. (It would do a better job of _that_ had I encumbered the architecture,
> as I did in some earlier attempts, with banks of 128 registers instead of 32, but you've
> noted that would create issues in accessing them - and I don't have the opcode space
> any more, at least in regular 32-bit long instructions.)
<
The way I vectorize loops, an implementation that is 1-wide In-Order can run the same
code as optimally as a machine that is 16-wide Out-of-Order. The 1-wide machine can
get 3-4 IPC in the C string and Mem library subroutines compiled as they are, yet the
GBOoO machine can get 48-64 I/C from the same code.
<
So, yes we both let the implementations choose for themselves what the proper balance
of mem-to-calculation-to-branch, My 66000 can run these optimally from a single code
base compiled for the 1-wide machine.
<
Also note, by being able to get 3-4 IPC out of 1-wide implementations in the small loops
found "lots of places" in code, one gets a majority of the speed of GBOoO machines
in significantly smaller packages (cores) !!
>
> John Savard

Re: Squeezing Those Bits: Concertina II

<189552d4-41c6-4239-b0dc-bef993ed2841n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17489&group=comp.arch#17489

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:c3d1:: with SMTP id p17mr15646956qvi.44.1623036831924;
Sun, 06 Jun 2021 20:33:51 -0700 (PDT)
X-Received: by 2002:a9d:6244:: with SMTP id i4mr11577462otk.182.1623036831748;
Sun, 06 Jun 2021 20:33:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 6 Jun 2021 20:33:51 -0700 (PDT)
In-Reply-To: <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:127:dbda:21a3:5004;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:127:dbda:21a3:5004
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <189552d4-41c6-4239-b0dc-bef993ed2841n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 03:33:51 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Mon, 7 Jun 2021 03:33 UTC

On Sunday, June 6, 2021 at 8:00:40 PM UTC-6, MitchAlsup wrote:

> {with only mild sarcasm::}
> It appears you are wandering around in the dark without a candle.

I certainly don't have anything approaching your level of expertise.

The main bit of 'progress' this design has achieved is: now that
having two 16-bit instructions starting with 0 in a 32-bit word
uses up 1/4 instead of 1/2 of the opcode space of 32-bit instructions,
all the 32-bit memory-reference instructions can use the standard
memory model and the standard set of base registers.

Given that GPU-based floating-point accelerators are a thing, and
the Cray-like NEC SX-9 connected its processors to memory via
a 16-channel wide bus... and AMD's EPYC and Threadripper Pro
processors have an 8-channel memory bus... I still believe that
Cray-style vector processing is worthwhile.

The advantage is that it's more general and flexible than any
GPU-based solution.

Other than that point, though, I'm not inclined to differ much
with your criticisms. I'm trying to attain high code density,
and the option of VLIW, and your eminently positive ideas
about immediates, in ways that are simple enough for me
to understand... so naturally they're clumsy.

Throwing in everything but the kitchen sink doesn't mean that
every implementation has to include it all. But if an implementor
does want a particular feature - a standardized opcode for it is
already defined. I think that can be helpful.

John Savard

Re: Squeezing Those Bits: Concertina II

<9GgvI.515493$J_5.262004@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17492&group=comp.arch#17492

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder5.feed.usenet.farm!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com> <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
In-Reply-To: <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 10
Message-ID: <9GgvI.515493$J_5.262004@fx46.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 07 Jun 2021 03:48:21 UTC
Date: Sun, 06 Jun 2021 23:48:08 -0400
X-Received-Bytes: 1895
 by: EricP - Mon, 7 Jun 2021 03:48 UTC

MitchAlsup wrote:
> On Sunday, June 6, 2021 at 8:51:29 PM UTC-5, Quadibloc wrote:
>> What am I up to?
> <
> {with only mild sarcasm::}
> It appears you are wandering around in the dark without a candle.

You are likely to be eaten by a grue.

Re: Squeezing Those Bits: Concertina II

<s9kb78$39d$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17495&group=comp.arch#17495

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Sun, 6 Jun 2021 22:35:35 -0700
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <s9kb78$39d$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 7 Jun 2021 05:35:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ee99f33c390778f2b2ea4b326a13c19b";
logging-data="3373"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/z/8QTbH4+Qx4AbJxcdKqB"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:Rn0rXHWj9g5gEtGyfQVGmJwcxUk=
In-Reply-To: <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Mon, 7 Jun 2021 05:35 UTC

On 6/6/2021 6:51 PM, Quadibloc wrote:
> On Sunday, June 6, 2021 at 9:16:49 AM UTC-6, Quadibloc wrote:
>> On Wednesday, June 2, 2021 at 12:27:23 PM UTC-6, MitchAlsup wrote:
>>
>>> Taking the light weight nature of VLIW and crushing it with added baggage.
>>
>> As I've noted, just because the ISA spec includes _both_ VLIW _and_
>> added baggage, that doesn't mean that implementations are required to
>> include both; they can omit either one. So the architecture can have two
>> subsets that make sense, even if the whole thing doesn't.
>
> Also, most of the time, the VLIW as designed will indeed be useless. Real-world
> code hardly _ever_ has an ILP of 8.

True in open code, false in pipelined loops.

Re: Squeezing Those Bits: Concertina II

<s9kmba$2b9$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17498&group=comp.arch#17498

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Mon, 7 Jun 2021 10:45:29 +0200
Organization: A noiseless patient Spider
Lines: 232
Message-ID: <s9kmba$2b9$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<s9dr3k$59c$1@dont-email.me> <s9dtqg$hbi$1@dont-email.me>
<s9e5r2$6j3$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 7 Jun 2021 08:45:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1d9d970b75a30f75627d2d8bcb10b11a";
logging-data="2409"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qWJ236AZiCpz3bZDOwLLTKCBS+h6SXDU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:u1ynOC5fLE8uFo9HaewuN8FTUgw=
In-Reply-To: <s9e5r2$6j3$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Mon, 7 Jun 2021 08:45 UTC

On 2021-06-04, BGB wrote:
> On 6/4/2021 2:10 PM, Marcus wrote:
>> On 2021-06-04, BGB wrote:
>>>

[snip]

>>>
>>
>> I would guess that part of what you're measuring is the compiler
>> maturity level. On my in-order CPU that does not even have I$ nor D$
>> (but a shared single cycle 32-bit BRAM bus), I got something like
>> 0.7-0.8 DMIPS/MHz - but that's using GCC 11 and some hand-optimized
>> C library functions (memcpy etc). Before optimizing the libc routines
>> I got 0.5 DMIPS/MHz.
>>
>> With a proper I$ (that I'm currently working on) I expect to get closer
>> to 1 DMIPS/MHz.
>>

BTW, my point here is that Dhrystone is very much a compiler and libc
test, not just a CPU test. AFAICT it's perfectly OK to hand-optimize
libc functions (memcpy, memset, strcmp, ...), and using MRISC32 vector
operations in the hot libc functions helped to bump the Dhrystone score
from ~0.5 to ~0.7 IIRC.

>
> In terms of stats at present:
>   I have an L1 I$ and D$ (both 16K, direct-mapped)
>     Memcpy (L1): 250MB/s
>     Memset (L1): 320MB/s
>   L2 Cache is 128K, 2-way set-associative
>     Memcpy (L2): ~ 50MB/s
>     Memset (L2): ~ 90MB/s
>   RAM (DDR2, 50MHz):
>     Memcpy: ~ 9MB/s
>     Memset: ~ 17MB/s
>
> Memory access, if properly pipelined, is 1 cycle for an L1 hit.
> It is 2 or 3 cycles if an interlock stall occurs (eg: trying to use a
> value directly following a load).
>
> There is also a branch-predictor and similar, ...

My machine is similar (except for the caches).

>
> My compiler does have a few weaknesses:
> It isn't really able to use VLIW capabilities effectively, so most of
> what it produces is scalar code;
> It isn't super great at avoiding things like needless MOV instructions
> or Load/Store ops;
> It always creates stack frames, even for trivial leaf functions (could
> be changed but would add a lot of complexity to the C compiler, and
> would require the codegen to first prove that the function doesn't
> contain any hidden function calls or similar);
> Doesn't perform inlining, at all;
> A certain subset of operators are implemented effectively using
> "call-threading" (eg, rather then the compiler doing it itself, it spits
> out hidden calls into the C runtime, *1);
> ...
>
> *1: This generally happens for operators which don't exist natively in
> the ISA and which can't be implemented effectively within a short
> instruction sequence. Things like integer divide, modulo, large
> multiply, etc, generally fall into this category. Large arrays, VLAs,
> large struct variables, copying or returning a struct by value, ..., may
> also generate hidden runtime calls. Some vector ops also involve runtime
> calls, and some extensions (such as the __variant type, __float128, ...)
> are implemented almost entirely via runtime calls.
>
>
> As noted, its code generation tends to go through a stack model, which
> in turn uses temporary variables.
>
> So:
>   z=3*x+y;
> Might be compiled as (pseudocode):
>   PUSH 3
>   LOAD x
>   BINOP '*'
>   LOAD y
>   BINOP '+'
>   STORE z
> Or, in something closer to its original notation:
>   3 $x * $y + =z
>
> Which might become, effectively (C-like pseudocode):
>   _t0_1i = 3;
>   _t1_1i = x;
>   _t0_2i = _t0_1i * _t1_1i;
>   _t1_2i = y;
>   _t0_3i = _t0_2i + _t1_2i;
>   z = _t0_3i;
> But, then "optimized" back to:
>   _t0_2i = x * 3;
>   z = _t0_2i + y;
>
> But, not always. If the types don't match exactly, there might be
> left-over type-conversion ops, say:
>   _t0_1 = (int)3;
>   _t1_1 = (int)x;
>
> Or, say, the values are computed as "int" but the destination is "long":
>   _t0_3i = _t0_2i + _t1_2i;
>   _t0_4l = (long)_t0_3i;
>   z = _t0_4l;
>
> These cases may prevent the forwarding, but don't otherwise actually
> change the value. In this case, these result in the occasional needless
> MOV or EXTS.L instruction (and also increases register pressure).
>
>
> The codegen backend then does more or less a direct translation of this
> into machine-code instructions, with a register allocator and similar
> which maps variables temporarily onto CPU registers (in which case they
> are loaded on demand from memory, and written back to memory at the end
> of the current basic block). The register allocator may also evict (and
> write back) values for registers if it needs to access another variable
> and no unassigned registers are left.
>
> A variable may be statically assigned to a CPU register in which case no
> memory write-back occurs, and the same variable maps to the same
> register throughout the entire function. This only works for local
> variables within a certain range of primitive types, and up to a certain
> maximum number of variables in any given function. If the "register"
> keyword is used, it adds a fairly big weight to the variable being
> picked for this.
>
>
> For better or for worse, it also translates fairly directly from the 3AC
> IR to machine-code, though it "would be better" had it first gone into a
> sort of "high-level ASM" which was then emitted as machine code. This
> would give more room for things like reorganizing instructions or
> performing peephole optimization (the instruction shuffling and bundling
> done by the WEXifier is actually done on the machine-code after it had
> already been emitted, which is kinda, not exactly the most ideal way to
> do this).
>
>
> Similarly, one doesn't always know for certain when a temporary value
> goes out of scope, if its value may reappear layer. In these cases the
> value may end up written back to memory even if it isn't going to be
> needed again (since a loss of efficiency due to needless stores is "less
> bad" than the program misbehaving because a needed value wasn't retained).
>
> ...
>
>
> I wouldn't exactly be surprised though, if GCC had a BJX2 backend, if it
> could beat out BGBCC pretty solidly. However, as noted, BGBCC is being
> used partly because writing a new GCC backend looked like a pretty big
> project.

It _is_ a pretty big project (and the GCC code isn't always a joy to
work with either). If at some point you want to approach that project
you could have a look at the MRISC32 patches for inspiration (a handful
of commits on top of the upstream GCC Git repo):

https://github.com/mrisc32/gcc-mrisc32

It is not complete (e.g. some C++ features are missing, like
exceptions), but it can compile a lot of different things and generates
decent code.

A deficiency right now is that I have not implemented relaxation in
binutils, so I always get two-instruction sequences for calls/tail-calls
and PC-relative loads/stores, where usually a single instruction would
suffice.

>
>>> Though, can note that the benchmark does depend a bit on integer
>>> division and strcmp, neither of which are "particularly" fast in my
>>> case (there are not any specialized instructions for these cases).
>>>
>>>
>>>
>>> Then again, I had noted one time though, that when I tried to do a
>>> port of BGBCC to generate code for ARM32, its performance (relative
>>> to GCC or Clang) was pretty much atrocious...
>>>
>>> Though, its generated code isn't *that* awful, so it is unclear what
>>> the main factor, apart from ARM's relative lack of register space
>>> meaning that the generated code consists mostly of LD/ST ops... (*)
>>>
>>> So, it is also possible that this could be a factor as well.
>>>
>>>
>>> *: I suspect a factor here is using registers for temporaries, where
>>> cases where a variables' value goes through a temporary register
>>> rather than being able to used directly is not ideal for register
>>> pressure. Combined with a compiler which isn't really smart enough to
>>> realize when temporary values are no longer needed and can be
>>> discarded (so these intermediate values from temporaries tend to
>>> frequently end up being stored back to the stack frame in the off
>>> chance they are needed later, ...).
>>>
>>> With BJX2 having roughly 27 (generic/usable) GPRs, it is able to keep
>>> a lot more stuff in registers, vs ARM32 only having 11.
>>>
>>> But, there isn't really a good/easy way to fix some of this.
>>>
>
> Side note:
> My current backend design really does not deal well with not having a
> lot of spare registers available...
>
>
>>>
>>>>> And, it appears this is not entirely recent: these sorts of 2-wide
>>>>> superscalar cores seem to have been dominant in phones and consumer
>>>>> electronics for roughly the past 15-20 years or so.
>>>>
>>>> Not sure what you mean with dominant.  OoO cores have been used on
>>>> smartphones since the Cortex-A9, used in, e.g., the Apple A5 (2011).
>>>>
>>>> As for other consumer electronics: If you don't need much performance,
>>>> no need for an expensive OoO core.
>>>>
>>>
>>> Dominant, as-in, the vast majority are using 2-wide superscalar,
>>> rather than OoO cores. While OoO isn't exactly new, and presumably
>>> not that much more expensive (if it is competitive in terms of area,
>>> ...), only a relative minority of devices use it.
>>>
>>> It seems like in the late 90s, consumer electronics / phones / ...
>>> mostly went from single-issue cores to dual-issue, and then just sort
>>> of sat there...
>>>
>>
>


Click here to read the complete article
Re: Squeezing Those Bits: Concertina II

<08408032-5fac-41ba-b857-571da323b9fen@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17503&group=comp.arch#17503

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4408:: with SMTP id v8mr16486146qkp.37.1623076326997; Mon, 07 Jun 2021 07:32:06 -0700 (PDT)
X-Received: by 2002:a9d:7c95:: with SMTP id q21mr14880107otn.5.1623076326751; Mon, 07 Jun 2021 07:32:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 07:32:06 -0700 (PDT)
In-Reply-To: <9GgvI.515493$J_5.262004@fx46.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:5006:5288:babd:1c76; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:5006:5288:babd:1c76
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com> <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com> <9GgvI.515493$J_5.262004@fx46.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <08408032-5fac-41ba-b857-571da323b9fen@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 14:32:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 7
 by: Quadibloc - Mon, 7 Jun 2021 14:32 UTC

On Sunday, June 6, 2021 at 9:48:24 PM UTC-6, EricP wrote:

> You are likely to be eaten by a grue.

I would be, if my efforts to design a new instruction set
were taking place within the Great Underground Empire.

John Savard

Re: Squeezing Those Bits: Concertina II

<70140cf3-9f7a-494c-b810-92105f9db4a8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17504&group=comp.arch#17504

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1185:: with SMTP id m5mr17339369qtk.140.1623085764037;
Mon, 07 Jun 2021 10:09:24 -0700 (PDT)
X-Received: by 2002:a9d:82b:: with SMTP id 40mr3428940oty.81.1623085763748;
Mon, 07 Jun 2021 10:09:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 10:09:23 -0700 (PDT)
In-Reply-To: <s9kb78$39d$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:5006:5288:babd:1c76;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:5006:5288:babd:1c76
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com>
<805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com>
<s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com> <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
<s9kb78$39d$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <70140cf3-9f7a-494c-b810-92105f9db4a8n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 17:09:24 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Mon, 7 Jun 2021 17:09 UTC

On Sunday, June 6, 2021 at 11:35:38 PM UTC-6, Ivan Godard wrote:
> On 6/6/2021 6:51 PM, Quadibloc wrote:

> > Also, most of the time, the VLIW as designed will indeed be useless. Real-world
> > code hardly _ever_ has an ILP of 8.

> True in open code, false in pipelined loops.

Which, of course, is exactly why I think it's worth having the VLIW capability
at all, as it is designed for that level of ILP.

John Savard

Re: Squeezing Those Bits: Concertina II

<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17506&group=comp.arch#17506

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:4f4a:: with SMTP id eu10mr19154330qvb.12.1623086564516;
Mon, 07 Jun 2021 10:22:44 -0700 (PDT)
X-Received: by 2002:a05:6808:117:: with SMTP id b23mr201570oie.7.1623086564247;
Mon, 07 Jun 2021 10:22:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 10:22:44 -0700 (PDT)
In-Reply-To: <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:5006:5288:babd:1c76;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:5006:5288:babd:1c76
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 17:22:44 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Mon, 7 Jun 2021 17:22 UTC

On Sunday, June 6, 2021 at 8:00:40 PM UTC-6, MitchAlsup wrote:
> On Sunday, June 6, 2021 at 8:51:29 PM UTC-5, Quadibloc wrote:

> > What am I up to?

> {with only mild sarcasm::}
> It appears you are wandering around in the dark without a candle.

And why am I doing this, since I _have_ Computer Architecture: A
Quantitative Approach readily at hand? Surely that means I'm engaging
in _intentional_ perversity?

One way my design can be thought of is... as a *marketing-driven*
design.

You're used to programming an IBM 360 or one of its successors?

Well, just like a System/360, we have full base-index addressing in
only 32 bits, and memory-to-memory string and packed decimal
instructions in only 48 bits!

You're used to programming a Motorola 68000, or an x86, or
other recent microprocessors?

Our instructions have full 16-bit memory displacements!

You're used to programming a Cray I, or similar vector machine?

Our architecture includes a set of 64 vector registers, each containing
64 scalar values, performing operations similar to those that classic
vector architectures based on the Cray offered.

You're used to programming a RISC architecture?

Our integer and floating-point register banks have 32 registers each.
Our register operate instructions include a bit to enable changing the
condition codes, so you can place instructions between an operate
instruction and a conditional branch based on its result, to reduce the
impact of this dependency.

You've made use of VLIW DSP processors, like the TMS6000C?

We offer the ability to explictly indicate when instructions can execute
in parallel, and which instructions depend on the results of which other
instructions.

So no matter what computer you had been working with before,
you'll find the features you were familiar with, and want to see in
the next computer you use here in our architecture!

John Savard

Re: Squeezing Those Bits: Concertina II

<0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17508&group=comp.arch#17508

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:418d:: with SMTP id o135mr17275746qka.418.1623087297155; Mon, 07 Jun 2021 10:34:57 -0700 (PDT)
X-Received: by 2002:aca:4a4f:: with SMTP id x76mr212509oia.157.1623087296923; Mon, 07 Jun 2021 10:34:56 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 10:34:56 -0700 (PDT)
In-Reply-To: <f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3983:6a8e:bd5c:deac; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3983:6a8e:bd5c:deac
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com> <81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com> <caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com> <563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me> <2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com> <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com> <f958da10-5f97-4cf
4-92d7-f696f31ad24bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Jun 2021 17:34:57 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 55
 by: MitchAlsup - Mon, 7 Jun 2021 17:34 UTC

On Monday, June 7, 2021 at 12:22:45 PM UTC-5, Quadibloc wrote:
> On Sunday, June 6, 2021 at 8:00:40 PM UTC-6, MitchAlsup wrote:
> > On Sunday, June 6, 2021 at 8:51:29 PM UTC-5, Quadibloc wrote:
>
> > > What am I up to?
>
> > {with only mild sarcasm::}
> > It appears you are wandering around in the dark without a candle.
> And why am I doing this, since I _have_ Computer Architecture: A
> Quantitative Approach readily at hand? Surely that means I'm engaging
> in _intentional_ perversity?
>
> One way my design can be thought of is... as a *marketing-driven*
> design.
>
> You're used to programming an IBM 360 or one of its successors?
>
> Well, just like a System/360, we have full base-index addressing in
> only 32 bits, and memory-to-memory string and packed decimal
> instructions in only 48 bits!
>
> You're used to programming a Motorola 68000, or an x86, or
> other recent microprocessors?
>
> Our instructions have full 16-bit memory displacements!
>
> You're used to programming a Cray I, or similar vector machine?
>
> Our architecture includes a set of 64 vector registers, each containing
> 64 scalar values, performing operations similar to those that classic
> vector architectures based on the Cray offered.
>
> You're used to programming a RISC architecture?
>
> Our integer and floating-point register banks have 32 registers each.
> Our register operate instructions include a bit to enable changing the
> condition codes, so you can place instructions between an operate
> instruction and a conditional branch based on its result, to reduce the
> impact of this dependency.
>
> You've made use of VLIW DSP processors, like the TMS6000C?
>
> We offer the ability to explictly indicate when instructions can execute
> in parallel, and which instructions depend on the results of which other
> instructions.
>
> So no matter what computer you had been working with before,
> you'll find the features you were familiar with, and want to see in
> the next computer you use here in our architecture!
<
But the very vast majority of computer users see the computer through a
high level language and its associated compiler and libraries.

Also note: My 66000 code density seems to be "on par" with x86-64.
>
> John Savard

Re: Squeezing Those Bits: Concertina II

<s9lna7$f6e$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17510&group=comp.arch#17510

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Mon, 7 Jun 2021 13:06:49 -0500
Organization: A noiseless patient Spider
Lines: 367
Message-ID: <s9lna7$f6e$1@dont-email.me>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com>
<s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at>
<s9dr3k$59c$1@dont-email.me> <s9dtqg$hbi$1@dont-email.me>
<s9e5r2$6j3$1@dont-email.me> <s9kmba$2b9$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 7 Jun 2021 18:08:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e08e80cd5236d9257370cd8bf265a65f";
logging-data="15566"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19bcOPzwGRPN7ncYWQc4+Q7"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:+MhO2/LaVrft1Pi7biOsotFMla8=
In-Reply-To: <s9kmba$2b9$1@dont-email.me>
Content-Language: en-US
 by: BGB - Mon, 7 Jun 2021 18:06 UTC

On 6/7/2021 3:45 AM, Marcus wrote:
> On 2021-06-04, BGB wrote:
>> On 6/4/2021 2:10 PM, Marcus wrote:
>>> On 2021-06-04, BGB wrote:
>>>>
>
> [snip]
>
>>>>
>>>
>>> I would guess that part of what you're measuring is the compiler
>>> maturity level. On my in-order CPU that does not even have I$ nor D$
>>> (but a shared single cycle 32-bit BRAM bus), I got something like
>>> 0.7-0.8 DMIPS/MHz - but that's using GCC 11 and some hand-optimized
>>> C library functions (memcpy etc). Before optimizing the libc routines
>>> I got 0.5 DMIPS/MHz.
>>>
>>> With a proper I$ (that I'm currently working on) I expect to get closer
>>> to 1 DMIPS/MHz.
>>>
>
> BTW, my point here is that Dhrystone is very much a compiler and libc
> test, not just a CPU test. AFAICT it's perfectly OK to hand-optimize
> libc functions (memcpy, memset, strcmp, ...), and using MRISC32 vector
> operations in the hot libc functions helped to bump the Dhrystone score
> from ~0.5 to ~0.7 IIRC.
>

Yeah.

My case, memcpy, memset, and strcmp are semi-optimized.

Memcpy and memset are transformed mostly into bigger operations, namely
mostly MOV.X operations (when properly aligned), or MOV.Q (otherwise).

The original code in the C library used (*) mostly loops which copied /
set data one byte at a time, which was not particularly good for
performance.

*: Mostly a fork of an old version of PDPCLIB, which was originally
written for some bizarre mix of IBM mainframes and an MS-DOS clone. Some
of the code in the C library was "not particularly high quality", but
not quite to the level to where "abandon all of it and start over"
seemed necessary. Had mostly been rewriting parts of it when stumbling
onto things that "particularly sucked".

Strcmp is harder to optimize mostly for the reason that one needs to
detect the presence of a zero byte and end the loop. I don't have a
dedicated instruction for this task (eg: "Set SR.T if QWord contains a
zero byte"), as it is fairly specialized.

Then again, such an instruction could help with strcpy/strcmp/strlen/...
so could maybe be justified.

As noted, a few things I am doing on my CPU core seem to be beyond the
abilities of a vintage 486.

>>
>> In terms of stats at present:
>>    I have an L1 I$ and D$ (both 16K, direct-mapped)
>>      Memcpy (L1): 250MB/s
>>      Memset (L1): 320MB/s
>>    L2 Cache is 128K, 2-way set-associative
>>      Memcpy (L2): ~ 50MB/s
>>      Memset (L2): ~ 90MB/s
>>    RAM (DDR2, 50MHz):
>>      Memcpy: ~ 9MB/s
>>      Memset: ~ 17MB/s
>>
>> Memory access, if properly pipelined, is 1 cycle for an L1 hit.
>> It is 2 or 3 cycles if an interlock stall occurs (eg: trying to use a
>> value directly following a load).
>>
>> There is also a branch-predictor and similar, ...
>
> My machine is similar (except for the caches).
>

Caches are useful as the bare RAM isn't super fast.

Though, during this whole process, I do seem to loose a fair bit of the
potential bandwidth of the RAM due to overheads (seemingly mostly state
transitions within the L2 cache and similar). Was working on this, then
got distracted with other stuff.

I am mostly using 16-byte cache lines in this case for the ring-bus
(16-byte transfers, and sending the whole cache line in a single message).

It is possible to get some more speed here by using 32-byte cache lines
in the L2 cache (with 32B transfers to/from DRAM), but this is still
pretty buggy ATM.

For single core, it is possible to boost the L1 caches up to 32K, which
improves performance slightly (say, we go from 95% hit to 98% hit).

At 2K or 4K, there are considerably more cache misses relative to 8K or
16K, so the smaller caches are a fair bit worse on this front.

In the past, I had also experimented with large L1's but no L2, though
this tended to perform worse than L1+L2.

....

>>
>> My compiler does have a few weaknesses:
>> It isn't really able to use VLIW capabilities effectively, so most of
>> what it produces is scalar code;
>> It isn't super great at avoiding things like needless MOV instructions
>> or Load/Store ops;
>> It always creates stack frames, even for trivial leaf functions (could
>> be changed but would add a lot of complexity to the C compiler, and
>> would require the codegen to first prove that the function doesn't
>> contain any hidden function calls or similar);
>> Doesn't perform inlining, at all;
>> A certain subset of operators are implemented effectively using
>> "call-threading" (eg, rather then the compiler doing it itself, it
>> spits out hidden calls into the C runtime, *1);
>> ...
>>
>> *1: This generally happens for operators which don't exist natively in
>> the ISA and which can't be implemented effectively within a short
>> instruction sequence. Things like integer divide, modulo, large
>> multiply, etc, generally fall into this category. Large arrays, VLAs,
>> large struct variables, copying or returning a struct by value, ...,
>> may also generate hidden runtime calls. Some vector ops also involve
>> runtime calls, and some extensions (such as the __variant type,
>> __float128, ...) are implemented almost entirely via runtime calls.
>>
>>
>> As noted, its code generation tends to go through a stack model, which
>> in turn uses temporary variables.
>>
>> So:
>>    z=3*x+y;
>> Might be compiled as (pseudocode):
>>    PUSH 3
>>    LOAD x
>>    BINOP '*'
>>    LOAD y
>>    BINOP '+'
>>    STORE z
>> Or, in something closer to its original notation:
>>    3 $x * $y + =z
>>
>> Which might become, effectively (C-like pseudocode):
>>    _t0_1i = 3;
>>    _t1_1i = x;
>>    _t0_2i = _t0_1i * _t1_1i;
>>    _t1_2i = y;
>>    _t0_3i = _t0_2i + _t1_2i;
>>    z = _t0_3i;
>> But, then "optimized" back to:
>>    _t0_2i = x * 3;
>>    z = _t0_2i + y;
>>
>> But, not always. If the types don't match exactly, there might be
>> left-over type-conversion ops, say:
>>    _t0_1 = (int)3;
>>    _t1_1 = (int)x;
>>
>> Or, say, the values are computed as "int" but the destination is "long":
>>    _t0_3i = _t0_2i + _t1_2i;
>>    _t0_4l = (long)_t0_3i;
>>    z = _t0_4l;
>>
>> These cases may prevent the forwarding, but don't otherwise actually
>> change the value. In this case, these result in the occasional
>> needless MOV or EXTS.L instruction (and also increases register
>> pressure).
>>
>>
>> The codegen backend then does more or less a direct translation of
>> this into machine-code instructions, with a register allocator and
>> similar which maps variables temporarily onto CPU registers (in which
>> case they are loaded on demand from memory, and written back to memory
>> at the end of the current basic block). The register allocator may
>> also evict (and write back) values for registers if it needs to access
>> another variable and no unassigned registers are left.
>>
>> A variable may be statically assigned to a CPU register in which case
>> no memory write-back occurs, and the same variable maps to the same
>> register throughout the entire function. This only works for local
>> variables within a certain range of primitive types, and up to a
>> certain maximum number of variables in any given function. If the
>> "register" keyword is used, it adds a fairly big weight to the
>> variable being picked for this.
>>
>>
>> For better or for worse, it also translates fairly directly from the
>> 3AC IR to machine-code, though it "would be better" had it first gone
>> into a sort of "high-level ASM" which was then emitted as machine
>> code. This would give more room for things like reorganizing
>> instructions or performing peephole optimization (the instruction
>> shuffling and bundling done by the WEXifier is actually done on the
>> machine-code after it had already been emitted, which is kinda, not
>> exactly the most ideal way to do this).
>>
>>
>> Similarly, one doesn't always know for certain when a temporary value
>> goes out of scope, if its value may reappear layer. In these cases the
>> value may end up written back to memory even if it isn't going to be
>> needed again (since a loss of efficiency due to needless stores is
>> "less bad" than the program misbehaving because a needed value wasn't
>> retained).
>>
>> ...
>>
>>
>> I wouldn't exactly be surprised though, if GCC had a BJX2 backend, if
>> it could beat out BGBCC pretty solidly. However, as noted, BGBCC is
>> being used partly because writing a new GCC backend looked like a
>> pretty big project.
>
> It _is_ a pretty big project (and the GCC code isn't always a joy to
> work with either). If at some point you want to approach that project
> you could have a look at the MRISC32 patches for inspiration (a handful
> of commits on top of the upstream GCC Git repo):
>
>   https://github.com/mrisc32/gcc-mrisc32
>
> It is not complete (e.g. some C++ features are missing, like
> exceptions), but it can compile a lot of different things and generates
> decent code.
>
> A deficiency right now is that I have not implemented relaxation in
> binutils, so I always get two-instruction sequences for calls/tail-calls
> and PC-relative loads/stores, where usually a single instruction would
> suffice.
>


Click here to read the complete article
Re: Squeezing Those Bits: Concertina II

<db879b20-33b6-4d2e-b775-86c4e8692181n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17512&group=comp.arch#17512

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:ef05:: with SMTP id j5mr17923859qkk.104.1623092861214; Mon, 07 Jun 2021 12:07:41 -0700 (PDT)
X-Received: by 2002:a05:6830:118c:: with SMTP id u12mr7824298otq.82.1623092860978; Mon, 07 Jun 2021 12:07:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 12:07:40 -0700 (PDT)
In-Reply-To: <s9lna7$f6e$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3983:6a8e:bd5c:deac; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3983:6a8e:bd5c:deac
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <7d9b1862-5d8d-4b07-8c13-9f1caef37cden@googlegroups.com> <s9bga3$ljs$1@dont-email.me> <2021Jun4.104421@mips.complang.tuwien.ac.at> <s9dr3k$59c$1@dont-email.me> <s9dtqg$hbi$1@dont-email.me> <s9e5r2$6j3$1@dont-email.me> <s9kmba$2b9$1@dont-email.me> <s9lna7$f6e$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <db879b20-33b6-4d2e-b775-86c4e8692181n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Jun 2021 19:07:41 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 76
 by: MitchAlsup - Mon, 7 Jun 2021 19:07 UTC

On Monday, June 7, 2021 at 1:08:10 PM UTC-5, BGB wrote:
> On 6/7/2021 3:45 AM, Marcus wrote:
> > On 2021-06-04, BGB wrote:
> >> On 6/4/2021 2:10 PM, Marcus wrote:
> >>> On 2021-06-04, BGB wrote:
> >>>>
> >
> > [snip]
> >
> >>>>
> >>>
> >>> I would guess that part of what you're measuring is the compiler
> >>> maturity level. On my in-order CPU that does not even have I$ nor D$
> >>> (but a shared single cycle 32-bit BRAM bus), I got something like
> >>> 0.7-0.8 DMIPS/MHz - but that's using GCC 11 and some hand-optimized
> >>> C library functions (memcpy etc). Before optimizing the libc routines
> >>> I got 0.5 DMIPS/MHz.
> >>>
> >>> With a proper I$ (that I'm currently working on) I expect to get closer
> >>> to 1 DMIPS/MHz.
> >>>
> >
> > BTW, my point here is that Dhrystone is very much a compiler and libc
> > test, not just a CPU test. AFAICT it's perfectly OK to hand-optimize
> > libc functions (memcpy, memset, strcmp, ...), and using MRISC32 vector
> > operations in the hot libc functions helped to bump the Dhrystone score
> > from ~0.5 to ~0.7 IIRC.
> >
> Yeah.
>
> My case, memcpy, memset, and strcmp are semi-optimized.
>
> Memcpy and memset are transformed mostly into bigger operations, namely
> mostly MOV.X operations (when properly aligned), or MOV.Q (otherwise).
>
> The original code in the C library used (*) mostly loops which copied /
> set data one byte at a time, which was not particularly good for
> performance.
>
>
> *: Mostly a fork of an old version of PDPCLIB, which was originally
> written for some bizarre mix of IBM mainframes and an MS-DOS clone. Some
> of the code in the C library was "not particularly high quality", but
> not quite to the level to where "abandon all of it and start over"
> seemed necessary. Had mostly been rewriting parts of it when stumbling
> onto things that "particularly sucked".
>
>
> Strcmp is harder to optimize mostly for the reason that one needs to
> detect the presence of a zero byte and end the loop. I don't have a
> dedicated instruction for this task (eg: "Set SR.T if QWord contains a
> zero byte"), as it is fairly specialized.
<
Strcmp is only hard if you attempt to run it wide with SIMD, it is brain
dead easy to run it as wide as your cache port using virtual vectors.
Thus, SIMD is the problem not C strings.
<
Secondarily, if instead of SET.?? you had a compare instruction that
delivered bit vectors of "just about anything you would want to know
about" between the comparands, this would fall out for free. as far
back as 88110, the compare instruction was augmented to contain
"any byte zero", an "any halfword zero". This is the added utility of
CMP delivering a bit vector rather than delivering a true or false.
<
But of course, you then need an efficient means to convert a bit in the
vector into a branch condition, or convert a bit into a true or false.
88K and My 66000 both have these.
>
> Then again, such an instruction could help with strcpy/strcmp/strlen/...
> so could maybe be justified.
<
I explicitly made the LOOP instruction in My 66000 deal with the loop
control of strncpy and strncmp where there are bot counted loop
terminations and data conditional loop terminations.
<
Basically every leaf level subroutine in the str and mem libraries vectorizes
under VVM.

Re: Squeezing Those Bits: Concertina II

<19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17514&group=comp.arch#17514

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:848:: with SMTP id dg8mr19916141qvb.2.1623096167203;
Mon, 07 Jun 2021 13:02:47 -0700 (PDT)
X-Received: by 2002:a9d:82b:: with SMTP id 40mr3968734oty.81.1623096166928;
Mon, 07 Jun 2021 13:02:46 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 13:02:46 -0700 (PDT)
In-Reply-To: <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:d94:c9b3:6345:d6b3;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:d94:c9b3:6345:d6b3
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com> <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 20:02:47 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3030
 by: Quadibloc - Mon, 7 Jun 2021 20:02 UTC

On Monday, June 7, 2021 at 11:34:58 AM UTC-6, MitchAlsup wrote:

> Also note: My 66000 code density seems to be "on par" with x86-64.

Oh, that's good. And, of course, comparing the lengths of _individual
instructions_, as I'm doing, is hardly the right way to assess code
density. If I *really* wanted to maximize code density, obviously I'd
use as the starting point the computer with the greatest code density
ever made, at least by some accounts: the PDP-8.

But a computer that does everything in memory would be very slow,
and no one would want to work with memory organized by 128-word
pages.

John Savard

Re: Squeezing Those Bits: Concertina II

<3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17517&group=comp.arch#17517

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:645:: with SMTP id a5mr18666772qka.70.1623099988432;
Mon, 07 Jun 2021 14:06:28 -0700 (PDT)
X-Received: by 2002:a9d:1d49:: with SMTP id m67mr15924878otm.76.1623099988242;
Mon, 07 Jun 2021 14:06:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 14:06:27 -0700 (PDT)
In-Reply-To: <19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3983:6a8e:bd5c:deac;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3983:6a8e:bd5c:deac
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com> <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
<19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Jun 2021 21:06:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Mon, 7 Jun 2021 21:06 UTC

On Monday, June 7, 2021 at 3:02:48 PM UTC-5, Quadibloc wrote:
> On Monday, June 7, 2021 at 11:34:58 AM UTC-6, MitchAlsup wrote:
>
> > Also note: My 66000 code density seems to be "on par" with x86-64.
<
> Oh, that's good. And, of course, comparing the lengths of _individual
> instructions_, as I'm doing, is hardly the right way to assess code
> density. If I *really* wanted to maximize code density, obviously I'd
> use as the starting point the computer with the greatest code density
> ever made, at least by some accounts: the PDP-8.
<
You should only consider 64-bit machines as the PDP-8 is at a serious
deficit when you consider loading a value into a register requires clearing
the registers and then adding memory to it: 2-instructions, 12×12 multiply
is 24 instructions,......indexing memory LD/ST: 3 instructions......
<
I think PDP-11 had better code density.
>
> But a computer that does everything in memory would be very slow,
> and no one would want to work with memory organized by 128-word
> pages.
>
> John Savard

Re: Squeezing Those Bits: Concertina II

<c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17518&group=comp.arch#17518

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:2d2:: with SMTP id a18mr18093383qtx.296.1623100304494;
Mon, 07 Jun 2021 14:11:44 -0700 (PDT)
X-Received: by 2002:a05:6808:117:: with SMTP id b23mr718847oie.7.1623100304251;
Mon, 07 Jun 2021 14:11:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 14:11:44 -0700 (PDT)
In-Reply-To: <3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:d94:c9b3:6345:d6b3;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:d94:c9b3:6345:d6b3
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com> <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
<19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com> <3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 07 Jun 2021 21:11:44 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: Quadibloc - Mon, 7 Jun 2021 21:11 UTC

On Monday, June 7, 2021 at 3:06:29 PM UTC-6, MitchAlsup wrote:

> You should only consider 64-bit machines as the PDP-8 is at a serious
> deficit when you consider loading a value into a register requires clearing
> the registers and then adding memory to it: 2-instructions, 12×12 multiply
> is 24 instructions,......indexing memory LD/ST: 3 instructions......

That is, indeed, such a no-brainer that I had not failed to see that point.

So while the instructions might _look_ like PDP-8 instructions, they would still
perform 64-bit operations. However, it's still unworkable for many other reasons.

> I think PDP-11 had better code density.

I have a design for a 16-bit mode that looks a lot like the PDP-11. By shrinking
the mode bits to only two, as used in the 9900, but keeping only 8 registers, I
get an opcode field of six bits. Suddenly, I can include floating-point opcodes.

Only thing is, it breaks down if one wants to use more than 64K of memory with
it.

John Savard

Re: Squeezing Those Bits: Concertina II

<jwvv96pmj9v.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17519&group=comp.arch#17519

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Squeezing Those Bits: Concertina II
Date: Mon, 07 Jun 2021 17:56:35 -0400
Organization: A noiseless patient Spider
Lines: 7
Message-ID: <jwvv96pmj9v.fsf-monnier+comp.arch@gnu.org>
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
<15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com>
<0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
<19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com>
<3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>
<c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="a9e63584b59c082e96e199cffd0f59c2";
logging-data="8788"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//4Ap3zipCGuS2w9sTS187"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:39Ijz5NHgRdxLDfnWf1Iwrvz22M=
sha1:2NOz6HKmUBl7qnamGAF1kDY88/M=
 by: Stefan Monnier - Mon, 7 Jun 2021 21:56 UTC

> Only thing is, it breaks down if one wants to use more than 64K of memory with
> it.

That's OK. 64kB should be plenty for any reasonable use.

Stefan

Re: Squeezing Those Bits: Concertina II

<fddb3129-96e2-4c85-902a-46a6f14a9db8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17520&group=comp.arch#17520

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e8cd:: with SMTP id m13mr20419802qvo.52.1623103779367;
Mon, 07 Jun 2021 15:09:39 -0700 (PDT)
X-Received: by 2002:aca:4a4f:: with SMTP id x76mr829869oia.157.1623103779162;
Mon, 07 Jun 2021 15:09:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 15:09:38 -0700 (PDT)
In-Reply-To: <jwvv96pmj9v.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3983:6a8e:bd5c:deac;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3983:6a8e:bd5c:deac
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com> <7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com>
<86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com> <110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com>
<3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com> <51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com>
<4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com> <ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com>
<631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com> <d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com>
<15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com> <f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com>
<0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com> <19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com>
<3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com> <c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com>
<jwvv96pmj9v.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fddb3129-96e2-4c85-902a-46a6f14a9db8n@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Jun 2021 22:09:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 7 Jun 2021 22:09 UTC

On Monday, June 7, 2021 at 4:56:38 PM UTC-5, Stefan Monnier wrote:
> > Only thing is, it breaks down if one wants to use more than 64K of memory with
> > it.
> That's OK. 64kB should be plenty for any reasonable use.
<
Why would any one need more than 640KB ??
<
Well, we blew right though 4GB, and are in spitting distance of not needing to run
the swapper on many/most home computers !
>
>
> Stefan

Re: Squeezing Those Bits: Concertina II

<25438a23-3deb-4068-b9d6-ddabed54e4dbn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17521&group=comp.arch#17521

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:eda5:: with SMTP id h5mr20032701qvr.26.1623103970487;
Mon, 07 Jun 2021 15:12:50 -0700 (PDT)
X-Received: by 2002:a54:4882:: with SMTP id r2mr811524oic.110.1623103970223;
Mon, 07 Jun 2021 15:12:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 15:12:49 -0700 (PDT)
In-Reply-To: <c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3983:6a8e:bd5c:deac;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3983:6a8e:bd5c:deac
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com> <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
<19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com> <3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>
<c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25438a23-3deb-4068-b9d6-ddabed54e4dbn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Jun 2021 22:12:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Mon, 7 Jun 2021 22:12 UTC

On Monday, June 7, 2021 at 4:11:45 PM UTC-5, Quadibloc wrote:
> On Monday, June 7, 2021 at 3:06:29 PM UTC-6, MitchAlsup wrote:
>
> > You should only consider 64-bit machines as the PDP-8 is at a serious
> > deficit when you consider loading a value into a register requires clearing
> > the registers and then adding memory to it: 2-instructions, 12×12 multiply
> > is 24 instructions,......indexing memory LD/ST: 3 instructions......
> That is, indeed, such a no-brainer that I had not failed to see that point.
>
> So while the instructions might _look_ like PDP-8 instructions, they would still
> perform 64-bit operations. However, it's still unworkable for many other reasons.
> > I think PDP-11 had better code density.
> I have a design for a 16-bit mode that looks a lot like the PDP-11. By shrinking
> the mode bits to only two, as used in the 9900, but keeping only 8 registers, I
> get an opcode field of six bits. Suddenly, I can include floating-point opcodes.
>
> Only thing is, it breaks down if one wants to use more than 64K of memory with
> it.
<
Give a DPD-11 64-bit registers and you will surprised at how good the code density
is, right up until you have to branch/call/jump farther than ±32KB. But I bet there
is some kind of "accommodation" to get larger constants that would work fairly well.
<
And then there is that "pipelineing" problem.
>
> John Savard

Re: Squeezing Those Bits: Concertina II

<5f9c04c7-0c4d-431c-8678-d112b7e1e51cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17522&group=comp.arch#17522

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9244:: with SMTP id u65mr19124776qkd.46.1623112781365;
Mon, 07 Jun 2021 17:39:41 -0700 (PDT)
X-Received: by 2002:a05:6830:a:: with SMTP id c10mr17180974otp.114.1623112781108;
Mon, 07 Jun 2021 17:39:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Jun 2021 17:39:40 -0700 (PDT)
In-Reply-To: <25438a23-3deb-4068-b9d6-ddabed54e4dbn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:d94:c9b3:6345:d6b3;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:d94:c9b3:6345:d6b3
References: <698865df-06a6-4ec1-ae71-a36ccc30b30an@googlegroups.com>
<030135f6-d63c-4b9b-8461-0ae08cfd5912n@googlegroups.com> <93c20171-88e1-4b0f-9919-2723cb3cf7dbn@googlegroups.com>
<81deeb7a-4f9f-4e5c-95bd-64eac1fcf53cn@googlegroups.com> <38e59b03-7103-477a-957e-63ef18b72a4dn@googlegroups.com>
<caf484d6-4574-4909-bc8a-ed944fc9bddcn@googlegroups.com> <805ec395-f39c-403b-bdc3-5110653e237fn@googlegroups.com>
<563fa215-c166-4906-bf4b-e715c8b002c7n@googlegroups.com> <s93lcf$1p1$1@dont-email.me>
<2a75fedf-7f84-41df-a12f-46e70a3bd696n@googlegroups.com> <4b68e3b2-6343-429f-9afd-cb124f378817n@googlegroups.com>
<7180f6f6-d57b-4191-bddd-ef20e4f35a1dn@googlegroups.com> <86e10294-a1ce-41c3-9d56-6f73afce5dean@googlegroups.com>
<110d93f7-d8bc-4523-869d-16f4249fad00n@googlegroups.com> <3d8d0ac1-0462-4525-82fd-9dca309f038en@googlegroups.com>
<51734e5c-3a02-4079-a178-f7f46c442504n@googlegroups.com> <4fb02966-46dc-4218-a26b-836ac68ecbb3n@googlegroups.com>
<ad2a41ce-c25e-4f84-b77c-bea8550f3b7bn@googlegroups.com> <631c11d8-566e-4aa4-98b6-f437ce98e0cbn@googlegroups.com>
<d23521a2-8ea8-46f9-8675-a8b1b3239c75n@googlegroups.com> <15ea2fc6-c4d6-4950-b0ef-505a204c7dc4n@googlegroups.com>
<f958da10-5f97-4cf4-92d7-f696f31ad24bn@googlegroups.com> <0e09f9ec-5649-4051-899c-53bdf0e9247fn@googlegroups.com>
<19893fe5-a53f-4549-b9ef-4b95bab1bc04n@googlegroups.com> <3612a40f-a638-43d2-b203-cf8d320d7eaan@googlegroups.com>
<c52dcbef-8079-4315-811c-b02589f23bf8n@googlegroups.com> <25438a23-3deb-4068-b9d6-ddabed54e4dbn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5f9c04c7-0c4d-431c-8678-d112b7e1e51cn@googlegroups.com>
Subject: Re: Squeezing Those Bits: Concertina II
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 08 Jun 2021 00:39:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: Quadibloc - Tue, 8 Jun 2021 00:39 UTC

On Monday, June 7, 2021 at 4:12:51 PM UTC-6, MitchAlsup wrote:
> On Monday, June 7, 2021 at 4:11:45 PM UTC-5, Quadibloc wrote:

> > Only thing is, it breaks down if one wants to use more than 64K of memory with
> > it.

> Give a DPD-11 64-bit registers and you will surprised at how good the code density
> is, right up until you have to branch/call/jump farther than ±32KB. But I bet there
> is some kind of "accommodation" to get larger constants that would work fairly well.

Well, the solution to using a larger address space than 64K was sitting right in
front of me, as I'm already making use of it extensively in the Concertina II as it
is.

In addition to combining the PDP-11 (eight registers) with the TI 9900 (but only four
address modes), I just have to put the IBM System/360 Model 20 into the mix.. So the
programmer's model includes eight base registers, which are mapped to the ones I
use in the regular instruction set.

> And then there is that "pipelineing" problem.

In my opinion, the cure for that is SMT. That is, yes, this particular dense code format
gives poor performance - if it's the only thing running on the machine. If it allternates with
other code, then even in the absence of OoO to fix things, there's no real problem.

So if you're happy to be one of sixteen concurrent threads, use the PDP-11-like format
for higher code density. If you want to hog the machine all to yourself, make efficient
use of it by using the VLIW block formats instead.

John Savard

Pages:1234567
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor