novaBBS - comp.arch - Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<f4c105cf-2116-4c94-980f-c178a22da76dn@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=23578&group=comp.arch#23578

X-Received: by 2002:a05:600c:898:b0:37c:30b8:2f with SMTP id l24-20020a05600c089800b0037c30b8002fmr2814841wmp.27.1645036380497;
Wed, 16 Feb 2022 10:33:00 -0800 (PST)
X-Received: by 2002:a9d:62ca:0:b0:5a3:f3d3:d00a with SMTP id
z10-20020a9d62ca000000b005a3f3d3d00amr1318111otk.144.1645036379974; Wed, 16
Feb 2022 10:32:59 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:32:59 -0800 (PST)
In-Reply-To: <suigkc$nkv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org> <suigkc$nkv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4c105cf-2116-4c94-980f-c178a22da76dn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:33:00 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 16 Feb 2022 18:32 UTC

On Wednesday, February 16, 2022 at 3:37:20 AM UTC-6, BGB wrote:
> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
> > Thomas Koenig wrote:
> >> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>> And because compiler
> >>> branch prediction (~10% miss rate)
> >>
> >> That seems optimistic.
> >>
> >>> is much worse than dynamic branch
> >>> prediction (~1% miss rate, both numbers vary strongly with the
> >>> application, so take them with a grain of salt),
> >>
> >> What is the branch miss rate on a binary search, or a sort?
> >> Should be close to 50%, correct?
> >>
> > Or closer to zero, if you implement your qsort with predicated
> > left/right pointer updates and data swaps.
> >
> > Same for a binary search, the left/right boundary updates can be
> > predicated allowing you to blindly run log2(N) iterations and pick the
> > remaining item. I.e. change it from a code state machine to a data state
> > machine because dependent loads can run at 2-3 cycles/iteration while
> > branch misses cost you 5-20 cycles.
> >
> Yeah, if the ISA does predicated instructions, and the compiler uses
> them, pretty much all of the internal "if()" branches in a typical
> sorting function can be expressed branch-free.
>
Brian's compiler pretty much predicates any if-them-else where the count
of instructions in the then and else clauses sums to less than 8.
>
> One can potentially modulo schedule them as well, though this requires
> an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where support
> for predicated ops using SR.S is an optional feature, though 3 or 4 bits
> could be better).
>
> Did also recently add another hack, where now stuff like:
> ADD?T R44, R53, R19 | ADD?F R37, 123, R13
> As well as ST/SF predicates:
> ADD?ST R44, R53, R19 | ADD?SF R37, 123, R13
> ...
>
>
> Can (in theory) be encoded using a 96-bit bundle with an Op64 prefix
> being split in half between two logical instructions. This is kind of an
> ugly hack, but alas.
>
> One downside is that the encoding scheme does not currently allow for, say:
> ADD?T R44, R53, R19 | MOV.W (R37, R6*8, 6), R13
>
> As well, as it also being effectively limited to encoding 2-wide bundles.
>
> But, alas...
>
>
> > Terje
> >

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<dd90c21d-9f28-4235-a490-81ed5672d526n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23579&group=comp.arch#23579

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:6884:0:b0:1e4:ed7b:fd71 with SMTP id h4-20020a5d6884000000b001e4ed7bfd71mr3354839wru.550.1645036425903;
Wed, 16 Feb 2022 10:33:45 -0800 (PST)
X-Received: by 2002:a05:6870:a447:b0:d2:ca49:2a73 with SMTP id
n7-20020a056870a44700b000d2ca492a73mr1066574oal.21.1645036425243; Wed, 16 Feb
2022 10:33:45 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:33:45 -0800 (PST)
In-Reply-To: <suijiu$giq$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <suijiu$giq$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dd90c21d-9f28-4235-a490-81ed5672d526n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:33:45 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 16 Feb 2022 18:33 UTC

On Wednesday, February 16, 2022 at 4:27:46 AM UTC-6, Ivan Godard wrote:
> On 2/16/2022 1:37 AM, BGB wrote:
> > On 2/16/2022 2:48 AM, Terje Mathisen wrote:
> >> Thomas Koenig wrote:
> >>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>>> And because compiler
> >>>> branch prediction (~10% miss rate)
> >>>
> >>> That seems optimistic.
> >>>
> >>>> is much worse than dynamic branch
> >>>> prediction (~1% miss rate, both numbers vary strongly with the
> >>>> application, so take them with a grain of salt),
> >>>
> >>> What is the branch miss rate on a binary search, or a sort?
> >>> Should be close to 50%, correct?
> >>>
> >> Or closer to zero, if you implement your qsort with predicated
> >> left/right pointer updates and data swaps.
> >>
> >> Same for a binary search, the left/right boundary updates can be
> >> predicated allowing you to blindly run log2(N) iterations and pick the
> >> remaining item. I.e. change it from a code state machine to a data
> >> state machine because dependent loads can run at 2-3 cycles/iteration
> >> while branch misses cost you 5-20 cycles.
> >>
> >
> > Yeah, if the ISA does predicated instructions, and the compiler uses
> > them, pretty much all of the internal "if()" branches in a typical
> > sorting function can be expressed branch-free.
> >
> >
> > One can potentially modulo schedule them as well, though this requires
> > an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where support
> > for predicated ops using SR.S is an optional feature, though 3 or 4 bits
> > could be better).
> Or just one predicate bit and a belt.
<
Or cast a shadow over several subsequent instructions.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<9021fd8d-91d0-4c81-8786-9ee0b6bfe3bcn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23580&group=comp.arch#23580

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:600c:1d99:b0:37b:b813:c7aa with SMTP id p25-20020a05600c1d9900b0037bb813c7aamr2838903wms.108.1645036577059;
Wed, 16 Feb 2022 10:36:17 -0800 (PST)
X-Received: by 2002:aca:34d4:0:b0:2d3:a858:b97f with SMTP id
b203-20020aca34d4000000b002d3a858b97fmr1262662oia.84.1645036576446; Wed, 16
Feb 2022 10:36:16 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:36:16 -0800 (PST)
In-Reply-To: <vG8PJ.15176$GjY3.1981@fx01.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9021fd8d-91d0-4c81-8786-9ee0b6bfe3bcn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:36:17 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 16 Feb 2022 18:36 UTC

On Wednesday, February 16, 2022 at 9:24:15 AM UTC-6, EricP wrote:
> BGB wrote:

> >
> > One can potentially modulo schedule them as well, though this requires
> > an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where support
> > for predicated ops using SR.S is an optional feature, though 3 or 4 bits
> > could be better).
> Mitch's My66K predicate value state flags are implied by the predicate
> instruction shadow and tracked internally, not architectural ISA
> registers like Itanium. To have multiple predicates in flight at once
> just use multiple PRED instructions.
<
And the only important uses of multiple PREDs is to cover && and || semantics.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sujgdl$rac$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23581&group=comp.arch#23581

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 12:39:48 -0600
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <sujgdl$rac$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <suijiu$giq$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Feb 2022 18:39:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c2fca33c8e4c63ea697ef74965f89bcd";
logging-data="27980"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/h3J981DlFSGS33lfWgh2G"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:vfqy+B3k0WZbBF3xLnVlCibRXdo=
In-Reply-To: <suijiu$giq$1@dont-email.me>
Content-Language: en-US

by: BGB - Wed, 16 Feb 2022 18:39 UTC

On 2/16/2022 4:27 AM, Ivan Godard wrote:
> On 2/16/2022 1:37 AM, BGB wrote:
>> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
>>> Thomas Koenig wrote:
>>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>>> And because compiler
>>>>> branch prediction (~10% miss rate)
>>>>
>>>> That seems optimistic.
>>>>
>>>>> is much worse than dynamic branch
>>>>> prediction (~1% miss rate, both numbers vary strongly with the
>>>>> application, so take them with a grain of salt),
>>>>
>>>> What is the branch miss rate on a binary search, or a sort?
>>>> Should be close to 50%, correct?
>>>>
>>> Or closer to zero, if you implement your qsort with predicated
>>> left/right pointer updates and data swaps.
>>>
>>> Same for a binary search, the left/right boundary updates can be
>>> predicated allowing you to blindly run log2(N) iterations and pick
>>> the remaining item. I.e. change it from a code state machine to a
>>> data state machine because dependent loads can run at 2-3
>>> cycles/iteration while branch misses cost you 5-20 cycles.
>>>
>>
>> Yeah, if the ISA does predicated instructions, and the compiler uses
>> them, pretty much all of the internal "if()" branches in a typical
>> sorting function can be expressed branch-free.
>>
>>
>> One can potentially modulo schedule them as well, though this requires
>> an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where
>> support for predicated ops using SR.S is an optional feature, though 3
>> or 4 bits could be better).
>
> Or just one predicate bit and a belt.
>

One issue with a single bit in this case is that, when wrapping around
the end of a loop, the loop conditional (if done via CMPxx+BT/BF) would
stomp the predicate bit.

Though, this could be sidestepped by using BEQ or BGT or similar, which
don't require stomping the SR.T bit.

Re: Advantages of in-order execution (was: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits)

<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23582&group=comp.arch#23582

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:1e04:b0:1e4:9b64:8cab with SMTP id bj4-20020a0560001e0400b001e49b648cabmr3337903wrb.608.1645037188357;
Wed, 16 Feb 2022 10:46:28 -0800 (PST)
X-Received: by 2002:a05:6830:25d5:b0:59c:10a8:f107 with SMTP id
d21-20020a05683025d500b0059c10a8f107mr1273188otu.85.1645037187722; Wed, 16
Feb 2022 10:46:27 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:46:27 -0800 (PST)
In-Reply-To: <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com> <2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com> <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
Subject: Re: Advantages of in-order execution (was: instruction set binding
time, was Encoding 20 and 40 bit instructions in 128 bits)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:46:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 42

by: MitchAlsup - Wed, 16 Feb 2022 18:46 UTC

On Wednesday, February 16, 2022 at 12:24:43 PM UTC-6, Stefan Monnier wrote:
> > B. Your conclusion is correct based on assuming that single-thread
> > performance is all that matters, ignoring power and die area. What if the
> > A73 design had been limited to the number of transistors (or more
> > importantly die area since bigger transistors could give an illusory
> > performance improvement if just the quantity is constant) in an A53?
<
> So, maybe the better question is: what kind of future process
> constraints could bring the trade-offs back in favor of
> in-order designs?
<
A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
as much power performing only 2× faster.
<
If software could figure out how to use 12 cores it could get 6× as much work done
at the same CPU power. But because calculation is rather inexpensive and memory
access rather expensive, the gain might only be 3× if you got limited by power.
<
Right now, due to SW inability to consume multiple CPUs, you don't have the
option of choosing 6× or even 3×.
<
This is not so much a HW problem, and to a large extent SW gets the blame.
But SW is not as much to blame as conversation indicate.
<
What is missing is a vonNeumann model of parallelism where HW provides a
few instructions to perform this new model and because of its simplicity
elegance and functionality, SW can easily find ways to utilize that new model.
<
Right now HW does not know what to build and SW does not know exactly
what to ask for. HW stumbles around adding TS, CAS, DCAW, LL-SC, ad
infinitum. SW trys to use new feature and finds it very difficult to use in
practice. This CLEARLY shows that what HW is trying to supply is not what
SW wants to consume; yet SW does not know on the intimate level what to
ask HW to build, so the circle continues.
>
>
> Stefan

Ivan Godard <ivan@millcomputing.com> writes:
>On 2/15/2022 1:46 AM, Anton Ertl wrote:
>> Ivan Godard <ivan@millcomputing.com> writes:
[...]
>I don't know of any JIT that does PGO or LTO or other analysis-heavy
>optimization. Perhaps you could enlighten me.

Profile-guided optimization (PGO): Have you ever heard of HotSpot? I
think it was an early one, but the general approach of performing
heavier (re)compilation for frequently-executed code has been used by
many others, with some working on method granularity, while others
work on "traces" (actually superblocks), pioneered by Dynamo.

Link-time optimization (LTO): That's a natural for JITs because JIT
compilers work at run-time. The only case where "linking" can happen
after the JIT compiler runs is if (in a JVM) a class file is loaded
dynamically. This is indeed a problem in the JVM context, but it can
be worked around with recompilation after the class file is loaded
(and recompilation is natural for JITs like HotSpot).

>Once you ignore the target-independent stuff, the only difference
>between a OOO target and an IO target is instruction scheduling, which
>is a small part of the cost of a back end for either kind of target.

For an ahead-of-time compiler, typically yes. For a JIT compiler, not
necessarily; these compilers (especially the first stage that is used
for everything, not just for hot spots) tend to be designed for
compilation speed, and instruction scheduling can cost significantly
in this context.

>It's a common misapprehension that OOO targets don't need to be
>scheduled. They do. A simplistic scheduler needs to schedule producers
>before consumers, and so must track dependency relations regardless of
>target.

Nonsense. The schedule is typically already there in the source,
which tends to be in some imperative form, so the compiler gets a
correct schedule by sticking to source order. In particular for
source code like JVM code, the order is given absolutely. If instead
the source code has some infix expression syntax, there may be some
freedom to the compiler, but you typically can just deal with it in a
left-to-right way, no complex dependency tracking needed.

>A more sophisticated scheduler tracks machine resources such as
>registers, and reorders instructions (statically) to lower resource
>pressure.

I have read some papers about this, and written about it, too, but
there has not been a particularly promising approach, and I am not
aware of a production compiler doing anything in that respect; just
generating the code without such considerations seems to do ok in most
cases. In any case, such a high-cost low-payoff optimization is
exactly what you don't do in a JIT compiler.

>One of our talks was a live demo of the Mill compiler tech. On the spot
>we created a new instruction, defined a new member by copying the spec
>of an existing member and adding the spec for the new instruction,
>rebuilt the tool chain using the new member as a target, wrote an asm
>program that used the new instruction, compiled it, and ran it single
>stepping in the debugger to show the execution of the new instruction,
>correctly scheduled in parallel with the surrounding instructions. All
>that in an hour, including slides, talk, and questions.

Sounds like a cool demo for the flexibility and the integration of
your hardware-software-codesign technology, but I don't think it will
help the maintainers of some JIT compiler much who have an established
code base, and you want to convince them to add a Mill target to their
JIT.

>>> But for quick and dirty, i.e. JIT, code you can just put one instruction
>>> in each bundle and be done with it, and ignore the fact that you are
>>> wasting all that parallelism. The result will be no worse than code for
>>> any other one-instruction-at-a-time code.
>>
>> Except that Mill will actually execute this code slowly, while even
>> the in-order Pentium or Cortex-A53 can execute 2 independent
>> instructions in parallel (but not all pairs are independent), and, as
>> mentioned, on OoO implementations we see good IPC even from very
>> simple JITs.
>
>You and your straw men. If you want to compare with a two-wide x86 than
>you should compare with a two-wide Mill. Actually, there is no such
>thing as a two wide Mill - the minimal width is four - so I suppose that
>you should compare against code that has been deliberately restricted to
>two instructions per bundle and ignore the rest of the width.

Above you wrote "put one instruction in each bundle", so it's your own
straw man (and obviously you attack it, as is normal for a straw-man
argument).

In case of ARM A64 or AMD64, I can generate instructions naively, and
a two-wide in-order implementation will recognize by itself when they
can be executed in parallel, and when they cannot. Will the Mill do
the same if I put one instruction in each bundle?

>> That's beside the point. If a virtual machine is uses certain pages,
>> in an SAS system these pages need to be free on the target system when
>> you migrate the virtual machine.
>
>You are unacquainted with ASIDs used to isolate VMs?

Yes.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

On 2/16/2022 12:28 PM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 2:12:38 AM UTC-6, Terje Mathisen wrote:
>> Anton Ertl wrote:
>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>
>>>>> * Before people make assumptions about what I mean with "software
>>>>> crisis": When the software cost is higher than the hardware cost,
>>>>> the software crisis reigns. This has been the case for much of the
>>>>> software for several decades.
>>>>
>>>> Two reasonable dates for that: 1957 (the first Fortran compiler) or
>>>> 1964, when the /360 demonstrated for all to see that software (especially
>>>> compatibility) was more important than any particular hardware.
>>>
>>> Yes, the term "software crisis" is from 1968, but programming language
>>> implementations demonstrated already before Fortran (and actually more
>>> so, because pre-Fortran languages had less implementation investment
>>> and did not utilize the hardware as well) that there is a world where
>>> software cost is more relevant than hardware cost. But of course at
>>
>> This is a very big world actually, i.e. see all the Phyton code running
>> on huge cloud instances even though they require 10-100 times as much
>> resources as the same algorithms implemented in Rust. The main exception
>> is the usual one, i.e. where the Phyton code (or some other scripting
>> language) is just a thin glue layer tying together low-level libraries
>> written in C(++), like in numpy.
> <
> This sounds exactly like the cost of medicine problem we have over hear.
> <
> to whit::
> <
> Patient does not see what the medical facility billed insurance company.
> Patient only sees monthly insurance bill.
> <
> So, why should patient care if it cost insurance company $10, or $10,000 ?
> Patient only sees monthly insurance bill.

Except when one gets the "bare minimum to require health insurance
mandate" insurance, which then does not cover the whole "actually going
to the hospital thing", then one does need to go for some reason, and it
costs $k ...

And, also the practice of "perma-temps" (keeping employees permanently
as temp workers), where the company does not provide health insurance or
similar, leading partly to the above situation.

....

>>
>> Terje
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"

Stephen Fuld wrote:
> On 2/16/2022 9:08 AM, John Dallman wrote:
>> In article <suj84u$cjv$1@dont-email.me>, sfuld@alumni.cmu.edu.invalid
>> (Stephen Fuld) wrote:
>>
>>> Microsoft sort of went with the S/38 model (different user
>>> compatible interfaces for different HW), but didn't do any
>>> automatic migration, e.g. from X86 to Alpha or MIPS)
>>
>> What are you thinking of here? Windows C/C++ and many other languages are
>> straightforwardly compiled to native code.
>
> Sure. The difference is in the "automatic" part. When you upgraded
> from a proprietary CISC IBM S/38 to a Power based AS/400, the first time
> you ran the program, the system automatically recompiled the
> intermediate code (that was included with the object code) for the new
> architecture. This led to the anecdotes mentioned in the AS/400 book
> about customers complaining that their shiny new AS/400 was running much
> slower than their old S/38 and IBM saying just rerun the program and it
> went much faster. What was going on was the first run (unknown to the
> customer) included the recompile time. When they ran it again, it ran
> much faster (native code for the faster processor). BTW, this was part
> of the reason for my suggestion at the end of my last post that the
> source code be transparently kept with the object code. It would allow
> transparent recompiles.

There is also static binary-binary translators which
takes the binary source and treats it like a language,
runs it through a compiler and linker, and spits out a native EXE.
IIRC DEC did thins for VAX->Alpha and x86->Alpha.

A bit of poking around finds

[paywalled]
LLBT: an LLVM-based static binary translator, 2012
https://dl.acm.org/doi/abs/10.1145/2380403.2380419

"... translates source binary into LLVM IR and then retargets
the LLVM IR to various ISAs by using the LLVM compiler...
from ARMv5 to Intel IA32, Intel x64, MIPS, and other ARMs such as ARMv7"

On 2/16/2022 11:02 AM, EricP wrote:
> Stephen Fuld wrote:
>> On 2/16/2022 9:08 AM, John Dallman wrote:
>>> In article <suj84u$cjv$1@dont-email.me>, sfuld@alumni.cmu.edu.invalid
>>> (Stephen Fuld) wrote:
>>>
>>>> Microsoft sort of went with the S/38 model (different user
>>>> compatible interfaces for different HW), but didn't do any
>>>> automatic migration, e.g. from X86 to Alpha or MIPS)
>>>
>>> What are you thinking of here? Windows C/C++ and many other languages
>>> are
>>> straightforwardly compiled to native code.
>>
>> Sure. The difference is in the "automatic" part. When you upgraded
>> from a proprietary CISC IBM S/38 to a Power based AS/400, the first
>> time you ran the program, the system automatically recompiled the
>> intermediate code (that was included with the object code) for the new
>> architecture. This led to the anecdotes mentioned in the AS/400 book
>> about customers complaining that their shiny new AS/400 was running
>> much slower than their old S/38 and IBM saying just rerun the program
>> and it went much faster. What was going on was the first run (unknown
>> to the customer) included the recompile time. When they ran it again,
>> it ran much faster (native code for the faster processor). BTW, this
>> was part of the reason for my suggestion at the end of my last post
>> that the source code be transparently kept with the object code. It
>> would allow transparent recompiles.
>
> There is also static binary-binary translators which
> takes the binary source and treats it like a language,
> runs it through a compiler and linker, and spits out a native EXE.
> IIRC DEC did thins for VAX->Alpha and x86->Alpha.
>
> A bit of poking around finds
>
> [paywalled]
> LLBT: an LLVM-based static binary translator, 2012
> https://dl.acm.org/doi/abs/10.1145/2380403.2380419
>
> "... translates source binary into LLVM IR and then retargets
> the LLVM IR to various ISAs by using the LLVM compiler...
> from ARMv5 to Intel IA32, Intel x64, MIPS, and other ARMs such as ARMv7"
>

Good point. Another alternative strategy. I guess an answer to John's
question is that there are a lot of alternatives, with different
tradeoffs, and it is not clear that HW/SW control is the only factor.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

>> So, maybe the better question is: what kind of future process
>> constraints could bring the trade-offs back in favor of
>> in-order designs?
> A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
> as much power performing only 2× faster.

Do you think this figure could still apply with today's processes?

Is there nowadays an in-order CPU which is about 1/12th the size of an
OoO sibling yet only ~50% (or less) slower?
[ It doesn't have to be 1/12th, of course; just significantly smaller
than the corresponding slowdown. ]

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<501133bc-105a-416b-bac5-f9ec95dbcf1en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23590&group=comp.arch#23590

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:600c:2552:b0:37b:c7dd:d376 with SMTP id e18-20020a05600c255200b0037bc7ddd376mr3222893wma.113.1645045499633;
Wed, 16 Feb 2022 13:04:59 -0800 (PST)
X-Received: by 2002:a05:6870:9108:b0:ce:c0c9:5dc with SMTP id
o8-20020a056870910800b000cec0c905dcmr1269809oae.46.1645045499116; Wed, 16 Feb
2022 13:04:59 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 13:04:58 -0800 (PST)
In-Reply-To: <sujg0c$lf8$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad> <sujg0c$lf8$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <501133bc-105a-416b-bac5-f9ec95dbcf1en@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 21:04:59 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 16 Feb 2022 21:04 UTC

On Wednesday, February 16, 2022 at 12:32:47 PM UTC-6, BGB wrote:
> On 2/16/2022 9:23 AM, EricP wrote:
> > BGB wrote:
>
> >
> > Mitch's My66K predicate value state flags are implied by the predicate
> > instruction shadow and tracked internally, not architectural ISA
> > registers like Itanium. To have multiple predicates in flight at once
> > just use multiple PRED instructions.
> >
> >
> Yeah, a PRED prefix/instruction represents a somewhat different approach
> than what I was using.
>
> Something like a PRED instruction would also likely add internal state
<
8-bits in the form of a shift register.
<
> that would need to be dealt with during interrupts or similar, and
> doesn't have an obvious way to "wrap around" in a modulo loop.
>
It gets saved with the rest of Thread Header (PSQW)
>
> OK. In my case, they are the low 2 bits of the status register.
>
> The normal predicated encodings are hard-wired to SR.T (for the ?T / ?F
> modes), and normally, CMPxx and similar also update this bit.
>
> The alternate bit (SR.S) is encoded via Op64 instructions.
> W.w: May select SR.S for Ez encodings (as predicate);
> W.m: May select SR.S as output for "CMPxx Imm, Rn"
> W.i: May select SR.S as output for "CMPxx Rm, Rn"
>
> In other contexts, these bits mostly serve to extend the register fields
> to 6 bits and similar.
>
> But, doing multi-bit predication in a "not quite so bit-soup" strategy
> would require ~ 4 or 5 bits of entropy.
<
If you only support && and || combining of predicate masks, it
requires 0 extra state for multiple predicates!
>
>
> Normally, this prefix effectively glues 24 bits onto the instruction.
> In the newly added bundle case, it merely adds 12, internally
> repacking/padding the bits as-if it were the 24-bit prefix, for each
> instruction.
>
> It adds 5 control bits, and 7 bits which may be used as one of:
> An extension to the immediate (3RI);
> A 4th register field (4R);
> An additional immediate (4RI);
> Potentially, as more opcode bits (2R/3R).
>
> Many of the bits designated for opcode extension are filled with zeroes
> in this case (and, the immediate-field is sign-filled to its length in
> the original Op64 encoding).
>
>
> Indirectly, it allows for bundles where R32..R63 and predicated
> instructions can be used at the same time, otherwise:
> Predicated instructions using R32..R63 require 64-bits to encode;
> They may not be used in bundles.
>
> There is a 32-bit encoding which has R32..R63, but can not encode
> predicated forms of these instructions.
>
>
> No effect on existing code, because previously using an Op64 prefix in a
> bundle encoding wasn't a valid encoding. It also only works for
> encodings in certain ranges, ...
>
>
> This is getting a little ugly though...
>
> And turning into a non-orthogonal mix-and-match game (with hacks layered
> on top of hacks).
>
>
> Though, I am operating within the limits of stuff I can do without
> breaking binary compatibility with existing code.
>
> It is also unclear how I could do things much "better" without moving
> over to a larger instruction size for base encodings (otherwise it is
> seeming like a "mix and match game" may well be inevitable).
>
>
> Then again, arguably I "could" widen fetch/decode to 128 bits, to allow
> for something like:
> Op64 - OpB | Op64 - OpA
> Or (how it would have been encoded in the WEX6W idea):
> Op64 | Op64 | OpB | OpA
>
> This can not currently be handled within the existing pipeline.
>
> ...

Re: Advantages of in-order execution

<sujp45$b1$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23591&group=comp.arch#23591

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Advantages of in-order execution
Date: Wed, 16 Feb 2022 15:08:19 -0600
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <sujp45$b1$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
<2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>
<jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
<jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 16 Feb 2022 21:08:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c2fca33c8e4c63ea697ef74965f89bcd";
logging-data="353"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bPirz1pPwi1Y4oAWj50tJ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:u9n9+vlK7BhMYqa/5a6ctJD1TX0=
In-Reply-To: <jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: BGB - Wed, 16 Feb 2022 21:08 UTC

On 2/16/2022 1:54 PM, Stefan Monnier wrote:
>>> So, maybe the better question is: what kind of future process
>>> constraints could bring the trade-offs back in favor of
>>> in-order designs?
>> A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
>> as much power performing only 2× faster.
>
> Do you think this figure could still apply with today's processes?
>
> Is there nowadays an in-order CPU which is about 1/12th the size of an
> OoO sibling yet only ~50% (or less) slower?
> [ It doesn't have to be 1/12th, of course; just significantly smaller
> than the corresponding slowdown. ]
>

This is a major factor for the potential resurgence of VLIW and friends.

It is unlikely that they will be able to match OoO cores in terms of
per-thread performance, but this may not matter all that much in a
longer run.

While, say, 1.5x or 2x slower on common workloads may seem like a big
issue...

We also commonly see software written using technologies (or frameworks)
which result in considerably larger slowdowns (such as dynamically typed
scripting languages), and there is a lot still that could be done to
improve performance on this front.

It is also possible that people could start writing code better able to
exploit the properties of certain classes of hardware (such as big VLIW
machines) without worrying as much that a "tight and simple loop" which
runs fast on an x86 machine performs poorly on a VLIW machine (which
would much rather that one throw "a dump truck full of variables" at the
problem, *).

*: Where automated loop unrolling and scheduling can fail, manual
unrolling and reordering can still succeed (provided the hardware is
able to make good use of it).

This is an area though where more traditional RISCs hit a wall though,
as absent a larger number of registers and ability to sort and bundle
things as needed, this would ultimately put a lower limit on the amount
of ILP that could be achieved with an in-order implementation.

Though, in both sorts of ISA's, things could be better if there were
some way to absorb or hide the impact of things like cache misses, which
seem to be a fairly significant source of lost clock cycles IME.

....

>
> Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<334973b1-ccd5-4102-b278-ea60fe8b76f9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23592&group=comp.arch#23592

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:4292:0:b0:1e4:b7fd:eb84 with SMTP id k18-20020a5d4292000000b001e4b7fdeb84mr3681015wrq.657.1645045836531;
Wed, 16 Feb 2022 13:10:36 -0800 (PST)
X-Received: by 2002:a54:4e85:0:b0:2ce:4cc1:9d82 with SMTP id
c5-20020a544e85000000b002ce4cc19d82mr1453400oiy.50.1645045836024; Wed, 16 Feb
2022 13:10:36 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 13:10:35 -0800 (PST)
In-Reply-To: <sujhj3$its$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<sudb0g$rq3$1@dont-email.me> <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.114353@mips.complang.tuwien.ac.at> <sug2j8$bid$1@newsreader4.netcologne.de>
<2022Feb15.191558@mips.complang.tuwien.ac.at> <suiblj$euv$1@gioia.aioe.org>
<f4e11bbb-7a4a-4038-89a3-08d5e5979b0dn@googlegroups.com> <sujhj3$its$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <334973b1-ccd5-4102-b278-ea60fe8b76f9n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 21:10:36 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 16 Feb 2022 21:10 UTC

On Wednesday, February 16, 2022 at 12:59:51 PM UTC-6, BGB wrote:
> On 2/16/2022 12:28 PM, MitchAlsup wrote:
> > On Wednesday, February 16, 2022 at 2:12:38 AM UTC-6, Terje Mathisen wrote:
> >> Anton Ertl wrote:
> >>> Thomas Koenig <tko...@netcologne.de> writes:
> >>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>>>
> >>>>> * Before people make assumptions about what I mean with "software
> >>>>> crisis": When the software cost is higher than the hardware cost,
> >>>>> the software crisis reigns. This has been the case for much of the
> >>>>> software for several decades.
> >>>>
> >>>> Two reasonable dates for that: 1957 (the first Fortran compiler) or
> >>>> 1964, when the /360 demonstrated for all to see that software (especially
> >>>> compatibility) was more important than any particular hardware.
> >>>
> >>> Yes, the term "software crisis" is from 1968, but programming language
> >>> implementations demonstrated already before Fortran (and actually more
> >>> so, because pre-Fortran languages had less implementation investment
> >>> and did not utilize the hardware as well) that there is a world where
> >>> software cost is more relevant than hardware cost. But of course at
> >>
> >> This is a very big world actually, i.e. see all the Phyton code running
> >> on huge cloud instances even though they require 10-100 times as much
> >> resources as the same algorithms implemented in Rust. The main exception
> >> is the usual one, i.e. where the Phyton code (or some other scripting
> >> language) is just a thin glue layer tying together low-level libraries
> >> written in C(++), like in numpy.
> > <
> > This sounds exactly like the cost of medicine problem we have over hear.
> > <
> > to whit::
> > <
> > Patient does not see what the medical facility billed insurance company.
> > Patient only sees monthly insurance bill.
> > <
> > So, why should patient care if it cost insurance company $10, or $10,000 ?
> > Patient only sees monthly insurance bill.
<
> Except when one gets the "bare minimum to require health insurance
> mandate" insurance, which then does not cover the whole "actually going
> to the hospital thing", then one does need to go for some reason, and it
> costs $k ...
<
No, let me explain::
<
Insurance is a bet--they bet it won't happen, you bet it does.
<
When you get low-ball insurance you are taking the "it won't happen" side--
in effect acting like your own insurance company!
>
> And, also the practice of "perma-temps" (keeping employees permanently
> as temp workers), where the company does not provide health insurance or
> similar, leading partly to the above situation.
<
The real problem is that we have a medical system that needs to consume
20% of GDP and a population base only willing to spend 15%; and no incentive
for either side to change their position.
>
> ...
> >>
> >> Terje
> >>
> >> --
> >> - <Terje.Mathisen at tmsw.no>
> >> "almost all programming can be viewed as an exercise in caching"

Re: Advantages of in-order execution

<99b2f0ed-a15d-49a6-a346-0257ed466f0dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23593&group=comp.arch#23593

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:f001:0:b0:1e4:b7b1:87c1 with SMTP id j1-20020adff001000000b001e4b7b187c1mr3861261wro.238.1645046092357;
Wed, 16 Feb 2022 13:14:52 -0800 (PST)
X-Received: by 2002:a05:6808:bd5:b0:2d4:447d:6d27 with SMTP id
o21-20020a0568080bd500b002d4447d6d27mr1561100oik.179.1645046091706; Wed, 16
Feb 2022 13:14:51 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 13:14:51 -0800 (PST)
In-Reply-To: <jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com> <2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com> <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com> <jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <99b2f0ed-a15d-49a6-a346-0257ed466f0dn@googlegroups.com>
Subject: Re: Advantages of in-order execution
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 21:14:52 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Wed, 16 Feb 2022 21:14 UTC

On Wednesday, February 16, 2022 at 1:54:27 PM UTC-6, Stefan Monnier wrote:
> >> So, maybe the better question is: what kind of future process
> >> constraints could bring the trade-offs back in favor of
> >> in-order designs?
> > A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
> > as much power performing only 2× faster.
> Do you think this figure could still apply with today's processes?
<
at handwaving accuracy levels: yes.
>
> Is there nowadays an in-order CPU which is about 1/12th the size of an
> OoO sibling yet only ~50% (or less) slower?
<
Point of Reference: Most of the IO designs do not have access to the
L1 or L2 performance of the GBOoO design, nor of the DRAM bandwidth.
Without comparable "rest of memory" hierarchy you can't exactly say
how they would compare fairly face to face. Performance comes from
the "whole" system.
<
> [ It doesn't have to be 1/12th, of course; just significantly smaller
> than the corresponding slowdown. ]
>
>
> Stefan

According to Ivan Godard <ivan@millcomputing.com>:
>I don't know of any JIT that does PGO or LTO or other analysis-heavy
>optimization. Perhaps you could enlighten me.

A long time ago, in the 1970s, APL\3000 for the HP 3000 did JIT and PGO. The first
time you ran a routine, the system compiled it into code that depended on the
shape of the arrays it used, then when the statement ran again, the code had a
signature check to see if the assumptions were still correct and if not recompiled
it into slower but more general code.

The whole thing was extremely clever for the 1970s. Unfortunately, the 3000 had
a very strong separation between code and data (think of a minicomputer version of
a B5500) so the JIT compiler couldn't emit native code and instead had to use
semi-interpreted code that depended on non-standard microcode and the performance
of APL was poor.

I think some of the optimization techniques have been used in other APL implemenations.
I later realized that I had reinvented the subscript calculus in a project for a 1972
compiler class, unaware that someone had gotten a PhD for it in 1970.

Read all about it here: https://aplwiki.com/wiki/APL%5C3000

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Advantages of in-order execution

<94f1e5e3-c5af-4165-8126-db9d6b5a48c8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23595&group=comp.arch#23595

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:600c:1987:b0:37d:1f4a:224e with SMTP id t7-20020a05600c198700b0037d1f4a224emr3237143wmq.124.1645046598247;
Wed, 16 Feb 2022 13:23:18 -0800 (PST)
X-Received: by 2002:a05:6808:3029:b0:2d3:a03d:165d with SMTP id
ay41-20020a056808302900b002d3a03d165dmr1451675oib.325.1645046597544; Wed, 16
Feb 2022 13:23:17 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 13:23:17 -0800 (PST)
In-Reply-To: <jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com> <2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com> <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com> <jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <94f1e5e3-c5af-4165-8126-db9d6b5a48c8n@googlegroups.com>
Subject: Re: Advantages of in-order execution
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 21:23:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Wed, 16 Feb 2022 21:23 UTC

On Wednesday, February 16, 2022 at 1:54:27 PM UTC-6, Stefan Monnier wrote:
> >> So, maybe the better question is: what kind of future process
> >> constraints could bring the trade-offs back in favor of
> >> in-order designs?
> > A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
> > as much power performing only 2× faster.
> Do you think this figure could still apply with today's processes?
>
> Is there nowadays an in-order CPU which is about 1/12th the size of an
> OoO sibling yet only ~50% (or less) slower?
> [ It doesn't have to be 1/12th, of course; just significantly smaller
> than the corresponding slowdown. ]
<
I might also point out that smaller IO cores work comparatively better
on Big-Data-Base::
assume low hit rates for both data (80%) and code (90%).
assume low hit rates Data TLB (80%).
A GBOoO core running 2× as fast as LBIO core generates 2× the
DRAM traffic, 2× TLB traffic,... so the TLBs and Caches have
to be 4× as big in order to have the same number of misses
per cycle as the LBIO core. So everything has to be beefed up
in proportion to performance--and most kinds of caches one
has to quad the capacity to gain a factor of 2 in miss rate.
<
Sooner or later managing all this stuff hits hits own brick
wall.
>
>
> Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sujr18$5fl$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23597&group=comp.arch#23597

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 15:40:54 -0600
Organization: A noiseless patient Spider
Lines: 145
Message-ID: <sujr18$5fl$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
<sujg0c$lf8$1@dont-email.me>
<501133bc-105a-416b-bac5-f9ec95dbcf1en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Feb 2022 21:40:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c2fca33c8e4c63ea697ef74965f89bcd";
logging-data="5621"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5DzlTDu/etB1OxDSd61fM"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:+BW//mLs+M/0DSTJzX55lLDzke4=
In-Reply-To: <501133bc-105a-416b-bac5-f9ec95dbcf1en@googlegroups.com>
Content-Language: en-US

by: BGB - Wed, 16 Feb 2022 21:40 UTC

On 2/16/2022 3:04 PM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 12:32:47 PM UTC-6, BGB wrote:
>> On 2/16/2022 9:23 AM, EricP wrote:
>>> BGB wrote:
>>
>>>
>>> Mitch's My66K predicate value state flags are implied by the predicate
>>> instruction shadow and tracked internally, not architectural ISA
>>> registers like Itanium. To have multiple predicates in flight at once
>>> just use multiple PRED instructions.
>>>
>>>
>> Yeah, a PRED prefix/instruction represents a somewhat different approach
>> than what I was using.
>>
>> Something like a PRED instruction would also likely add internal state
> <
> 8-bits in the form of a shift register.
> <
>> that would need to be dealt with during interrupts or similar, and
>> doesn't have an obvious way to "wrap around" in a modulo loop.
>>
> It gets saved with the rest of Thread Header (PSQW)

OK.

In my case, the SR bits are saved during an ISR.

The ISR basically starts out with a blob that dumps everything to the stack.

Though, there is an annoyance that dumping the full BJX2 register space
to the stack currently takes up roughly 640 bytes on the stack. So, a
basic ISR entry frame needs ~ 1K.

This is "mildly concerning" given that the region holding the ISR stack
is only a few kB...

I have gone from having the ISR prolog/epilog sequence capturing only
the scratch registers, to capturing all of the registers. Partly this
was for practical reasons, but partly it was also a result of me trying
to figure out why TLB miss interrupts seem to be (occasionally) mangling
execution state. Does have the property that now all of the captured
state is now in the same place.

Could maybe add either a "magic variable" or "pseudo register" (C side)
to allow ISR handlers easier access to this captured register state.

>>
>> OK. In my case, they are the low 2 bits of the status register.
>>
>> The normal predicated encodings are hard-wired to SR.T (for the ?T / ?F
>> modes), and normally, CMPxx and similar also update this bit.
>>
>> The alternate bit (SR.S) is encoded via Op64 instructions.
>> W.w: May select SR.S for Ez encodings (as predicate);
>> W.m: May select SR.S as output for "CMPxx Imm, Rn"
>> W.i: May select SR.S as output for "CMPxx Rm, Rn"
>>
>> In other contexts, these bits mostly serve to extend the register fields
>> to 6 bits and similar.
>>
>> But, doing multi-bit predication in a "not quite so bit-soup" strategy
>> would require ~ 4 or 5 bits of entropy.
> <
> If you only support && and || combining of predicate masks, it
> requires 0 extra state for multiple predicates!

I was thinking of things like:
if(x) { A }
if(y) { B }

And then overlaying blocks A and B on top of each other.

In a few cases, wrangling a single predicate bit was also a hassle when
dealing with combining things like DEPTH_TEST and ALPHA_TEST in a
rasterizer (though, these fall under an && pattern).

if(depth_test_passes && alpha_test_passes && stencil_test_passes)
{ draw_pixel }

An && or || chain can be faked in some of these cases by predicating
subsequent CMPxx instructions (BGBCC doesn't do this, but I have done it
in ASM).

>>
>>
>> Normally, this prefix effectively glues 24 bits onto the instruction.
>> In the newly added bundle case, it merely adds 12, internally
>> repacking/padding the bits as-if it were the 24-bit prefix, for each
>> instruction.
>>
>> It adds 5 control bits, and 7 bits which may be used as one of:
>> An extension to the immediate (3RI);
>> A 4th register field (4R);
>> An additional immediate (4RI);
>> Potentially, as more opcode bits (2R/3R).
>>
>> Many of the bits designated for opcode extension are filled with zeroes
>> in this case (and, the immediate-field is sign-filled to its length in
>> the original Op64 encoding).
>>
>>
>> Indirectly, it allows for bundles where R32..R63 and predicated
>> instructions can be used at the same time, otherwise:
>> Predicated instructions using R32..R63 require 64-bits to encode;
>> They may not be used in bundles.
>>
>> There is a 32-bit encoding which has R32..R63, but can not encode
>> predicated forms of these instructions.
>>
>>
>> No effect on existing code, because previously using an Op64 prefix in a
>> bundle encoding wasn't a valid encoding. It also only works for
>> encodings in certain ranges, ...
>>
>>
>> This is getting a little ugly though...
>>
>> And turning into a non-orthogonal mix-and-match game (with hacks layered
>> on top of hacks).
>>
>>
>> Though, I am operating within the limits of stuff I can do without
>> breaking binary compatibility with existing code.
>>
>> It is also unclear how I could do things much "better" without moving
>> over to a larger instruction size for base encodings (otherwise it is
>> seeming like a "mix and match game" may well be inevitable).
>>
>>
>> Then again, arguably I "could" widen fetch/decode to 128 bits, to allow
>> for something like:
>> Op64 - OpB | Op64 - OpA
>> Or (how it would have been encoded in the WEX6W idea):
>> Op64 | Op64 | OpB | OpA
>>
>> This can not currently be handled within the existing pipeline.
>>
>> ...

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<a021dfe7-8486-41db-8b3f-fa478a748e7an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23604&group=comp.arch#23604

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:64ac:0:b0:1e7:1415:2548 with SMTP id m12-20020a5d64ac000000b001e714152548mr351102wrp.267.1645058225896;
Wed, 16 Feb 2022 16:37:05 -0800 (PST)
X-Received: by 2002:a05:6808:1281:b0:2d4:6e64:1c6 with SMTP id
a1-20020a056808128100b002d46e6401c6mr450575oiw.293.1645058225205; Wed, 16 Feb
2022 16:37:05 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!3.eu.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 16:37:05 -0800 (PST)
In-Reply-To: <sujr18$5fl$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
<sujg0c$lf8$1@dont-email.me> <501133bc-105a-416b-bac5-f9ec95dbcf1en@googlegroups.com>
<sujr18$5fl$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a021dfe7-8486-41db-8b3f-fa478a748e7an@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 00:37:05 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 155

by: MitchAlsup - Thu, 17 Feb 2022 00:37 UTC

On Wednesday, February 16, 2022 at 3:40:59 PM UTC-6, BGB wrote:
> On 2/16/2022 3:04 PM, MitchAlsup wrote:
> > On Wednesday, February 16, 2022 at 12:32:47 PM UTC-6, BGB wrote:
> >> On 2/16/2022 9:23 AM, EricP wrote:
> >>> BGB wrote:
> >>
> >>>
> >>> Mitch's My66K predicate value state flags are implied by the predicate
> >>> instruction shadow and tracked internally, not architectural ISA
> >>> registers like Itanium. To have multiple predicates in flight at once
> >>> just use multiple PRED instructions.
> >>>
> >>>
> >> Yeah, a PRED prefix/instruction represents a somewhat different approach
> >> than what I was using.
> >>
> >> Something like a PRED instruction would also likely add internal state
> > <
> > 8-bits in the form of a shift register.
> > <
> >> that would need to be dealt with during interrupts or similar, and
> >> doesn't have an obvious way to "wrap around" in a modulo loop.
> >>
> > It gets saved with the rest of Thread Header (PSQW)
>
> OK.
> In my case, the SR bits are saved during an ISR.
>
> The ISR basically starts out with a blob that dumps everything to the stack.
<
By the time control arrives at handler, everything that needs to be saved
is already in memory, and everything that needs to be loaded is already
present, including a stack for ISR use.
>
>
> Though, there is an annoyance that dumping the full BJX2 register space
> to the stack currently takes up roughly 640 bytes on the stack. So, a
> basic ISR entry frame needs ~ 1K.
<
My 66000 ISR stacks can be as small as a few dozen DoubleWords.....
>
> This is "mildly concerning" given that the region holding the ISR stack
> is only a few kB...
>
>
> I have gone from having the ISR prolog/epilog sequence capturing only
> the scratch registers, to capturing all of the registers. Partly this
> was for practical reasons, but partly it was also a result of me trying
> to figure out why TLB miss interrupts seem to be (occasionally) mangling
> execution state. Does have the property that now all of the captured
> state is now in the same place.
<
lipstick on a pig.
>
> Could maybe add either a "magic variable" or "pseudo register" (C side)
> to allow ISR handlers easier access to this captured register state.
> >>
> >> OK. In my case, they are the low 2 bits of the status register.
> >>
> >> The normal predicated encodings are hard-wired to SR.T (for the ?T / ?F
> >> modes), and normally, CMPxx and similar also update this bit.
> >>
> >> The alternate bit (SR.S) is encoded via Op64 instructions.
> >> W.w: May select SR.S for Ez encodings (as predicate);
> >> W.m: May select SR.S as output for "CMPxx Imm, Rn"
> >> W.i: May select SR.S as output for "CMPxx Rm, Rn"
> >>
> >> In other contexts, these bits mostly serve to extend the register fields
> >> to 6 bits and similar.
> >>
> >> But, doing multi-bit predication in a "not quite so bit-soup" strategy
> >> would require ~ 4 or 5 bits of entropy.
> > <
> > If you only support && and || combining of predicate masks, it
> > requires 0 extra state for multiple predicates!
<
> I was thinking of things like:
> if(x) { A }
> if(y) { B }
>
> And then overlaying blocks A and B on top of each other.
>
I started out that way, but Brian did not find enough instances where this
would work in his compiler port; so I (with his guidance) dropped back to
&& and || only.
>
> In a few cases, wrangling a single predicate bit was also a hassle when
> dealing with combining things like DEPTH_TEST and ALPHA_TEST in a
> rasterizer (though, these fall under an && pattern).
>
> if(depth_test_passes && alpha_test_passes && stencil_test_passes)
> { draw_pixel }
>
> An && or || chain can be faked in some of these cases by predicating
> subsequent CMPxx instructions (BGBCC doesn't do this, but I have done it
> in ASM).
<
Since my CMP instructions deliver (about) 30 bits to a register, the code
leading to a Predicated region can perform all the CMPs first, and then
use them to PRED whatever then or else clause is supposed to get
control. A second CMP does not destroy the "value" of the first CMP !!
> >>
> >>
> >> Normally, this prefix effectively glues 24 bits onto the instruction.
> >> In the newly added bundle case, it merely adds 12, internally
> >> repacking/padding the bits as-if it were the 24-bit prefix, for each
> >> instruction.
> >>
> >> It adds 5 control bits, and 7 bits which may be used as one of:
> >> An extension to the immediate (3RI);
> >> A 4th register field (4R);
> >> An additional immediate (4RI);
> >> Potentially, as more opcode bits (2R/3R).
> >>
> >> Many of the bits designated for opcode extension are filled with zeroes
> >> in this case (and, the immediate-field is sign-filled to its length in
> >> the original Op64 encoding).
> >>
> >>
> >> Indirectly, it allows for bundles where R32..R63 and predicated
> >> instructions can be used at the same time, otherwise:
> >> Predicated instructions using R32..R63 require 64-bits to encode;
> >> They may not be used in bundles.
> >>
> >> There is a 32-bit encoding which has R32..R63, but can not encode
> >> predicated forms of these instructions.
> >>
> >>
> >> No effect on existing code, because previously using an Op64 prefix in a
> >> bundle encoding wasn't a valid encoding. It also only works for
> >> encodings in certain ranges, ...
> >>
> >>
> >> This is getting a little ugly though...
> >>
> >> And turning into a non-orthogonal mix-and-match game (with hacks layered
> >> on top of hacks).
> >>
> >>
> >> Though, I am operating within the limits of stuff I can do without
> >> breaking binary compatibility with existing code.
> >>
> >> It is also unclear how I could do things much "better" without moving
> >> over to a larger instruction size for base encodings (otherwise it is
> >> seeming like a "mix and match game" may well be inevitable).
> >>
> >>
> >> Then again, arguably I "could" widen fetch/decode to 128 bits, to allow
> >> for something like:
> >> Op64 - OpB | Op64 - OpA
> >> Or (how it would have been encoded in the WEX6W idea):
> >> Op64 | Op64 | OpB | OpA
> >>
> >> This can not currently be handled within the existing pipeline.
> >>
> >> ...

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suke20$dt5$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23607&group=comp.arch#23607

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 21:05:35 -0600
Organization: A noiseless patient Spider
Lines: 241
Message-ID: <suke20$dt5$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
<sujg0c$lf8$1@dont-email.me>
<501133bc-105a-416b-bac5-f9ec95dbcf1en@googlegroups.com>
<sujr18$5fl$1@dont-email.me>
<a021dfe7-8486-41db-8b3f-fa478a748e7an@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 03:05:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="da247ea08d81cb1af9e8b533c0216bae";
logging-data="14245"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NREee/tUOoOUA77l0jMR9"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:RggRP8Op8Sw2C1dI5VK4lH8LWkM=
In-Reply-To: <a021dfe7-8486-41db-8b3f-fa478a748e7an@googlegroups.com>
Content-Language: en-US

by: BGB - Thu, 17 Feb 2022 03:05 UTC

On 2/16/2022 6:37 PM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 3:40:59 PM UTC-6, BGB wrote:
>> On 2/16/2022 3:04 PM, MitchAlsup wrote:
>>> On Wednesday, February 16, 2022 at 12:32:47 PM UTC-6, BGB wrote:
>>>> On 2/16/2022 9:23 AM, EricP wrote:
>>>>> BGB wrote:
>>>>
>>>>>
>>>>> Mitch's My66K predicate value state flags are implied by the predicate
>>>>> instruction shadow and tracked internally, not architectural ISA
>>>>> registers like Itanium. To have multiple predicates in flight at once
>>>>> just use multiple PRED instructions.
>>>>>
>>>>>
>>>> Yeah, a PRED prefix/instruction represents a somewhat different approach
>>>> than what I was using.
>>>>
>>>> Something like a PRED instruction would also likely add internal state
>>> <
>>> 8-bits in the form of a shift register.
>>> <
>>>> that would need to be dealt with during interrupts or similar, and
>>>> doesn't have an obvious way to "wrap around" in a modulo loop.
>>>>
>>> It gets saved with the rest of Thread Header (PSQW)
>>
>> OK.
>> In my case, the SR bits are saved during an ISR.
>>
>> The ISR basically starts out with a blob that dumps everything to the stack.
> <
> By the time control arrives at handler, everything that needs to be saved
> is already in memory, and everything that needs to be loaded is already
> present, including a stack for ISR use.

Mine is basically the other extreme:
It swaps a few registers, copies some bits around, and then performs a
computed branch into a jump table (using the interrupt category to
calculate where to jump in the jump table, *);
The ISR is responsible for pretty much everything else.

It sorta starts out as a sort of puzzle of how to get the user-visible
state saved without stomping any of the user-visible state.

An earlier design would have banked out a few registers (as in the
SuperH / SH-4 ISA), but it seemed like kind of a waste to have a bunch
of registers which would only ever be used by ISRs, but then again, this
would have made some things a little easier.

I at one point considered leaving out the automatic SP<->SSP swap (SP
and SSP switch places when running inside an ISR), but this would make
things a bit harder as now the ISR would need to set up the ISR stack
without having *any* free registers to work with.

Whereas, with the SP swap, it is at least possible use the ISR stack as
initial scratch space to save off a few registers, which can be used for
getting other important stuff saved.

Process is made slightly harder due to me generally omitting having an
instruction to load/store control registers directly, so they need to be
moved through GPRs.

*: Interrupt Entry:
Copy PC to SPC
Copy SR(31:0) to EXSR(63:32)
Set SR(30:28) to 111 (SV,RB,BL), Clear (27:24), ...
Swap SP and SSP (via RB being Set)
Branch relative to VBR
Which is required to have a 256B alignment for "reasons".

And, RTE:
Restore SR(31:0) from EXSR
Unswap SP and SSP (via RB being Cleared);
Branch to SPC.

Within an ISR, the flags also serve to effectively disable the MMU/TLB
(as long as SP and SSP remain swapped). Also, ISR's may not overlap (a
fault within an ISR is instantly fatal).

Or, basically, bordering on the cheapest ISR mechanism I could come up
with at the time...

There is also an ISR handler intended for system calls, which can
effectively perform a task-switch into the kernel (it isn't really
viable to handle a system call from within the ISR handler itself).

>>
>>
>> Though, there is an annoyance that dumping the full BJX2 register space
>> to the stack currently takes up roughly 640 bytes on the stack. So, a
>> basic ISR entry frame needs ~ 1K.
> <
> My 66000 ISR stacks can be as small as a few dozen DoubleWords.....

ISRs, in my case, are written in C, so one needs enough space for the C
ABI to do its thing. The notation and mechanism (on this side of things)
has some superficial similarity to interrupts on the MSP430 (with some
architectural hardware registers being mapped to "magic" C variables).

Partly for legacy reasons, the ISR stack is put into a dedicated SRAM
area, which is split between ISR related variables / scratch-space and
the ISR stack. Unlike the DRAM areas, this space will not experience an
L2 miss.

>>
>> This is "mildly concerning" given that the region holding the ISR stack
>> is only a few kB...
>>
>>
>> I have gone from having the ISR prolog/epilog sequence capturing only
>> the scratch registers, to capturing all of the registers. Partly this
>> was for practical reasons, but partly it was also a result of me trying
>> to figure out why TLB miss interrupts seem to be (occasionally) mangling
>> execution state. Does have the property that now all of the captured
>> state is now in the same place.
> <
> lipstick on a pig.

FWIW: It isn't designed to be elegant...

>>
>> Could maybe add either a "magic variable" or "pseudo register" (C side)
>> to allow ISR handlers easier access to this captured register state.
>>>>
>>>> OK. In my case, they are the low 2 bits of the status register.
>>>>
>>>> The normal predicated encodings are hard-wired to SR.T (for the ?T / ?F
>>>> modes), and normally, CMPxx and similar also update this bit.
>>>>
>>>> The alternate bit (SR.S) is encoded via Op64 instructions.
>>>> W.w: May select SR.S for Ez encodings (as predicate);
>>>> W.m: May select SR.S as output for "CMPxx Imm, Rn"
>>>> W.i: May select SR.S as output for "CMPxx Rm, Rn"
>>>>
>>>> In other contexts, these bits mostly serve to extend the register fields
>>>> to 6 bits and similar.
>>>>
>>>> But, doing multi-bit predication in a "not quite so bit-soup" strategy
>>>> would require ~ 4 or 5 bits of entropy.
>>> <
>>> If you only support && and || combining of predicate masks, it
>>> requires 0 extra state for multiple predicates!
> <
>> I was thinking of things like:
>> if(x) { A }
>> if(y) { B }
>>
>> And then overlaying blocks A and B on top of each other.
>>
> I started out that way, but Brian did not find enough instances where this
> would work in his compiler port; so I (with his guidance) dropped back to
> && and || only.

OK.

>>
>> In a few cases, wrangling a single predicate bit was also a hassle when
>> dealing with combining things like DEPTH_TEST and ALPHA_TEST in a
>> rasterizer (though, these fall under an && pattern).
>>
>> if(depth_test_passes && alpha_test_passes && stencil_test_passes)
>> { draw_pixel }
>>
>> An && or || chain can be faked in some of these cases by predicating
>> subsequent CMPxx instructions (BGBCC doesn't do this, but I have done it
>> in ASM).
> <
> Since my CMP instructions deliver (about) 30 bits to a register, the code
> leading to a Predicated region can perform all the CMPs first, and then
> use them to PRED whatever then or else clause is supposed to get
> control. A second CMP does not destroy the "value" of the first CMP !!

Yeah.

With a shared flag bit, each CMP overwrites the flag bit.

>>>>
>>>>
>>>> Normally, this prefix effectively glues 24 bits onto the instruction.
>>>> In the newly added bundle case, it merely adds 12, internally
>>>> repacking/padding the bits as-if it were the 24-bit prefix, for each
>>>> instruction.
>>>>
>>>> It adds 5 control bits, and 7 bits which may be used as one of:
>>>> An extension to the immediate (3RI);
>>>> A 4th register field (4R);
>>>> An additional immediate (4RI);
>>>> Potentially, as more opcode bits (2R/3R).
>>>>
>>>> Many of the bits designated for opcode extension are filled with zeroes
>>>> in this case (and, the immediate-field is sign-filled to its length in
>>>> the original Op64 encoding).
>>>>
>>>>
>>>> Indirectly, it allows for bundles where R32..R63 and predicated
>>>> instructions can be used at the same time, otherwise:
>>>> Predicated instructions using R32..R63 require 64-bits to encode;
>>>> They may not be used in bundles.
>>>>
>>>> There is a 32-bit encoding which has R32..R63, but can not encode
>>>> predicated forms of these instructions.
>>>>
>>>>
>>>> No effect on existing code, because previously using an Op64 prefix in a
>>>> bundle encoding wasn't a valid encoding. It also only works for
>>>> encodings in certain ranges, ...
>>>>
>>>>
>>>> This is getting a little ugly though...
>>>>
>>>> And turning into a non-orthogonal mix-and-match game (with hacks layered
>>>> on top of hacks).
>>>>
>>>>
>>>> Though, I am operating within the limits of stuff I can do without
>>>> breaking binary compatibility with existing code.
>>>>
>>>> It is also unclear how I could do things much "better" without moving
>>>> over to a larger instruction size for base encodings (otherwise it is
>>>> seeming like a "mix and match game" may well be inevitable).
>>>>
>>>>
>>>> Then again, arguably I "could" widen fetch/decode to 128 bits, to allow
>>>> for something like:
>>>> Op64 - OpB | Op64 - OpA
>>>> Or (how it would have been encoded in the WEX6W idea):
>>>> Op64 | Op64 | OpB | OpA
>>>>
>>>> This can not currently be handled within the existing pipeline.
>>>>
>>>> ...

Click here to read the complete article

EricP wrote:
> BGB wrote:
>> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
>>> Thomas Koenig wrote:
>>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>>> And because compiler
>>>>> branch prediction (~10% miss rate)
>>>>
>>>> That seems optimistic.
>>>>
>>>>> is much worse than dynamic branch
>>>>> prediction (~1% miss rate, both numbers vary strongly with the
>>>>> application, so take them with a grain of salt),
>>>>
>>>> What is the branch miss rate on a binary search, or a sort?
>>>> Should be close to 50%, correct?
>>>>
>>> Or closer to zero, if you implement your qsort with predicated
>>> left/right pointer updates and data swaps.
>>>
>>> Same for a binary search, the left/right boundary updates can be
>>> predicated allowing you to blindly run log2(N) iterations and pick
>>> the remaining item. I.e. change it from a code state machine to a
>>> data state machine because dependent loads can run at 2-3
>>> cycles/iteration while branch misses cost you 5-20 cycles.
>>>
>>
>> Yeah, if the ISA does predicated instructions, and the compiler uses
>> them, pretty much all of the internal "if()" branches in a typical
>> sorting function can be expressed branch-free.
>>
>>
>> One can potentially modulo schedule them as well, though this requires
>> an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where
>> support for predicated ops using SR.S is an optional feature, though 3
>> or 4 bits could be better).
>
> Mitch's My66K predicate value state flags are implied by the predicate
> instruction shadow and tracked internally, not architectural ISA
> registers like Itanium. To have multiple predicates in flight at once
> just use multiple PRED instructions.

Mitch's baby is the poser child for such branchless code, it fits very
well indeed with this model to avoid poorly-predicted branches:

while (l < r) {
pivot = (l+r)>>1;
if (arr[pivot] < target) l = pivot;
else r = pivot;
}

Notice that there is no attempt to detect an early exact match, so the
number of iterations will always be log2(arr.size), and therefore (close
to) perfectly predictable, while the loop body fits the predicate shadow
very nicely.

Written in this form, even an x86 can use the same approach, just taking
a cycle more per iteration due to the slow CMOV operations:

next:
lea ebx,[esi+edi]
shr ebx,1
mov eax,arr[ebx*4]
cmp eax,edx ;; EDX == target value
CMOVB esi,ebx ;; Opposite predicates, so
CMOVAE edi,ebx ;; only one will execute!

cmp esi,edi
jb next

This will take 6-8 clock cycles/iteration (assuming the entire arr[] in
$L1 cache, add miss cycles otherwise), so ~90 cycles to find an entry in
an 8K array.

Using branching code instead would add half a branch mispredict per
iteration for random searches, but much less if you would search
repeatedly for close target values.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 2:12:38 AM UTC-6, Terje Mathisen wrote:
>> Anton Ertl wrote:
>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>
>>>>> * Before people make assumptions about what I mean with "software
>>>>> crisis": When the software cost is higher than the hardware cost,
>>>>> the software crisis reigns. This has been the case for much of the
>>>>> software for several decades.
>>>>
>>>> Two reasonable dates for that: 1957 (the first Fortran compiler) or
>>>> 1964, when the /360 demonstrated for all to see that software (especially
>>>> compatibility) was more important than any particular hardware.
>>>
>>> Yes, the term "software crisis" is from 1968, but programming language
>>> implementations demonstrated already before Fortran (and actually more
>>> so, because pre-Fortran languages had less implementation investment
>>> and did not utilize the hardware as well) that there is a world where
>>> software cost is more relevant than hardware cost. But of course at
>>
>> This is a very big world actually, i.e. see all the Phyton code running
>> on huge cloud instances even though they require 10-100 times as much
>> resources as the same algorithms implemented in Rust. The main exception
>> is the usual one, i.e. where the Phyton code (or some other scripting
>> language) is just a thin glue layer tying together low-level libraries
>> written in C(++), like in numpy.
> <
> This sounds exactly like the cost of medicine problem we have over hear.
> <
> to whit::
> <
> Patient does not see what the medical facility billed insurance company.
> Patient only sees monthly insurance bill.
> <
> So, why should patient care if it cost insurance company $10, or $10,000 ?
> Patient only sees monthly insurance bill.

The US health system is an aberration, not a model to be emulated by
anyone else?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 2/15/2022 3:49 AM, Anton Ertl wrote:
>> Also: OoO allows scheduling across many hundreds of instructions (512
>> in Alder Lake) rather than typically a few or a few dozen for static
>> scheduling,
>
>I am not suggesting that isn't true, but I do question why it is true.
>That is, if it is beneficial, I presume a compiler could do its
>scheduling across a window at least as big as HW. The compiler can use
>more memory and time than is available to the HW. As a minimum, it
>could emulate what the HW does so should be equal (excepting for
>variable length delays).
>
>So is such a big window not beneficial to the compiler?

The problem is that a 512-instruction window contains ~70 conditional
branches (taking our LaTeX benchmark as example):

/home/anton/tmp/pmu-tools/ocperf.py stat -e br_inst_retired.conditional -e instructions latex bench >/dev/null

Performance counter stats for 'latex bench':

400088535 br_inst_retired_conditional
3010329693 instructions

[Unfortunately, Skylake does not have a separate event counter for
indirect branches, but they would also have to be predicted by the
compiler.]

So the compiler would have to branch-predict across ~70 instructions to
schedule across such a window.

Why not just perform non-speculative scheduling, i.e., move
instructions further down in the execution rather than up? Because
that means that in many cases the dependencies of conditional branches
limit the IPC very much (e.g., see Section 6 and Figure 9 of
<http://www.complang.tuwien.ac.at/papers/ertl-krall94cc.ps.gz>).

So you want to speculative execution in a statically scheduled system
if you want scheduling windows competing with hardware scheduling windows.

Now, architectures up to now have no way for letting the compiler make
use of dynamic branch prediction. Instead, they are limited to static
branch prediction, which has a much higher miss rate than dynamic
branch prediction (some very rough numbers are ~10% for static with
profile data (20% without), 1% for dynamic).

joshua.landau.ws@gmail.com has proposed (in the thread including
<389485e7-b6dc-4b10-ac49-7883ae9fff0e@googlegroups.com>) an
architectural mechanism for letting compilers make use of dynamic
branch prediction accuracy: He splits the branch into a branch predict
(brp) part, where the branch is taken based on the dynamic prediction,
and branch-verify (brv) where the condition is actually available and
the direction is verified; if the prediction was wrong, execution
takes a recovery path and eventually continues on the alternative
branch of the corresponding brp instruction.

So how would a compiler make use of this architectural idea? It would
place a brp before the first speculative instruction the compiler may
want to schedule from a block controlled by a branch (brv), maybe 70
brvs earlier. I expect a lot of code duplication; the worst case
would be a factor of 2^70 code duplication, but realisitically the
compiler would limit the code duplication to the maybe 100 most
probable paths (as statically predicted), with very limited
speculation on the other paths. The questions in this respect are:

1) How much do these code duplication limitations reduce the average
IPC?

2) Even with such limitations, what is the effect of the code
duplications on I-cache misses and L2-cache misses?

If the predicted (and verified) path is always the same (and the
static prediction actually included this path), the limitations will
not reduce the IPC, and the I-cache misses will not become a problem,
but if there are thousands of paths (that dynamic branch prediction
nevertheless can predict well), things look less rosy.

Despite these issues, if static scheduling wants to have any chance
against OoO at all, IMO it needs this architectural mechanism.

Other problems limiting static scheduling effectiveness are scheduling
limitations from dynamic binding of method calls and dynamic linking.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sulkbd$it1$3@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23622&group=comp.arch#23622

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 05:59:09 -0800
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <sulkbd$it1$3@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <suijiu$giq$1@dont-email.me>
<sujgdl$rac$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 13:59:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="19361"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/kZhaC+u2RwwyRm+ljH0pP"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:jgvSo8TVkqx52W0VN2DzSnQPEfQ=
In-Reply-To: <sujgdl$rac$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Thu, 17 Feb 2022 13:59 UTC

On 2/16/2022 10:39 AM, BGB wrote:
> On 2/16/2022 4:27 AM, Ivan Godard wrote:
>> On 2/16/2022 1:37 AM, BGB wrote:
>>> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
>>>> Thomas Koenig wrote:
>>>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>>>> And because compiler
>>>>>> branch prediction (~10% miss rate)
>>>>>
>>>>> That seems optimistic.
>>>>>
>>>>>> is much worse than dynamic branch
>>>>>> prediction (~1% miss rate, both numbers vary strongly with the
>>>>>> application, so take them with a grain of salt),
>>>>>
>>>>> What is the branch miss rate on a binary search, or a sort?
>>>>> Should be close to 50%, correct?
>>>>>
>>>> Or closer to zero, if you implement your qsort with predicated
>>>> left/right pointer updates and data swaps.
>>>>
>>>> Same for a binary search, the left/right boundary updates can be
>>>> predicated allowing you to blindly run log2(N) iterations and pick
>>>> the remaining item. I.e. change it from a code state machine to a
>>>> data state machine because dependent loads can run at 2-3
>>>> cycles/iteration while branch misses cost you 5-20 cycles.
>>>>
>>>
>>> Yeah, if the ISA does predicated instructions, and the compiler uses
>>> them, pretty much all of the internal "if()" branches in a typical
>>> sorting function can be expressed branch-free.
>>>
>>>
>>> One can potentially modulo schedule them as well, though this
>>> requires an ISA with multiple predicate bits (BJX2 has 1 or 2 bits,
>>> where support for predicated ops using SR.S is an optional feature,
>>> though 3 or 4 bits could be better).
>>
>> Or just one predicate bit and a belt.
>>
>
> One issue with a single bit in this case is that, when wrapping around
> the end of a loop, the loop conditional (if done via CMPxx+BT/BF) would
> stomp the predicate bit.
>
> Though, this could be sidestepped by using BEQ or BGT or similar, which
> don't require stomping the SR.T bit.
>

The predicate is on the belt too; unstompable :-)

Re: Advantages of in-order execution (was: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits)

<sull0p$7dj$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23623&group=comp.arch#23623

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Advantages of in-order execution (was: instruction set binding
time, was Encoding 20 and 40 bit instructions in 128 bits)
Date: Thu, 17 Feb 2022 06:10:33 -0800
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <sull0p$7dj$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
<2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>
<jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Feb 2022 14:10:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="7603"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19M7l0LnsYEekcW8XLMSBrG"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:CXg3EbyzD4mTfovQcaOGKzYo830=
In-Reply-To: <5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Thu, 17 Feb 2022 14:10 UTC

On 2/16/2022 10:46 AM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 12:24:43 PM UTC-6, Stefan Monnier wrote:
>>> B. Your conclusion is correct based on assuming that single-thread
>>> performance is all that matters, ignoring power and die area. What if the
>>> A73 design had been limited to the number of transistors (or more
>>> importantly die area since bigger transistors could give an illusory
>>> performance improvement if just the quantity is constant) in an A53?
> <
>> So, maybe the better question is: what kind of future process
>> constraints could bring the trade-offs back in favor of
>> in-order designs?
> <
> A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
> as much power performing only 2× faster.
> <
> If software could figure out how to use 12 cores it could get 6× as much work done
> at the same CPU power. But because calculation is rather inexpensive and memory
> access rather expensive, the gain might only be 3× if you got limited by power.
> <
> Right now, due to SW inability to consume multiple CPUs, you don't have the
> option of choosing 6× or even 3×.
> <
> This is not so much a HW problem, and to a large extent SW gets the blame.
> But SW is not as much to blame as conversation indicate.
> <
> What is missing is a vonNeumann model of parallelism where HW provides a
> few instructions to perform this new model and because of its simplicity
> elegance and functionality, SW can easily find ways to utilize that new model.
> <
> Right now HW does not know what to build and SW does not know exactly
> what to ask for. HW stumbles around adding TS, CAS, DCAW, LL-SC, ad
> infinitum. SW trys to use new feature and finds it very difficult to use in
> practice. This CLEARLY shows that what HW is trying to supply is not what
> SW wants to consume; yet SW does not know on the intimate level what to
> ask HW to build, so the circle continues.

Amen!

Re: Advantages of in-order execution

<sulneg$rq0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23625&group=comp.arch#23625

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Advantages of in-order execution
Date: Thu, 17 Feb 2022 06:51:58 -0800
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <sulneg$rq0$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
<2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>
<jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
<jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Feb 2022 14:52:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="28480"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19IecXHuc1/BgfWuAwh1pL9"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:Z56mRQht3o7iWrOwU39V7D0oiRo=
In-Reply-To: <jwvee424mn2.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Ivan Godard - Thu, 17 Feb 2022 14:51 UTC

On 2/16/2022 11:54 AM, Stefan Monnier wrote:
>>> So, maybe the better question is: what kind of future process
>>> constraints could bring the trade-offs back in favor of
>>> in-order designs?
>> A GBOoO core (Opteron because I am familiar) is 12× as big and burns 12×
>> as much power performing only 2× faster.
>
> Do you think this figure could still apply with today's processes?
>
> Is there nowadays an in-order CPU which is about 1/12th the size of an
> OoO sibling yet only ~50% (or less) slower?
> [ It doesn't have to be 1/12th, of course; just significantly smaller
> than the corresponding slowdown. ]

The fundamental Mill technical assumption is that the 12x cost for 2x
gain is only true for equal width.

However, widths differ. For pipelinable loops, 2x width is 2x
performance. The conventional wisdom is that 80% of dynamic instructions
are in loops, so if those can be piped then overall 2x width is 1.6x
gain. However, the extra hardware for width is only part of the cost, so
if doubling the width increases cost by 80% or less then increasing gain
by N using width only increases costs by less than N.

However, the validity of this inequality is range-bounded: width cannot
be less than one, and there is an upper bound imposed by data routing.
It seems agreed that an OOO cannot have a width greater than eight; our
preliminary experience suggests that a belt architecture hits
diminishing returns somewhere between 32 and 64 width.

However, a larger constraint in the numbers above is the assumption that
all loops are pipeable, which is clearly false. In fact, on a legacy
architecture very few loops can be piped, mostly due to control flow or
the possibility of exceptions. The few that are are the embarrassingly
parallel ones found in micro-benchmarks and marketing departments.

The Mill market proposition from the beginning is that far more loops
can be piped if the ISA is properly designed. Not all, but more. Taking
advantage of the architecture, our preliminary results suggest that some
60-70% of dynamic instructions are in loops that a Mill can pipe.

Putting those numbers together, we conclude that increasing with gives
up more band for each buck (up to the implementation limit) so long as
width is no more than 60% of the overall costs. As there is 10X
available going from 12x to 2x, it appears that we can get that 2x back
by increasing width 3X at an additional cost of 2x. 3x an 8-wide OOO is
24 wide, which is well within the 32-64x widening limit.

The cost (in area and power) have doubled while going to 24 wide. The
result: a Mill can match a GBOOO's performance at a 4x cost vs. 12x. Or
it can exceed the GBOOO's performance by pushing into that 32-64 width
region, while still staying well below the GBOOO cost.

Yes, I know that there are massive assumptions in all this. YMMV.

Building translators is good clean fun. -- T. Cheatham

devel / comp.arch / Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton