novaBBS - comp.arch - Re: instruction set binding time, was Encoding 20 and 40 bit

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwvr18392lk.fsf-monnier+comp.arch@gnu.org>

https://www.novabbs.com/devel/article-flat.php?id=23543&group=comp.arch#23543

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 17:54:26 -0500
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <jwvr18392lk.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suh8vh$g7r$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="f5eb5aba23f1f57d35d02e72b57d6302";
logging-data="28273"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1821HAmh/JJE/mswDP4af0Z"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:WGaD6R1NgAv3RB3tl/Ggxp8tNWc=
sha1:ayG/ZDJPeJqDB8q10KvhfQRVCn4=

by: Stefan Monnier - Tue, 15 Feb 2022 22:54 UTC

>> Profile-driven optimization is used, of course, but the problem remains:
>> the compiler needs to generate code that works for all possible
>> situations and it can't freely duplicate code all over the place.
>> A bit of code duplication (to specialize a code path to a few different
>> scenarios) can be done, but only within fairly strict limits otherwise
>> code size will explode and performance goes down the drain again.
> But, with a profiler, you can know *where* it likely matters, so it could be
> done without exploding the code-size. The compiler would realize
> a rarely-used path is rarely used, and this not bother unrolling or
> modulo-scheduling it's loops, ...

Exploding code size doesn't have to mean "increase from a couple megs to
a hundred megs". It can just as well mean that you grow your inner loop
outside the bounds of you L1 because profiling indicated that it's "hot"
and so you generated say 3 different versions but all 3 scenarios
occur often.

It's hard to gain even a few % of speed up in a compiler, and it's easy
to lose a lot more when it backfires. And users notice a lot more the
"backfires" case. Also, when speed really matters, the programmer can
usually help a "naive" compiler by doing manual inlining/specializing,
so in the end compilers tend to focus on "safe" optimizations where the
risk of seeing a noticeable slowdown is hopefully very small.

In theory, a compiler could do wonders. But in practice it's terribly
hard, because that compiler has to work sanely for any code you throw at
it. That's why the failure of Itanium was easy to foresee for people in
the compiler business (to be honest, I thought it would still come out
ahead simply because "the ISA doesn't matter" and Intel would just end
up doing the same translation to a big OoO dataflow core as they do with
the x86, and win by virtue of their financial and engineering muscle).

> Partial issue is pulling off profile driven optimization in a way that is
> not sufficiently annoying that the programmers don't bother with it.

That too, indeed.

JITs solve this problem nicely ;-)

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suhbil$1gq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23544&group=comp.arch#23544

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 15:04:52 -0800
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <suhbil$1gq$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at> <sug7bd$b9l$1@dont-email.me>
<suh3kq$3pd$3@newsreader4.netcologne.de> <suh5d3$po5$1@dont-email.me>
<suh8l5$7e8$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 23:04:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a03ab46b81509913aef66e0a10295b0";
logging-data="1562"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zWdzGzGjLvZFI2KqiyF+C"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:Hx3NcWz5TSUyfoBp+esvoQwB/LY=
In-Reply-To: <suh8l5$7e8$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 23:04 UTC

On 2/15/2022 2:15 PM, Thomas Koenig wrote:
> Ivan Godard <ivan@millcomputing.com> schrieb:
>> On 2/15/2022 12:49 PM, Thomas Koenig wrote:
>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>> On 2/15/2022 1:46 AM, Anton Ertl wrote:
>>>>> Ivan Godard <ivan@millcomputing.com> writes:
>>>>>> The
>>>>>> specializer does quite extensive optimization - bundle-packing,
>>>>>> scheduling, CFG collapse, software pipelining, in- and out-lining, etc.
>>>>>> - which is too expensive for a JIT, but the passes that do the
>>>>>> optimizations are optionally bypassed.
>>>>>
>>>>> By contrast, an OoO microarchitecture does not care whether the native
>>>>> code has been generated by a JIT compiler or not, it will
>>>>> branch-predict and schedule the instructions all the same.
>>>>
>>>> So will a Mill.
>>>>
>>>> The specializer passes that a JIT would skip are target independent, and
>>>> give better code for any target: OOO, IO, or sideways.
>>>
>>> If it is target independent, why is it in the specializer?
>>
>> The specializer is the back-end of the compiler, and does back-end stuff
>> in addition to member-targeting stuff. The functionality could be split,
>> but that would require a communication format across the split.
>
> That really depends on how you defined your model-independent
> language, GenAsm. I would have expected this to be defined
> in such a way that the specializer really only had to do the
> target-dependent stuff, so most of the heavy lifting is done in
> the normal compilation step.
>
> You must have had your reasons, I just don't understand them.
>
>>
>> Actually there are three kinds of work done: family independent (done in
>> both Mill and x86, like inlining);
>
> In the specializer? That sounds a lot of what a compiler middle end
> should do (maybe guided by some information abut the back end,
> like number of registers - a thorny issue).
>
>> member independent (done in all Mill
>> members but not other ISAs,
>
> That sounds like the task of a more or less traditional back
> end to me (like the nvptx "back end", which also generates
> intermediate code.
>
>> like ganging op/compare instructions, or
>> injecting pseudo-ops for load retires), and member dependent (done
>> uniquely for each member, like bit stuffing into the binary).
>
> This is the part that I would probably like to keep as small
> as possible, if it were my project, which it isn't :-)

We have deliberately eschewed LLVM-hacking, and put all the
Mill-family-specific stuff in the specializer. For example, other ISAs
can't safely speculate FP ops due to the possibility of traps, but Mill
can. Sure, LLVM could be taught about that possibility, but it was much
easier to put that into a smarter back end.

Similarly, Mill has different cost functions for many standard
optimizations. It's safe that anything good for an x86 is also good for
Mill, so we let LLVM apply everything it has, but then we look over the
result for additional Mill-specific opportunities.

The true member-dependent parts are small. We're not worried about slow
JITting.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<129fdd82-8128-4c47-b0ca-611ad12ca046n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23546&group=comp.arch#23546

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1349:b0:2d7:fd8c:4263 with SMTP id w9-20020a05622a134900b002d7fd8c4263mr405690qtk.556.1644970915045;
Tue, 15 Feb 2022 16:21:55 -0800 (PST)
X-Received: by 2002:a05:6870:aa8d:b0:c6:db43:22db with SMTP id
gr13-20020a056870aa8d00b000c6db4322dbmr157049oab.314.1644970914558; Tue, 15
Feb 2022 16:21:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 16:21:54 -0800 (PST)
In-Reply-To: <jwvr18392lk.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:1da9:46e3:e9c4:35aa;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:1da9:46e3:e9c4:35aa
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suh8vh$g7r$1@dont-email.me> <jwvr18392lk.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <129fdd82-8128-4c47-b0ca-611ad12ca046n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 00:21:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 31

by: Quadibloc - Wed, 16 Feb 2022 00:21 UTC

On Tuesday, February 15, 2022 at 3:54:31 PM UTC-7, Stefan Monnier wrote:

> In theory, a compiler could do wonders. But in practice it's terribly
> hard, because that compiler has to work sanely for any code you throw at
> it. That's why the failure of Itanium was easy to foresee for people in
> the compiler business (to be honest, I thought it would still come out
> ahead simply because "the ISA doesn't matter" and Intel would just end
> up doing the same translation to a big OoO dataflow core as they do with
> the x86, and win by virtue of their financial and engineering muscle).

The way I've always understood this issue, which seems to be flawled, is
this:

There is no technical obstacle to replacing an out-of-order design with
a design that has a lot of explicit registers, with the compiler allocating
the registers in such a way to make for piles of independent instructions
that can be marked for simultaneous execution.

None at all - it can be done. But this only deals with register hazards,
and it exacts a price in code density, which may be an issue since memory
bandwidth is a constraint.

However, I've been told that a computer being out-of-order *also* helps
with the effects of cache misses. These are basically unpredictable, and
so the compiler can't help you with them. And that is the reason why attempting
to replace an out-of-order design with explicit parallelism and co-operating
register allocation fail.

Now, maybe the compiler can help a little with prefetches, but it's nowhere
near as predictable and effective as in the case of register hazards.

John Savard

On Tuesday, February 15, 2022 at 4:10:05 PM UTC-6, Ivan Godard wrote:
> On 2/15/2022 8:59 AM, John Dallman wrote:
> >> Itanium was not a VLIW, it was an EPIC architecture; granted, both
> >> were wide issue, which is commonly confused with VLIW.
> >>
> >> The Mill is not a VLIW (or EPIC) either, although it is closer than
> >> the Itanium.
> >
> > Can you enlarge on the differences? The basic concept of presenting
> > several instructions to the processor at once, with the compiler ensuring
> > that there are no data hazards between those instructions, seems to be
> > much the same.
> >
> > Itanium had fixed-size bundles that sometimes needed to be padded with
> > no-ops, and stop bits for indicating inter-bundle dependencies. I know
> > Mill has variable-size bundles. What else is different in this aspect?
> >
> >> You probably have a VLIW in your pocket: Qualcomm Hexagon.
> >
> > On the table, but yes. Not that I use it very hard.
> >
> > John
> These differ in which hazards are shifted from the hardware to the
> compiler when dealing with a wide bundle.
>
> EPIC hardware assumed that there are no intra-bundle hazards, so the
> issue queue and its associated hazard check can be omitted. However,
> there were no such guarantees for inter-bundle hazards, so either the
> entire bundle had to run to completion, or, in later Itaniums, the
> hardware did OOO-like retire hazard checking to deal with varying
> instruction latency.
<
This is a big difference between VLIWs and Mill. the unit of fetch and
issue is not the unit of data-flow ignorance.
>
> A classic VLIW assumes that there are neither issue nor retire hazards
> in the schedule, and hence the latency of everything is statically
> known. Neither issue nor retire hardware checking is needed. These days
> the fixed latency requirement restricts VLIWs to applications which do
> not present variable instruction latencies: DSPs, mini-engines like
> crypto blocks, and the like. Usually there is no latency variability
> physically possible, but if the core is connected to external devices
> with unpredictable latency (such as DRAM) the hardware stalls if
> external data is not available.
>
> Mill is VLIW-like for all instructions but loads: everything has static
> fixed latency, and neither issue nor retire hazards need hardware hazard
> checking or scheduling queues. Like a VLIW, it does not check for or
> take advantage of early-out such as multiply-by-one. For loads, which
> are the only source of significant latency variability, Mill splits the
> issue and retire into two different instructions which are independently
> static scheduled (patented). A load-retire instruction stalls the core
> if it has no data yet.
>
> The distinction among these categories is not a matter of encoding - all
> are wide issue, and each has its own scheme. The Itanium used fixed size
> bundles; classic VLIW uses variable-size bundles; and Mill uses a
> bundle-of-six-bundles encoding. However, any of the categories could use
> any of the encoding schemes. It's a matter of what is done by the
> compiler vs. hardware, not a matter of the bitsy representation.
>
> This makes the Mill suitable for general-purpose work that would cause a
> classic VLIW to spend too much time in stall, yet still take advantage
> of any schedule variability to do other work while (possibly) waiting
> for external data, and without needing OOO retire hazard hardware.
>
> The gain from Mill's split load is limited by the maximal gap (in time)
> between the load-issue and load-retire instructions, which is in turn
> determined by how much work is available to do that neither depends on
> the load result nor is depended on by the load issue. That's easily
> determined by conventional dataflow analysis in the compiler, and is
> exactly the same as the amount of work that an OOO can do while waiting
> for a load to retire.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<1a1b0126-ff07-42cb-9d1a-a55c70697a10n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23548&group=comp.arch#23548

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5f4b:0:b0:2ce:27e8:7a7d with SMTP id y11-20020ac85f4b000000b002ce27e87a7dmr444556qta.464.1644971834152;
Tue, 15 Feb 2022 16:37:14 -0800 (PST)
X-Received: by 2002:a05:6870:1c8:b0:d3:6d9a:8fd8 with SMTP id
n8-20020a05687001c800b000d36d9a8fd8mr181390oad.333.1644971833885; Tue, 15 Feb
2022 16:37:13 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 16:37:13 -0800 (PST)
In-Reply-To: <suh97n$hng$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:19e6:c3ce:1a76:25c3;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:19e6:c3ce:1a76:25c3
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh97n$hng$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1a1b0126-ff07-42cb-9d1a-a55c70697a10n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 00:37:14 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 45

by: MitchAlsup - Wed, 16 Feb 2022 00:37 UTC

On Tuesday, February 15, 2022 at 4:24:58 PM UTC-6, Ivan Godard wrote:
> On 2/15/2022 10:43 AM, Anton Ertl wrote:
> > Stefan Monnier <mon...@iro.umontreal.ca> writes:
> >>> Concerning speculation, yes, it does waste power, but rarely, because
> >>> branch mispredictions are rare.
> >>
> >> Of course in-order cores also speculate, so this is only
> >> tangentially related. But as a side-note I'll point out that statically
> >> scheduled processors encourage the compiler to move long-latency
> >> instructions such as loads to "as soon as it's safe to do it" rather
> >> than "as soon as we know we will need it", and that sometimes
> >> requires adding yet more "compensation code" in other branches.
> >
> > Loads are a bad example, because you typically don't move the above a
> > branch they control-depend on, because loads can trap. Instead one
> > tends to use prefetches. IA-64 had a mechanism that allowed moving
> > loads up, however. Non-trapping instructions such as multiplications
> > can be moved up speculatively by the compiler. And because compiler
> > branch prediction (~10% miss rate) is much worse than dynamic branch
> > prediction (~1% miss rate, both numbers vary strongly with the
> > application, so take them with a grain of salt), a static scheduling
> > speculating compiler will tend to waste more energy for the same
> > degree of speculation.
<
> Not in modern processes, or so I'm told by the hardware guys. Leakage is
> of the same order of cost as execution these days, so an idle ALU might
> as well do something potentially useful. Consequently it is worth while
> for the compiler to if-convert everything until it runs out of FUs.
<
Pipelining data from the register file to the ALU and back with forwarding
consumes WAY more power than ALU (until you get to integer multiplication
or Floating point) Register file Access consumes way more power than ALU
SRAM cache access consumes ~10× ALU power. Pins cost 100× ALU.
>
> Incidentally, in-order != static-prediction. No in-order core much above
> a Z80 will use static branch prediction, for the reason you give. Well,
> no competent core, anyway.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<5a38b1db-3a00-470c-b4c7-21799423827bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23549&group=comp.arch#23549

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:f84b:0:b0:42c:459a:53a8 with SMTP id g11-20020a0cf84b000000b0042c459a53a8mr394933qvo.74.1644972057906;
Tue, 15 Feb 2022 16:40:57 -0800 (PST)
X-Received: by 2002:a05:6808:f8a:b0:2d0:70a3:2138 with SMTP id
o10-20020a0568080f8a00b002d070a32138mr2942549oiw.64.1644972057657; Tue, 15
Feb 2022 16:40:57 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 16:40:57 -0800 (PST)
In-Reply-To: <suhafj$our$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:19e6:c3ce:1a76:25c3;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:19e6:c3ce:1a76:25c3
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5a38b1db-3a00-470c-b4c7-21799423827bn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 00:40:57 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 41

by: MitchAlsup - Wed, 16 Feb 2022 00:40 UTC

On Tuesday, February 15, 2022 at 4:46:14 PM UTC-6, Ivan Godard wrote:
> On 2/15/2022 8:58 AM, Stefan Monnier wrote:
> >> Thank you. That makes perfect sense. BTW, does that present an opportunity
> >> for some sort of profile driven optimization where the run time branch
> >> history is fed back to a future compilation for better optimization?
> >> Probably not as useful for an OoO machine.
> >
> > Profile-driven optimization is used, of course, but the problem remains:
> > the compiler needs to generate code that works for all possible
> > situations and it can't freely duplicate code all over the place.
> > A bit of code duplication (to specialize a code path to a few different
> > scenarios) can be done, but only within fairly strict limits otherwise
> > code size will explode and performance goes down the drain again.
> >
> > In contrast, an OoO is free to use a different schedule each time
> > a chunk of code is run (and similarly the branch predictor is free to
> > provide wildly different predictions each time that chunk of code is
> > run) without any downside.
> >
> > The OoO works on the trace of the actual execution, where the main limit
> > is the size of the window it can consider (linked to the accuracy of the
> > branch predictor), whereas the compiler is not limited to such a window
> > but instead it's limited to work on the non-unrolled code.
> Actually the limit to useful window size in GP code is dataflow
> dependencies. All the fancy numbers bandied about for big windows are
> for embarrassingly parallel apps - walk a huge array doing the same
> thing for every element for example. Those (typically HPC) are
> important, but engender a supercomputer bias in design.
>
> GP code - the classic payroll app, or the great bulk of real work once
> the embarrassingly parallel code has been moved off to special purpose
> engines - hits a dataflow dependence within a few tends of instructions.
> The rest of the window can be filled with instructions awaiting issue
> resolution - but you might as well have left them in the icache.
>
> "The dirty little secret about OOO is how little OOO there really is."
> - Andy Glue
<
If you had a reorder window that was 20 instructions deep, but had an
execution window that was 500 instructions deep with rather serial
ordering between reorder windows, you are already well above the knee
of the curve as to what Great-Big-OoO buys.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<82b70246-f42e-4882-9bd4-6b2fafa6f251n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23550&group=comp.arch#23550

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:5503:0:b0:4e1:2b66:dc53 with SMTP id j3-20020a375503000000b004e12b66dc53mr231588qkb.706.1644972294767;
Tue, 15 Feb 2022 16:44:54 -0800 (PST)
X-Received: by 2002:a54:4e85:0:b0:2ce:4cc1:9d82 with SMTP id
c5-20020a544e85000000b002ce4cc19d82mr2669726oiy.50.1644972294519; Tue, 15 Feb
2022 16:44:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 16:44:54 -0800 (PST)
In-Reply-To: <7fbd38c5-55a7-4417-b833-8778017a6da0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:1da9:46e3:e9c4:35aa;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:1da9:46e3:e9c4:35aa
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <7fbd38c5-55a7-4417-b833-8778017a6da0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <82b70246-f42e-4882-9bd4-6b2fafa6f251n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 00:44:54 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 55

by: Quadibloc - Wed, 16 Feb 2022 00:44 UTC

On Tuesday, February 15, 2022 at 2:07:56 PM UTC-7, MitchAlsup wrote:

> Until SW figures out how to utilize "craploads" of cores, the market will
> continue to demand GBOoO--------it really is that simple.

If it were possible to make computer chips out of some exotic new
material, with the following properties:

- yields aren't as good as with silicon, so the best you can economically
make with it is a single in-order core (plus the L1 cache it needs) on a
single die;

- gate and signal delays are so much lower that this in-order core runs
faster than a silicon GBOoO core

*then* the demand for GBOoO would _also_ collapse, even without a
parallel programming breakthrough, because now there would be a
better way to what the market wants, faster single-thread performance.

However, of course silicon CMOS has now advanced to the point where
the search for such an exotic material is, presumably, chimerical.
But it is still being _conducted_; this is why we've heard headlines about
how researchers at the University of Illinois have made the "world's
fastest transistor" out of Indium Arsenide and Indium Gallium Phosphide.

Looking up information on the properties of semiconductors, I found these
materials had very high electron mobility. The trouble is, of course, that
long ago, the increasing density of microchips led to ECL and NMOS being
abandoned in favor of CMOS as the only practical way to make today's
very dense chips.

So we need a material that has a much higher hole mobility than silicon
as well as a much higher electron mobility, so we can make really fast CMOS
circuits - so they can be shrunk down to have wire delays competitive
with silicon, without which faster transistors won't do you any good.

I looked, and, indeed, there is one: lead telluride. It is used today to make
diodes... for Peltier Effect refrigeration. There was even a news item not so
long ago about how someone managed to make a transistor out of it.

Tellurium happens to be _very_ weakly radioactive; some have said that this
is a fatal obstacle, but apparently the radioactivity is _so_ weak that the MTBF
would be fairly long, and so a design with decent RAS features might be
workable; I haven't done the numbers.

Ah, here we are: the longest-lived isotope, Tellurium-128, has a half-life of 2.2 * 10^24 years.

Now, Avogadro's Number is 6.02 * 10^23.

A microprocessor die weighs, what, one gram?

So we're talking less than one radioactive decay in a thousand years. I think that's
a tolerable error rate. But we would have to use semiconductor-on-insulator
technology, apparently.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<71916e54-a402-4de4-b3f5-310acbcb839bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23551&group=comp.arch#23551

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:57c3:0:b0:2cb:9701:aab4 with SMTP id w3-20020ac857c3000000b002cb9701aab4mr441877qta.187.1644972381039;
Tue, 15 Feb 2022 16:46:21 -0800 (PST)
X-Received: by 2002:a05:6808:bd5:b0:2d4:447d:6d27 with SMTP id
o21-20020a0568080bd500b002d4447d6d27mr137267oik.179.1644972380787; Tue, 15
Feb 2022 16:46:20 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 16:46:20 -0800 (PST)
In-Reply-To: <129fdd82-8128-4c47-b0ca-611ad12ca046n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:19e6:c3ce:1a76:25c3;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:19e6:c3ce:1a76:25c3
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suh8vh$g7r$1@dont-email.me>
<jwvr18392lk.fsf-monnier+comp.arch@gnu.org> <129fdd82-8128-4c47-b0ca-611ad12ca046n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <71916e54-a402-4de4-b3f5-310acbcb839bn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 00:46:21 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 37

by: MitchAlsup - Wed, 16 Feb 2022 00:46 UTC

On Tuesday, February 15, 2022 at 6:21:56 PM UTC-6, Quadibloc wrote:
> On Tuesday, February 15, 2022 at 3:54:31 PM UTC-7, Stefan Monnier wrote:
>
> > In theory, a compiler could do wonders. But in practice it's terribly
> > hard, because that compiler has to work sanely for any code you throw at
> > it. That's why the failure of Itanium was easy to foresee for people in
> > the compiler business (to be honest, I thought it would still come out
> > ahead simply because "the ISA doesn't matter" and Intel would just end
> > up doing the same translation to a big OoO dataflow core as they do with
> > the x86, and win by virtue of their financial and engineering muscle).
> The way I've always understood this issue, which seems to be flawled, is
> this:
>
> There is no technical obstacle to replacing an out-of-order design with
> a design that has a lot of explicit registers, with the compiler allocating
> the registers in such a way to make for piles of independent instructions
> that can be marked for simultaneous execution.
>
> None at all - it can be done. But this only deals with register hazards,
> and it exacts a price in code density, which may be an issue since memory
> bandwidth is a constraint.
>
> However, I've been told that a computer being out-of-order *also* helps
> with the effects of cache misses. These are basically unpredictable, and
> so the compiler can't help you with them. And that is the reason why attempting
> to replace an out-of-order design with explicit parallelism and co-operating
> register allocation fail.
<
Imaging trying to schedule a "loop" that takes a miss on one load once every 7
loops and takes a store miss once every 13 loops, but the two misses almost
never occur on the same loops.
<
This is child's play for OoO.
>
> Now, maybe the compiler can help a little with prefetches, but it's nowhere
> near as predictable and effective as in the case of register hazards.
>
> John Savard

Anton Ertl wrote:
> Thomas Koenig <tkoenig@netcologne.de> writes:
>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>
>>> * Before people make assumptions about what I mean with "software
>>> crisis": When the software cost is higher than the hardware cost,
>>> the software crisis reigns. This has been the case for much of the
>>> software for several decades.
>>
>> Two reasonable dates for that: 1957 (the first Fortran compiler) or
>> 1964, when the /360 demonstrated for all to see that software (especially
>> compatibility) was more important than any particular hardware.
>
> Yes, the term "software crisis" is from 1968, but programming language
> implementations demonstrated already before Fortran (and actually more
> so, because pre-Fortran languages had less implementation investment
> and did not utilize the hardware as well) that there is a world where
> software cost is more relevant than hardware cost. But of course at

This is a very big world actually, i.e. see all the Phyton code running
on huge cloud instances even though they require 10-100 times as much
resources as the same algorithms implemented in Rust. The main exception
is the usual one, i.e. where the Phyton code (or some other scripting
language) is just a thin glue layer tying together low-level libraries
written in C(++), like in numpy.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suidol$1b8d$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23557&group=comp.arch#23557

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!goVO5BCyzA4KUFWQCsWX5Q.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 09:48:20 +0100
Organization: Aioe.org NNTP Server
Message-ID: <suidol$1b8d$1@gioia.aioe.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="44301"; posting-host="goVO5BCyzA4KUFWQCsWX5Q.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Wed, 16 Feb 2022 08:48 UTC

Thomas Koenig wrote:
> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> And because compiler
>> branch prediction (~10% miss rate)
>
> That seems optimistic.
>
>> is much worse than dynamic branch
>> prediction (~1% miss rate, both numbers vary strongly with the
>> application, so take them with a grain of salt),
>
> What is the branch miss rate on a binary search, or a sort?
> Should be close to 50%, correct?
>
Or closer to zero, if you implement your qsort with predicated
left/right pointer updates and data swaps.

Same for a binary search, the left/right boundary updates can be
predicated allowing you to blindly run log2(N) iterations and pick the
remaining item. I.e. change it from a code state machine to a data state
machine because dependent loads can run at 2-3 cycles/iteration while
branch misses cost you 5-20 cycles.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suifjt$uuo$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23559&group=comp.arch#23559

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 09:19:57 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suifjt$uuo$2@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at> <sug7bd$b9l$1@dont-email.me>
<suh3kq$3pd$3@newsreader4.netcologne.de> <suh5d3$po5$1@dont-email.me>
<suh8l5$7e8$1@newsreader4.netcologne.de> <suhbil$1gq$1@dont-email.me>
Injection-Date: Wed, 16 Feb 2022 09:19:57 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="31704"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 16 Feb 2022 09:19 UTC

Ivan Godard <ivan@millcomputing.com> schrieb:

> The true member-dependent parts are small. We're not worried about slow
> JITting.

It will exclude the Mill from many applications where JIT
is important, such as browsers, programs built on browser
infrastructure such (Electron is used for Teams, IIRC) and
commercial applications which use a lot of Java.

That is, of course, your (business) decision to make.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suigkc$nkv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23560&group=comp.arch#23560

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 03:37:15 -0600
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <suigkc$nkv$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Feb 2022 09:37:16 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c2fca33c8e4c63ea697ef74965f89bcd";
logging-data="24223"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX195ZiRkU7YBsYwMJH3TAVFF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:te+jrrNG9cHWrvRS+r/Ug223yVo=
In-Reply-To: <suidol$1b8d$1@gioia.aioe.org>
Content-Language: en-US

by: BGB - Wed, 16 Feb 2022 09:37 UTC

On 2/16/2022 2:48 AM, Terje Mathisen wrote:
> Thomas Koenig wrote:
>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>> And because compiler
>>> branch prediction (~10% miss rate)
>>
>> That seems optimistic.
>>
>>> is much worse than dynamic branch
>>> prediction (~1% miss rate, both numbers vary strongly with the
>>> application, so take them with a grain of salt),
>>
>> What is the branch miss rate on a binary search, or a sort?
>> Should be close to 50%, correct?
>>
> Or closer to zero, if you implement your qsort with predicated
> left/right pointer updates and data swaps.
>
> Same for a binary search, the left/right boundary updates can be
> predicated allowing you to blindly run log2(N) iterations and pick the
> remaining item. I.e. change it from a code state machine to a data state
> machine because dependent loads can run at 2-3 cycles/iteration while
> branch misses cost you 5-20 cycles.
>

Yeah, if the ISA does predicated instructions, and the compiler uses
them, pretty much all of the internal "if()" branches in a typical
sorting function can be expressed branch-free.

One can potentially modulo schedule them as well, though this requires
an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where support
for predicated ops using SR.S is an optional feature, though 3 or 4 bits
could be better).

Did also recently add another hack, where now stuff like:
ADD?T R44, R53, R19 | ADD?F R37, 123, R13
As well as ST/SF predicates:
ADD?ST R44, R53, R19 | ADD?SF R37, 123, R13
....

Can (in theory) be encoded using a 96-bit bundle with an Op64 prefix
being split in half between two logical instructions. This is kind of an
ugly hack, but alas.

One downside is that the encoding scheme does not currently allow for, say:
ADD?T R44, R53, R19 | MOV.W (R37, R6*8, 6), R13

As well, as it also being effectively limited to encoding 2-wide bundles.

But, alas...

> Terje
>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suijiu$giq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23562&group=comp.arch#23562

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 02:27:44 -0800
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <suijiu$giq$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Feb 2022 10:27:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a03ab46b81509913aef66e0a10295b0";
logging-data="16986"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19rmDZBIP0Z0e74rnAXyIW7"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:U1DZzn86XWzZdfTOMdbaXsQP3LU=
In-Reply-To: <suigkc$nkv$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Wed, 16 Feb 2022 10:27 UTC

On 2/16/2022 1:37 AM, BGB wrote:
> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>> And because compiler
>>>> branch prediction (~10% miss rate)
>>>
>>> That seems optimistic.
>>>
>>>> is much worse than dynamic branch
>>>> prediction (~1% miss rate, both numbers vary strongly with the
>>>> application, so take them with a grain of salt),
>>>
>>> What is the branch miss rate on a binary search, or a sort?
>>> Should be close to 50%, correct?
>>>
>> Or closer to zero, if you implement your qsort with predicated
>> left/right pointer updates and data swaps.
>>
>> Same for a binary search, the left/right boundary updates can be
>> predicated allowing you to blindly run log2(N) iterations and pick the
>> remaining item. I.e. change it from a code state machine to a data
>> state machine because dependent loads can run at 2-3 cycles/iteration
>> while branch misses cost you 5-20 cycles.
>>
>
> Yeah, if the ISA does predicated instructions, and the compiler uses
> them, pretty much all of the internal "if()" branches in a typical
> sorting function can be expressed branch-free.
>
>
> One can potentially modulo schedule them as well, though this requires
> an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where support
> for predicated ops using SR.S is an optional feature, though 3 or 4 bits
> could be better).

Or just one predicate bit and a belt.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suik1j$lf7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23563&group=comp.arch#23563

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 02:35:32 -0800
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <suik1j$lf7$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at> <sug7bd$b9l$1@dont-email.me>
<suh3kq$3pd$3@newsreader4.netcologne.de> <suh5d3$po5$1@dont-email.me>
<suh8l5$7e8$1@newsreader4.netcologne.de> <suhbil$1gq$1@dont-email.me>
<suifjt$uuo$2@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Feb 2022 10:35:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4a03ab46b81509913aef66e0a10295b0";
logging-data="21991"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ftTPj+NS7AO9xIdOXOGqb"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:vmwztX6LV6OmU+5Jl2+k1y+8TRg=
In-Reply-To: <suifjt$uuo$2@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Wed, 16 Feb 2022 10:35 UTC

On 2/16/2022 1:19 AM, Thomas Koenig wrote:
> Ivan Godard <ivan@millcomputing.com> schrieb:
>
>> The true member-dependent parts are small. We're not worried about slow
>> JITting.
>
> It will exclude the Mill from many applications where JIT
> is important, such as browsers, programs built on browser
> infrastructure such (Electron is used for Teams, IIRC) and
> commercial applications which use a lot of Java.
>
> That is, of course, your (business) decision to make.

Oops - ambiguity alert. Clearing: yes, we would worry about the market
response to a slow JIT, but we are not worried that a Mill JIT might
turn out to be slow. That is, we are confident that a Mill JIT will be
as fast as any other ISA.

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> Will
>> software developers finally get around to designing software that
>> makes the maximum of the hardware, because inefficiency can no longer
>> be papered over with faster hardware?
>
>I have my doubts.
>
>If one trusts a random statistic grabbed off the internet, there
>are more than 25 million software developers at the moment (however
>that is defined, I probably would not qualify). Many of them are
>not very well qualified, and with the advent of trends like "low
>coding", this will be even worse.

The reaction to the software crisis (as in software costs more than
hardware) has always been measures to make software cheaper, and as
hardware gets cheaper/more capable, we see further measures to make
software cheaper.

>Looking at the numerous ransomware attacks and zero-day-exploits,
>the sheer amount and complexity of software already overwhelm
>our capability to write it.

The symptoms were different, but the diagnosis was the same when the
software crisis was first discussed.

>We are already in the middle of a
>software crisis,

We have been for more than 5 decades; remember that the term is from
1968, and it describes phenomena that had already been seen earlier.

>and it will become worse when computers stop
>getting (much) faster.

That's the question. If I can write software that will be relevant,
say, 50 years instead of 10, because the same or similar hardware will
be built in 50 years, I can invest 5 times as much effort to make the
software better at the same cost per year, or I can lower the cost per
year for the same-quality software by a factor of 5 (that assumes a
rent-per-year model without support, things get more muddled if you
include support or have a pay-once model, or the common gratis
software, and pay for features, support, or pay-to-win models).

Interestingly, we have seen a substantial slowdown in clock rate
increases (and more recently and less markedly, RAM increases, and
mass storage increases), but I have not noticed such effects. Yes,
there is software like World of Warcraft that has been on the market
for more than 17 years, but it requires regular expansions to keep its
player base (but that's in the nature of most entertainment wares,
just like people don't like to watch the same film every day); or one
might consider office software to be feature-complete for most users
for at least two decades (actually independent of hardware advances).

Hmm, maybe I have noticed such effects after all: While in earlier
times software and hardware reviews discussed features, these days the
first thing I read in either is language from fashion magazines, such
as "fresh look". Maybe this points to the hardware and software
industry (and the magazines and review sites that are financed by
them) searching for a reason for you to buy their stuff despite the
lack of substantial advances.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Scott Smader <yogaman101@yahoo.com> writes:
>On Monday, February 14, 2022 at 7:10:21 PM UTC-8, Stefan Monnier wrote:
>> > Not sure if the assumption of OoO superiority is justified at current=
>=20
>> > transistor budget, but for the same transistor budget at the same clock=
>=20
>> > rate, statically scheduled should win because it more efficiently uses =
>the=20
>> > available transistors to do useful work.
>> What makes you think so? AFAIK OoO cores are better at keeping their=20
>> FUs busy, and they're probably also better at keeping the memory=20
>> hierarchy busy. Maybe they do that at the cost of more transistors=20
>> "wasted" doing "administrative overhead", but it's far from obvious that=
>=20
>> this overhead is the only thing that matters.=20
>
>I would appreciate links about OoO's superior use of FUs.

You just need to look at the performance results I posted for Bonnell
vs. Bobcat and Cortex-A53 vs. A73. The OoO machines have a higher IPC
than the in-order machines, and given that they have roughly the same
number of FUs (all these cores are 2-wide), they have a higher FU
utilization.

>Statically scheduled design really seems to require VLIW for the compilers =
>to optimally utilize FUs, unrolling loops and vectorizing as convenient.

Why do you think so? There seems very little advantage wrt FU
utilization to VLIW over an in-order superscalar; the only advantage
is swap-type parallelism, but that advantage does not often play a
role (and AFAIK not all VLIWs support it).

>As=
> you note, there haven't been a bunch of popular VLIW machines, although Mi=
>ll stands on the cusp. To compare the relatively mature performance of iter=
>ated OoO designs against the state of existing statically scheduled machine=
>s doesn't seem to answer questions about performance in the limit when tran=
>sistor budgets are frozen, which is how I understood your question.

In-order execution already lost to OoO in the less mature state of OoO
of more than 20 years ago: Numbers are seconds run-time for our LaTeX
benchmark:

Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
HP workstation 900MHz Itanium II, Debian Linux 3.528

Note that the Celeron 800 is a low-cost three-wide CPU by Intel from
2001 (based on the Coppermine core from 1999), while the Itanium II is
a high-end 6-wide CPU by Intel from 2002.

>I made a cheating assumption that a statically scheduled machine would incl=
>ude dynamic branch prediction hardware.

I expect so, but the scheduler in the compiler will not use it with
current compiler concepts.

>I think that goes a long way toward=
> equalizing cache/memory access times.

Why should it?

>We've seen what OoO hath wrought, but we haven't yet seen a machine like Mi=
>ll put through its paces. My impression is that Mill's innovations will dra=
>matically advance the path to improved statically scheduled machines. It ma=
>y not be fair to include all of the Mill's clever inventions into what I'm =
>pointing at for static scheduling, but I am.

Given that it's a statically scheduled machine, I think it's fair.
But I have yet to see a reason to believe that there will be a
dramatic improvement.

>OoO works definitely better for programs which depend on conditions that ca=
>n't be known (or accurately guessed) at compile time, provided the metric i=
>s speed. But OoO lags when better is measured by die area or power-per-inst=
>ruction or when the program's sequencing is predictable.

Best die area is probably something like the b16 or (more mainstream)
the Cortex-M0, and I don't think a Mill can compete with that.
Power-per-instruction also looks good for these CPUs.

But your original claim was that you would consume the same area by
replacing the "speculation transistors" by FU transistors, so no die
size advantage. Looking at the Itanium II, it's 180nm incarnation
(McKinley) uses 421mm^2, and has 130W TDP, compared to the 180nm
Celeron 800 with 90mm^2 die size and 20.8W TDP. The Itanium II still
has a branch predictor, but it does get rid of the OoO transistors and
has more functional units instead, like you suggested. But it is
worse in performance, worse in area, and worse in power consumption.

>I think static scheduling wins when datasets are huge, and I =
>expect problem datasets to be really, really huge when transistor budgets s=
>tagnate.

For data-parallel problems approaches like GPGPU and SIMD often work
well. For most of the problems where they don't, static scheduling
does not do so great, either. There may be cases where SIMD does not
work, but in-order works well, but I think they are pretty rare (none
come to my mind). Ok, you might say, maybe in-order is great for
doing the control part of programs where SIMD works well; but looking
at the history of Xeon Phi, even for CPUs designed for HPC stuff
(mostly data-parallel) they went from in-order with wide SIMD
Knights-Ferry/Corner to narrow OoO with wide SIMD (Knight's
Landing/Mill), to eventually using their mainline wide OoO with wide
SIMD instead of having specialized HPC CPUs; my guess is that with
SIMD in play, the OoO overhead consumes a smaller proportion of the
area, and even in HPC there are irregular codes (maybe sparse-matrix
stuff) where OoO pays off.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

On 2/16/2022 5:40 AM, John Dallman wrote:
> In article <suh8bp$ce8$1@dont-email.me>, ivan@millcomputing.com (Ivan
> Godard) wrote:
>
>> For loads, which are the only source of significant latency
>> variability, Mill splits the issue and retire into two different
>> instructions which are independently static scheduled (patented).
>
> A retire presumably puts the loaded data on the belt?
>
> Where is the data stored after arrival, but before retiring?
>
> I'm asking because I have traumatic memories of the Itanium's failure to
> solve these problems properly for floating-point data, which I had to dig
> out and report. My Intel contact said "Of course that goes wrong, doesn't
> everyone know?" but it wrecked the whole advance loading system for
> floating point.
>
>> A load-retire instruction stalls the core if it has no data yet.
>
> Yup, that's unavoidable.
>
>
> John

The hardware of each Mill family member has a configurable number of
holding buffers which we call "retire stations". Perhaps confusingly,
the term is sometimes used for something else in OOO; ours are more
similar to load/store queue entries.

Each RS contains an operand buffer that accumulates the bytes from a
requested load until either a timeout of a delay given in the load
instruction or an explicit pickup instruction referencing the RS. The RS
snoops the store stream so that the value dropped to the belt at retire
reflects memory state as of the retire time, not the load time. Thus one
can see a Mill load instruction as a sort of value-catching prefetch.

Within a single function frame, it is the responsibility of the
specializer to schedule the lifetimes of in-flight load operations such
that the number simultaneously live does not exceed the number of RSs
configured. Between frames the RSs are lazily and automatically spilled
by the hardware spiller, so that each frame appears to have a complete
set of free RSs at entry. A function return restores the previously
active RSs automatically. The current implementations spill only address
and metadata from RSs, not the buffer content; a spill refill reissues
the load request, which should be satisfied from cache because of the
pre-spill load action.

Countdown ("deferred") loads only count cycles of the load's frame.
While counting, or while awaiting an explicit retire instruction, the
program may do anything it wants including executing calls to
arbitrarily lengthy functions. This provides insulation against cache
misses similar to that provided by very long issue windows in OOO, but
without the hardware overhead or window size limitations.

It works very well in typical GP open code. It does not work so well for
embarrassingly parallel loops that miss every iteration because there
are not enough RSs to hide all the DRAM latency in software pipelines.
OOO has the equivalent problem in codes for which it runs out of window
entries.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<vG8PJ.15176$GjY3.1981@fx01.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23568&group=comp.arch#23568

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx01.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions
in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at> <jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at> <suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org> <suigkc$nkv$1@dont-email.me>
In-Reply-To: <suigkc$nkv$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 42
Message-ID: <vG8PJ.15176$GjY3.1981@fx01.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 16 Feb 2022 15:24:11 UTC
Date: Wed, 16 Feb 2022 10:23:53 -0500
X-Received-Bytes: 3041

by: EricP - Wed, 16 Feb 2022 15:23 UTC

BGB wrote:
> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>> And because compiler
>>>> branch prediction (~10% miss rate)
>>>
>>> That seems optimistic.
>>>
>>>> is much worse than dynamic branch
>>>> prediction (~1% miss rate, both numbers vary strongly with the
>>>> application, so take them with a grain of salt),
>>>
>>> What is the branch miss rate on a binary search, or a sort?
>>> Should be close to 50%, correct?
>>>
>> Or closer to zero, if you implement your qsort with predicated
>> left/right pointer updates and data swaps.
>>
>> Same for a binary search, the left/right boundary updates can be
>> predicated allowing you to blindly run log2(N) iterations and pick the
>> remaining item. I.e. change it from a code state machine to a data
>> state machine because dependent loads can run at 2-3 cycles/iteration
>> while branch misses cost you 5-20 cycles.
>>
>
> Yeah, if the ISA does predicated instructions, and the compiler uses
> them, pretty much all of the internal "if()" branches in a typical
> sorting function can be expressed branch-free.
>
>
> One can potentially modulo schedule them as well, though this requires
> an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where support
> for predicated ops using SR.S is an optional feature, though 3 or 4 bits
> could be better).

Mitch's My66K predicate value state flags are implied by the predicate
instruction shadow and tracked internally, not architectural ISA
registers like Itanium. To have multiple predicates in flight at once
just use multiple PRED instructions.

On 2/13/2022 9:52 AM, John Levine wrote:
> According to Ivan Godard <ivan@millcomputing.com>:
>> In Mill the translation happens once, at install time, via a transparent
>> invocation of the specializer (not a recompile, and the source code is
>> not needed). In a microcoded machine (or a packing/cracking scheme) the
>> translation takes place at execution time, every time. Either way, it
>> gets done, and either way permits ISA extension freely. They differ in
>> power, area, and time; the choice is an engineering design dimension.
>>
>> I assert without proof that Mill-style software translation in fact
>> permits ISA extension that is *more* flexible and powerful than what can
>> be achieved by hardware approaches. YMMV.
>
> If one believes in proof by example, this approach has been wildly successful
> in the IBM S/38, AS/400. and whatever it is called now, something something i.
>
> You can take 30 year TIMI object code and run it on current hardware at full speed,
> supporting their exotic object architecture with 128 bit pointers. I believe the
> architecture had one significant change in the 1990s, with addresses expanding
> from 48 to 64 bits but the old object code still works.
>
> I'm kind of surprised nobody else does this. I suppose only IBM had sufficient
> control over both the hardware and software to make it work.

You have raised an interesting question. I have been thinking about it
for a few days, and I don't think it is as simple as HW/SW control.

First, you have to recognize that there is a perhaps subtle, but very
important difference between what the Mill is going to do, and what the
S/38 etc. did. Specifically, the Mill "respecializes" the OS. I
believe that IBM rewrote the lowermost kernel stuff from scratch when
they went from the proprietary CISC to the Power based systems.

This difference means that the Mill is "limited" to a range of changes,
i.e. number of FUs, belt size, presence or absence of certain features,
etc. It could not be used for a more fundamental change at the OS
level, such as if some hypothetical future Mill abandoned the SAS model
or went to a supervisor state based security (I realize unlikely to
happen for other reasons). So while IBM's approach allows more basic
changes in the HW design, it comes at the cost of non-automatic OS
migration.

So, with that in mind, lets look at some hypothetical vendor considering
a new system. What alternatives does he have?

1. He could go Mill like, with automatic migration of everything at the
cost of limiting his flexibility

2. He could go sort of S/38 like but instead of keeping some
intermediate level, keep the (required to be portable) source code with
the executable, and recompile if needed for a future machine. After
all, if we are talking of an infrequent migration, the extra cost of a
full compile versus the "partial compile" of respecializing would be
lost in the noise.

3. He could expect to limit future HW designs to provide compatibility.

4. He could expect much/most application software to be written to a
defined intermediate language such as the JVM. This is much like the
S/38 approach, but substitutes JIT compiles for install time recompiles.

5. Probably others. . .

Now let's look at what actually happened in some real situations.

While WinTel was two companies, they worked closely together and had a
near monopoly for some years. Intel mostly went with #3 (with the
exception of Itanium), but Microsoft sort of went with the S/38 model
(different user compatible interfaces for different HW), but didn't do
any automatic migration, e.g. from X86 to Alpha or MIPS)

Apple had complete control over the Mac environment and went with
essentially #2, but without the automatic recompiles. This allowed
major changes to the underlying hardware, mostly without too many problems.

Apple again, but this time with the iPhone. Seems like mostly #4.

While not a "company", Linux and the GCC programs essentially went with
a different variant. They used their version of C as essentially a
public (as opposed to S/38's proprietary) intermediate language, and
expect small changes and recompiles for different underlying architectures.

While most hardware CPU vendors have little interest in making it easy
to migrate to non compatible future hardware, they have great interest
in allowing easy migration to their future CPUs. They accomplish this
by doing transparently some of what the Mill requires a respecializaton
for (i.e. increasing the number of FUs), and providing "compatibility"
modes such as what allows you to run a 30 year old S/360 program on a
current system.

So, overall, I think the answer to your question is "It's complicated!"
I think other vendors have evolved different solutions that provide
similar capabilities (not as good in some areas, better in others).
While what S/38 did was revolutionary at the time, and quite "elegant",
the requirements that drove it led to multiple solutions.

However, I do think it might be a good idea for a system to
automatically keep the source code with the object code to allow for
automatic recompiles.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23570&group=comp.arch#23570

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:5983:0:b0:1e5:7dd6:710 with SMTP id n3-20020a5d5983000000b001e57dd60710mr3071233wri.392.1645029962840;
Wed, 16 Feb 2022 08:46:02 -0800 (PST)
X-Received: by 2002:a54:4e85:0:b0:2ce:4cc1:9d82 with SMTP id
c5-20020a544e85000000b002ce4cc19d82mr1033012oiy.50.1645029962195; Wed, 16 Feb
2022 08:46:02 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 08:46:02 -0800 (PST)
In-Reply-To: <2022Feb16.133409@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com> <2022Feb16.133409@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Wed, 16 Feb 2022 16:46:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 197

by: Scott Smader - Wed, 16 Feb 2022 16:46 UTC

On Wednesday, February 16, 2022 at 5:34:57 AM UTC-8, Anton Ertl wrote:
> Scott Smader <yogam...@yahoo.com> writes:
> >On Monday, February 14, 2022 at 7:10:21 PM UTC-8, Stefan Monnier wrote:
> >> > Not sure if the assumption of OoO superiority is justified at current=
> >=20
> >> > transistor budget, but for the same transistor budget at the same clock=
> >=20
> >> > rate, statically scheduled should win because it more efficiently uses =
> >the=20
> >> > available transistors to do useful work.
> >> What makes you think so? AFAIK OoO cores are better at keeping their=20
> >> FUs busy, and they're probably also better at keeping the memory=20
> >> hierarchy busy. Maybe they do that at the cost of more transistors=20
> >> "wasted" doing "administrative overhead", but it's far from obvious that=
> >=20
> >> this overhead is the only thing that matters.=20
> >
> >I would appreciate links about OoO's superior use of FUs.
>
> You just need to look at the performance results I posted for Bonnell
> vs. Bobcat and Cortex-A53 vs. A73. The OoO machines have a higher IPC
> than the in-order machines, and given that they have roughly the same
> number of FUs (all these cores are 2-wide), they have a higher FU
> utilization.

You just need to look at the relative silicon areas of A53 & A73. It's the same ratio as the performance increase. Ie, no performance gain per area.

>
> >Statically scheduled design really seems to require VLIW for the compilers =
> >to optimally utilize FUs, unrolling loops and vectorizing as convenient.
>
> Why do you think so? There seems very little advantage wrt FU
> utilization to VLIW over an in-order superscalar; the only advantage
> is swap-type parallelism, but that advantage does not often play a
> role (and AFAIK not all VLIWs support it).
>

Why? Because a wider instruction can dispatch more functions results in a single cycle.

> >As=
> > you note, there haven't been a bunch of popular VLIW machines, although Mi=
> >ll stands on the cusp. To compare the relatively mature performance of iter=
> >ated OoO designs against the state of existing statically scheduled machine=
> >s doesn't seem to answer questions about performance in the limit when tran=
> >sistor budgets are frozen, which is how I understood your question.
>
> In-order execution already lost to OoO in the less mature state of OoO
> of more than 20 years ago: Numbers are seconds run-time for our LaTeX
> benchmark:
>
> Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
> HP workstation 900MHz Itanium II, Debian Linux 3.528
>
> Note that the Celeron 800 is a low-cost three-wide CPU by Intel from
> 2001 (based on the Coppermine core from 1999), while the Itanium II is
> a high-end 6-wide CPU by Intel from 2002.

A. Itanium's failure is not proof that statically scheduled will lose out when transistor budgets freeze.
B. Your conclusion is correct based on assuming that single-thread performance is all that matters, ignoring power and die area. What if the A73 design had been limited to the number of transistors (or more importantly die area since bigger transistors could give an illusory performance improvement if just the quantity is constant) in an A53?

>
> >I made a cheating assumption that a statically scheduled machine would incl=
> >ude dynamic branch prediction hardware.
>
> I expect so, but the scheduler in the compiler will not use it with
> current compiler concepts.
>

Not a criticism of statically scheduled. A compiler for a computer is part of the product deliverables. It must understand the machine it's producing code for.

> >I think that goes a long way toward=
> > equalizing cache/memory access times.
>
> Why should it?

Look at Ivan's answer about how Mill handles memory accesses from 6:30am today.

>
> >We've seen what OoO hath wrought, but we haven't yet seen a machine like Mi=
> >ll put through its paces. My impression is that Mill's innovations will dra=
> >matically advance the path to improved statically scheduled machines. It ma=
> >y not be fair to include all of the Mill's clever inventions into what I'm =
> >pointing at for static scheduling, but I am.
>
> Given that it's a statically scheduled machine, I think it's fair.
> But I have yet to see a reason to believe that there will be a
> dramatic improvement.
>
> >OoO works definitely better for programs which depend on conditions that ca=
> >n't be known (or accurately guessed) at compile time, provided the metric i=
> >s speed. But OoO lags when better is measured by die area or power-per-inst=
> >ruction or when the program's sequencing is predictable.
>
> Best die area is probably something like the b16 or (more mainstream)
> the Cortex-M0, and I don't think a Mill can compete with that.
> Power-per-instruction also looks good for these CPUs.
>
> But your original claim was that you would consume the same area by
> replacing the "speculation transistors" by FU transistors, so no die
> size advantage.

The additional FUs are doing work that isn't discarded.

In this discussion thread at Feb 15, 2022, 1:07:56 PM, Mitch wrote:
"Late in my AMD career, I did a study design of a 1-wide in order x86-64
to see how much the OoO-ness was costing.
<
"We discussed this back when Nick McLaren was still posting.
<
"The bottom line is that the GBOoO design is 12× bigger than the 1-wide
I-O design (including L1 cache and predictors but excluding L2 cache).
The GBOoO core ran 2× faster than LBIO core from the same L2 outwards.
The LBIO cores also operated at 1/12 the power--logic power scales with
area, [SD]RAM and pin power scales with utilization. "

So 12x the area and 12x the power gives 2x the performance.

Also please see Ivan's quote of Andy Glue: "The dirty little secret about OOO is how little OOO there really is."

> Looking at the Itanium II, it's 180nm incarnation
> (McKinley) uses 421mm^2, and has 130W TDP, compared to the 180nm
> Celeron 800 with 90mm^2 die size and 20.8W TDP. The Itanium II still
> has a branch predictor, but it does get rid of the OoO transistors and
> has more functional units instead, like you suggested. But it is
> worse in performance, worse in area, and worse in power consumption.
>
> >I think static scheduling wins when datasets are huge, and I =
> >expect problem datasets to be really, really huge when transistor budgets s=
> >tagnate.
>
> For data-parallel problems approaches like GPGPU and SIMD often work
> well. For most of the problems where they don't, static scheduling
> does not do so great, either. There may be cases where SIMD does not
> work, but in-order works well, but I think they are pretty rare (none
> come to my mind). Ok, you might say, maybe in-order is great for
> doing the control part of programs where SIMD works well; but looking
> at the history of Xeon Phi, even for CPUs designed for HPC stuff
> (mostly data-parallel) they went from in-order with wide SIMD
> Knights-Ferry/Corner to narrow OoO with wide SIMD (Knight's
> Landing/Mill), to eventually using their mainline wide OoO with wide
> SIMD instead of having specialized HPC CPUs; my guess is that with
> SIMD in play, the OoO overhead consumes a smaller proportion of the
> area, and even in HPC there are irregular codes (maybe sparse-matrix
> stuff) where OoO pays off.

Only some large datasets are amenable to SIMD. Agreed that OoO consumes a small proportion of the area in a SIMD-focused machine.

I was actually thinking of problems that require data from multiple sources which each require lots of domain-specific manipulation, ie, perfect for more cores working in parallel. I said large datasets; I meant complex problems with many large, disparate datasets.

And now I'm really feeling redundantly repetitive. It's apparent we disagree. How about we leave it there until there are concrete Mill benchmarks, eh?

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

On 2/16/2022 9:08 AM, John Dallman wrote:
> In article <suj84u$cjv$1@dont-email.me>, sfuld@alumni.cmu.edu.invalid
> (Stephen Fuld) wrote:
>
>> Microsoft sort of went with the S/38 model (different user
>> compatible interfaces for different HW), but didn't do any
>> automatic migration, e.g. from X86 to Alpha or MIPS)
>
> What are you thinking of here? Windows C/C++ and many other languages are
> straightforwardly compiled to native code.

Sure. The difference is in the "automatic" part. When you upgraded
from a proprietary CISC IBM S/38 to a Power based AS/400, the first time
you ran the program, the system automatically recompiled the
intermediate code (that was included with the object code) for the new
architecture. This led to the anecdotes mentioned in the AS/400 book
about customers complaining that their shiny new AS/400 was running much
slower than their old S/38 and IBM saying just rerun the program and it
went much faster. What was going on was the first run (unknown to the
customer) included the recompile time. When they ran it again, it ran
much faster (native code for the faster processor). BTW, this was part
of the reason for my suggestion at the end of my last post that the
source code be transparently kept with the object code. It would allow
transparent recompiles.

> .NET code can be portable between architectures if it is "pure" but quite
> a lot of it is not. Making "pure" code that interacts with native code is
> quite hard, and that problem was largely dealt with by waiting for
> Microsoft to add interface libraries that are part of the
> platform-specific .NET runtime.

Yeah, I really didn't consider .NET in my analysis. :-( I guess it is
sort of like using the JVM as a portable intermediate language.

> 32-bit x86 Windows software runs on 64-bit x86 via a system call
> translation layer. x86 software runs on ARM (and did on Itanium) via
> binary-translator-emulators.

Yup. And S/360 and descendants allow old code to run on current
hardware without even that. But such upgrades are limited to "similar"
ISAs, not the radically different ISAs of the S/38-AS/400 or the various
MAC processor transitions.

>
> Porting native-code Windows software from 32-bit to 64-bit, and 64-bit to
> ARM64 isn't especially difficult, but it requires a porting effort, and
> producing separate builds afterwards.

Agreed.

>
>> Apple had complete control over the Mac environment and went with
>> essentially #2, but without the automatic recompiles. This allowed
>> major changes to the underlying hardware, mostly without too many
>> problems.
>
> Again, what are you thinking of here? Objective-C/C++ and Swift are
> compiled to native code, as are most other programming languages on Macs.
> Apple have provided binary-translator-emulators to ease their hardware
> transitions (68000 to PowerPC, PowerPC to Intel, Intel to ARM).

Yes. Again, the difference is the automatic part and the use of source
instead of some intermediate code as the "unit of translation".

I was just trying to point out that various vendors produced alternative
mechanisms than the S/38 did, but perhaps not as clean or automatic.
This was to answer John's original question of why other vendors didn't
adopt the S/38 model.

>
> Porting macOS software from PowerPC-32 to Intel64, and Intel64 to ARM64
> isn't especially difficult, but it requires a porting effort, and
> producing separate builds afterwards.

Agreed.

>
>> Apple again, but this time with the iPhone. Seems like mostly #4.
>
> I suspect you are thinking of Android software, where writing in Kotlin
> or Java and compiling that to a platform-independent binary is encouraged,
> but is not compulsory.
>
> iPhone software is much like Mac software at the binary level, although
> some of the APIs are quite different. The only platform transition has
> been ARM32 to ARM64, and that's required porting efforts.
>
> Apple do require you to provide a platform-independent binary form, based
> on LLVM bitcode, for watchOS and tvOS software, so that they can
> re-compile it for new platforms, but that's confined to those two
> operating systems at present.

Interesting. I didn't know that. Thanks for telling us about it. It
seems they are moving toward the S/38 model and is some reinforcement
for the argument that they can do this because the have control over the
complete HW/SW environment.

>
>> While not a "company", Linux and the GCC programs essentially went
>> with a different variant. They used their version of C as
>> essentially a public (as opposed to S/38's proprietary)
>> intermediate language, and expect small changes and recompiles for
>> different underlying architectures.
>
> That's how Microsoft and Apple treat most programming languages.

Agreed, of course.

>
>> While most hardware CPU vendors have little interest in making it
>> easy to migrate to non compatible future hardware, they have great
>> interest in allowing easy migration to their future CPUs. They
>> accomplish this by doing transparently some of what the Mill
>> requires a respecializaton for (i.e. increasing the number of FUs),
>> and providing "compatibility" modes such as what allows you to run
>> a 30 year old S/360 program on a current system.
>
> To do this, they have to freeze the design of the architecturally visible
> state, at least on a per-mode basis. Does the Mill actually have any
> visible state?

I am not sure what you mean here. In a sense, Mill has move visible
state than X86. On an X86, a vendor can change say the number of FUs
and the same object code will run as before, perhaps even faster.
Adding FUs (going to a larger model) on a Mill requires a (transparent)
respecialization.

>
>> However, I do think it might be a good idea for a system to
>> automatically keep the source code with the object code to allow
>> for automatic recompiles.
>
> That will make it quite difficult to have third-party proprietary
> software. This may be acceptable for many applications, of course.

Agreed. But I am not sure that other than effort to understand, having
an intermediate language, such as you are saying the Apple Watch does,
is much better. But that is beyond my expertise, so this is
speculation. But perhaps it is a contributing reason why other vendors
didn't follow the S/38 model.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Advantages of in-order execution (was: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits)

<jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23573&group=comp.arch#23573

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Advantages of in-order execution (was: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits)
Date: Wed, 16 Feb 2022 13:24:39 -0500
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
<2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="78f761c51fcb7be8c116fec35f477bf1";
logging-data="3391"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19F42U9xl+N7oQbHiuwRqPb"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:fgLMKb8Qv/G4f+kA6DcB27vnUe4=
sha1:xXiHUyfX2IOybyZ1+G/xAxxwMxw=

by: Stefan Monnier - Wed, 16 Feb 2022 18:24 UTC

> B. Your conclusion is correct based on assuming that single-thread
> performance is all that matters, ignoring power and die area. What if the
> A73 design had been limited to the number of transistors (or more
> importantly die area since bigger transistors could give an illusory
> performance improvement if just the quantity is constant) in an A53?

So, maybe the better question is: what kind of future process
constraints could bring the trade-offs back in favor of
in-order designs?

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<f4e11bbb-7a4a-4038-89a3-08d5e5979b0dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23575&group=comp.arch#23575

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:64ac:0:b0:1e7:1415:2548 with SMTP id m12-20020a5d64ac000000b001e714152548mr3325839wrp.267.1645036134354;
Wed, 16 Feb 2022 10:28:54 -0800 (PST)
X-Received: by 2002:a05:6808:1208:b0:2d4:419d:8463 with SMTP id
a8-20020a056808120800b002d4419d8463mr1280711oil.227.1645036133713; Wed, 16
Feb 2022 10:28:53 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:28:53 -0800 (PST)
In-Reply-To: <suiblj$euv$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<sudb0g$rq3$1@dont-email.me> <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.114353@mips.complang.tuwien.ac.at> <sug2j8$bid$1@newsreader4.netcologne.de>
<2022Feb15.191558@mips.complang.tuwien.ac.at> <suiblj$euv$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4e11bbb-7a4a-4038-89a3-08d5e5979b0dn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:28:54 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 41

by: MitchAlsup - Wed, 16 Feb 2022 18:28 UTC

On Wednesday, February 16, 2022 at 2:12:38 AM UTC-6, Terje Mathisen wrote:
> Anton Ertl wrote:
> > Thomas Koenig <tko...@netcologne.de> writes:
> >> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>
> >>> * Before people make assumptions about what I mean with "software
> >>> crisis": When the software cost is higher than the hardware cost,
> >>> the software crisis reigns. This has been the case for much of the
> >>> software for several decades.
> >>
> >> Two reasonable dates for that: 1957 (the first Fortran compiler) or
> >> 1964, when the /360 demonstrated for all to see that software (especially
> >> compatibility) was more important than any particular hardware.
> >
> > Yes, the term "software crisis" is from 1968, but programming language
> > implementations demonstrated already before Fortran (and actually more
> > so, because pre-Fortran languages had less implementation investment
> > and did not utilize the hardware as well) that there is a world where
> > software cost is more relevant than hardware cost. But of course at
>
> This is a very big world actually, i.e. see all the Phyton code running
> on huge cloud instances even though they require 10-100 times as much
> resources as the same algorithms implemented in Rust. The main exception
> is the usual one, i.e. where the Phyton code (or some other scripting
> language) is just a thin glue layer tying together low-level libraries
> written in C(++), like in numpy.
<
This sounds exactly like the cost of medicine problem we have over hear.
<
to whit::
<
Patient does not see what the medical facility billed insurance company.
Patient only sees monthly insurance bill.
<
So, why should patient care if it cost insurance company $10, or $10,000 ?
Patient only sees monthly insurance bill.
>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<5b4e9b30-6e45-4e6c-952c-5f4ce5204ed0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23576&group=comp.arch#23576

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:4523:0:b0:1e4:ac79:7c25 with SMTP id j3-20020a5d4523000000b001e4ac797c25mr3208512wra.507.1645036260129;
Wed, 16 Feb 2022 10:31:00 -0800 (PST)
X-Received: by 2002:a05:6870:b017:b0:ce:c0c9:673 with SMTP id
y23-20020a056870b01700b000cec0c90673mr962021oae.197.1645036259510; Wed, 16
Feb 2022 10:30:59 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:30:59 -0800 (PST)
In-Reply-To: <suifjt$uuo$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<sudb0g$rq3$1@dont-email.me> <2022Feb15.104639@mips.complang.tuwien.ac.at>
<sug7bd$b9l$1@dont-email.me> <suh3kq$3pd$3@newsreader4.netcologne.de>
<suh5d3$po5$1@dont-email.me> <suh8l5$7e8$1@newsreader4.netcologne.de>
<suhbil$1gq$1@dont-email.me> <suifjt$uuo$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5b4e9b30-6e45-4e6c-952c-5f4ce5204ed0n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:31:00 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Wed, 16 Feb 2022 18:30 UTC

On Wednesday, February 16, 2022 at 3:20:00 AM UTC-6, Thomas Koenig wrote:
> Ivan Godard <iv...@millcomputing.com> schrieb:
> > The true member-dependent parts are small. We're not worried about slow
> > JITting.
> It will exclude the Mill from many applications where JIT
> is important, such as browsers, programs built on browser
> infrastructure such (Electron is used for Teams, IIRC) and
> commercial applications which use a lot of Java.
<
I have turned JavaScript off on so may URLs that I practically don't
run JITs in my browser!
>
> That is, of course, your (business) decision to make.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sujg0c$lf8$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23577&group=comp.arch#23577

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Wed, 16 Feb 2022 12:32:42 -0600
Organization: A noiseless patient Spider
Lines: 122
Message-ID: <sujg0c$lf8$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Feb 2022 18:32:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c2fca33c8e4c63ea697ef74965f89bcd";
logging-data="21992"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ziwpcembNZBukjcDOflBN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:AWS/aKno+7dffv8BJgK2c/aBVJU=
In-Reply-To: <vG8PJ.15176$GjY3.1981@fx01.iad>
Content-Language: en-US

by: BGB - Wed, 16 Feb 2022 18:32 UTC

On 2/16/2022 9:23 AM, EricP wrote:
> BGB wrote:
>> On 2/16/2022 2:48 AM, Terje Mathisen wrote:
>>> Thomas Koenig wrote:
>>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>>> And because compiler
>>>>> branch prediction (~10% miss rate)
>>>>
>>>> That seems optimistic.
>>>>
>>>>> is much worse than dynamic branch
>>>>> prediction (~1% miss rate, both numbers vary strongly with the
>>>>> application, so take them with a grain of salt),
>>>>
>>>> What is the branch miss rate on a binary search, or a sort?
>>>> Should be close to 50%, correct?
>>>>
>>> Or closer to zero, if you implement your qsort with predicated
>>> left/right pointer updates and data swaps.
>>>
>>> Same for a binary search, the left/right boundary updates can be
>>> predicated allowing you to blindly run log2(N) iterations and pick
>>> the remaining item. I.e. change it from a code state machine to a
>>> data state machine because dependent loads can run at 2-3
>>> cycles/iteration while branch misses cost you 5-20 cycles.
>>>
>>
>> Yeah, if the ISA does predicated instructions, and the compiler uses
>> them, pretty much all of the internal "if()" branches in a typical
>> sorting function can be expressed branch-free.
>>
>>
>> One can potentially modulo schedule them as well, though this requires
>> an ISA with multiple predicate bits (BJX2 has 1 or 2 bits, where
>> support for predicated ops using SR.S is an optional feature, though 3
>> or 4 bits could be better).
>
> Mitch's My66K predicate value state flags are implied by the predicate
> instruction shadow and tracked internally, not architectural ISA
> registers like Itanium. To have multiple predicates in flight at once
> just use multiple PRED instructions.
>
>

Yeah, a PRED prefix/instruction represents a somewhat different approach
than what I was using.

Something like a PRED instruction would also likely add internal state
that would need to be dealt with during interrupts or similar, and
doesn't have an obvious way to "wrap around" in a modulo loop.

OK. In my case, they are the low 2 bits of the status register.

The normal predicated encodings are hard-wired to SR.T (for the ?T / ?F
modes), and normally, CMPxx and similar also update this bit.

The alternate bit (SR.S) is encoded via Op64 instructions.
W.w: May select SR.S for Ez encodings (as predicate);
W.m: May select SR.S as output for "CMPxx Imm, Rn"
W.i: May select SR.S as output for "CMPxx Rm, Rn"

In other contexts, these bits mostly serve to extend the register fields
to 6 bits and similar.

But, doing multi-bit predication in a "not quite so bit-soup" strategy
would require ~ 4 or 5 bits of entropy.

Normally, this prefix effectively glues 24 bits onto the instruction.
In the newly added bundle case, it merely adds 12, internally
repacking/padding the bits as-if it were the 24-bit prefix, for each
instruction.

It adds 5 control bits, and 7 bits which may be used as one of:
An extension to the immediate (3RI);
A 4th register field (4R);
An additional immediate (4RI);
Potentially, as more opcode bits (2R/3R).

Many of the bits designated for opcode extension are filled with zeroes
in this case (and, the immediate-field is sign-filled to its length in
the original Op64 encoding).

Indirectly, it allows for bundles where R32..R63 and predicated
instructions can be used at the same time, otherwise:
Predicated instructions using R32..R63 require 64-bits to encode;
They may not be used in bundles.

There is a 32-bit encoding which has R32..R63, but can not encode
predicated forms of these instructions.

No effect on existing code, because previously using an Op64 prefix in a
bundle encoding wasn't a valid encoding. It also only works for
encodings in certain ranges, ...

This is getting a little ugly though...

And turning into a non-orthogonal mix-and-match game (with hacks layered
on top of hacks).

Though, I am operating within the limits of stuff I can do without
breaking binary compatibility with existing code.

It is also unclear how I could do things much "better" without moving
over to a larger instruction size for base encodings (otherwise it is
seeming like a "mix and match game" may well be inevitable).

Then again, arguably I "could" widen fetch/decode to 128 bits, to allow
for something like:
Op64 - OpB | Op64 - OpA
Or (how it would have been encoded in the WEX6W idea):
Op64 | Op64 | OpB | OpA

This can not currently be handled within the existing pipeline.

....

Statistics are no substitute for judgement. -- Henry Clay

devel / comp.arch / Re: instruction set binding time, was Encoding 20 and 40 bit

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton