Message-ID:

Sic transit discus mundi -- From the System Administrator's Guide, by Lars Wirzenius

devel / comp.arch / Re: Hoisting load issue out of functions; was: instruction set

John Dallman <jgd@cix.co.uk> schrieb:
> In article <2022Feb20.192357@mips.complang.tuwien.ac.at>,
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> BGB <cr88192@gmail.com> writes
>> >One is more limited by how effectively one can parallelize the
>> >codebase, but if one is assuming an extended timeframe where no
>> >other hardware-level performance improvements are viable,
>> >programmers will "make it work" (IOW: if one assumes that the only
>> >other option is multiple decades of stagnation).
>>
>> We have seen ~17 years of stagnation since the introduction of
>> multi-cores, and I see no signs of that changing in the foreseeable
>> future.
>
> It isn't total stagnation, some codebases do get slowly upgraded. But
> it's patchy and slow.

Depends on the application.

CFD codes have been a driver for multi-core applications, and they
are quite good at it. CFD is usually memory-limited, and is unusual
because its performance degrades from hyperthreading because of
the additional overhead induced by more processes.

> Re-writing for multi-core has the problem that scaling by core count is
> what people want, but is distinctly harder than, for example, re-writing
> to make good use of two threads but no more. The advent of "asymmetric
> multi-processing" (a mix of large fast and small slower) cores makes it
> harder. Now you need more information about the machine to decide what to
> do for best performance, plus the long-standing assumption that all cores
> were equal has been busted.
>
> Putting in lots of effort to improve multi-core scaling appeals less to
> corporate product managers than adding new features.

It's not easy to do. Parallel programming is much more difficult
than sequential programming, and suffers from Amdahl's law.

One question is the programming model. My favourite one is PGAS,
which Fortran implements (I've written a short and incomplete
tutorial on coarrays because they are cool :-).

Some applications (games, obviously, but also some scientific
applications) profit enormously from graphics cards. These are good
for embarrasingly parallel code like the Discrete Element Method,
and have become much easier to program.

One area that would profit a lot from more parallelization is
compilers, or more specifically, link-time optimization when a whole
humungous program like Firefox is run through the compiler at once.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<ygnmtilrx59.fsf@y.z>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23719&group=comp.arch#23719

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx05.iad.POSTED!not-for-mail
From: x...@y.z (Josh Vanderhoof)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
<2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<ygn1qzzzs9p.fsf@y.z>
<3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>
<ygno832ts92.fsf@y.z> <2esQJ.20734$Wwf9.7208@fx23.iad>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)
Reply-To: Josh Vanderhoof <jlv@mxsimulator.com>
Message-ID: <ygnmtilrx59.fsf@y.z>
Cancel-Lock: sha1:sRyxnfhQpx/evKLz0xRZymusZkU=
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 56
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Sun, 20 Feb 2022 22:28:03 UTC
Date: Sun, 20 Feb 2022 17:28:02 -0500
X-Received-Bytes: 4092

by: Josh Vanderhoof - Sun, 20 Feb 2022 22:28 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:

> Josh Vanderhoof wrote:
>> MitchAlsup <MitchAlsup@aol.com> writes:
>>
>>> On Friday, February 18, 2022 at 5:09:29 PM UTC-6, Josh Vanderhoof wrote:
>>>> MitchAlsup <Mitch...@aol.com> writes:
>>>>
>>>>> On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc
>>>>> wrote:
>>>>>> There is no way that a compiler can take the place of
>>>>>> out-of-order circuitry for allowing a computer to deal with
>>>>>> cache misses. These aren't predictable.
>>>>> < Actually, they are predictable in HW. < Say we have a loop that
>>>>> does 2 LDDs and one STD: and we run the loop "lots" of times;
>>>>> LDD[1] takes a miss every 8 cycles in iteration MOD 3 LDD[2]
>>>>> takes a miss every 8 cycles in iteration MOD 6 ST....... takes a
>>>>> miss every 8 cycles in iteration MOD 1 < HW can handle this
>>>>> perfectly after a couple of 16+ loops, and it does not have to be
>>>>> OoO HW to do this. It could be detected between L1<->L2 and just
>>>>> schedule a 3 prefetches every 8×latency-of-loop.
>>>> Say you have a loop with unpredictable cache misses like this:
>>>>
>>>> for (i = 0; i < n; i++) d[i] = f(s[get_addr(i)]);
>>> <
>>> MOV Ri,#0
>>> for_loop:
>>> LDD R1,[Rsp+Ri<<2]
>>> CALL f
>>> ADD Ri,Ri,#1
>>> CMP Rt,Ri,Rn
>>> BLT for_loop
>>> <
>>> did you mean to call a function ?
>>
>> Yes, the function doesn't matter and I assume it would have to be inline
>> for this to work. What I mean is the loop creates a hard to predict
>> address, "get_addr()", loads a value from that address, does some work,
>> "f()", with that value and then stores the result. Any time the load
>> misses the cache it's going to have to sit and wait for the value, where
>> an OoO machine would keep busy working on f() for the next loop
>> iteration that hit the cache.
>>
>> Would it be worthwhile to break it into 2 loops, one that works on data
>> already in the cache and skips/prefetches the uncached data, and then a
>> second loop that runs on the data that the previous loop skipped, which
>> has hopefully now arrived in the cache due to the prefetches in the
>> previous loop.
>
> That's basically what scout threads do.
>
> https://en.wikipedia.org/wiki/Hardware_scout

Not really since this would only be one thread and wouldn't discard any
results. It's basically just reordering the loop to run the cache
missing iterations last.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwvh78tp24r.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23720&group=comp.arch#23720

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Sun, 20 Feb 2022 18:11:33 -0500
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <jwvh78tp24r.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
<2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<ygn1qzzzs9p.fsf@y.z>
<3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>
<ygno832ts92.fsf@y.z>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="fec8af2f0403aa4cf94fab96e5b00f54";
logging-data="31334"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FXvNPXacNISpOJJ8UDDts"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:IW26fQRKxTTmjHv+SqLiO/7nTQs=
sha1:wNM7/3G/ptLvJY0R0XoEgJDFoAM=

by: Stefan Monnier - Sun, 20 Feb 2022 23:11 UTC

> Would it be worthwhile to break it into 2 loops, one that works on data
> already in the cache and skips/prefetches the uncached data, and then a
> second loop that runs on the data that the previous loop skipped, which
> has hopefully now arrived in the cache due to the prefetches in the
> previous loop.

For OoO CPUs that have a window large enough to withstand the cache
misses's latency, that can already happen without any special effort on
the programmer&compiler's side.

Stefan

Ivan Godard wrote:
> On 2/18/2022 8:42 AM, Paul A. Clayton wrote:
>> On Friday, February 18, 2022 at 9:27:14 AM UTC-5, Ivan Godard
>> wrote:
>> [snip]
>>> Still,
>>> in principle an OOO can issue a load from inside a called body
>>> before
>>> the call itself has been issued, and a Mill can't do that.
>>
>> In theory a caller could start a pick up load that is picked up
>> by the callee. This would be similar to function specific
>> interfaces rather than using a generic ABI. It is not obvious
>> how useful this would be.
>
> The architecture does not permit that - each frame has a distinct
> set of load retire stations. While that could be changed, it's not
> clear that, given the limited applicability to calls which were
> statically resolvable (i.e. no polymorphic), the the game would be
> worth the candle. That case can be addressed without architecture
> change using out-lining, too.

My proposal would "merely" allow load retire stations to be
treated similarly to callee saved registers. This, as with any
flexibility, would increase complexity, but I have no clue how
much hardware and the compiler/specializer would be affected.

Even for polymorphic calls, there might be some cases where a
load could be started early. If every callee loads a fixed sized
member (if I remember correctly, the Mill only distinguished by
size and is_pointer not between bit vector, integer, or FP of a
given size/SIMD count) with a common offset, the caller could
initiate such a load without knowledge of which function will use
the result.

Passing the load result as an additional argument would have a
similar benefit (using a load that drops just before the function
call), but such would not allow the delay of the called function's
use of the value to hide latency — an instruction cache miss could
provide significant extra delay where this memory-level
parallelism could be useful but even load value use two cycles
into a function might be a use-case for a 3-cycle data cache as
even initiating the load in the first cycle of the called function
might introduce one all-nop cycle (or at least reduced the
execution parallelism).

An ordinary deferred load that is immediately passed as an
argument will stall the pipeline if the load delay was excessive
(e.g., an L1 miss) despite actual use potentially being later. A
prefetch would avoid the too-early stall but would not allow the
load itself to be done. (One could support 'prefetch' data smaller
than an L1 cache block into a special very fast cache, but a pick
up load could be lower latency and lower power [just a present
check versus a tag check] and possibly slightly smaller code size.)

If such highly deferred loads were somewhat common, tracking
resources might be moved father from the heavy action of the core,
allowing more timing-critical hardware to be closer. Utilization
of such a distanced structure might be increased by also utilizing
it for prefetches (both hardware and software initiated), perhaps.

I have *no idea* if such would be worthwhile. I only observe that
stopping loads at the call boundary prevents using some natural
delay to further hide load latency.

>>> Put all this together and the Mill ISA has eliminated all the OOO
>>> advantages except: OOO will excel on 1) frequently called 2)
>>> polymorphic
>>> functions that 3) immediately execute a load whose address
>>> arguments are
>>> 4) ready and that 5) misses when 6) the hardware is not bandwidth
>>> limited. For that case a Mill will stall in the callee and an
>>> OOO likely
>>> will not.
>>
>> I am not convinced this is the case, but writing on Google
>> groups from a tablet is painful.

I still need to think about what other parallelism opportunities
OoO can exploit that a statically scheduled processor cannot. An
OoO processor can initiate other loads past a load that fails to
complete on schedule. In theory, one could use informing loads — a
cache miss/excessive delay sets a condition which can be used for
branches — to branch to alternative work on a cache miss, but it
is not clear how useful such would be. (On the Mill, a pick-up
load might be slightly extended to provide a branch target on a
pick-up schedule failure, but I suspect that a more orthogonal
branching interface might be better, allowing one alternate path
branch to be used for multiple loads.)

It seems the compiler would have a difficult job in deciding
whether to eagerly execute loads and how far in advance to
schedule loads. Hoisting a load too early wastes load tracking
resources. With respect to eager execution, executing an unused
load is somewhat expensive even on a cache hit (energy and
throughput for loads seems to better than for multiplications but
much worse than for simple ALU operations); on a cache miss — at
least when this does not effectively become a prefetch for a later
access — this could be especially expensive.

Generating memory level parallelism (such that the latency of one
miss is partially covered by the latency of another) given
indirections and conditional guards seems likely to be
challenging. (In theory, a Mill implementation could cancel a load
that turned out to generate a dead value, perhaps using a simple
runahead to detect if a NaResult-marked [for load value not
available in time] actually commits, in which case an exception
would return to the load dropping cycle and wait for the load to
return a value. For a single load miss, the runahead could even
commit once guarding conditions have been resolved if the load was
not used; more complex conditions might make such too complex. I
thinking indicating that such a load is predicated may help hardware.)

>>> We decided that this case was not worth 12x.
>>
>> I doubt the Mill will get a 12x PPA advantage over an OoO
>> implementation of a conventional ISA like AArch64.
>
> 12x is not my number (see Mitch), and in any case we have given up
> some of whatever it is to pay for width.

Mitch's 12x core (including L1) area and power advantage for a
scalar in-order core came with a 2x performance disadvantage. (He
also mentioned that system power more nearly tracks performance.
OoO speculation may generate excessive cache/interconnect and
memory traffic, but such is probably not especially great.
Hardware prefetching may have worse misspeculation energy cost.
How much of the chip power is used by such "system" activity — L2
accesses, memory controllers, etc. — would seem to be workload
dependent. Even including L2 caches in core area, it seems that
more than half of the chip _area_ is used by non-core hardware for
most great-big-OoO designs without any special accelerators and
with significant FP SIMD 'bloating' the cores; Intel's Golden Cove
core's 1.25 MiB L2 seems to be about a quarter of that "core"
area, while AMD's Zen2 L2 looks to use about a sixth of the "core"
area.)

If a Mill's processing hardware is much more area efficient for a
given level of performance, system components will be a larger
fraction of chip area. System hardware activity would be more
difficult to compare. The Mill's fine-grained validity seems
likely to make coherence a little more expensive, particularly
with guarantees of sequential consistency (if such is still
expected to be provided). Avoiding read for ownership will provide
some performance and energy advantage, but the read-for-ownership
advantage of backless stores could — I believe — be provided to
conventional ISAs without application changes (avoiding zero
initialization would require application changes).

In general increasing performance becomes increasingly more
difficult in area and energy costs (as well as design cost);
expecting linear scaling of power-area with performance seems
rather unlikely. Communication costs seem problematic. While
functional units in a single core have considerably more physical
locality than cores in a chip multiprocessor, the typical
expectation of communication is also much higher. Even an 8-wide
processor is expected to be able to route all 8 result to all 8
functional units — the Mill does avoid the additional
communication paths with register storage (though the belt
presumably needs to be a little larger than the normal forwarding
hardware). Data-level parallelism (vector/SIMD) simplifies
communication in avoiding 'lane crossing', but even for ordinary
communication patterns all-to-all forwarding seems considerable
'overkill'.

While the Mill has some features that would provide better
performance such as virtually addressed caches and dual-path
instruction fetch (and Mitch has suggested that with limited
additional hardware a normally scalar processor could dual issue
often enough to significantly improve performance, implying that
the 12x was not necessary best case), even getting 6x PPA **with
similar performance** seems unlikely.

Click here to read the complete article

Paul A. Clayton <paaronclayton@gmail.com> schrieb:
> Ivan Godard wrote:
>> On 2/18/2022 8:42 AM, Paul A. Clayton wrote:
>>> On Friday, February 18, 2022 at 9:27:14 AM UTC-5, Ivan Godard
>>> wrote:
>>> [snip]
>>>> Still,
>>>> in principle an OOO can issue a load from inside a called body
>>>> before
>>>> the call itself has been issued, and a Mill can't do that.
>>>
>>> In theory a caller could start a pick up load that is picked up
>>> by the callee. This would be similar to function specific
>>> interfaces rather than using a generic ABI. It is not obvious
>>> how useful this would be.
>>
>> The architecture does not permit that - each frame has a distinct
>> set of load retire stations. While that could be changed, it's not
>> clear that, given the limited applicability to calls which were
>> statically resolvable (i.e. no polymorphic), the the game would be
>> worth the candle. That case can be addressed without architecture
>> change using out-lining, too.
>
> My proposal would "merely" allow load retire stations to be
> treated similarly to callee saved registers. This, as with any
> flexibility, would increase complexity, but I have no clue how
> much hardware and the compiler/specializer would be affected.

Maybe I'm confused, but how would this (in a conventional
architecture) be different from issuing a load into an argument
register prior to a function, then calling the function?

On Sunday, February 20, 2022 at 7:05:29 PM UTC-7, Paul A. Clayton wrote:

> Mitch's 12x core (including L1) area and power advantage for a
> scalar in-order core came with a 2x performance disadvantage.

And, since parallel programming is so difficult, and more single-thread
performance is more valuable than gold, people are gladly paying the 12x
cost of die area and power consumption for a GBOoO core.

I agree that being able to do parallel programming better would be
very useful. I disagree that it's likely or even possible that we will
ever manage to improve our ability to do parallel programming enough
to be so useful as to dim our ardor for the GBOoO core. I think the
problem here is a *fundamental* one; sometimes you need to know A
before you can do B.

The only possibility I see is if we started making microprocessors out
of some material that let us build faster transistors and shorter
interconnects, so that we could take an in-order core, and make it 2x
or more faster more cheaply than by building an OoO core. Then we
wouldn't have wasteful GBOoO cores -- until such time as our yields
in the new material improved enough to let us build them.

So I guess there is truly no escape.

John Savard

Due to the length of Paul's screed I have split my response into several
sections

On 2/20/2022 6:05 PM, Paul A. Clayton wrote:
> Ivan Godard wrote:
>> On 2/18/2022 8:42 AM, Paul A. Clayton wrote:
>>> On Friday, February 18, 2022 at 9:27:14 AM UTC-5, Ivan Godard wrote:
>>> [snip]
>>>> Still,
>>>> in principle an OOO can issue a load from inside a called body before
>>>> the call itself has been issued, and a Mill can't do that.
>>>
>>> In theory a caller could start a pick up load that is picked up by
>>> the callee. This would be similar to function specific interfaces
>>> rather than using a generic ABI. It is not obvious how useful this
>>> would be.
>>
>> The architecture does not permit that - each frame has a distinct set
>> of load retire stations. While that could be changed, it's not clear
>> that, given the limited applicability to calls which were statically
>> resolvable (i.e. no polymorphic), the the game would be worth the
>> candle. That case can be addressed without architecture change using
>> out-lining, too.
>
> My proposal would "merely" allow load retire stations to be treated
> similarly to callee saved registers. This, as with any
> flexibility, would increase complexity, but I have no clue how
> much hardware and the compiler/specializer would be affected.
>
> Even for polymorphic calls, there might be some cases where a
> load could be started early. If every callee loads a fixed sized member
> (if I remember correctly, the Mill only distinguished by size and
> is_pointer not between bit vector, integer, or FP of a given size/SIMD
> count) with a common offset, the caller could
> initiate such a load without knowledge of which function will use the
> result.
>
> Passing the load result as an additional argument would have a similar
> benefit (using a load that drops just before the function call), but
> such would not allow the delay of the called function's use of the value
> to hide latency — an instruction cache miss could
> provide significant extra delay where this memory-level parallelism
> could be useful but even load value use two cycles into a function might
> be a use-case for a 3-cycle data cache as even initiating the load in
> the first cycle of the called function might introduce one all-nop cycle
> (or at least reduced the execution parallelism).
>
> An ordinary deferred load that is immediately passed as an argument will
> stall the pipeline if the load delay was excessive (e.g., an L1 miss)
> despite actual use potentially being later. A prefetch would avoid the
> too-early stall but would not allow the
> load itself to be done. (One could support 'prefetch' data smaller than
> an L1 cache block into a special very fast cache, but a pick up load
> could be lower latency and lower power [just a present check versus a
> tag check] and possibly slightly smaller code size.)
>
> If such highly deferred loads were somewhat common, tracking resources
> might be moved father from the heavy action of the core, allowing more
> timing-critical hardware to be closer. Utilization of such a distanced
> structure might be increased by also utilizing it for prefetches (both
> hardware and software initiated), perhaps.
>
> I have *no idea* if such would be worthwhile. I only observe that
> stopping loads at the call boundary prevents using some natural delay to
> further hide load latency.
>
>>>> Put all this together and the Mill ISA has eliminated all the OOO
>>>> advantages except: OOO will excel on 1) frequently called 2)
>>>> polymorphic
>>>> functions that 3) immediately execute a load whose address arguments
>>>> are
>>>> 4) ready and that 5) misses when 6) the hardware is not bandwidth
>>>> limited. For that case a Mill will stall in the callee and an OOO
>>>> likely
>>>> will not.
>>>
>>> I am not convinced this is the case, but writing on Google groups
>>> from a tablet is painful.
>
> I still need to think about what other parallelism opportunities OoO can
> exploit that a statically scheduled processor cannot. An OoO processor
> can initiate other loads past a load that fails to complete on schedule.
> In theory, one could use informing loads — a cache miss/excessive delay
> sets a condition which can be used for branches — to branch to
> alternative work on a cache miss, but it is not clear how useful such
> would be. (On the Mill, a pick-up load might be slightly extended to
> provide a branch target on a pick-up schedule failure, but I suspect
> that a more orthogonal branching interface might be better, allowing one
> alternate path branch to be used for multiple loads.)
>
> It seems the compiler would have a difficult job in deciding whether to
> eagerly execute loads and how far in advance to schedule loads. Hoisting
> a load too early wastes load tracking resources. With respect to eager
> execution, executing an unused load is somewhat expensive even on a
> cache hit (energy and throughput for loads seems to better than for
> multiplications but much worse than for simple ALU operations); on a
> cache miss — at least when this does not effectively become a prefetch
> for a later access — this could be especially expensive.
>
> Generating memory level parallelism (such that the latency of one miss
> is partially covered by the latency of another) given indirections and
> conditional guards seems likely to be challenging. (In theory, a Mill
> implementation could cancel a load that turned out to generate a dead
> value, perhaps using a simple runahead to detect if a NaResult-marked
> [for load value not available in time] actually commits, in which case
> an exception would return to the load dropping cycle and wait for the
> load to return a value. For a single load miss, the runahead could even
> commit once guarding conditions have been resolved if the load was not
> used; more complex conditions might make such too complex. I thinking
> indicating that such a load is predicated may help hardware.)
>

Your proposal is one we have explored, though perhaps not far enough.
I'll explain the difficulties we found, in hopes that you might see ways
around them.

First, there are presently two ways for a Mill load to retire,
implicitly via timeout and explicitly via the pickup instruction. Each
presents its own problems.

An explicit load encodes a tag# argument, an arbitrary small constant
drawn from a set the size of the set of hardware retire stations. It is
not a physical RS#, but is mapped to one at issue in a way essentially
equivalent to a rename register in OOO. The corresponding pickup
instruction carries the same tag, and the mapping says which RS to
retire. Each call frame has its own tag namespace. The specializer
schedule and hardware checks assures that loads don't issue an in-use
tag, nor pickups try to retire one that is not in flight, and so on.

To use your suggestion with explicit retire would require that there be
a tag namespace that crossed frame boundaries so a load could be loaded
in one and retired in a different one. Such a notion is very un-Mill
because we define interrupts, faults and traps as being mere ordinary
functions that are involuntarily called. As these can occur at any time,
a function cannot know whose frame is adjacent to its own. This makes
for a very easy call interface - you can only see your own stuff and
your explicit arguments - but makes sharing namespaces across frames
problematic.

The alternative to sharing is of course argument passing, as you
suggest: the call instruction could contain load tags from its caller
tag set, to be picked up by some kind of callee signature instruction
that maps the argument tags to the callee tag namespace, whence a normal
pickup can retire them. The problem is one of encoding: the call
instruction already has a variable length list of belt arguments, and
this would require a variable length list of tag arguments as well. We
never found a satisfactory way to encode two lists in one instruction.

The implicit load method has its own issues. For a load to timeout in a
callee it must be counting in that callee. However there may be other
calls to irrelevant functions between the load issue and the intended
target call, and the load should not count during those. Note that the
intervening non-counting calls may include interrupts and exceptions
that are not statically schedulable. Consequently, a call must be able
to indicate which in-flight loads should continue counting across the
call - and we are back in the encode-two-lists problem again.

The second list can be avoided if a callee is permitted to reference
caller tags ordinally, sort of a tag belt or tag RF. The callee would
then have an instruction that says in effect "retire the 3rd most recent
load issued by the caller" or "retire what the caller calls tag5". There
are obvious security holes in this because caller and callee may be in
different mutually untrusting turfs.

Click here to read the complete article

On 2/21/2022 1:26 AM, John Dallman wrote:
> In article <suvjue$gep$1@dont-email.me>, ivan@millcomputing.com (Ivan
> Godard) wrote:
>
>> ... Each call frame has its own tag namespace. The specializer
>> schedule and hardware checks assures that loads don't issue an
>> in-use tag, nor pickups try to retire one that is not in flight,
>> and so on.
>
> How does this work with C longjmp(), C++ exception throwing, and other
> forced transitions between call frames?
>
> John

Frame exit recovers all resources of the frame: belt, scratchpad, and
yes, retire stations. It's a bit too complicated to explain here, and
some of the details are NYF, but you can think of it as being as if each
load requests carries a frame identification, so when a response comes
back from the uncore carrying the frame id of an exited frame then the
response is discarded.

This behavior falls out naturally from the HW means used to attach
responses to the original load instruction and its retire. Remember that
loads can be in flight over calls (including handlers), so at any time
the uncore may have in flight loads from several different frames. Any
mechanism that attaches a response to its originating frame's set of RSs
also serves to detach a response from an already exited frame.

Ivan Godard <ivan@millcomputing.com> writes:
>On 2/15/2022 10:43 AM, Anton Ertl wrote:
>> And because compiler
>> branch prediction (~10% miss rate) is much worse than dynamic branch
>> prediction (~1% miss rate, both numbers vary strongly with the
>> application, so take them with a grain of salt), a static scheduling
>> speculating compiler will tend to waste more energy for the same
>> degree of speculation.
>
>Not in modern processes, or so I'm told by the hardware guys. Leakage is
>of the same order of cost as execution these days, so an idle ALU might
>as well do something potentially useful.

If true, this means that replacing the "speculation transistors" with
FUs (as someone proposed) would not save any power.

>Consequently it is worth while
>for the compiler to if-convert everything until it runs out of FUs.

If you if-convert a branch that is very predictable (at least for
dynamic branch prediction; say, 0.1% miss rate), yes, you have
utilized the FU, but you have not increased performance.

>Incidentally, in-order != static-prediction.

Nobody claimed it is. However, I have yet to see a paper on
speculation by the compiler that uses dynamic branch prediction for
the speculation. As far as I am aware, there is only Joshua Landau's
brp/brv idea in that direction, and nobody has pursued it further.

>No in-order core much above
>a Z80 will use static branch prediction

The 486 has no branch prediction (or, you might say that it statically
always predicts fall-through). IIRC SPARC has a way to suppress the
delay slot if the branch is not taken (allows predicting that a branch
is taken). IIRC Alpha has hint bits in the branch instruction that
allow to predict that the branch is taken or not. Some
implementations of architectures without such hint bits use the
backward-taken/forward-not-taken heuristic.

But generally, static branch prediction is more a compiler thing than
an architecture thing: The compiler predicts the direction statically,
and (simplest case) arranges the code such that the predicted path is
where the hardware executes it fastest. And for speculation, that's
certainly a compiler thing. The compiler moves instructions from a
likely-executed block up across a branch to be executed speculatively;
some architectures (particularly IA-64) have architectural support
that allows compilers to speculate more.

>, for the reason you give. Well,
>no competent core, anyway.

486, R2000 and basically everything else from the 1980s (Z80 was
released in 1976, the Pentium with dynamic branch prediction in 1993),
all incompetent?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Ivan Godard <ivan@millcomputing.com> writes:
>On 2/15/2022 8:58 AM, Stefan Monnier wrote:
>> In contrast, an OoO is free to use a different schedule each time
>> a chunk of code is run (and similarly the branch predictor is free to
>> provide wildly different predictions each time that chunk of code is
>> run) without any downside.
>>
>> The OoO works on the trace of the actual execution, where the main limit
>> is the size of the window it can consider (linked to the accuracy of the
>> branch predictor), whereas the compiler is not limited to such a window
>> but instead it's limited to work on the non-unrolled code.
>
>Actually the limit to useful window size in GP code is dataflow
>dependencies.

If that was the case, we would not need branch prediction. But
indeed, as we have shown in our limit study (see Section 6 and Figure
9 of <http://www.complang.tuwien.ac.at/papers/ertl-krall94cc.ps.gz>),
which considers only data flow (RAW) dependencies, control
dependencies for stores, and branch prediction, branch prediction
increases potential IPC a lot.

>All the fancy numbers bandied about for big windows are
>for embarrassingly parallel apps - walk a huge array doing the same
>thing for every element for example.

What makes you think so? Embarrassingly parallel problems are easy to
parallelize and don't need deep reorder buffers. You can handle that
with many small cores, and/or depending on the thing you do for every
element, with SIMD instructions, GPGPUs, or software pipelined wide
in-order cores.

Do you have cases in mind where the thing being done takes up to ~500
instructions, and is too irregular for SIMD, GPGPUs, and software
pipelining? Yes, OoO would help there, but many small cores would as
well. I doubt that there are that many problems of that kind to
justify deep OoO, because there are not enough problems of that kind
to justify many small-cores (or Tilera and the UltraSPARC T1000 would
have been more successful).

>GP code - the classic payroll app, or the great bulk of real work once
>the embarrassingly parallel code has been moved off to special purpose
>engines - hits a dataflow dependence within a few tends of instructions.

What do you mean by that?

>The rest of the window can be filled with instructions awaiting issue
>resolution - but you might as well have left them in the icache.

You seem to be thinking about in-order execution. Just because one
instruction waits on the result of another instruction does not mean
that all instructions do. Even looking 500 instructions ahead, an
instruction coming from the front end might be ready (e.g., it loads a
constant, or it loads a value from the stack, and the stack pointer
has not been updated in the last 500 dynamic instructions).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Thomas Koenig [2022-02-21 06:20:49] wrote:
> Paul A. Clayton <paaronclayton@gmail.com> schrieb:
>> My proposal would "merely" allow load retire stations to be
>> treated similarly to callee saved registers. This, as with any
>> flexibility, would increase complexity, but I have no clue how
>> much hardware and the compiler/specializer would be affected.
> Maybe I'm confused, but how would this (in a conventional
> architecture) be different from issuing a load into an argument
> register prior to a function, then calling the function?

I don't think the intention is to be different, but on the contrary to
reproduce the behavior of OoO in the case you describe.

An OoO can start evaluating the function's body before the load
has finished. Whereas with the way loads work on the Mill currently,
the whole "issue the load ... get the result on the belt" can either
happen 100% while executing instructions from the caller, or 100% while
executing instructions of the callee.

So currently on the Mill if you do the load on the caller's side, that
will delay the call until the load completes, and if you do the call on
the callee's side that means you can't start the load before you enter
the function.

Paul's proposal is to allow doing it half and half. Whether it's worth
the added complexity in the ISA, I can't judge. It might be worth
trying it out.

Stefan

On 2/21/2022 9:09 AM, John Dallman wrote:
> In article <suj1pu$sou$1@dont-email.me>, ivan@millcomputing.com (Ivan
> Godard) wrote:
>
>> Within a single function frame, it is the responsibility of the
>> specializer to schedule the lifetimes of in-flight load operations
>> such that the number simultaneously live does not exceed the number
>> of RSs configured.
>
> OK. If you unavoidably have more loads live than that, there will
> presumably be stalls waiting for memory?

There cannot be more than that; it's static scheduled, the code cannot
initiate more loads than there are stations, and an attempt to do so is
a HW detected fault. Hence no stall. If you have a lot of loads to
schedule, then always schedule enough retires that there are free RSs to
schedule more loads onto. The retiring loads drop to the belt, and if
that makes you run out of belt too then you spill belt values to the
scratchpad and refill them when you need the value.

>> Between frames the RSs are lazily and
>> automatically spilled by the hardware spiller, so that each frame
>> appears to have a complete set of free RSs at entry. A function
>> return restores the previously active RSs automatically. The
>> current implementations spill only address and metadata from RSs,
>> not the buffer content; a spill refill reissues the load request,
>> which should be satisfied from cache because of the pre-spill load
>> action.
>
> Do spilled RS go on the control stack or the data stack?

Control. All spiller stuff goes (eventually) on the control stack.

>
>> Countdown ("deferred") loads only count cycles of the load's frame.
>> While counting, or while awaiting an explicit retire instruction,
>> the program may do anything it wants including executing calls to
>> arbitrarily lengthy functions. This provides insulation against
>> cache misses similar to that provided by very long issue windows in
>> OOO, but without the hardware overhead or window size limitations.
>
> How are the counts values set up? Establishing "correct" values is not an
> obviously simple problem. What happens when a count runs out before data
> arrives?

The specializer actually schedules the load issues and load retires as
distinct instructions, with retire depending on issue with a minimum
latency determined by the D$1 round trip. This is no different than
scheduling a multiply's result into an add, and the issue can be remote
from the retire just as the mul can be remote from the add. The retire
is a pseudo-instruction used for scheduling but which does not appear in
the generated binary.

The cycle count between the as-scheduled issue's bundle and the retire's
bundle, biased to zero, is the deferral count put in the issue
instruction. If that count is too big for the encoding then the
instruction is changed from using countdown to one using explicit
pickup, which has no encoding limit on the gap between issue and retire.

> John

On 2/21/2022 8:17 AM, Stefan Monnier wrote:
> Thomas Koenig [2022-02-21 06:20:49] wrote:
>> Paul A. Clayton <paaronclayton@gmail.com> schrieb:
>>> My proposal would "merely" allow load retire stations to be
>>> treated similarly to callee saved registers. This, as with any
>>> flexibility, would increase complexity, but I have no clue how
>>> much hardware and the compiler/specializer would be affected.
>> Maybe I'm confused, but how would this (in a conventional
>> architecture) be different from issuing a load into an argument
>> register prior to a function, then calling the function?
>
> I don't think the intention is to be different, but on the contrary to
> reproduce the behavior of OoO in the case you describe.
>
> An OoO can start evaluating the function's body before the load
> has finished. Whereas with the way loads work on the Mill currently,
> the whole "issue the load ... get the result on the belt" can either
> happen 100% while executing instructions from the caller, or 100% while
> executing instructions of the callee.
>
> So currently on the Mill if you do the load on the caller's side, that
> will delay the call until the load completes, and if you do the call on
> the callee's side that means you can't start the load before you enter
> the function.
>
> Paul's proposal is to allow doing it half and half. Whether it's worth
> the added complexity in the ISA, I can't judge. It might be worth
> trying it out.

As described in my lengthy response to Paul, we did try it out, but
never ound a way to encode it that didn't muck up everything.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv0irq$mmc$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23735&group=comp.arch#23735

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 09:41:15 -0800
Organization: A noiseless patient Spider
Lines: 71
Message-ID: <sv0irq$mmc$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at> <suh97n$hng$1@dont-email.me>
<2022Feb21.111158@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Feb 2022 17:41:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="292ea7d621bd2f9ee11298ee908c5ea5";
logging-data="23244"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6jNt6zYcLLUr4Lci4jaAU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:fR7QqTR4vSc8JrcH17yeFGdjSZY=
In-Reply-To: <2022Feb21.111158@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: Ivan Godard - Mon, 21 Feb 2022 17:41 UTC

On 2/21/2022 2:11 AM, Anton Ertl wrote:
> Ivan Godard <ivan@millcomputing.com> writes:
>> On 2/15/2022 10:43 AM, Anton Ertl wrote:
>>> And because compiler
>>> branch prediction (~10% miss rate) is much worse than dynamic branch
>>> prediction (~1% miss rate, both numbers vary strongly with the
>>> application, so take them with a grain of salt), a static scheduling
>>> speculating compiler will tend to waste more energy for the same
>>> degree of speculation.
>>
>> Not in modern processes, or so I'm told by the hardware guys. Leakage is
>> of the same order of cost as execution these days, so an idle ALU might
>> as well do something potentially useful.
>
> If true, this means that replacing the "speculation transistors" with
> FUs (as someone proposed) would not save any power.
>
>> Consequently it is worth while
>> for the compiler to if-convert everything until it runs out of FUs.
>
> If you if-convert a branch that is very predictable (at least for
> dynamic branch prediction; say, 0.1% miss rate), yes, you have
> utilized the FU, but you have not increased performance.
>
>> Incidentally, in-order != static-prediction.
>
> Nobody claimed it is. However, I have yet to see a paper on
> speculation by the compiler that uses dynamic branch prediction for
> the speculation. As far as I am aware, there is only Joshua Landau's
> brp/brv idea in that direction, and nobody has pursued it further.
>
>> No in-order core much above
>> a Z80 will use static branch prediction
>
> The 486 has no branch prediction (or, you might say that it statically
> always predicts fall-through). IIRC SPARC has a way to suppress the
> delay slot if the branch is not taken (allows predicting that a branch
> is taken). IIRC Alpha has hint bits in the branch instruction that
> allow to predict that the branch is taken or not. Some
> implementations of architectures without such hint bits use the
> backward-taken/forward-not-taken heuristic.
>
> But generally, static branch prediction is more a compiler thing than
> an architecture thing: The compiler predicts the direction statically,
> and (simplest case) arranges the code such that the predicted path is
> where the hardware executes it fastest. And for speculation, that's
> certainly a compiler thing. The compiler moves instructions from a
> likely-executed block up across a branch to be executed speculatively;
> some architectures (particularly IA-64) have architectural support
> that allows compilers to speculate more.
>
>> , for the reason you give. Well,
>> no competent core, anyway.
>
> 486, R2000 and basically everything else from the 1980s (Z80 was
> released in 1976, the Pentium with dynamic branch prediction in 1993),
> all incompetent?

By today's standard, yes. You have been (ISTM) asserting that an ISA
from 40 years ago cannot match a modern OOO (to which I agree), and that
a modern ISA will necessarily suffer the constraints of that old
architecture (to which I don't).

Just because you haven't seen it in a paper doesn't mean that one cannot
use a HW runtime branch predictor with a statically scheduled
architecture. I have explained how that works here more than I care to
have done (and no doubt more than the denizens here care for me to have
done). I suggest that you read how it actually works, here or in our doc
at millcomputing.com, and not assume what the words "static schedule"
meant forty years ago.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv0k1f$ju$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23736&group=comp.arch#23736

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 10:01:19 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <sv0k1f$ju$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 21 Feb 2022 18:01:19 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="292ea7d621bd2f9ee11298ee908c5ea5";
logging-data="638"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18mSHk0AzYPGBUGMwL11Mcu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:pF4UxozAR54+mYaQUgM8ilXZldU=
In-Reply-To: <2022Feb21.115543@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: Ivan Godard - Mon, 21 Feb 2022 18:01 UTC

On 2/21/2022 2:55 AM, Anton Ertl wrote:

<snip>

..
>
> You seem to be thinking about in-order execution. Just because one
> instruction waits on the result of another instruction does not mean
> that all instructions do. Even looking 500 instructions ahead, an
> instruction coming from the front end might be ready (e.g., it loads a
> constant, or it loads a value from the stack, and the stack pointer
> has not been updated in the last 500 dynamic instructions).

1) Like My66, Mill does not load constants from memory.
2) The stack will be in the D$1 and so is three cycles away. The
schedule can fill those three cycles with other instructions at full
width if the ILP is available; a 500 instruction issue queue buys nothing.
3) If the load is dataflow dependent on other loads (pointer chasing for
example) then no ISA will be faster than the back-to-back latency, and
again the other instructions can be fit in the the width between the loads.
4) If the load has no dataflow dependency on other loads (iterating over
an array with independent computation at each element, for example) then
the load can be hoisted arbitrarily high by increasing the iteration
interval of the loop pipeline. How high to hoist is a cost choice -
bigger hides more latency, but increases wasted work in pipe setup and
teardown.

I do not say that static schedules will always be better than OOO
schedules. I do not even say that they will always be as good as OOO
schedules. I do say that they will usually be as good as, and never much
worse, than OOO schedules, and cost much much less.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv12jf$j8h$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23738&group=comp.arch#23738

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 16:09:50 -0600
Organization: A noiseless patient Spider
Lines: 129
Message-ID: <sv12jf$j8h$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 21 Feb 2022 22:09:52 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a042b44ea7608b5f2d99a4c966f01b93";
logging-data="19729"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RugP7PEejr3XViOh5yvP3"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:tvghgS6fhwZQMUiaF1Yx/qtfYyk=
In-Reply-To: <sv0k1f$ju$1@dont-email.me>
Content-Language: en-US

by: BGB - Mon, 21 Feb 2022 22:09 UTC

On 2/21/2022 12:01 PM, Ivan Godard wrote:
> On 2/21/2022 2:55 AM, Anton Ertl wrote:
>
> <snip>
>
> .
>>
>> You seem to be thinking about in-order execution. Just because one
>> instruction waits on the result of another instruction does not mean
>> that all instructions do. Even looking 500 instructions ahead, an
>> instruction coming from the front end might be ready (e.g., it loads a
>> constant, or it loads a value from the stack, and the stack pointer
>> has not been updated in the last 500 dynamic instructions).
>
> 1) Like My66, Mill does not load constants from memory.

Nor does BJX2, FWIW.
In its current primary profiles, large constants are typically composed
via Jumbo prefixes (which in effect horizontally paste together an
immediate from multiple instruction words within a single bundle).

My core also has a branch predictor.

The branch predictor can be made cheap enough, and is accurate enough,
that one may as well have it. It is in effect a simple lookup (based on
low-order address bits and recent branch history), which is used as a
3-bit state machine.

> 2) The stack will be in the D$1 and so is three cycles away. The
> schedule can fill those three cycles with other instructions at full
> width if the ILP is available; a 500 instruction issue queue buys nothing.

Likewise.

L1 access latency is fixed (*), can be scheduled around assuming one has
something else to do in the meantime. This is a drawback of "small tight
loops", namely that they have little else to fill the space with, so
cycles will get wasted here.

L2 latency is modest, but in theory a lot of this could be absorbed with
a prefetch mechanism. DRAM latency is harder, would require a different
sort of prefetch. In some cases, prefetching into L2 can be handled by
logic in the L1 cache.

Say: We store into a line in the L1, and the L1 signals the L2 that this
has happened, and the L2 (asynchronously) gets the target cache-line
ready to accept the result when it is written back.

*: In my case, it is 3 cycles, so, say:
MOV.Q (SP, 8), R4
Op1 //R4 not available yet
Op2 //R4 still not available
ADD R4, R7 //Now R4 is available

You can try to use it sooner than this, the CPU will detect this case
and behave as if there were NOP instructions present.

To some extent, the WEXifier will also try to shuffle things to avoid
this, but tends to be fairly limited in what it can do.

Partly it is the "great register tradeoff":
One can reuse registers quickly, fewer registers needed;
But, then one can't really shuffle anything;
One could use all the registers and then allocate them round-robin;
But, for small functions this costs more than the former.

Current register allocator uses a recency counter and heuristics to try
to decide if we could allocate more registers, and prioritizes replacing
old/less-used variables before newer or more heavily used variables.

> 3) If the load is dataflow dependent on other loads (pointer chasing for
> example) then no ISA will be faster than the back-to-back latency, and
> again the other instructions can be fit in the the width between the loads.

Yep.

Some types of tasks, like loops iterating over linked lists, are hard to
make fast.

Usually about the only way to "fix" this is to rewrite the code to not
have a tight loop operating over a linked list.

> 4) If the load has no dataflow dependency on other loads (iterating over
> an array with independent computation at each element, for example) then
> the load can be hoisted arbitrarily high by increasing the iteration
> interval of the loop pipeline. How high to hoist is a cost choice -
> bigger hides more latency, but increases wasted work in pipe setup and
> teardown.
>

This is where a coding style like "throwing a bunch of variables and
expressions at the problem" helps, and writing code in a style where it
is fairly to run expressions in parallel.

OoO: Goes fast, because it allows CPU to better use its OoO capabilities
(and also modern x86 CPUs are mostly able to work around memory spills
and reloads).

VLIW: Also helps, the CPU core can run all this stuff in parallel.

The main cases this doesn't help:
Scalar cores, which can't run thing in parallel;
An ISA with insufficient registers to keep everything in registers.

Some amount of older code tries for the "small tight loops which don't
do all that much" style, since it can be noted that this was generally
the fastest strategy on the 386 and 486.

Likewise, code on 32-bit ARM also seems to prefer the "small tight
loops" strategy over the "boatload of manually unrolled and interleaved
expressions" strategy.

> I do not say that static schedules will always be better than OOO
> schedules. I do not even say that they will always be as good as OOO
> schedules. I do say that they will usually be as good as, and never much
> worse, than OOO schedules, and cost much much less.
>

Generally seems true.

Ivan Godard <ivan@millcomputing.com> schrieb:
> On 2/21/2022 2:55 AM, Anton Ertl wrote:
>
><snip>
>
> .
>>
>> You seem to be thinking about in-order execution. Just because one
>> instruction waits on the result of another instruction does not mean
>> that all instructions do. Even looking 500 instructions ahead, an
>> instruction coming from the front end might be ready (e.g., it loads a
>> constant, or it loads a value from the stack, and the stack pointer
>> has not been updated in the last 500 dynamic instructions).
>
> 1) Like My66, Mill does not load constants from memory.

You have to watch out for one thing - for code like

void foo (double *a, double *b, double *c)
{ *a = 42.;
*b = 42.;
*c = 42.;
}

the compiler should not put the same constant in the instruction
stream three times :-)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<19acb0f0-0b06-4b0c-aacb-54bee6e7f4e3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23740&group=comp.arch#23740

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:600c:2e54:b0:380:d3c8:ef33 with SMTP id q20-20020a05600c2e5400b00380d3c8ef33mr23268wmf.69.1645482972244;
Mon, 21 Feb 2022 14:36:12 -0800 (PST)
X-Received: by 2002:a05:6830:448e:b0:5a4:c845:9869 with SMTP id
r14-20020a056830448e00b005a4c8459869mr7424010otv.112.1645482971721; Mon, 21
Feb 2022 14:36:11 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Feb 2022 14:36:11 -0800 (PST)
In-Reply-To: <sv12jf$j8h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c50:3053:d762:9bba;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c50:3053:d762:9bba
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me> <sv12jf$j8h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <19acb0f0-0b06-4b0c-aacb-54bee6e7f4e3n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 21 Feb 2022 22:36:12 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Mon, 21 Feb 2022 22:36 UTC

On Monday, February 21, 2022 at 2:09:55 PM UTC-8, BGB wrote:
> On 2/21/2022 12:01 PM, Ivan Godard wrote:
> > On 2/21/2022 2:55 AM, Anton Ertl wrote:
> >
> > <snip>
> >
> > .
> >>
> >> You seem to be thinking about in-order execution. Just because one
> >> instruction waits on the result of another instruction does not mean
> >> that all instructions do. Even looking 500 instructions ahead, an
> >> instruction coming from the front end might be ready (e.g., it loads a
> >> constant, or it loads a value from the stack, and the stack pointer
> >> has not been updated in the last 500 dynamic instructions).
> >
> > 1) Like My66, Mill does not load constants from memory.
> Nor does BJX2, FWIW.
> In its current primary profiles, large constants are typically composed
> via Jumbo prefixes (which in effect horizontally paste together an
> immediate from multiple instruction words within a single bundle).
>
> My core also has a branch predictor.
>
> The branch predictor can be made cheap enough, and is accurate enough,
> that one may as well have it. It is in effect a simple lookup (based on
> low-order address bits and recent branch history), which is used as a
> 3-bit state machine.
> > 2) The stack will be in the D$1 and so is three cycles away. The
> > schedule can fill those three cycles with other instructions at full
> > width if the ILP is available; a 500 instruction issue queue buys nothing.
> Likewise.
>
> L1 access latency is fixed (*), can be scheduled around assuming one has
> something else to do in the meantime. This is a drawback of "small tight
> loops", namely that they have little else to fill the space with, so
> cycles will get wasted here.
<
That decision is implementation dependent.
<
Say we have a speed demon design with a small 3 cycle D$ 16KB-32Kb
Then
Say we have a server design with a large 4 cycle L1 128KB-256Kb
<
Both designs use the same blocks to build the pipeline
Both are designed to hit the same FAB technology.
<
Speed demon might have 3 levels of cache while server has 2 of equal
capacity with speed demon.
<
Now you want to run the same code optimally on both implementations.
Mill addresses this with the Specializer.
My 66000 addresses this with small O
<-----------
> > 4) If the load has no dataflow dependency on other loads (iterating over
> > an array with independent computation at each element, for example) then
> > the load can be hoisted arbitrarily high by increasing the iteration
> > interval of the loop pipeline. How high to hoist is a cost choice -
> > bigger hides more latency, but increases wasted work in pipe setup and
> > teardown.
> >
> This is where a coding style like "throwing a bunch of variables and
> expressions at the problem" helps, and writing code in a style where it
> is fairly to run expressions in parallel.
>
> OoO: Goes fast, because it allows CPU to better use its OoO capabilities
> (and also modern x86 CPUs are mostly able to work around memory spills
> and reloads).
<
Ahem: because it allows instruction to have whatever latencies the machine,
in that instant, produces.
>
> VLIW: Also helps, the CPU core can run all this stuff in parallel.
>
>
> The main cases this doesn't help:
> Scalar cores, which can't run thing in parallel;
> An ISA with insufficient registers to keep everything in registers.
>
>
> Some amount of older code tries for the "small tight loops which don't
> do all that much" style, since it can be noted that this was generally
> the fastest strategy on the 386 and 486.
>
> Likewise, code on 32-bit ARM also seems to prefer the "small tight
> loops" strategy over the "boatload of manually unrolled and interleaved
> expressions" strategy.
> > I do not say that static schedules will always be better than OOO
> > schedules. I do not even say that they will always be as good as OOO
> > schedules. I do say that they will usually be as good as, and never much
> > worse, than OOO schedules, and cost much much less.
> >
> Generally seems true.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23741&group=comp.arch#23741

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:47a8:0:b0:1ea:85d5:3cd9 with SMTP id 8-20020a5d47a8000000b001ea85d53cd9mr574572wrb.349.1645483271505;
Mon, 21 Feb 2022 14:41:11 -0800 (PST)
X-Received: by 2002:a05:6808:3029:b0:2d3:a03d:165d with SMTP id
ay41-20020a056808302900b002d3a03d165dmr566293oib.325.1645483270894; Mon, 21
Feb 2022 14:41:10 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Feb 2022 14:41:10 -0800 (PST)
In-Reply-To: <sv13ah$sor$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c50:3053:d762:9bba;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c50:3053:d762:9bba
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 21 Feb 2022 22:41:11 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Mon, 21 Feb 2022 22:41 UTC

On Monday, February 21, 2022 at 2:22:12 PM UTC-8, Thomas Koenig wrote:
> Ivan Godard <iv...@millcomputing.com> schrieb:
> > On 2/21/2022 2:55 AM, Anton Ertl wrote:
> >
> ><snip>
> >
> > .
> >>
> >> You seem to be thinking about in-order execution. Just because one
> >> instruction waits on the result of another instruction does not mean
> >> that all instructions do. Even looking 500 instructions ahead, an
> >> instruction coming from the front end might be ready (e.g., it loads a
> >> constant, or it loads a value from the stack, and the stack pointer
> >> has not been updated in the last 500 dynamic instructions).
> >
> > 1) Like My66, Mill does not load constants from memory.
> You have to watch out for one thing - for code like
>
> void foo (double *a, double *b, double *c)
> {
> *a = 42.;
> *b = 42.;
> *c = 42.;
> }
Compiler is not restricted from doing::
<
MOV Rt,#42
STD Rt,[Ra]
STD Rt,[Rb]
STD Rt,[Rc]
<
but it does not HAVE to.
<
Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
STD #42,[Ra]
STD #42,[Rb]
STD #42,[Rc]
may be less than the above example.
>
> the compiler should not put the same constant in the instruction
> stream three times :-)
<
This is a code density argument not a performance argument. Most of the time
having immediates and displacements of all sizes in the instruction stream
improves code density.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<ygnpmnf4zbp.fsf@y.z>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23742&group=comp.arch#23742

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1-2.proxad.net!proxad.net!feeder1-1.proxad.net!193.141.40.65.MISMATCH!npeer.as286.net!npeer-ng0.as286.net!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: x...@y.z (Josh Vanderhoof)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
<2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<ygn1qzzzs9p.fsf@y.z>
<3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>
<ygno832ts92.fsf@y.z> <jwvh78tp24r.fsf-monnier+comp.arch@gnu.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)
Reply-To: Josh Vanderhoof <jlv@mxsimulator.com>
Message-ID: <ygnpmnf4zbp.fsf@y.z>
Cancel-Lock: sha1:6a0TGG7rBzjJjnOzvEBZlz8EZ5o=
MIME-Version: 1.0
Content-Type: text/plain
Lines: 15
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Mon, 21 Feb 2022 22:41:47 UTC
Date: Mon, 21 Feb 2022 17:41:46 -0500
X-Received-Bytes: 2276

by: Josh Vanderhoof - Mon, 21 Feb 2022 22:41 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Would it be worthwhile to break it into 2 loops, one that works on data
>> already in the cache and skips/prefetches the uncached data, and then a
>> second loop that runs on the data that the previous loop skipped, which
>> has hopefully now arrived in the cache due to the prefetches in the
>> previous loop.
>
> For OoO CPUs that have a window large enough to withstand the cache
> misses's latency, that can already happen without any special effort on
> the programmer&compiler's side.

Yes, that would be the point. It'd be trying to run the loop out of
order on an in order CPU and hopefully achieve similar performance to
OoO. (Admittedly only on a very specific loop template.)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv1e9o$gg0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23744&group=comp.arch#23744

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bage...@gmail.com (Brian G. Lucas)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 19:29:26 -0600
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <sv1e9o$gg0$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Feb 2022 01:29:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="df46d6c629330b736e12145726cab86f";
logging-data="16896"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TZC2QM9WO5G/DcHHMB9PG"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:yuHnBh5JDljYHWIbCsOm5mGnf9U=
In-Reply-To: <f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
Content-Language: en-US

by: Brian G. Lucas - Tue, 22 Feb 2022 01:29 UTC

On 2/21/22 16:41, MitchAlsup wrote:
> On Monday, February 21, 2022 at 2:22:12 PM UTC-8, Thomas Koenig wrote:
>> Ivan Godard <iv...@millcomputing.com> schrieb:
>>> On 2/21/2022 2:55 AM, Anton Ertl wrote:
>>>
>>> <snip>
>>>
>>> .
>>>>
>>>> You seem to be thinking about in-order execution. Just because one
>>>> instruction waits on the result of another instruction does not mean
>>>> that all instructions do. Even looking 500 instructions ahead, an
>>>> instruction coming from the front end might be ready (e.g., it loads a
>>>> constant, or it loads a value from the stack, and the stack pointer
>>>> has not been updated in the last 500 dynamic instructions).
>>>
>>> 1) Like My66, Mill does not load constants from memory.
>> You have to watch out for one thing - for code like
>>
>> void foo (double *a, double *b, double *c)
>> {
>> *a = 42.;
>> *b = 42.;
>> *c = 42.;
>> }
> Compiler is not restricted from doing::
> <
> MOV Rt,#42
> STD Rt,[Ra]
> STD Rt,[Rb]
> STD Rt,[Rc]
> <
> but it does not HAVE to.
> <
> Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
> STD #42,[Ra]
> STD #42,[Rb]
> STD #42,[Rc]
> may be less than the above example.
>>
>> the compiler should not put the same constant in the instruction
>> stream three times :-)
> <
> This is a code density argument not a performance argument. Most of the time
> having immediates and displacements of all sizes in the instruction stream
> improves code density.

Thomas's example used 64-bit floating point constants which moves the code
density pointer somewhat. At some point one needs to burn a register load and
do register stores. I don't know where that trade-off should be.

brian

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv1fku$a4n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23745&group=comp.arch#23745

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 19:52:29 -0600
Organization: A noiseless patient Spider
Lines: 142
Message-ID: <sv1fku$a4n$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv12jf$j8h$1@dont-email.me>
<19acb0f0-0b06-4b0c-aacb-54bee6e7f4e3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Feb 2022 01:52:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3c9640f024530e8f700b91ad353ec73d";
logging-data="10391"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19yjYtkkSGJWYdI/AOExNXR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:PxNhjVBekjoGbMo5P+W7DoLy2Z8=
In-Reply-To: <19acb0f0-0b06-4b0c-aacb-54bee6e7f4e3n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 22 Feb 2022 01:52 UTC

On 2/21/2022 4:36 PM, MitchAlsup wrote:
> On Monday, February 21, 2022 at 2:09:55 PM UTC-8, BGB wrote:
>> On 2/21/2022 12:01 PM, Ivan Godard wrote:
>>> On 2/21/2022 2:55 AM, Anton Ertl wrote:
>>>
>>> <snip>
>>>
>>> .
>>>>
>>>> You seem to be thinking about in-order execution. Just because one
>>>> instruction waits on the result of another instruction does not mean
>>>> that all instructions do. Even looking 500 instructions ahead, an
>>>> instruction coming from the front end might be ready (e.g., it loads a
>>>> constant, or it loads a value from the stack, and the stack pointer
>>>> has not been updated in the last 500 dynamic instructions).
>>>
>>> 1) Like My66, Mill does not load constants from memory.
>> Nor does BJX2, FWIW.
>> In its current primary profiles, large constants are typically composed
>> via Jumbo prefixes (which in effect horizontally paste together an
>> immediate from multiple instruction words within a single bundle).
>>
>> My core also has a branch predictor.
>>
>> The branch predictor can be made cheap enough, and is accurate enough,
>> that one may as well have it. It is in effect a simple lookup (based on
>> low-order address bits and recent branch history), which is used as a
>> 3-bit state machine.
>>> 2) The stack will be in the D$1 and so is three cycles away. The
>>> schedule can fill those three cycles with other instructions at full
>>> width if the ILP is available; a 500 instruction issue queue buys nothing.
>> Likewise.
>>
>> L1 access latency is fixed (*), can be scheduled around assuming one has
>> something else to do in the meantime. This is a drawback of "small tight
>> loops", namely that they have little else to fill the space with, so
>> cycles will get wasted here.
> <
> That decision is implementation dependent.
> <
> Say we have a speed demon design with a small 3 cycle D$ 16KB-32Kb
> Then
> Say we have a server design with a large 4 cycle L1 128KB-256Kb
> <
> Both designs use the same blocks to build the pipeline
> Both are designed to hit the same FAB technology.
> <
> Speed demon might have 3 levels of cache while server has 2 of equal
> capacity with speed demon.
> <
> Now you want to run the same code optimally on both implementations.
> Mill addresses this with the Specializer.
> My 66000 addresses this with small O
> <-----------

OK.

I have 3-cycle L1, also 16 or 32K.

From what I can gather, the idea would be that RISCs would have a fixed
2-cycle latency, but this is an issue for timing.

Eg:
IF ID EX MA WB

My cores have typically ended up with a few more stages than this...

It is possible that code could be compiled with fine-tuning options for
the hardware it is to be run on (so, things like Load, ALU, FPU, ...
latency could be specified as part of the target processor profile).

Interlocks are good in that they still allow code to run correctly (with
reduced performance) if running code which assumes different instruction
latency values (some DSPs and similar skipped out on this).

>>> 4) If the load has no dataflow dependency on other loads (iterating over
>>> an array with independent computation at each element, for example) then
>>> the load can be hoisted arbitrarily high by increasing the iteration
>>> interval of the loop pipeline. How high to hoist is a cost choice -
>>> bigger hides more latency, but increases wasted work in pipe setup and
>>> teardown.
>>>
>> This is where a coding style like "throwing a bunch of variables and
>> expressions at the problem" helps, and writing code in a style where it
>> is fairly to run expressions in parallel.
>>
>> OoO: Goes fast, because it allows CPU to better use its OoO capabilities
>> (and also modern x86 CPUs are mostly able to work around memory spills
>> and reloads).
> <
> Ahem: because it allows instruction to have whatever latencies the machine,
> in that instant, produces.

Granted.

In any case, one can have spill-heavy code without suffering a big
performance hit.

I suspect this was a big source of some of my performance woes on the
ARM11 and Cortex-A53. Cases which have a lot of "load op spill load op
spill ..." kinda perform like crap.

The latency for loads is also kind of an issue on my BJX2 core, but I
have enough GPRs to partially work around this. Any L1 misses will still
stall the pipeline though.

FWIW:
In my case, it would appear that Load interlocks are 2nd place (for
wasted clock cycles) after L1 misses (based on the debug 'LED'
indicators), but more into 1st place in some cases.

I guess it could make sense to start gathering up stats for the
percentage of the clock-cycle budget that are being spent on interlocks.

>>
>> VLIW: Also helps, the CPU core can run all this stuff in parallel.
>>
>>
>> The main cases this doesn't help:
>> Scalar cores, which can't run thing in parallel;
>> An ISA with insufficient registers to keep everything in registers.
>>
>>
>> Some amount of older code tries for the "small tight loops which don't
>> do all that much" style, since it can be noted that this was generally
>> the fastest strategy on the 386 and 486.
>>
>> Likewise, code on 32-bit ARM also seems to prefer the "small tight
>> loops" strategy over the "boatload of manually unrolled and interleaved
>> expressions" strategy.
>>> I do not say that static schedules will always be better than OOO
>>> schedules. I do not even say that they will always be as good as OOO
>>> schedules. I do say that they will usually be as good as, and never much
>>> worse, than OOO schedules, and cost much much less.
>>>
>> Generally seems true.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<0cfcf36b-31fe-4991-9ea3-87c5023c8747n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23748&group=comp.arch#23748

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:6251:0:b0:1ea:9412:fc8b with SMTP id m17-20020a5d6251000000b001ea9412fc8bmr381961wrv.657.1645499678198;
Mon, 21 Feb 2022 19:14:38 -0800 (PST)
X-Received: by 2002:a05:6870:1c8:b0:d3:6d9a:8fd8 with SMTP id
n8-20020a05687001c800b000d36d9a8fd8mr824726oad.333.1645499677570; Mon, 21 Feb
2022 19:14:37 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 21 Feb 2022 19:14:37 -0800 (PST)
In-Reply-To: <sv1fku$a4n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c50:3053:d762:9bba;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c50:3053:d762:9bba
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me>
<sv12jf$j8h$1@dont-email.me> <19acb0f0-0b06-4b0c-aacb-54bee6e7f4e3n@googlegroups.com>
<sv1fku$a4n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0cfcf36b-31fe-4991-9ea3-87c5023c8747n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Feb 2022 03:14:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Tue, 22 Feb 2022 03:14 UTC

On Monday, February 21, 2022 at 7:52:34 PM UTC-6, BGB wrote:
> On 2/21/2022 4:36 PM, MitchAlsup wrote:
> > On Monday, February 21, 2022 at 2:09:55 PM UTC-8, BGB wrote:
> >> On 2/21/2022 12:01 PM, Ivan Godard wrote:
> >>> On 2/21/2022 2:55 AM, Anton Ertl wrote:
> >>>
> >>> <snip>
> >>>
> >>> .
> >>>>
> >>>> You seem to be thinking about in-order execution. Just because one
> >>>> instruction waits on the result of another instruction does not mean
> >>>> that all instructions do. Even looking 500 instructions ahead, an
> >>>> instruction coming from the front end might be ready (e.g., it loads a
> >>>> constant, or it loads a value from the stack, and the stack pointer
> >>>> has not been updated in the last 500 dynamic instructions).
> >>>
> >>> 1) Like My66, Mill does not load constants from memory.
> >> Nor does BJX2, FWIW.
> >> In its current primary profiles, large constants are typically composed
> >> via Jumbo prefixes (which in effect horizontally paste together an
> >> immediate from multiple instruction words within a single bundle).
> >>
> >> My core also has a branch predictor.
> >>
> >> The branch predictor can be made cheap enough, and is accurate enough,
> >> that one may as well have it. It is in effect a simple lookup (based on
> >> low-order address bits and recent branch history), which is used as a
> >> 3-bit state machine.
> >>> 2) The stack will be in the D$1 and so is three cycles away. The
> >>> schedule can fill those three cycles with other instructions at full
> >>> width if the ILP is available; a 500 instruction issue queue buys nothing.
> >> Likewise.
> >>
> >> L1 access latency is fixed (*), can be scheduled around assuming one has
> >> something else to do in the meantime. This is a drawback of "small tight
> >> loops", namely that they have little else to fill the space with, so
> >> cycles will get wasted here.
> > <
> > That decision is implementation dependent.
> > <
> > Say we have a speed demon design with a small 3 cycle D$ 16KB-32Kb
> > Then
> > Say we have a server design with a large 4 cycle L1 128KB-256Kb
> > <
> > Both designs use the same blocks to build the pipeline
> > Both are designed to hit the same FAB technology.
> > <
> > Speed demon might have 3 levels of cache while server has 2 of equal
> > capacity with speed demon.
> > <
> > Now you want to run the same code optimally on both implementations.
> > Mill addresses this with the Specializer.
> > My 66000 addresses this with small O
> > <-----------
> OK.
>
> I have 3-cycle L1, also 16 or 32K.
>
>
> From what I can gather, the idea would be that RISCs would have a fixed
> 2-cycle latency, but this is an issue for timing.
<
We have not be able to use 2 cycle loads for quite some time. In order to
get 2 cycle loads,
a) result deliver, wire delay, and forwarding take no more than ½ cycle
b) AGEN takes not more than ½ cycles
c) SRAM read takes no more than ½ cycles
d) Load Alignment takes no more than ½ cycles.
<
We do not have ½ cycle SRAM unless we are only clocking the pipeline
at DIV 2. Current SRAMs have a setup time before the rising edge of
a clock, and a delay time after the rising edge of the subsequent cycle.
So, SRAM is about 1¼ cycles.
<
A fast clock will have AGEN at 3/4 clock and wire delay to SRAMs
more than ¼ clock
<
Load Align is closer to ½ clock than ¼ clock due to fanout and wire delay.
<
That is just the rules of the game right now.
>
> Eg:
> IF ID EX MA WB
>
> My cores have typically ended up with a few more stages than this...
>
NO fewer than 6, more likely no fewer than 7.
>
> It is possible that code could be compiled with fine-tuning options for
> the hardware it is to be run on (so, things like Load, ALU, FPU, ...
> latency could be specified as part of the target processor profile).
>
> Interlocks are good in that they still allow code to run correctly (with
> reduced performance) if running code which assumes different instruction
> latency values (some DSPs and similar skipped out on this).
> >>> 4) If the load has no dataflow dependency on other loads (iterating over
> >>> an array with independent computation at each element, for example) then
> >>> the load can be hoisted arbitrarily high by increasing the iteration
> >>> interval of the loop pipeline. How high to hoist is a cost choice -
> >>> bigger hides more latency, but increases wasted work in pipe setup and
> >>> teardown.
> >>>
> >> This is where a coding style like "throwing a bunch of variables and
> >> expressions at the problem" helps, and writing code in a style where it
> >> is fairly to run expressions in parallel.
> >>
> >> OoO: Goes fast, because it allows CPU to better use its OoO capabilities
> >> (and also modern x86 CPUs are mostly able to work around memory spills
> >> and reloads).
> > <
> > Ahem: because it allows instruction to have whatever latencies the machine,
> > in that instant, produces.
> Granted.
>
> In any case, one can have spill-heavy code without suffering a big
> performance hit.
>
> I suspect this was a big source of some of my performance woes on the
> ARM11 and Cortex-A53. Cases which have a lot of "load op spill load op
> spill ..." kinda perform like crap.
>
>
> The latency for loads is also kind of an issue on my BJX2 core, but I
> have enough GPRs to partially work around this. Any L1 misses will still
> stall the pipeline though.
>
> FWIW:
> In my case, it would appear that Load interlocks are 2nd place (for
> wasted clock cycles) after L1 misses (based on the debug 'LED'
> indicators), but more into 1st place in some cases.
>
> I guess it could make sense to start gathering up stats for the
> percentage of the clock-cycle budget that are being spent on interlocks.
> >>
> >> VLIW: Also helps, the CPU core can run all this stuff in parallel.
> >>
> >>
> >> The main cases this doesn't help:
> >> Scalar cores, which can't run thing in parallel;
> >> An ISA with insufficient registers to keep everything in registers.
> >>
> >>
> >> Some amount of older code tries for the "small tight loops which don't
> >> do all that much" style, since it can be noted that this was generally
> >> the fastest strategy on the 386 and 486.
> >>
> >> Likewise, code on 32-bit ARM also seems to prefer the "small tight
> >> loops" strategy over the "boatload of manually unrolled and interleaved
> >> expressions" strategy.
> >>> I do not say that static schedules will always be better than OOO
> >>> schedules. I do not even say that they will always be as good as OOO
> >>> schedules. I do say that they will usually be as good as, and never much
> >>> worse, than OOO schedules, and cost much much less.
> >>>
> >> Generally seems true.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv1o7b$ddv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23751&group=comp.arch#23751

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 22:18:49 -0600
Organization: A noiseless patient Spider
Lines: 204
Message-ID: <sv1o7b$ddv$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv12jf$j8h$1@dont-email.me>
<19acb0f0-0b06-4b0c-aacb-54bee6e7f4e3n@googlegroups.com>
<sv1fku$a4n$1@dont-email.me>
<0cfcf36b-31fe-4991-9ea3-87c5023c8747n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 22 Feb 2022 04:18:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3c9640f024530e8f700b91ad353ec73d";
logging-data="13759"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xlh2ME77IISuUEmV8EvGN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:MbQ7dC06jAf7lYL4Ij4eCpEibOk=
In-Reply-To: <0cfcf36b-31fe-4991-9ea3-87c5023c8747n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 22 Feb 2022 04:18 UTC

On 2/21/2022 9:14 PM, MitchAlsup wrote:
> On Monday, February 21, 2022 at 7:52:34 PM UTC-6, BGB wrote:
>> On 2/21/2022 4:36 PM, MitchAlsup wrote:
>>> On Monday, February 21, 2022 at 2:09:55 PM UTC-8, BGB wrote:
>>>> On 2/21/2022 12:01 PM, Ivan Godard wrote:
>>>>> On 2/21/2022 2:55 AM, Anton Ertl wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>> .
>>>>>>
>>>>>> You seem to be thinking about in-order execution. Just because one
>>>>>> instruction waits on the result of another instruction does not mean
>>>>>> that all instructions do. Even looking 500 instructions ahead, an
>>>>>> instruction coming from the front end might be ready (e.g., it loads a
>>>>>> constant, or it loads a value from the stack, and the stack pointer
>>>>>> has not been updated in the last 500 dynamic instructions).
>>>>>
>>>>> 1) Like My66, Mill does not load constants from memory.
>>>> Nor does BJX2, FWIW.
>>>> In its current primary profiles, large constants are typically composed
>>>> via Jumbo prefixes (which in effect horizontally paste together an
>>>> immediate from multiple instruction words within a single bundle).
>>>>
>>>> My core also has a branch predictor.
>>>>
>>>> The branch predictor can be made cheap enough, and is accurate enough,
>>>> that one may as well have it. It is in effect a simple lookup (based on
>>>> low-order address bits and recent branch history), which is used as a
>>>> 3-bit state machine.
>>>>> 2) The stack will be in the D$1 and so is three cycles away. The
>>>>> schedule can fill those three cycles with other instructions at full
>>>>> width if the ILP is available; a 500 instruction issue queue buys nothing.
>>>> Likewise.
>>>>
>>>> L1 access latency is fixed (*), can be scheduled around assuming one has
>>>> something else to do in the meantime. This is a drawback of "small tight
>>>> loops", namely that they have little else to fill the space with, so
>>>> cycles will get wasted here.
>>> <
>>> That decision is implementation dependent.
>>> <
>>> Say we have a speed demon design with a small 3 cycle D$ 16KB-32Kb
>>> Then
>>> Say we have a server design with a large 4 cycle L1 128KB-256Kb
>>> <
>>> Both designs use the same blocks to build the pipeline
>>> Both are designed to hit the same FAB technology.
>>> <
>>> Speed demon might have 3 levels of cache while server has 2 of equal
>>> capacity with speed demon.
>>> <
>>> Now you want to run the same code optimally on both implementations.
>>> Mill addresses this with the Specializer.
>>> My 66000 addresses this with small O
>>> <-----------
>> OK.
>>
>> I have 3-cycle L1, also 16 or 32K.
>>
>>
>> From what I can gather, the idea would be that RISCs would have a fixed
>> 2-cycle latency, but this is an issue for timing.
> <
> We have not be able to use 2 cycle loads for quite some time. In order to
> get 2 cycle loads,
> a) result deliver, wire delay, and forwarding take no more than ½ cycle
> b) AGEN takes not more than ½ cycles
> c) SRAM read takes no more than ½ cycles
> d) Load Alignment takes no more than ½ cycles.
> <
> We do not have ½ cycle SRAM unless we are only clocking the pipeline
> at DIV 2. Current SRAMs have a setup time before the rising edge of
> a clock, and a delay time after the rising edge of the subsequent cycle.
> So, SRAM is about 1¼ cycles.
> <
> A fast clock will have AGEN at 3/4 clock and wire delay to SRAMs
> more than ¼ clock
> <
> Load Align is closer to ½ clock than ¼ clock due to fanout and wire delay.
> <
> That is just the rules of the game right now.

Yeah.
Seems so.

>>
>> Eg:
>> IF ID EX MA WB
>>
>> My cores have typically ended up with a few more stages than this...
>>
> NO fewer than 6, more likely no fewer than 7.

As-is:
IF ID1 ID2 EX1 EX2 EX3 WB
One could include a PF pseudo-stage:
PF IF ID1 ID2 EX1 EX2 EX3 WB

But, PF overlaps with the IF from the prior stage, so whether it is
really a stage is subject to interpretation.

The WB stage is mostly another weakly-defined stage, which all it really
does is represent the registers being written back to the register file
(prior to passing through WB, the register is accessed via forwarding).

In some notations, what I call ID2 would be denoted RF:
IF ID RF EX1 EX2 EX3 WB
For 'Register Fetch'.
This stage doesn't actually decode anything, the whole cycle is
basically either fetching values from the register file, or forwarding
them from the outputs of the EX stages.

For memory access:
EX1:
AGU runs here;
Prepare state variables for the next stage of the request.
Calculate cache-array index values;
Hash high order bits of request address;
...
EX2:
Cache line fetched;
Cache miss handling, ...
EX3:
Produce final output (Load)
Write back dirty lines to L1 cache (Store)

Internally, the Block-RAM array can be accessed in a single clock-edge
at 50MHz (so, need to calculate and feed in the index one cycle, get a
result the next cycle).

....

>>
>> It is possible that code could be compiled with fine-tuning options for
>> the hardware it is to be run on (so, things like Load, ALU, FPU, ...
>> latency could be specified as part of the target processor profile).
>>
>> Interlocks are good in that they still allow code to run correctly (with
>> reduced performance) if running code which assumes different instruction
>> latency values (some DSPs and similar skipped out on this).
>>>>> 4) If the load has no dataflow dependency on other loads (iterating over
>>>>> an array with independent computation at each element, for example) then
>>>>> the load can be hoisted arbitrarily high by increasing the iteration
>>>>> interval of the loop pipeline. How high to hoist is a cost choice -
>>>>> bigger hides more latency, but increases wasted work in pipe setup and
>>>>> teardown.
>>>>>
>>>> This is where a coding style like "throwing a bunch of variables and
>>>> expressions at the problem" helps, and writing code in a style where it
>>>> is fairly to run expressions in parallel.
>>>>
>>>> OoO: Goes fast, because it allows CPU to better use its OoO capabilities
>>>> (and also modern x86 CPUs are mostly able to work around memory spills
>>>> and reloads).
>>> <
>>> Ahem: because it allows instruction to have whatever latencies the machine,
>>> in that instant, produces.
>> Granted.
>>
>> In any case, one can have spill-heavy code without suffering a big
>> performance hit.
>>
>> I suspect this was a big source of some of my performance woes on the
>> ARM11 and Cortex-A53. Cases which have a lot of "load op spill load op
>> spill ..." kinda perform like crap.
>>
>>
>> The latency for loads is also kind of an issue on my BJX2 core, but I
>> have enough GPRs to partially work around this. Any L1 misses will still
>> stall the pipeline though.
>>
>> FWIW:
>> In my case, it would appear that Load interlocks are 2nd place (for
>> wasted clock cycles) after L1 misses (based on the debug 'LED'
>> indicators), but more into 1st place in some cases.
>>
>> I guess it could make sense to start gathering up stats for the
>> percentage of the clock-cycle budget that are being spent on interlocks.
>>>>
>>>> VLIW: Also helps, the CPU core can run all this stuff in parallel.
>>>>
>>>>
>>>> The main cases this doesn't help:
>>>> Scalar cores, which can't run thing in parallel;
>>>> An ISA with insufficient registers to keep everything in registers.
>>>>
>>>>
>>>> Some amount of older code tries for the "small tight loops which don't
>>>> do all that much" style, since it can be noted that this was generally
>>>> the fastest strategy on the 386 and 486.
>>>>
>>>> Likewise, code on 32-bit ARM also seems to prefer the "small tight
>>>> loops" strategy over the "boatload of manually unrolled and interleaved
>>>> expressions" strategy.
>>>>> I do not say that static schedules will always be better than OOO
>>>>> schedules. I do not even say that they will always be as good as OOO
>>>>> schedules. I do say that they will usually be as good as, and never much
>>>>> worse, than OOO schedules, and cost much much less.
>>>>>
>>>> Generally seems true.

Click here to read the complete article

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv1qhc$o41$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23752&group=comp.arch#23752

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 21 Feb 2022 20:58:18 -0800
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <sv1qhc$o41$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Feb 2022 04:58:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="aa681367c2296b0b3369e867e926f680";
logging-data="24705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+M/TjuT3d6/gpJIHoeCVBA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:v1ABuZU9EG7AN/NUHnBao/smCP0=
In-Reply-To: <sv13ah$sor$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Tue, 22 Feb 2022 04:58 UTC

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton