novaBBS - comp.arch - Re: Hoisting load issue out of functions; was: instruction set binding time

On 3/29/2022 6:15 AM, Paul A. Clayton wrote:
> Thank you for reading, and especially for responding to my
> post. I apologize for the delay in responding.
>
> I am not entirely sure this response if entirely coherent as
> it was composed at multiple times (with some versions having
> been abandonned) and my final reading before posting was not
> especially careful — I do feel I have delayed too long
> already.

Always welcome whenever :-)

After some thought I think that "load for him (LFH) loads using the
deferral mechanism are impractical; they would tie the compiler's
instruction scheduler into knots and the resulting bad code would
obviate any gains.

LFH using the pickup form looks more likely, though as noted there are
encoding issues that would cause some bloat. However it looks like a
fairly straightforward implementation can both load for immediate callee
and also for nested callees and across domain boundaries.

> Ivan Godard wrote:
> > On 2/20/2022 6:05 PM, Paul A. Clayton wrote:
> [snip proposal of caller initiated/callee realized loads]
>
>> Your proposal is one we have explored, though perhaps not far enough.
>> I'll explain the difficulties we found, in hopes that you might see
>> ways around them.
>>
>> First, there are presently two ways for a Mill load to retire,
>> implicitly via timeout and explicitly via the pickup instruction. Each
>> presents its own problems.
>>
>> An explicit load encodes a tag# argument, an arbitrary small constant
>> drawn from a set the size of the set of hardware retire stations. It
>> is not a physical RS#, but is mapped to one at issue in a way
>> essentially equivalent to a rename register in OOO. The corresponding
>> pickup instruction carries the same tag, and the mapping says which RS
>> to retire. Each call frame has its own tag namespace. The specializer
>> schedule and hardware checks assures that loads don't issue an in-use
>> tag, nor pickups try to retire one that is not in flight, and so on.
>
> Side comment: I am a little surprised that hardware checks
> for trying to use an in-use slot. This seems like something
> that could be done straightforwardly in software once. (The
> software overhead might be undesirable for JIT compilation.)
> Trusting that the compilation system handles this correctly
> seems reasonable, though some have proposed invalid markers
> for registers to catch a similar problem of unitialized
> register storage. (A pick-up attempt on an inactive station
> would have to error anyway — or perhaps return a not-a-value
> result like a permission violating load can though it is not
> clear when that would be useful. A pick-up attempt on an
> "ordinary", deferred load would be easily recognized anyway.)

By policy the Mill micro-architecture checks for and faults every
nonsensical thing the program does. Mill is extremely focused on RAS
issues, and every bit of uncaught nonsense multiplies the attack
surface. In addition, as different Mill versions might display different
behavior in response to uncaught nonsense, leaving something unchecked
invites customer howls of "but my program worked on your prior chip!",
and we have no wish to impose the curse of bug compatibility on our
corporate future.

> [Thinking about returning a NaV, such *might* be useful for
> delinquent loads to support runahead or speculating that the
> value is not used. In the latter use, resuming execution at
> the load acceptance point would only be necessary on wrong
> speculation, but both uses would require a means for such
> resumption. If NaV state can be tested by software, this
> would also provide an "informing load", i.e., a load miss can
> be observed by software and an alternative work path used,
> but I suspect such is not the best way to architect informing
> loads.]

That's actually what happens when a load hits a protection violation:
you get a NaR at load-retire time, whether timeout or pickup. But
screwing up the station addressing gets you an immediate fault. The
difference is dynamic vs. static: an untaken speculative load could be
to a bad address, but speculative or no the code should never issue two
loads to the same station, nor pickup without a preceding load.

A NaR'd load that is on the taken path will get a NaR fault when it is
used non-speculatively, but one on an untaken speculative path will
never get used (untaken, after all) and so will harmlessly fall off the
belt.

>> To use your suggestion with explicit retire would require that there
>> be a tag namespace that crossed frame boundaries so a load could be
>> loaded in one and retired in a different one. Such a notion is very
>> un-Mill because we define interrupts, faults and traps as being mere
>> ordinary functions that are involuntarily called. As these can occur
>> at any time, a function cannot know whose frame is adjacent to its
>> own. This makes for a very easy call interface - you can only see your
>> own stuff and your explicit arguments - but makes sharing namespaces
>> across frames problematic.
>
> Could the following work? A load marks the load retire station
> entry via the load initializing operation/instruction in the
> caller as owned by the callee but *held* by the caller. On an
> interrupt or exception, held stations would be spilled just
> like owned stations. Hardware knows when a 'call' is an
> ordinary call and when it is an exception handler or external
> interrupt.
>
> This would increase the metadata associated with the load retire
> stations and increase the size of the load operation encoding
> (especially if the load is allowed to target calls other than
> the next call — next call might get most of the benefit, if
> there is any, especially since intermediate calls would lose the
> use of that station), but it would seem to handle 'involuntary
> calls' and call operation bloat (the load retire station
> argument is implicit, encoded in the state of the retire
> station, more similar to registers).
>
> Passing loads to calls other than the next one would also
> mean re-issuing the load after each intervening return until
> the targeted call or "wasting" a retire station by reserving
> it through all intermediate calls. (It is not clear that the
> filler engine could be smart enough and be given enough
> information to insert the load just-in-time, as if the
> initial load instruction was a prefetch into L1 — or perhaps
> a small pending-load cache with more capacity than retire
> stations but less latency than L1 — and a second issuing of
> the load would complete the load. Getting the second issue
> to complete such that the first cycle of the load-targeted
> function could contain the pick-up seems challenging.)
>
> Since the value is spilled into the caller's storage, which is
> accessible to the caller's debug system, this would restrict
> the loads to those the caller is allowed to perform.
> Effectively, the caller is loading the value and then passing
> it as an implicit argument.
>
> The load value reception would be checked against the *owner*
> rather than the *holder*, so if the load completed before the
> call it could still receive its value.

I don't think that an owner marker is necessary if tags are explicitly
passed as part of the signature. And if they are not then I don't see a
way to match load issue (in caller) with load retire (in callee).

> [Side thought: I wonder if distinguishing "unaliased" loads,
> possibly aliased loads, and shared memory loads would be
> worthwhile. Loads that are known at compile time not to have
> aliases (or possibly even be guarded by a lock for shared
> memory) would not need to be checked against store addresses
> nor against cache coherence probes. Aliased but "unshared"
> loads would only have to check against local stores. I suspect
> the complexity and resource utilization imbalance would
> argue against such, but it seems wrong to check for an
> impossibility (though always checking can be more efficient
> than extracting special cases). Itanium's ALAT mechanism seemed
> clunky because all advanced loads were cache coherent even if
> the storage was thread local. An unaliased load could, in
> theory, retire early, though such would only seem to help the
> Mill when a load returns and the retire station is spilled —
> currently all loads are presumed aliasable and, if I
> understand correctly, the spiller spills the address and the
> filler reinstalls the load — the load would not have to be
> retried after fill as the value was saved and could not have
> been altered. This seems likely to be a rare case.]

Aliasing is an issue even in monocores where there is no cache coherency
necessary. It's simplest just to have the stations snoop.

> Presumably a return from the function targeted by the load
> without a pick-up operation in that function would free the
> reservation (though one might alternatively require such
> loads to be 'acknowledged' by the function — not necessarily
> dropped on the belt — to clear the reservation).

Best to just have the same "use it or kill it" rule that regular loads have.

> If I understand the Mill's load mechanism adequately, such
> should support cross-call loads without preventing involuntary
> calls from properly preserving state or bloating the call
> operation.
>
> (Supporting loads that cross permission domains seems likely
> to be too challenging. The hardware presumably would not be
> able to easily determine what the permission domain will be
> at the time of the load — in theory, hardware could scan
> ahead and prefetch the permission domain of the callee, but
> that seems a bit much for a likely marginal performance
> benefit.)
>
> I am conceptually aware that complexity adds to design time
> and effort, increases the likelihood of errors — really bad
> for hardware and compilers —, and tends toward fragility of
> performance (e.g., providing a 10% benefit in a 1% case most
> of the time but sometimes causing a 20x slowdown).
>
>> The alternative to sharing is of course argument passing, as you
>> suggest: the call instruction could contain load tags from its caller
>> tag set, to be picked up by some kind of callee signature instruction
>> that maps the argument tags to the callee tag namespace, whence a
>> normal pickup can retire them. The problem is one of encoding: the
>> call instruction already has a variable length list of belt arguments,
>> and this would require a variable length list of tag arguments as
>> well. We never found a satisfactory way to encode two lists in one
>> instruction.
>
> It is not clear why the load operation could not encode a larger
> tag space to include the callee. The retire stations could
> be implicit arguments relative to the call operation. This seems
> to be what you suggest elsewhere.
>
>> The implicit load method has its own issues. For a load to timeout in
>> a callee it must be counting in that callee. However there may be
>> other calls to irrelevant functions between the load issue and the
>> intended target call, and the load should not count during those. Note
>> that the intervening non-counting calls may include interrupts and
>> exceptions that are not statically schedulable. Consequently, a call
>> must be able to indicate which in-flight loads should continue
>> counting across the call - and we are back in the encode-two-lists
>> problem again.
>
> If this was only available for the next call, some of those
> issues would be removed. (A pick-up operation could avoid all the
> counting factors.) If the retire station is saved for all
> intermediate functions (and not just involuntary ones), the
> timing problem would seem to be avoided at the cost of
> excessive reissuing of the load on every return (unless
> something like a return counter was used — a return counter
> might have the interesting use where a load uses a retire
> station in the function before the targeted function, i.e.,
> the return counter is one less 'expected').
>
>> The second list can be avoided if a callee is permitted to reference
>> caller tags ordinally, sort of a tag belt or tag RF. The callee would
>> then have an instruction that says in effect "retire the 3rd most
>> recent load issued by the caller" or "retire what the caller calls
>> tag5". There are obvious security holes in this because caller and
>> callee may be in different mutually untrusting turfs.
>
> I was only thinking of loads that would not have permission
> differences. Unless one provided an encapsulated address, the
> load operation would need to know the address which seems
> uncommon for untrusting actors.
>
> (I suppose, in theory, one could have a pick-up operation that
> specifies an address check. This might be more like reissuing
> the load but not performing a normal load but checking the
> retire stations — or a special cache, in which case a miss
> would presumably convert the operation into a normal load.
> For such a cross-function load, it seems a returned value
> would have to be stored in protected storage so a debugger
> for the function starting the load could not look in the
> spill area and retrieve secret values. This would seem less
> unreasonable with a cache rather than a retire station.)
>
>> We did not see any compiler issues with your proposal: hoisting loads
>> out of callees into the call sites amounts to partial inlining, which
>> is well known compiler tech. We'd welcome any ideas you might have
>> that address the encoding and security issues we hit, while preserving
>> Mill axioms such as traps being involuntary calls of normal functions.
>
> I am not certain any of the above helps.
>

Subject	Replies	Author
Encoding 20 and 40 bit instructions in 128 bits By: Thomas Koenig on Thu, 27 Jan 2022	340	Thomas Koenig

This login session: $13.99

computers / comp.arch / Re: Hoisting load issue out of functions; was: instruction set binding time