Message-ID:

"If that makes any sense to you, you have a big problem." -- C. Durance, Computer Science 234

devel / comp.arch / Re: Using vector registers to stack return addresses

I got to thinking looking at the Phoenix ISA that it would be trimming only one
bit from the target address of a call instruction to allow storing the return
address in a vector register. Since instructions are 40-bit that left 27 bits for
the target which is more than enough. (Targets can be extended further with
postfix constants). The vector register could then be used as a return address
stack if the function call depth were known to be limited. A vector slide
operation in the function prologue / epilogue could take care of being able to
call other functions.

I was looking for more information on the use of vector registers for function
calls. And is this sort of thing supported by compilers?

Call v63,,My_rout
…
My_rout:
VSLLVI v63,v63,1 # stack return address in vector register
…
Call SomeFn # better not go more than 12 deep
…
VSRLVI v63,v63,1 # restore return address
RET

On Saturday, September 10, 2022 at 2:04:19 AM UTC-5, robf...@gmail.com wrote:
> I got to thinking looking at the Phoenix ISA that it would be trimming only one
> bit from the target address of a call instruction to allow storing the return
> address in a vector register. Since instructions are 40-bit that left 27 bits for
> the target which is more than enough. (Targets can be extended further with
> postfix constants). The vector register could then be used as a return address
> stack if the function call depth were known to be limited. A vector slide
> operation in the function prologue / epilogue could take care of being able to
> call other functions.
>
> I was looking for more information on the use of vector registers for function
> calls. And is this sort of thing supported by compilers?
>
> Call v63,,My_rout
> …
> My_rout:
> VSLLVI v63,v63,1 # stack return address in vector register
> …
> Call SomeFn # better not go more than 12 deep
> …
> VSRLVI v63,v63,1 # restore return address
> RET
<
I went the other way:: When safe stack is in use, the return address is not
placed in R0, but on a stack where LD and ST instructions will take Faults,
but ENTER and EXIT work as desired. That is secure from ROP, and buffer
overrun attack strategies.
<
Other than "there are a crap load of individual containers" in CRAY-like vector
machines, I can't see how the boundary cases are handled with aplomb.

robf...@gmail.com wrote:
> I got to thinking looking at the Phoenix ISA that it would be trimming only one
> bit from the target address of a call instruction to allow storing the return
> address in a vector register. Since instructions are 40-bit that left 27 bits for
> the target which is more than enough. (Targets can be extended further with
> postfix constants). The vector register could then be used as a return address
> stack if the function call depth were known to be limited. A vector slide
> operation in the function prologue / epilogue could take care of being able to
> call other functions.
>
> I was looking for more information on the use of vector registers for function
> calls. And is this sort of thing supported by compilers?
>
> Call v63,,My_rout
> …
> My_rout:
> VSLLVI v63,v63,1 # stack return address in vector register
> …
> Call SomeFn # better not go more than 12 deep
> …
> VSRLVI v63,v63,1 # restore return address
> RET

Sounds like a variation on register windows which had good cache-hit
performance but also some clunky parts for particular implementations.
Dealing with asynchronous non-coherence is in part what makes it clunky.

My guess is you would probably find similar performance stats to
register windows, and similar clunky parts like dealing with
overflows, underflows, setjmp and exceptions.

On Saturday, September 10, 2022 at 2:19:29 PM UTC-4, EricP wrote:
> robf...@gmail.com wrote:
> > I got to thinking looking at the Phoenix ISA that it would be trimming only one
> > bit from the target address of a call instruction to allow storing the return
> > address in a vector register. Since instructions are 40-bit that left 27 bits for
> > the target which is more than enough. (Targets can be extended further with
> > postfix constants). The vector register could then be used as a return address
> > stack if the function call depth were known to be limited. A vector slide
> > operation in the function prologue / epilogue could take care of being able to
> > call other functions.
> >
> > I was looking for more information on the use of vector registers for function
> > calls. And is this sort of thing supported by compilers?
> >
> > Call v63,,My_rout
> > …
> > My_rout:
> > VSLLVI v63,v63,1 # stack return address in vector register
> > …
> > Call SomeFn # better not go more than 12 deep
> > …
> > VSRLVI v63,v63,1 # restore return address
> > RET
> Sounds like a variation on register windows which had good cache-hit
> performance but also some clunky parts for particular implementations.
> Dealing with asynchronous non-coherence is in part what makes it clunky.
>
> My guess is you would probably find similar performance stats to
> register windows, and similar clunky parts like dealing with
> overflows, underflows, setjmp and exceptions.

Back to the drawing board. It was another late-night idea. Next idea is to use a
small stack of regs with lazy writes to memory. Other than leaf function the
return address ends up on the stack anyway, therefore it may be better to go
without a dedicated return address register.

I like the ENTER / EXIT approach. It makes things simple. But I do not want to
add micro-code to the current project. So, some assembler macros could be
used to implement ENTER and EXIT.

For safe stack, do return addresses have their own stack independent of
function arguments within a memory page that does not allow LD / ST?

On 9/11/2022 1:48 AM, robf...@gmail.com wrote:
> On Saturday, September 10, 2022 at 2:19:29 PM UTC-4, EricP wrote:
>> robf...@gmail.com wrote:
>>> I got to thinking looking at the Phoenix ISA that it would be trimming only one
>>> bit from the target address of a call instruction to allow storing the return
>>> address in a vector register. Since instructions are 40-bit that left 27 bits for
>>> the target which is more than enough. (Targets can be extended further with
>>> postfix constants). The vector register could then be used as a return address
>>> stack if the function call depth were known to be limited. A vector slide
>>> operation in the function prologue / epilogue could take care of being able to
>>> call other functions.
>>>
>>> I was looking for more information on the use of vector registers for function
>>> calls. And is this sort of thing supported by compilers?
>>>
>>> Call v63,,My_rout
>>> …
>>> My_rout:
>>> VSLLVI v63,v63,1 # stack return address in vector register
>>> …
>>> Call SomeFn # better not go more than 12 deep
>>> …
>>> VSRLVI v63,v63,1 # restore return address
>>> RET
>> Sounds like a variation on register windows which had good cache-hit
>> performance but also some clunky parts for particular implementations.
>> Dealing with asynchronous non-coherence is in part what makes it clunky.
>>
>> My guess is you would probably find similar performance stats to
>> register windows, and similar clunky parts like dealing with
>> overflows, underflows, setjmp and exceptions.
>
> Back to the drawing board. It was another late-night idea. Next idea is to use a
> small stack of regs with lazy writes to memory. Other than leaf function the
> return address ends up on the stack anyway, therefore it may be better to go
> without a dedicated return address register.
>
> I like the ENTER / EXIT approach. It makes things simple. But I do not want to
> add micro-code to the current project. So, some assembler macros could be
> used to implement ENTER and EXIT.
>
> For safe stack, do return addresses have their own stack independent of
> function arguments within a memory page that does not allow LD / ST?
>
>

FWIW: In my case, I just use normal memory for the stack frames (though
this does not allow for "safe stack").

Much of the time though, the compiler does add "stack canaries" which
will trigger a breakpoint if trying to return from a function which does
not have these values intact (no real way to disable these at present;
but they are skipped for small leaf functions and similar).

Well, I could also probably consider adding an option to skip on adding
exception-unwinding cruft, since this is detrimental to code density,
and generally C does not use try/catch exception (but, if exceptions
could potentially occur anywhere, this unwinding cruft is needed along
pretty much the entire call stack).

Some things, like the Boot ROM though, are not going to use try/catch
exceptions (ever), so it is basically a pure overhead in this case.

But, they are still less overhead than would be needed to support things
like "call with current continuation" or fine-grained "async"; which
would effectively also require non-trivial ABI modifications. Currently
BGBCC still uses a more traditional ABI design which by itself could not
natively support call/cc or async. Assuming both are implemented
internally on top of continuations; which in turn effectively also imply
the using heap allocation for parts of the call frames and similar. Note
that at present, VLAs, lambdas, and "alloca()", are already implemented
via heap-allocating this stuff, but lambda capture rules are restricted
in a way that they are still compatible with a traditional C style ABI
(they are not quite as generalized as in Common Lisp or Scheme, but
these semantics would also require ABI tweaks and would have a non-zero
performance impact).

Not really sure if more mainstream compilers have better options for a
lot of this.

Though, I did come up with a trick that (usually) allows
branch-predicting the return:
The return link register is loaded into a register before reloading the
other registers.

So, the idea here is that by the time it reaches the final return
instruction (usually encoded as "JMP R1" in this case), the load has
already finished, and the branch predictor (with side-channel inputs for
LR and R1) can special-case the branch instruction (if the load has not
finished, it falls back to the slower "generic" case used for most other
registers).

Early on, there was a "RET" instruction which would pop a value from the
stack and then branch to it, but it was dropped around the same time I
dropped the PUSH/POP instructions in favor of normal memory Load/Store.

Similarly:
Reload original LR into R1;
Reload other registers;
Adjust Stack;
Jump to R1.

Is also generally faster than the original RET instruction would have
been (roughly 12 clock cycles).

Partial issue would have also existed with PUSH/POP, which would have
been slower on average than plain Load/Store (as well as also needing
special internal plumbing for the SP adjustments).

Re: Using vector registers to stack return addresses

<168a6e7a-3217-4eaa-81f5-fd76f464f9een@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27741&group=comp.arch#27741

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:27e2:b0:4ac:97ba:57b5 with SMTP id jt2-20020a05621427e200b004ac97ba57b5mr6935529qvb.130.1662914839773;
Sun, 11 Sep 2022 09:47:19 -0700 (PDT)
X-Received: by 2002:a05:622a:214:b0:342:f97c:1706 with SMTP id
b20-20020a05622a021400b00342f97c1706mr20089805qtx.291.1662914839613; Sun, 11
Sep 2022 09:47:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 11 Sep 2022 09:47:19 -0700 (PDT)
In-Reply-To: <ea9e86dd-72ca-4a44-9aec-9eb433125974n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:25a1:8cc0:ef77:159d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:25a1:8cc0:ef77:159d
References: <27269b16-7d65-45a6-bd4a-b00d4b50b9d5n@googlegroups.com>
<Oy4TK.337081$6Il8.301108@fx14.iad> <ea9e86dd-72ca-4a44-9aec-9eb433125974n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <168a6e7a-3217-4eaa-81f5-fd76f464f9een@googlegroups.com>
Subject: Re: Using vector registers to stack return addresses
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 11 Sep 2022 16:47:19 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3938

by: MitchAlsup - Sun, 11 Sep 2022 16:47 UTC

On Sunday, September 11, 2022 at 1:48:02 AM UTC-5, robf...@gmail.com wrote:
> On Saturday, September 10, 2022 at 2:19:29 PM UTC-4, EricP wrote:
> > robf...@gmail.com wrote:
> > > I got to thinking looking at the Phoenix ISA that it would be trimming only one
> > > bit from the target address of a call instruction to allow storing the return
> > > address in a vector register. Since instructions are 40-bit that left 27 bits for
> > > the target which is more than enough. (Targets can be extended further with
> > > postfix constants). The vector register could then be used as a return address
> > > stack if the function call depth were known to be limited. A vector slide
> > > operation in the function prologue / epilogue could take care of being able to
> > > call other functions.
> > >
> > > I was looking for more information on the use of vector registers for function
> > > calls. And is this sort of thing supported by compilers?
> > >
> > > Call v63,,My_rout
> > > …
> > > My_rout:
> > > VSLLVI v63,v63,1 # stack return address in vector register
> > > …
> > > Call SomeFn # better not go more than 12 deep
> > > …
> > > VSRLVI v63,v63,1 # restore return address
> > > RET
> > Sounds like a variation on register windows which had good cache-hit
> > performance but also some clunky parts for particular implementations.
> > Dealing with asynchronous non-coherence is in part what makes it clunky..
> >
> > My guess is you would probably find similar performance stats to
> > register windows, and similar clunky parts like dealing with
> > overflows, underflows, setjmp and exceptions.
> Back to the drawing board. It was another late-night idea. Next idea is to use a
> small stack of regs with lazy writes to memory. Other than leaf function the
> return address ends up on the stack anyway, therefore it may be better to go
> without a dedicated return address register.
>
> I like the ENTER / EXIT approach. It makes things simple. But I do not want to
> add micro-code to the current project. So, some assembler macros could be
> used to implement ENTER and EXIT.
>
> For safe stack, do return addresses have their own stack independent of
> function arguments within a memory page that does not allow LD / ST?
<
Return address and the preserved registers go on the safe-stack, anything
else goes on the normal stack (varargs arguments, for example.)

On Sunday, September 11, 2022 at 2:43:30 AM UTC-5, BGB wrote:
> On 9/11/2022 1:48 AM, robf...@gmail.com wrote:
> > On Saturday, September 10, 2022 at 2:19:29 PM UTC-4, EricP wrote:
> >> robf...@gmail.com wrote:
> >>> I got to thinking looking at the Phoenix ISA that it would be trimming only one
> >>> bit from the target address of a call instruction to allow storing the return
> >>> address in a vector register. Since instructions are 40-bit that left 27 bits for
> >>> the target which is more than enough. (Targets can be extended further with
> >>> postfix constants). The vector register could then be used as a return address
> >>> stack if the function call depth were known to be limited. A vector slide
> >>> operation in the function prologue / epilogue could take care of being able to
> >>> call other functions.
> >>>
> >>> I was looking for more information on the use of vector registers for function
> >>> calls. And is this sort of thing supported by compilers?
> >>>
> >>> Call v63,,My_rout
> >>> …
> >>> My_rout:
> >>> VSLLVI v63,v63,1 # stack return address in vector register
> >>> …
> >>> Call SomeFn # better not go more than 12 deep
> >>> …
> >>> VSRLVI v63,v63,1 # restore return address
> >>> RET
> >> Sounds like a variation on register windows which had good cache-hit
> >> performance but also some clunky parts for particular implementations.
> >> Dealing with asynchronous non-coherence is in part what makes it clunky.
> >>
> >> My guess is you would probably find similar performance stats to
> >> register windows, and similar clunky parts like dealing with
> >> overflows, underflows, setjmp and exceptions.
> >
> > Back to the drawing board. It was another late-night idea. Next idea is to use a
> > small stack of regs with lazy writes to memory. Other than leaf function the
> > return address ends up on the stack anyway, therefore it may be better to go
> > without a dedicated return address register.
> >
> > I like the ENTER / EXIT approach. It makes things simple. But I do not want to
> > add micro-code to the current project. So, some assembler macros could be
> > used to implement ENTER and EXIT.
> >
> > For safe stack, do return addresses have their own stack independent of
> > function arguments within a memory page that does not allow LD / ST?
> >
> >
> FWIW: In my case, I just use normal memory for the stack frames (though
> this does not allow for "safe stack").
>
>
> Much of the time though, the compiler does add "stack canaries" which
> will trigger a breakpoint if trying to return from a function which does
> not have these values intact (no real way to disable these at present;
> but they are skipped for small leaf functions and similar).
>
> Well, I could also probably consider adding an option to skip on adding
> exception-unwinding cruft, since this is detrimental to code density,
> and generally C does not use try/catch exception (but, if exceptions
> could potentially occur anywhere, this unwinding cruft is needed along
> pretty much the entire call stack).
>
> Some things, like the Boot ROM though, are not going to use try/catch
> exceptions (ever), so it is basically a pure overhead in this case.
>
> But, they are still less overhead than would be needed to support things
> like "call with current continuation" or fine-grained "async"; which
> would effectively also require non-trivial ABI modifications. Currently
> BGBCC still uses a more traditional ABI design which by itself could not
> natively support call/cc or async. Assuming both are implemented
> internally on top of continuations; which in turn effectively also imply
> the using heap allocation for parts of the call frames and similar. Note
> that at present, VLAs, lambdas, and "alloca()", are already implemented
> via heap-allocating this stuff, but lambda capture rules are restricted
> in a way that they are still compatible with a traditional C style ABI
> (they are not quite as generalized as in Common Lisp or Scheme, but
> these semantics would also require ABI tweaks and would have a non-zero
> performance impact).
>
> Not really sure if more mainstream compilers have better options for a
> lot of this.
>
>
>
> Though, I did come up with a trick that (usually) allows
> branch-predicting the return:
> The return link register is loaded into a register before reloading the
> other registers.
<
EXIT does this:: EXIT reads the return address first, and then proceeds
to reload the preserved registers, so instructions at return target are
available when all the registers have been reloaded. As long as the
first instructions at return target are not memory-references, those
instructions can be performed while register reload is in progress.
<
Due to typical semantics of RET, compilers cannot do this with
code scheduling (movement). Also note: return address does not
pass through (and thereby wasting) a register.
>
> So, the idea here is that by the time it reaches the final return
> instruction (usually encoded as "JMP R1" in this case), the load has
> already finished, and the branch predictor (with side-channel inputs for
> LR and R1) can special-case the branch instruction (if the load has not
> finished, it falls back to the slower "generic" case used for most other
> registers).
>
>
> Early on, there was a "RET" instruction which would pop a value from the
> stack and then branch to it, but it was dropped around the same time I
> dropped the PUSH/POP instructions in favor of normal memory Load/Store.
>
> Similarly:
> Reload original LR into R1;
> Reload other registers;
> Adjust Stack;
> Jump to R1.
>
> Is also generally faster than the original RET instruction would have
> been (roughly 12 clock cycles).
>
> Partial issue would have also existed with PUSH/POP, which would have
> been slower on average than plain Load/Store (as well as also needing
> special internal plumbing for the SP adjustments).

On 9/11/2022 11:55 AM, MitchAlsup wrote:
> On Sunday, September 11, 2022 at 2:43:30 AM UTC-5, BGB wrote:
>> On 9/11/2022 1:48 AM, robf...@gmail.com wrote:
>>> On Saturday, September 10, 2022 at 2:19:29 PM UTC-4, EricP wrote:
>>>> robf...@gmail.com wrote:
>>>>> I got to thinking looking at the Phoenix ISA that it would be trimming only one
>>>>> bit from the target address of a call instruction to allow storing the return
>>>>> address in a vector register. Since instructions are 40-bit that left 27 bits for
>>>>> the target which is more than enough. (Targets can be extended further with
>>>>> postfix constants). The vector register could then be used as a return address
>>>>> stack if the function call depth were known to be limited. A vector slide
>>>>> operation in the function prologue / epilogue could take care of being able to
>>>>> call other functions.
>>>>>
>>>>> I was looking for more information on the use of vector registers for function
>>>>> calls. And is this sort of thing supported by compilers?
>>>>>
>>>>> Call v63,,My_rout
>>>>> …
>>>>> My_rout:
>>>>> VSLLVI v63,v63,1 # stack return address in vector register
>>>>> …
>>>>> Call SomeFn # better not go more than 12 deep
>>>>> …
>>>>> VSRLVI v63,v63,1 # restore return address
>>>>> RET
>>>> Sounds like a variation on register windows which had good cache-hit
>>>> performance but also some clunky parts for particular implementations.
>>>> Dealing with asynchronous non-coherence is in part what makes it clunky.
>>>>
>>>> My guess is you would probably find similar performance stats to
>>>> register windows, and similar clunky parts like dealing with
>>>> overflows, underflows, setjmp and exceptions.
>>>
>>> Back to the drawing board. It was another late-night idea. Next idea is to use a
>>> small stack of regs with lazy writes to memory. Other than leaf function the
>>> return address ends up on the stack anyway, therefore it may be better to go
>>> without a dedicated return address register.
>>>
>>> I like the ENTER / EXIT approach. It makes things simple. But I do not want to
>>> add micro-code to the current project. So, some assembler macros could be
>>> used to implement ENTER and EXIT.
>>>
>>> For safe stack, do return addresses have their own stack independent of
>>> function arguments within a memory page that does not allow LD / ST?
>>>
>>>
>> FWIW: In my case, I just use normal memory for the stack frames (though
>> this does not allow for "safe stack").
>>
>>
>> Much of the time though, the compiler does add "stack canaries" which
>> will trigger a breakpoint if trying to return from a function which does
>> not have these values intact (no real way to disable these at present;
>> but they are skipped for small leaf functions and similar).
>>
>> Well, I could also probably consider adding an option to skip on adding
>> exception-unwinding cruft, since this is detrimental to code density,
>> and generally C does not use try/catch exception (but, if exceptions
>> could potentially occur anywhere, this unwinding cruft is needed along
>> pretty much the entire call stack).
>>
>> Some things, like the Boot ROM though, are not going to use try/catch
>> exceptions (ever), so it is basically a pure overhead in this case.
>>
>> But, they are still less overhead than would be needed to support things
>> like "call with current continuation" or fine-grained "async"; which
>> would effectively also require non-trivial ABI modifications. Currently
>> BGBCC still uses a more traditional ABI design which by itself could not
>> natively support call/cc or async. Assuming both are implemented
>> internally on top of continuations; which in turn effectively also imply
>> the using heap allocation for parts of the call frames and similar. Note
>> that at present, VLAs, lambdas, and "alloca()", are already implemented
>> via heap-allocating this stuff, but lambda capture rules are restricted
>> in a way that they are still compatible with a traditional C style ABI
>> (they are not quite as generalized as in Common Lisp or Scheme, but
>> these semantics would also require ABI tweaks and would have a non-zero
>> performance impact).
>>
>> Not really sure if more mainstream compilers have better options for a
>> lot of this.
>>
>>
>>
>> Though, I did come up with a trick that (usually) allows
>> branch-predicting the return:
>> The return link register is loaded into a register before reloading the
>> other registers.
> <
> EXIT does this:: EXIT reads the return address first, and then proceeds
> to reload the preserved registers, so instructions at return target are
> available when all the registers have been reloaded. As long as the
> first instructions at return target are not memory-references, those
> instructions can be performed while register reload is in progress.
> <
> Due to typical semantics of RET, compilers cannot do this with
> code scheduling (movement). Also note: return address does not
> pass through (and thereby wasting) a register.

There are partial reasons I eliminated things like POP and RET.

Say:
POP.X Rn // 3 cycles every time.
MOV.X (SP, disp), Rn // 1 cycle on average

RET: ~ 12 cycles

And:
MOV.Q (SP, disp), R1 // 1 cycle
...
JMP R1 // 2 cycles

Though, this uses R1 because the MOV.C instruction is not part of the
core ISA, and R1 has ended being defined as a de-facto "secondary link
register".

Prolog:
ADD -xxx, SP
MOV LR, R1 //1c
MOV.Q R1, (SP, disp) //1c

For prolog compression, it is usually copied to R18.
In the 128-bit ABI, R19:R18 is used for the "this" pointer (in C++ and
BS2 modes), where in the 64-bit ABI, R3 is used for "this".

Stack canary checks will add some cases like:
LDIZ xxxx, R16 // 1c
MOV.Q R16, (SP, disp) // 1c
...
MOV.Q (SP, disp), R16 // 2c
LDIZ xxxx, R17 // 1c
CMPQEQ R16, R17 // 1c
BREAK?F // 1c

So, adds around 7 cycles to each function call/return when used, with
xxxx as a combination of pseudo-random and hash number.

Generally, it is used to protect the saved register area of the stack
from potential damage due to buffer overruns.

OTOH:

The exception handling mechanism was derived from a similar design to
that used in WinCE and Win64 (and by MSVC++). It does not add any
(explicit) run-time overhead, but does add code space overhead.
Typically a small blob of instructions after the epilog which interacts
with the exception-winding mechanism, and would be where one would put
any code for dispatching into "catch" blocks.

This is combined with a table that basically encodes ranges of ".text"
addresses used to locate the corresponding unwind/catch handlers for a
given PC address (with some bit-twiddly and trickery used to try to keep
this table reasonably compact, *).

This has a non-zero space overhead.

*: For Doom, this table is ~ 3% of the total binary size, with an
additional ~ 7% overhead due to the per-function unwind handling logic,
adding around 10% to the size of the binary.

The alternative would be do have done it more like Win32 SEH, which
would have less average code-footprint overhead for C, but would have a
more significant cost overhead for each try/catch block (which would now
need to do something akin to a "setjmp" along with adding an exception
handler to a linked list).

Say, Pseudocode for such a mechanism:
jmp_buf_ex_t t_eh;
jmp_buf_ex_t *eh, *old_eh;
taskinfo_t *tki;

eh = &t_eh;
if(__setjmp_ex(eh))
{
... catch handling ...
}else
{
tki = __arch_tbr;
old_eh = tki->seh_prr;
eh->next = old_eh;
tki->seh_ptr = eh;
... try block ...
tki->seh_ptr = old_eh;
}

The argument though, is that this would makes the average case cost of
each "try" block significantly higher in the "no exception was thrown"
case. Similarly, it would also (significantly) increase the cost of
things like RAII destructors (if throwing an exception is assumed to
also call any destructors along the call graph).

But, the (current) vectored mechanism sort of deals with this by making
everything more expensive whether it uses exceptions or not (where it
can be argued that C code has no use for exceptions, and ideally C code
can just use "signal" or similar as a fallback; where an uncaught
exception and/or unwind failure would turn it into a "signal").

Click here to read the complete article

Subject	Author
Using vector registers to stack return addresses	robf...@gmail.com
Re: Using vector registers to stack return addresses	MitchAlsup
Re: Using vector registers to stack return addresses	EricP
Re: Using vector registers to stack return addresses	robf...@gmail.com
Re: Using vector registers to stack return addresses	BGB
Re: Using vector registers to stack return addresses	MitchAlsup
Re: Using vector registers to stack return addresses	BGB
Re: Using vector registers to stack return addresses	MitchAlsup