Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
 If you find that it is broken please let me know here rocksolid.nodes.help


devel / comp.arch / Re: Concertina II Progress

SubjectAuthor
* Concertina II ProgressQuadibloc
+- Re: Concertina II ProgressBGB
+* Re: Concertina II ProgressThomas Koenig
|+* Re: Concertina II ProgressBGB-Alt
||`* Re: Concertina II ProgressQuadibloc
|| `* Re: Concertina II ProgressBGB-Alt
||  +* Re: Concertina II ProgressQuadibloc
||  |+* Re: Concertina II ProgressBGB
||  ||`- Re: Concertina II ProgressMitchAlsup
||  |+* Re: Concertina II ProgressScott Lurndal
||  ||`* Re: Concertina II ProgressBGB
||  || +* Re: Concertina II ProgressStephen Fuld
||  || |`* Re: Concertina II ProgressMitchAlsup
||  || | +- Re: Concertina II ProgressBGB-Alt
||  || | `* Re: Concertina II ProgressStephen Fuld
||  || |  `* Re: Concertina II ProgressMitchAlsup
||  || |   `* Re: Concertina II ProgressStephen Fuld
||  || |    `* Re: Concertina II ProgressMitchAlsup
||  || |     `* Re: Concertina II ProgressStephen Fuld
||  || |      `* Re: Concertina II ProgressBGB
||  || |       `* Re: Concertina II ProgressMitchAlsup
||  || |        +* Re: Concertina II ProgressBGB
||  || |        |`* Re: Concertina II ProgressMitchAlsup
||  || |        | +* Re: Concertina II ProgressStefan Monnier
||  || |        | |`* Re: Concertina II ProgressMitchAlsup
||  || |        | | `* Re: Concertina II ProgressScott Lurndal
||  || |        | |  `* Re: Concertina II ProgressMitchAlsup
||  || |        | |   +- Re: Concertina II ProgressPaul A. Clayton
||  || |        | |   `* Re: Concertina II ProgressStefan Monnier
||  || |        | |    +- Re: Concertina II ProgressMitchAlsup
||  || |        | |    `* Re: Concertina II ProgressScott Lurndal
||  || |        | |     `* Re: Concertina II ProgressBGB
||  || |        | |      +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |`* Re: Concertina II ProgressBGB
||  || |        | |      | +- Re: Concertina II ProgressMitchAlsup
||  || |        | |      | `* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |  +* Re: Concertina II ProgressBGB
||  || |        | |      |  |`* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |  | `* Re: Concertina II ProgressBGB
||  || |        | |      |  |  +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |  |  |+- Re: Concertina II ProgressMitchAlsup
||  || |        | |      |  |  |`* Re: Concertina II ProgressBGB
||  || |        | |      |  |  | `- Re: Concertina II ProgressScott Lurndal
||  || |        | |      |  |  `* Re: Concertina II ProgressRobert Finch
||  || |        | |      |  |   `- Re: Concertina II ProgressBGB
||  || |        | |      |  `* Re: Concertina II ProgressMitchAlsup
||  || |        | |      |   `* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |    `* Re: Concertina II ProgressMitchAlsup
||  || |        | |      |     +* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |     |`- Re: Concertina II ProgressMitchAlsup
||  || |        | |      |     `* Re: Concertina II ProgressScott Lurndal
||  || |        | |      |      `- Re: Concertina II ProgressMitchAlsup
||  || |        | |      `* Re: Concertina II ProgressMitchAlsup
||  || |        | |       +- Re: Concertina II ProgressRobert Finch
||  || |        | |       `* Re: Concertina II ProgressScott Lurndal
||  || |        | |        `* Re: Concertina II ProgressMitchAlsup
||  || |        | |         `* Re: Concertina II ProgressChris M. Thomasson
||  || |        | |          `* Re: Concertina II ProgressMitchAlsup
||  || |        | |           `* Re: Concertina II ProgressMitchAlsup
||  || |        | |            `- Re: Concertina II ProgressChris M. Thomasson
||  || |        | `* Re: Concertina II ProgressBGB
||  || |        |  `* Re: Concertina II ProgressMitchAlsup
||  || |        |   `* Re: Concertina II ProgressBGB
||  || |        |    `* Re: Concertina II ProgressMitchAlsup
||  || |        |     +* Re: Concertina II ProgressRobert Finch
||  || |        |     |`* Re: Concertina II ProgressMitchAlsup
||  || |        |     | +- Re: Concertina II ProgressRobert Finch
||  || |        |     | `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  +* Re: Concertina II ProgressQuadibloc
||  || |        |     |  |`* Re: Concertina II ProgressMitchAlsup
||  || |        |     |  | +* Re: Concertina II ProgressScott Lurndal
||  || |        |     |  | |`* Re: Concertina II ProgressMitchAlsup
||  || |        |     |  | | +- Re: Concertina II ProgressScott Lurndal
||  || |        |     |  | | `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  | |  `* Re: Concertina II ProgressMitchAlsup
||  || |        |     |  | |   `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  | |    `- Re: Concertina II ProgressQuadibloc
||  || |        |     |  | `* Re: Concertina II ProgressQuadibloc
||  || |        |     |  |  `- Re: Concertina II ProgressMitchAlsup
||  || |        |     |  `- Re: Concertina II ProgressMitchAlsup
||  || |        |     +- Re: Concertina II ProgressBGB
||  || |        |     `* Re: Concertina II ProgressPaul A. Clayton
||  || |        |      +* Re: Concertina II ProgressRobert Finch
||  || |        |      |`* Re: Concertina II ProgressPaul A. Clayton
||  || |        |      | +* Re: Concertina II ProgressMitchAlsup
||  || |        |      | |`* Re: Concertina II ProgressPaul A. Clayton
||  || |        |      | | `- Re: Concertina II ProgressBGB
||  || |        |      | `* Computer architecture (was: Concertina II Progress)Anton Ertl
||  || |        |      |  +* Re: Computer architectureEricP
||  || |        |      |  |`* Re: Computer architectureAnton Ertl
||  || |        |      |  | `* Re: Computer architectureScott Lurndal
||  || |        |      |  |  +* Re: Computer architectureStefan Monnier
||  || |        |      |  |  |`* Re: Computer architectureScott Lurndal
||  || |        |      |  |  | `* Re: Computer architectureStefan Monnier
||  || |        |      |  |  |  +* Re: Computer architectureScott Lurndal
||  || |        |      |  |  |  |`* Re: Computer architectureStefan Monnier
||  || |        |      |  |  |  | `* Re: Computer architectureBGB
||  || |        |      |  |  |  |  `- Re: Computer architectureStefan Monnier
||  || |        |      |  |  |  `* Re: Computer architectureBGB
||  || |        |      |  |  |   `- Re: Computer architectureScott Lurndal
||  || |        |      |  |  `* Re: Computer architectureAnton Ertl
||  || |        |      |  `* Re: Computer architecturePaul A. Clayton
||  || |        |      `* Re: Concertina II ProgressMitchAlsup
||  || |        `* Re: Concertina II ProgressRobert Finch
||  || `* Re: Concertina II ProgressMitchAlsup
||  |+- Re: Concertina II ProgressMitchAlsup
||  |`* Re: Concertina II ProgressThomas Koenig
||  +- Re: Concertina II ProgressQuadibloc
||  `* Re: Concertina II ProgressQuadibloc
|`* Re: Concertina II ProgressQuadibloc
`* Re: Concertina II ProgressMitchAlsup

Pages:1234567891011121314151617181920212223242526272829303132333435363738
Re: Concertina II Progress

<ujjt01$16pav$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35066&group=comp.arch#35066

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Tue, 21 Nov 2023 21:36:30 -0600
Organization: A noiseless patient Spider
Lines: 308
Message-ID: <ujjt01$16pav$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 22 Nov 2023 03:36:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c4181a2271d6559bd68447460fbf18b8";
logging-data="1271135"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19It0m+p3T7BWJGcZ+bjHL0"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:HiRvcKQB+rP3d56LOiWZjKLh3CM=
In-Reply-To: <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
Content-Language: en-US
 by: BGB - Wed, 22 Nov 2023 03:36 UTC

On 11/21/2023 4:12 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 11/20/2023 11:31 AM, Stephen Fuld wrote:
>>> On 11/15/2023 1:10 PM, MitchAlsup wrote:
>
>
>> For some ops, the 3rd register (Ro) would instead operate as a 5-bit
>> immediate/displacement field. Which was initially a similar idea, with
>> the 32-bit space mirroring the 16-bit space.
>
> Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
> register
> specifier into a 5-bit immediate of either positive or negative integer
> value. This makes::
>
>     1<<n
>    ~0<<n
>     container.bitfield = 7;
>
> single instructions.
>

Originally, the pattern depended on the 16-bit operation, IIRC:
(Rm), Rn => (Rm, Disp5), Rn
(Rm, R0), Rn => (Rm, Ro), Rn
ALU Ops:
OP Rm, Rn => OP Rm, Ro, Rn
OP Rm, R0, Rn => OP Rm, Imm5u, Rn

Initially, BJX2 started out in a similar camp to BJX1, but when it
became obvious that the 16-bit and 32-bit encodings effectively needed
separate encoders, there was no real point keeping up the concept of
32-bit ops being prefix-extended 16-bit ops.

Then some other analysis/testing showed that for "general case
tradeoffs", it was better to have an ISA with primarily 32-bit encodings
with a 16-bit subset, than one with primarily 16-bit encodings with
32-bit extended forms (though, by this point, I had already settled on
the general encoding scheme).

The main practical consequence of this realization was that the ISA did
not need to be able to operate entirely within the limits of the 16-bit
encoding space (but, did need to be able to operate without any of the
16-bit encodings).

After more development, I now have:
Imm5u/Disp5u, some ops (Baseline)
Imm6s/Disp6s (XG2)
Imm9u: Typical ALU ops
Imm10u (XG2)
Imm9n: A few ALU ops
Imm10n (XG2)
Disp9u: LD/ST ops
Disp10s (XG2)
TBD if Disp10u+Disp6s would have been better.
Since negative displacements are still pretty rare.
Might have been better to have larger positive displacements.
Imm10{u/n}: Various 2RI ops
Imm11{u/n} {XG2}
Disp11s / Disp12s (XG2), Branch-Compare-Zero
Effectively uses an opcode bit as the sign bit.
Imm16u/Imm16n: Some 2RI ops.
Disp20s: BRA/BSR
Disp23s (XG2)
Imm24{u/n}: LDIZ/LDIN ("MOV Imm25s, R0")

However, they are only available in specific combinations.
Imm9u: ADD, ADDS.L, ADDU.L, AND, OR, XOR, SH{A/L}D{L/Q}, MULS, MULU
Imm9n: ADD, ADDS.L, ADDU.L

Which does mean, say:
y=x&(~7);
Needs either to load a constant into a register, or use a jumbo prefix.

The Disp9u/Disp10s encoding exists on all basic Load/Store ops, however
"special" ops (like XMOV.x) only have Disp5u/Disp6s encodings (not a
huge loss though).

With a Jumbo-Imm prefix, many of the Disp/Imm cases expand to 33 bits
(except Disp5 which only goes to 29 bits).

>
>> Where, say:
>> Thread: Logical thread of execution within some existing process;
>          has a register file and a stack.
>> Process: Distinct collection of 1 or more threads within a shared
>          has a memory map a heap and a vector of threads.
>> address space and shared process identity (may have its own address
>> space, though as-of-yet, TestKern uses a shared global address space);
>> Task: Supergroup that includes Threads, Processes, and other
>> thread-like entities (such as call and method handlers), may be either
>> thread-like or process-like.
>
>> Where, say, the Syscall interrupt handler doesn't generally handle
>> syscalls itself (since the ISRs will only have access to
>> physically-mapped addresses), but effectively instead initiates a
>> context switch to the task that can handle the request (or, to context
>> switch back to the task that made the request, or to yield to another
>> task, ...).
>
> We call these things:: dispatchers.
>

Yeah.

As-is, I have several major interrupt handlers:

Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.

IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).

TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

SYSCALL: Mostly initiates task switches and similar, and little else.

Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The closest
option is mostly to use the interrupt handlers to trigger context
switches (which is, ironically, slightly less of an issue, as many of
the "hard" parts of a context switch are already performed for sake of
dealing with the "rather minimalist" interrupt mechanism).

Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.

They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked registers
(apart from SP/SSP, which switch places when entering/leaving an ISR).

Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").

So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.

Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are a
few CR's which capture the CPU state at the time the ISR happens (these
registers getting overwritten every time a new interrupt occurs).

So, say:
Interrupt entry:
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Interrupt return (RTE):
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.

*1: At the time, couldn't figure a good way to shave more logic off the
mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).

So, instead, the ISR entry point would do something like:
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...

But, didn't really think of it at the time.

There is already the trick of requiring VBR to be aligned (currently 64B
in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.

Not sure if many CPUs have a cheaper mechanism here...

Note that in my case, generally the interrupt handlers are written in C,
with the compiler managing all the ISR prolog/epilog stuff (mostly
saving/restoring pretty much the entire CPU state to the ISR stack).

Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).

Where:
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM

Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Though, I guess one other possibility would be if the compiler-generated
ISR code assumed TBR to always be valid (and then copied the registers
to a fixed location relative to TBR instead of the ISR stack), which
could in-theory allow for faster context switching (by eliminating the
need for the memcpy calls), but would be a bit more brittle (if TBR is
invalid, stuff is going to break pretty hard as soon as an interrupt
happens).

Would likely need special compiler attributes for this (would not make
sense for interrupts which do not, or are unlikely to, perform a context
switch).


Click here to read the complete article
Re: Concertina II Progress

<ujjtkf$16q70$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35067&group=comp.arch#35067

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi...@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Tue, 21 Nov 2023 22:47:26 -0500
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <ujjtkf$16q70$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 22 Nov 2023 03:47:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="05cc0b4669e92e6e2f97575fd2b0122e";
logging-data="1272032"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+DLFM0yUmzFXri4qBW4NjUziiskarpYc4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:lbcmEBjl+mpV0pbX/LTQltOsEyI=
In-Reply-To: <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
Content-Language: en-US
 by: Robert Finch - Wed, 22 Nov 2023 03:47 UTC

On 2023-11-21 5:12 p.m., MitchAlsup wrote:
> BGB wrote:
>
>> On 11/20/2023 11:31 AM, Stephen Fuld wrote:
>>> On 11/15/2023 1:10 PM, MitchAlsup wrote:
>
>
>> For some ops, the 3rd register (Ro) would instead operate as a 5-bit
>> immediate/displacement field. Which was initially a similar idea, with
>> the 32-bit space mirroring the 16-bit space.
>
> Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit
> register
> specifier into a 5-bit immediate of either positive or negative integer
> value. This makes::
>
>     1<<n
>    ~0<<n
>     container.bitfield = 7;
>
> single instructions.
>
Q+ CPU allows immediates of any length to be used in place of source
operand register values via postfix instructions. Virtually all
instructions may use immediates instead of registers. There are also
quick immediate form instructions that have the second source operand as
an immediate constant encoded directly in the instruction as this is the
most common use.

The postfix immediate instructions come in four lengths. 23-bit, 39-bit,
71-bit and 135-bit. Currently float values make use of on 32 or 64 bits
out of the 39 and 71-bit formats. I have been pondering having the float
immediates left aligned with additional trailing bits. These bits are
zero for now.

Postfixes are treated as part of the current instruction by the CPU.

>
>> Where, say:
>> Thread: Logical thread of execution within some existing process;
>          has a register file and a stack.
>> Process: Distinct collection of 1 or more threads within a shared
>          has a memory map a heap and a vector of threads.
>> address space and shared process identity (may have its own address
>> space, though as-of-yet, TestKern uses a shared global address space);
>> Task: Supergroup that includes Threads, Processes, and other
>> thread-like entities (such as call and method handlers), may be either
>> thread-like or process-like.
>
>> Where, say, the Syscall interrupt handler doesn't generally handle
>> syscalls itself (since the ISRs will only have access to
>> physically-mapped addresses), but effectively instead initiates a
>> context switch to the task that can handle the request (or, to context
>> switch back to the task that made the request, or to yield to another
>> task, ...).
>
> We call these things:: dispatchers.
>
>> Though, will need to probably add more special case handling such that
>> the Syscall task can not yield or try to itself make a syscall (the
>> only valid exit point for this task being where it transfers control
>> back to the caller and awaits the next syscall to arrive; and it is
>> not valid for this task to try to syscall back into itself).
>
> In My 66000, every <effective> SysCall goes deeper into the privilege
> hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
> Guest HV SysCalls real HV. No data structures need maintenance during
> these transitions of the hierarchy.

Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.

Re: Concertina II Progress

<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35069&group=comp.arch#35069

  copy link   Newsgroups: comp.arch
Date: Wed, 22 Nov 2023 18:38:00 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$UjN4pOYKC2gU/ZNiaBlAzu4/DYSOLnwTFwfVhnx52DoF6wOv/nm0C
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me>
Organization: novaBBS
Message-ID: <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
 by: MitchAlsup - Wed, 22 Nov 2023 18:38 UTC

BGB wrote:

> On 11/21/2023 4:12 PM, MitchAlsup wrote:
>> BGB wrote:
>
>>> Where, say, the Syscall interrupt handler doesn't generally handle
>>> syscalls itself (since the ISRs will only have access to
>>> physically-mapped addresses), but effectively instead initiates a
>>> context switch to the task that can handle the request (or, to context
>>> switch back to the task that made the request, or to yield to another
>>> task, ...).
>>
>> We call these things:: dispatchers.
>>

> Yeah.

> As-is, I have several major interrupt handlers:

> Fault: Something has gone wrong, current handling is to stall the CPU
> until reset (and/or terminate the emulator). Could in premise do other
> things.

I call these checks:: a page fault is an unanticipated SysCall to the
Guest OS page fault handler; whereas a check is something that should
never happen but did (ECC repair fail): These trap to Real HV.

> IRQ: Deals with timer, may potentially be used for preemptive task
> scheduling (code is in place, but this is not currently enabled). Does
> not currently perform any other "complex" actions (and the "practical"
> use of IRQ's remains limited in my case, due in large part to the
> limitations of interrupt handling).

Every My 66000 process has its own event table which combines exceptions
interrupts, SysCalls,... This means there is no table surgery when switching
between Guest OS and Guest Hypervisor and Real Hypervisor.

> TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
> action if a "page fault" style event occurs (or something needs to be
> paged in/paged out from the swapfile).

HW table walking.

> SYSCALL: Mostly initiates task switches and similar, and little else.

Part of Event table.

> Unlike x86, the design of the interrupt mechanisms means it isn't
> practical to hang the whole OS off of an interrupt handler. The closest
> option is mostly to use the interrupt handlers to trigger context
> switches (which is, ironically, slightly less of an issue, as many of
> the "hard" parts of a context switch are already performed for sake of
> dealing with the "rather minimalist" interrupt mechanism).

My 66000 can perform a context (user->user) in a single instruction.
Old state goes to memory, new state comes from memory; by the time
state has arrived, you are fetching instructions in the new context
under the new context MMU tables and privileges and priorities.

> Basically, in this design, it isn't possible to enter a new interrupt
> without first returning from the prior interrupt (at least not without
> f*ing the CPU state). And, as-is, interrupts can only operate in
> physically addressed mode.

> They also need to manually save and restore all the registers, since
> unlike either SuperH or RISC-V, BJX2 does not have any banked registers
> (apart from SP/SSP, which switch places when entering/leaving an ISR).

> Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
> 8086 real-mode, it doesn't implicitly push anything to the stack (nor
> have an "interrupt vector table").

> So, the interrupt handling is basically a computed branch; which was
> basically about the cheapest mechanism I could come up with at the time.

> Did create a little bit of a puzzle initially as to how to get the CPU
> state saved off and restored with no free registers. Though, there are a
> few CR's which capture the CPU state at the time the ISR happens (these
> registers getting overwritten every time a new interrupt occurs).

Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

> So, say:
> Interrupt entry:
> Copy low bits of SR into high bits of EXSR;
> Copy PC into SPC.
> Copy fault address into TEA;
> Swap SP and SSP (*1);
> Set CPU flags to Supervisor+ISR mode;
> CPU Mode bits now copied from high bits of VBR.
> Computed branch relative to VBR.
> Offset depends on interrupt category.
> Interrupt return (RTE):
> Copy EXSR bits back into SR;
> Unswap SP/SSP (*1);
> Branch to SPC.

Interrupt Entry Point::
// by this point all the old registers have been saved where they
// are supposed to go, and the interrupt dispatcher registers are
// already loader up and ready to go, and the CPU is running at
// whatever privilege level was specified.
HR R1<-WHY
LD IP,[IP,R1<<3,InterruptVectorTable] // Call through table
RTI
//
InterruptHandler0:
// do what is necessary
// note this can all be written in C
RET
InterruptHandler1::

> *1: At the time, couldn't figure a good way to shave more logic off the
> mechanism. Though, now, the most obvious candidate now would be to
> eliminate the implicit SP/SSP swapping (this part is currently handled
> in the instruction decoder).

> So, instead, the ISR entry point would do something like:
> MOV SP, SSP
> MOV 0xDE00, SP //Designated ISR stack SRAM
> MOV.Q R0, (SP, 0)
> NOV.Q R1, (SP, 8)
> ... Now save off everything else ...

> But, didn't really think of it at the time.

> There is already the trick of requiring VBR to be aligned (currently 64B
> in practice; formally 256B), mostly so as to allow the "address
> computation" to be done via bit-slicing.

> Not sure if many CPUs have a cheaper mechanism here...

Treat the CPU state and the register state as cache lines and have
HW shuffle them in and out. You can even start the 5 cache line reads
before you start the CPU state writes; saving latency (which you cannot
using SW only methods).

> Note that in my case, generally the interrupt handlers are written in C,
> with the compiler managing all the ISR prolog/epilog stuff (mostly
> saving/restoring pretty much the entire CPU state to the ISR stack).

My 66000 compiler remains blissfully ignorant of ISR prologue and
epilogue and it still works.

> Generally, the ISR's also need to deal with having a comparably small
> stack (with 0.75K already used for the saved CPU state).

> Where:
> 0000..7FFF: Boot ROM
> 8000..BFFF: (Optional) Extended Boot ROM
> C000..DFFF: Boot/ISR SRAM
> E000..FFFF: (Optional) Extended SRAM

> Generally, much of the work of the context switch is pulled off using
> "memcpy" calls (with the compiler providing a special "__arch_regsave"
> variable giving the address of the location it has dumped the CPU
> registers into; which in turn covers most of the core state that needs
> to be saved/restored for a process context switch).

Why not just make the HW push and pull cache lines.

Re: Concertina II Progress

<ca3ca531f631b50e1e7a9d5e11eac2de@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35070&group=comp.arch#35070

  copy link   Newsgroups: comp.arch
Date: Wed, 22 Nov 2023 19:36:28 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$CuSSiepNO3k54smMoZtdFetCgP5XSz9fbQppK5hL.flDJepJjqZli
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjtkf$16q70$1@dont-email.me>
Organization: novaBBS
Message-ID: <ca3ca531f631b50e1e7a9d5e11eac2de@news.novabbs.com>
 by: MitchAlsup - Wed, 22 Nov 2023 19:36 UTC

Robert Finch wrote:

> On 2023-11-21 5:12 p.m., MitchAlsup wrote:
>
>> In My 66000, every <effective> SysCall goes deeper into the privilege
>> hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
>> Guest HV SysCalls real HV. No data structures need maintenance during
>> these transitions of the hierarchy.

> Does it follow the same way for hardware interrupts? I think RISCV goes
> to the deepest level first, machine level, then redirects to lower
> levels as needed. I was planning on Q+ operating the same way.

It depends, there is the school of thought that just deliver control to
someone who can always deal with it (Machine level in RISC-V) and there
is the other school of thought that some table should encode which level
of the system control is delivered to. The former allow SW to control
every step of the process, the later gets rid of all the SW checking
and simplifies the process of getting to and back from interrupt handlers
(and their associated soft IRQs.)

Re: Concertina II Progress

<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35072&group=comp.arch#35072

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Wed, 22 Nov 2023 17:17:30 -0500
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de>
<uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me>
<uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me>
<4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me>
<uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="0316dd13f68d229b35f02c839c59ec2f";
logging-data="1597559"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18vzLIb4E7iUf/F1eOte4geSNoGykopo+c="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:G+Lr2ZENg4csDM05PvJweABBwNs=
sha1:PSpWYdJR+C7lquziCr0FKtna3Jw=
 by: Stefan Monnier - Wed, 22 Nov 2023 22:17 UTC

> Why not just treat the RF as a cache with a known address in physical memory.
> In MY 66000 that is what I do and then just push and pull 4 cache lines at a

Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
it's not also a pun on the TI 9900.

Stefan

Re: Concertina II Progress

<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35075&group=comp.arch#35075

  copy link   Newsgroups: comp.arch
Date: Wed, 22 Nov 2023 23:58:19 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$TvRLHYqRe7WJd3PL.u40oeAg.5vyOm51lAlCnpf9h96oIIRDparsG
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
Organization: novaBBS
Message-ID: <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
 by: MitchAlsup - Wed, 22 Nov 2023 23:58 UTC

Stefan Monnier wrote:

>> Why not just treat the RF as a cache with a known address in physical memory.
>> In MY 66000 that is what I do and then just push and pull 4 cache lines at a

> Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
> it's not also a pun on the TI 9900.

In reverence to CDC 6600, not came from.

Exchange Jump on CDC 6600 causes a context switch that took 16+10 processor cycles
(after the scoreboard cleared.) And on the 6600, NOS was in the PPs and the CPUs
were there to just crunch numbers.

I have a hard real time version of My 66000 where the lower levels of the OS is
in HW, and if you have fewer than 1024 threads running, you do not expend any
(zero, 0, nada, zilch) cycles in the OS performing context switches or priority
alterations. This system has the property that if an interrupt (or message)
arrives to unblock a waiting thread that is of higher priority than any CPU in
affinity group of CPUs, then the lowest priority CPU in that group receives the
higher priority thread (without an excursion through the OS (damaging cache
state).)

I have a Linux friendly version where context switch is a single instruction.
When you write a context pointer that entire context is now available to support
whatever you want it to support. So, a unprivileged application can context
switch to another unprivileged application by writing a single control register
leaving Guest OS, Guest HV and Real HV in their original configuration. Guest
OS can context switch to a different Guest OS in a single instruction and then
the Guest OS receiving control needs to context switch to an application it wants
to run--so 20-ish cycles to perform a Guest OS switch. (This now costs typical
old architectures 10,000 cycles)

But nowhere does any thread receiving control have to execute and state or
register saving or restoring......Just like Exchange Jump.....

> Stefan

Re: Concertina II Progress

<ujmi69$1n5ko$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35076&group=comp.arch#35076

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Wed, 22 Nov 2023 21:50:30 -0600
Organization: A noiseless patient Spider
Lines: 463
Message-ID: <ujmi69$1n5ko$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 23 Nov 2023 03:50:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0810ecf233304c32017e594c3f764a1d";
logging-data="1808024"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18SNCs3b2NTXmk7ZMcDfPAj"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:lamGIdDbeDsrRCyf+Ge2XyzNFyY=
Content-Language: en-US
In-Reply-To: <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
 by: BGB - Thu, 23 Nov 2023 03:50 UTC

On 11/22/2023 12:38 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 11/21/2023 4:12 PM, MitchAlsup wrote:
>>> BGB wrote:
>>
>>>> Where, say, the Syscall interrupt handler doesn't generally handle
>>>> syscalls itself (since the ISRs will only have access to
>>>> physically-mapped addresses), but effectively instead initiates a
>>>> context switch to the task that can handle the request (or, to
>>>> context switch back to the task that made the request, or to yield
>>>> to another task, ...).
>>>
>>> We call these things:: dispatchers.
>>>
>
>> Yeah.
>
>> As-is, I have several major interrupt handlers:
>
>> Fault: Something has gone wrong, current handling is to stall the CPU
>> until reset (and/or terminate the emulator). Could in premise do other
>> things.
>
> I call these checks:: a page fault is an unanticipated SysCall to the
> Guest OS page fault handler; whereas a check is something that should
> never happen but did (ECC repair fail): These trap to Real HV.
>

A lot of things here are things that could be handled, but are not
currently handled:
Invalid instructions;
Access to invalid memory regions;
Access to memory in a way which violates access protections;
A branch to an invalid address;
Code used the BREAK instruction or similar;
Etc.

Generally at present, if any of these happens, it means that something
has gone badly enough that I want to stall immediately and probably
debug it.

In a "real" OS, if this happens in userland, one would typically turn
this into "SEGFAULT" or similar.

For the emulator, if a BREAK occurs in ISR mode (or any other fault
happens in ISR mode), it causes the emulator to stop execution, dump a
backtrace and registers, and then terminate. Otherwise, exiting the
emulator normally will dump a bunch of profiling information (this part
is not done if the emulator terminates due to a fault).

Stalling the core in the Verilog core causes it to dump the state of the
pipeline and some other things via "$display" (potentially relevant for
debugging). Or, allows seeing the crash PC on the 7-segment display on
the Nexys A7.

>> IRQ: Deals with timer, may potentially be used for preemptive task
>> scheduling (code is in place, but this is not currently enabled). Does
>> not currently perform any other "complex" actions (and the "practical"
>> use of IRQ's remains limited in my case, due in large part to the
>> limitations of interrupt handling).
>
> Every My 66000 process has its own event table which combines exceptions
> interrupts, SysCalls,... This means there is no table surgery when
> switching
> between Guest OS and Guest Hypervisor and Real Hypervisor.
>

In my case, the VBR register is global (and set up during boot).

Any per-process event dispatching would need to be handled in software.

I didn't go with an x86-style IDT or similar partly because this would
have been significantly more expensive (in terms of Verilog code and
LUTs) than the existing mechanism. The role of an x86-style IDT could be
faked in software though.

So, VBR is sort of like:
(63:48): Encodes CPU state to use on ISR entry;
(47: 6): Encodes the ISR entry point.
In practice only (28:6) are "actually usable".
( 5: 0): Must be Zero

Where, low-order bits are replaced with an entry offset:
00: RESET
08: FAULT
10: IRQ
18: TLBMISS
20: SYSCALL
28: Reserved

The 8-bytes of space gives enough space to encode a relative or absolute
branch to the actual entry point (which not being so big as to be
needlessly wasteful).

During CPU reset, VBR is cleared to 0, and then control is transferred
to 0, which branches to the ROM's entry point.

The use of a computed branch was preferable to a "vector table" as the
vector table would have required some mechanism for the CPU to perform a
memory load to get the address. Computed branch was easier, since no
special memory load is needed, just branch there, and assume this lands
on a branch instruction which takes control where it needs to go.

>> TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
>> action if a "page fault" style event occurs (or something needs to be
>> paged in/paged out from the swapfile).
>
> HW table walking.
>

Yeah, no page-table hardware in my case.

Had on/off considered an "Inverted Page-Table" like in IA-64, but this
still seemed to be annoyingly expensive vs the "Throw a TLB-Miss
Exception" route. Even if I eliminated the TLB-Miss logic, would then
need to have Page-Fault logic, which doesn't really save anything there
either.

There is a designated register though for the page-table: TTB.

With the considered inverted-page-table using a separate VIPT register,
the idea being that VIPT would point to a region of, say, 4096x4x128b
TLBE's (~256K), effectively functioning as a RAM-backed L3 TLB. If this
table lacked the requested TLBE, this would still result in a TLB Miss
fault.

Note that the idea was still that trying to use 96-bit virtual address
mode would require two TLBE's, effectively halving associativity. This
in turn requires plain modulo-addressing as hashing can create a "bad
situation" where a 2-way TLB will get stuck in an infinite loop (but
this infinite loop scenario is narrowly averted with modulo addressing).

Granted, 4-way is still better as it seems to result in a comparably
lower TLB miss rate.

It is still possible though to XOR the TLBE's index with a bit-pattern
derived from the ASID, to slightly reduce the cost of context switches
in some cases (if multiple address spaces were being used).

Note that the L1 I$ and D$ can get along reasonably well with an
optional 32-entry 1-way "Micro-TLB".

>> SYSCALL: Mostly initiates task switches and similar, and little else.
>
> Part of Event table.
>

All software in my case.

>> Unlike x86, the design of the interrupt mechanisms means it isn't
>> practical to hang the whole OS off of an interrupt handler. The
>> closest option is mostly to use the interrupt handlers to trigger
>> context switches (which is, ironically, slightly less of an issue, as
>> many of the "hard" parts of a context switch are already performed for
>> sake of dealing with the "rather minimalist" interrupt mechanism).
>
> My 66000 can perform a context (user->user) in a single instruction.
> Old state goes to memory, new state comes from memory; by the time
> state has arrived, you are fetching instructions in the new context
> under the new context MMU tables and privileges and priorities.
>

Yeah, but that is not exactly minimalist in terms of the hardware.

Granted, burning around 1 kilocycle of overhead per syscall isn't ideal
either...

Eg:
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.

Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).

If the "memcpy's" could be eliminated, this could roughly halve the cost
of doing a syscall.

One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

Worth the cost? Dunno.

Not too much different to modern Windows, where slow syscalls are still
fairly common (and despite the slowness of the mechanism, it seems like
BJX2 sycalls still manage to be around an order of magnitude faster than
Windows syscalls in terms of clock-cycle cost...).

Well, and the seeming absurdity of WaitForSingleObject() on a mutex
generally taking upwards of 1 million clock-cycles IIRC in past
experiments (when the mutex isn't already locked; and, if it is
locked... yeah...).

You could lock a mutex... or you could render an entire frame in Doom,
then checksum the frame image, and use the checksum as a hash key. In a
roughly similar time-scale.

Luckily, at least, the CriticalSection objects were not absurdly slow...

>> Basically, in this design, it isn't possible to enter a new interrupt
>> without first returning from the prior interrupt (at least not without
>> f*ing the CPU state). And, as-is, interrupts can only operate in
>> physically addressed mode.
>
>> They also need to manually save and restore all the registers, since
>> unlike either SuperH or RISC-V, BJX2 does not have any banked
>> registers (apart from SP/SSP, which switch places when
>> entering/leaving an ISR).
>
>> Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
>> 8086 real-mode, it doesn't implicitly push anything to the stack (nor
>> have an "interrupt vector table").
>
>
>> So, the interrupt handling is basically a computed branch; which was
>> basically about the cheapest mechanism I could come up with at the time.
>
>> Did create a little bit of a puzzle initially as to how to get the CPU
>> state saved off and restored with no free registers. Though, there are
>> a few CR's which capture the CPU state at the time the ISR happens
>> (these registers getting overwritten every time a new interrupt occurs).
>
> Why not just treat the RF as a cache with a known address in physical
> memory.
> In MY 66000 that is what I do and then just push and pull 4 cache lines
> at a
> time.
>


Click here to read the complete article
Re: Concertina II Progress

<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35079&group=comp.arch#35079

  copy link   Newsgroups: comp.arch
Date: Thu, 23 Nov 2023 16:53:04 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$AQAqhQxzZCSE8.gbCkrNu.EL5O08v7rjuhzXbcG8IlBJuceluq4dO
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <ujmi69$1n5ko$1@dont-email.me>
Organization: novaBBS
Message-ID: <b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
 by: MitchAlsup - Thu, 23 Nov 2023 16:53 UTC

BGB wrote:

> On 11/22/2023 12:38 PM, MitchAlsup wrote:
>> BGB wrote:

> Yeah, but that is not exactly minimalist in terms of the hardware.

> Granted, burning around 1 kilocycle of overhead per syscall isn't ideal
> either...

> Eg:
> Save registers to ISR stack;
> Copy registers to User context;
> Copy handler-task registers to ISR stack;
> Reload registers from ISR stack;
> Handle the syscall;
> Save registers to ISR stack;
> Copy registers to Syscall context;
> Copy User registers to ISR stack;
> Reload registers from ISR stack.

> Does mean that one needs to be economical with syscalls (say, doing
> "printf" a whole line at a time, rather than individual characters, ...).

Not at all--I have reduced SysCalls to just a bit slower than actual CALL.
say around 10-cycles. Use them as often as you like.

> And, did create incentive to allow getting the microsecond-clock value
> and hardware RNG values from CPUID rather than needing a syscall (say,
> don't want to burn 20us to check the microsecond counter, ...).

> If the "memcpy's" could be eliminated, this could roughly halve the cost
> of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

> One other option would be to do like RISC-V's privileged spec and have
> multiple copies of the register file (and likely instructions for
> accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines;
There is a 5th cache line that contains all the other PSW stuff.

> Worth the cost? Dunno.

In my opinion--Absolutely worth it.

> Not too much different to modern Windows, where slow syscalls are still
> fairly common (and despite the slowness of the mechanism, it seems like
> BJX2 sycalls still manage to be around an order of magnitude faster than
> Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

>
>> Why not just treat the RF as a cache with a known address in physical
>> memory.
>> In MY 66000 that is what I do and then just push and pull 4 cache lines
>> at a
>> time.
>>

> Possible, but poses its own share of problems...

> Not sure how this could be implemented cost-effectively, or for that
> matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

> Though, could make sense if one has a mechanism where a context switch
> could have a mechanism to dump the whole register file to Block-RAM, and
> some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

> Pros/cons, seems like each possibility would also come with drawbacks:
> As-is: Slowness due to needing to save/reload everything;
> RISC-V: Expensive regfile, only works for limited cases;
> MMIO Backed + RV-like: Faster U<->S, but slower task switching.
> RAM Backed: Cache coherence becomes a critical feature.

> The RISC-V like approach makes sense if one assumes:
> There is a user process;
> There is a kernel running under it;
> We want to call from the user process into the kernel.

So if you ae running under a Real OS you don't need 2 sets of RFs in my
model.

> Doesn't make so much sense, say, for:
> User Process A calls a VTable entry which calls into User Process B;
> Service A uses a VTable to call into the VFS;
> ...

> Say, where one is making use of horizontal context switches for control
> flow between logical tasks. Which would still remain fairly expensive
> under a RISC-V like model.

Yes, but PTHREADing can be done without privilege and in a single instruction.

> One could have enough register banks for N logical tasks, but supporting
> 4 or 8 copies of the register file is going to cost more than 2 or 3.

> Above, I was describing what the hardware was doing.

> The software side is basically more like:
> Branch from VBR-table to ISR entry point;
> Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

> Get some of the CRs saved off (we need R0 and R1 free here);
> Get the rest of the GPRs saved onto the stack;
> Call into the main part of the ISR handler (using normal C ABI);
> Restore most of the GPRs;
> Restore most of the CRs;
> Restore R0 and R1;
> Do an RTE.

If HW does register file save/restore the above looks like::

> The software side is basically more like:
> Branch from VBR-table to ISR entry point;
> Call into the main part of the ISR handler (using normal C ABI);
> Do an RTE.

See what it saves ??

Re: Concertina II Progress

<_yN7N.90514$AqO5.65665@fx11.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35082&group=comp.arch#35082

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjtkf$16q70$1@dont-email.me> <ca3ca531f631b50e1e7a9d5e11eac2de@news.novabbs.com>
Lines: 63
Message-ID: <_yN7N.90514$AqO5.65665@fx11.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 23 Nov 2023 19:17:14 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 23 Nov 2023 19:17:14 GMT
X-Received-Bytes: 4257
 by: Scott Lurndal - Thu, 23 Nov 2023 19:17 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>Robert Finch wrote:
>
>> On 2023-11-21 5:12 p.m., MitchAlsup wrote:
>>
>>> In My 66000, every <effective> SysCall goes deeper into the privilege
>>> hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
>>> Guest HV SysCalls real HV. No data structures need maintenance during
>>> these transitions of the hierarchy.
>
>> Does it follow the same way for hardware interrupts? I think RISCV goes
>> to the deepest level first, machine level, then redirects to lower
>> levels as needed. I was planning on Q+ operating the same way.
>
>It depends, there is the school of thought that just deliver control to
>someone who can always deal with it (Machine level in RISC-V) and there
>is the other school of thought that some table should encode which level
>of the system control is delivered to. The former allow SW to control
>every step of the process, the later gets rid of all the SW checking
>and simplifies the process of getting to and back from interrupt handlers
>(and their associated soft IRQs.)

ARMv8 allows the interrupt and fast interrupt (IRQ, FIQ) signals to be
delivered to the EL1 (operating system) ring unless system registers at
higher (more privileged) exception levels trap the signal. EL3 (firmware)
level is the most privileged level and generally 'owns' the FIQ signal,
while the IRQ signal is owned by EL1 (bare metal OS) or EL2 (hypervisor).

The destination exception level of each signal is controlled by
bits in system registers (SCR_EL3 to direct them to EL3, HCR_EL2 to
direct them to EL2).

Interrupts can be assigned to one of two groups - group 0 which is
always delivered as an FIQ and group 1 which is delivered as an IRQ.

Group zero interrupts are considered "secure" interrupts and only
secure accesses can modify the configuration of such interrupts.

Group one interrupts can be either non-secure or secure depending on
the security state of the target exception level (secure or non-secure).

The higher priority half of the interrupt priority (8 bits) is considered
a secure range, the rest non-secure, thus secure interrupts will always have
higher priority than non-secure interrupts.

There is no software "checking" required.

Exception return (i.e. context switch) loads the PSR from SPSR_ELx and
the PC from ELR_ELx[*] and that's the entirety of the software visible state
handled by the hardware. Each exception level has its own page table
root registers (TTBR0_ELx, TTBR1_ELx for each half of the VA space), so
there is nothing for software to reload. Hardware manages the TLB entries
which are tagged with both security state and exception level.

[*] Both are system registers (flops, not ram)

[**] The secure flag (!SCR_EL3[NS]) acts like an 'invisible'
address bit at bit N (where N is the number of bits of supported
physical address). This provides two completely distinct N-bit
address spaces - one secure and one non-secure with SCR_EL3[NS]
controlling which space is used by accesses. NS only applies
to EL 0 - 2, EL3 is always considered secure. N is typically 48,
but can be up to 52 in the current versions of the architecture.

Re: Concertina II Progress

<OSO7N.18767$_z86.2860@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35083&group=comp.arch#35083

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
Lines: 69
Message-ID: <OSO7N.18767$_z86.2860@fx46.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 23 Nov 2023 20:46:38 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 23 Nov 2023 20:46:38 GMT
X-Received-Bytes: 4498
 by: Scott Lurndal - Thu, 23 Nov 2023 20:46 UTC

mitchalsup@aol.com (MitchAlsup) writes:
>Stefan Monnier wrote:

>
>I have a Linux friendly version where context switch is a single instruction.

The Burroughs B3500 had a single such instruction, called
Branch Reinstate (BRE).

The task context (base register, limit register, accumulator, comparison
and overflow flags) were stored in small region at absolute address 60
and BRE would restore that state (and interrupts would save it).
Index registers were mapped to base-relative addresses 8, 16 and 24
(8 digits each).

The V-Series did a complete revamp of the processor architecture to
support larger memory sizes (both per task and systemwide) and
SMP. A segmentation scheme was adopted (for backward compatability)
and seven additional base-limit pairs were added to support direct
access to 8 segments at any time (called an evironment). There
could be up to 1,000,000 environments per task, each with up to
8 active memory areas (and 92 inactive memory areas accessible to
three special instructions for data movement and comparison).

The instruction was renamed Branch Reinstate Virtual (BRV) and would
read the task table entry and load all the relevent state, including
loading the active environment table into the processor base-limit
registers. BRV accessed a table in memory, indexed by task number,
that stored all the state of the task (200 digits worth).

At the same time, we added SMP support including an inter-cpu
communication instruction (my invention) similar to the
mechanism adopted a few years later when Intel added SMP
support for P5.

We also added hardware mutex and condition variable instructions;
the "LOK" instruction would atomically acquire the mutex, if
available, or interrupt to a microkernel scheduler if unavailable.
"UNLK" would interrupt if a higher priority task was waiting
for the lock. There were CAUS and WAIT instructions that
offered capabilities similar to posix condition variables.

Each defined lock had a canonical lock level (a 4 digit
number) and the hardware would fail a lock request where
the new lock canonical lock number is less than the current
lock owned by the task (if any). Unlock enforced the
reverse. This prevented any A-B deadlock situations from
occuring, although with many locks in a large subsystem (e.g
the MCP OS) it was tricky sometimes to assign lock numbers.
This also implicitly encouraged programmers to minimize
the critical section and avoid nested locking where possible.

The microkernel only handled scheduling and interrupts, all
MCP code ran in the context of either the task making the
request, or in an 'independent runner' (a kernel thread)
dispatched from the microkernel. I/O interrupts were dispatched
to two different independent runners, one for normal interrupts
and one for real-time interrupts. Real-time interrupts were
used for document sorters (e.g. MICR reader/sorters processing
checks/cheques/utility bills, etc) in order to be able to
select the destination pocket for each document in the
time interval from the read station to the pocket-select
station (at 2500 documents per minute - 42 per second,
one document every 24 milliseconds). We supported ten
active sorters per host. Even had one host installed
on an L-1011 with reader/sorters that processed
checks on coast-to-coast overnight flights.

Re: Concertina II Progress

<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35084&group=comp.arch#35084

  copy link   Newsgroups: comp.arch
Date: Thu, 23 Nov 2023 21:08:45 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$oyCtiwEOd33CREoaz7RGqujiCvBf0K13RtZ5tzTmtHA/EA9KOIjzK
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad>
Organization: novaBBS
Message-ID: <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
 by: MitchAlsup - Thu, 23 Nov 2023 21:08 UTC

Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup) writes:
>>Stefan Monnier wrote:

>>
>>I have a Linux friendly version where context switch is a single instruction.

> The Burroughs B3500 had a single such instruction, called
> Branch Reinstate (BRE).

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
each privilege level has its own {IP, RF, Root Pointer, CSP, Exception
{Enabled, Raised}, and a few more things contained in 5 contiguous cache
lines.

The 4 privilege levels, each, have a pointer to those 5 cache lines. By
writing the control register (HR instruction) one can change the control
point for each level (of course you have to have appropriate permission--
but I decided that a user should have the ability to context switch to
another user without needing OS intervention--thus pthreads do not need
an excursion through the Guest OS to switch threads under the same memory
map {but do when crossing processes}.

Thus, all 4 privileges are always resident in the privilege hierarchy
at the cost of 4 DoubleWord registers instead of at the cost of 4 RFs.
With these levels all resident simultaneously, no table surgery is needed
to switch levels {Root pointers, MTRR,...} and no RF save/restore is
needed.

Re: Concertina II Progress

<ujohle$209gb$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35087&group=comp.arch#35087

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Thu, 23 Nov 2023 15:53:47 -0600
Organization: A noiseless patient Spider
Lines: 482
Message-ID: <ujohle$209gb$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 23 Nov 2023 21:53:50 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0810ecf233304c32017e594c3f764a1d";
logging-data="2106891"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Ap27nQqddrwffC48RkAYB"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:taK5tYS9sjDkwsQ6MUGMW73rQVc=
In-Reply-To: <b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
Content-Language: en-US
 by: BGB - Thu, 23 Nov 2023 21:53 UTC

On 11/23/2023 10:53 AM, MitchAlsup wrote:
> BGB wrote:
>
>> On 11/22/2023 12:38 PM, MitchAlsup wrote:
>>> BGB wrote:
>
>> Yeah, but that is not exactly minimalist in terms of the hardware.
>
>> Granted, burning around 1 kilocycle of overhead per syscall isn't
>> ideal either...
>
>
>> Eg:
>>    Save registers to ISR stack;
>>    Copy registers to User context;
>>    Copy handler-task registers to ISR stack;
>>    Reload registers from ISR stack;
>>    Handle the syscall;
>>    Save registers to ISR stack;
>>    Copy registers to Syscall context;
>>    Copy User registers to ISR stack;
>>    Reload registers from ISR stack.
>
>
>> Does mean that one needs to be economical with syscalls (say, doing
>> "printf" a whole line at a time, rather than individual characters, ...).
>
> Not at all--I have reduced SysCalls to just a bit slower than actual CALL.
> say around 10-cycles. Use them as often as you like.
>

OK.

Well, they aren't very fast in my case, in any case.

>> And, did create incentive to allow getting the microsecond-clock value
>> and hardware RNG values from CPUID rather than needing a syscall (say,
>> don't want to burn 20us to check the microsecond counter, ...).
>
>
>> If the "memcpy's" could be eliminated, this could roughly halve the
>> cost of doing a syscall.
>
> I have MM (memory move) as a 3-operand instruction.
>

None in my case...

But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.

Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

>> One other option would be to do like RISC-V's privileged spec and have
>> multiple copies of the register file (and likely instructions for
>> accessing these alternate register files).
>
> There is one CPU register file, and every running thread has an address
> where that file comes from and goes to--just like a block of 4 cache lines;
> There is a 5th cache line that contains all the other PSW stuff.
>

No direct equivalent.

I was thinking sort of like the RISC-V Privileged spec, there are
User/Supervisor/Machine sets, with the mode effecting which of these is
visible.

Obvious drawback in my case is that this would effectively increase the
number of internal GPRs from 64 to 192 (and, at that point, may as well
go to 4 copies and have 256).

If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).

The increase in the number of CRs could be less, since only a few of
them actually need duplication.

But, don't want to go this way, and it would only be a partial solution
that also does not map up well to my current implementation.

Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.

Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.

There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around a
particular chunk of ASM code or something. In my case, I kept a similar
idea, but just used a fixed 8-byte spacing, with the idea of these spots
branching to the actual entry point.

Though, one other difference is in my case I ended up adding a dedicated
SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
have gone to the FAULT handler instead.

It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch, but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).

Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.

In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the registers),
but ended up not going this way (and instead giving the syscalls their
designated own task context; which also saves on per-task memory overhead).

>> Worth the cost? Dunno.
>
> In my opinion--Absolutely worth it.
>
>> Not too much different to modern Windows, where slow syscalls are
>> still fairly common (and despite the slowness of the mechanism, it
>> seems like BJX2 sycalls still manage to be around an order of
>> magnitude faster than Windows syscalls in terms of clock-cycle cost...).
>
> Now, just get it down to a cache missing {L1, L2} instruction fetch.
>

Looked into it a little more, realized that "an order of magnitude" may
have actually been a little conservative; seems like Windows syscalls
may be more in the area of 50-100k cycles.

Why exactly? Dunno.

This is still ignoring some of the "slow cases" which may take millions
of clock cycles.

It also seems like fast-ish syscalls may be more of a Linux thing.

>>
>>> Why not just treat the RF as a cache with a known address in physical
>>> memory.
>>> In MY 66000 that is what I do and then just push and pull 4 cache
>>> lines at a
>>> time.
>>>
>
>> Possible, but poses its own share of problems...
>
>> Not sure how this could be implemented cost-effectively, or for that
>> matter, more cheaply than a RISC-V style mode-banked register-file.
>
> 1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
> of having 4 cache lines of state and 1 doubleword of address, you need
> 16 cache lines of state.
>

OK.

Having only 1 set of registers is good...

Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.

Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

One bit of trickery would be, "what if" the Boot SRAM region were inside
the L1 cache rather than out on the ringbus?...

But, then one would have the cost of keeping 8K of SRAM close to the CPU
core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).

Though keeping it tied to a specific CPU core (and effectively processor
local) would avoid the ugly "what if" scenario of two CPU cores trying
to service an interrupt at the same time and potentially stepping on
each others' stacks. The main tradeoff vs putting the stacks in DRAM is
mostly that DRAM may have (comparably more expensive) L2 misses.

Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code from
the ISR stack or similar.

>> Though, could make sense if one has a mechanism where a context switch
>> could have a mechanism to dump the whole register file to Block-RAM,
>> and some sort of mechanism to access this RAM via an MMIO interface.
>
> Just put it in DRAM at SW controlled (via TLB) addresses.
>

Possibly.

It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this route
(baking parts of it into the C ABI is at least "slightly" less evil).

Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I guess I can probably safely rule out MMIO under the basis that context
switching via moving registers via MMIO would be slower than the current
mechanism (of using a series of Load/Store instructions).

>> Pros/cons, seems like each possibility would also come with drawbacks:
>>    As-is: Slowness due to needing to save/reload everything;
>>    RISC-V: Expensive regfile, only works for limited cases;
>>    MMIO Backed + RV-like: Faster U<->S, but slower task switching.
>>    RAM Backed: Cache coherence becomes a critical feature.
>
>
>
>> The RISC-V like approach makes sense if one assumes:
>>    There is a user process;
>>    There is a kernel running under it;
>>    We want to call from the user process into the kernel.
>
> So if you ae running under a Real OS you don't need 2 sets of RFs in my
> model.
>


Click here to read the complete article
Re: Concertina II Progress

<ujoiph$20eko$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35088&group=comp.arch#35088

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Thu, 23 Nov 2023 17:13:03 -0500
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <ujoiph$20eko$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me> <uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 23 Nov 2023 22:13:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="51eaafb1da784ded83345eeb05351d9d";
logging-data="2112152"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VA/cn1SD9aGUNZ7YqBjF8BIFErdjvwe4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:GrYq5DKBozQX64uJ9DSALbzyb80=
In-Reply-To: <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
 by: Paul A. Clayton - Thu, 23 Nov 2023 22:13 UTC

On 11/23/23 4:08 PM, MitchAlsup wrote:
[snip]
> The 4 privilege levels, each, have a pointer to those 5 cache
> lines. By writing the control register (HR instruction) one
> can change the control point for each level (of course you
> have to have appropriate permission-- but I decided that a
> user should have the ability to context switch to another
> user without needing OS intervention--thus pthreads do not
> need an excursion through the Guest OS to switch threads
> under the same memory map {but do when crossing processes}.

My 66000 also has Port Holes, which seem to offer some
cross-protection-domain access.

While not significantly helpful, I also wonder if privilege
reducing operations could be lower cost by not involving the
OS. This would require the OS to store the allowed privilege
elsewhere, but this might be done anyway. It would also have
little use (I suspect) and still require OS involvement to
restore privilege. There might be some cases where privilege
is only needed in an initialization stage, but that seems
likely to be rare.

Writing to the accessed and dirty bits of a PTE would also
seem to be something that could, in theory, be allowed to a
user-level process. Clearing the dirty bit could be dangerous
if stale data was from another protection domain. Clearing
the accessed bit would seem to only "strongly hint" that the
page be victimized earlier; setting the dirty bit would not
be different than a "silent store" [not useful it seems since
a load/store instruction pair could accomplish the same] and
setting the accessed bit would seem the same as performing a
non-caching load to any location in the page acting as a
"keep me" hint [probably not useful]. Even with this little
thought, allowing these PTE changes seems not worthwhile.)

Re: Concertina II Progress

<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35089&group=comp.arch#35089

  copy link   Newsgroups: comp.arch
Date: Thu, 23 Nov 2023 23:30:50 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Spam-Level: *
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$.1bBwUDeGoPwjjKSEfmn.O.s1ZV2E0Y.INmnwVVWXQzBiqaZoD4XK
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <ujmi69$1n5ko$1@dont-email.me> <b0e6671f87b1d447b23372180d119e2e@news.novabbs.com> <ujohle$209gb$1@dont-email.me>
Organization: novaBBS
Message-ID: <835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
 by: MitchAlsup - Thu, 23 Nov 2023 23:30 UTC

BGB wrote:

> On 11/23/2023 10:53 AM, MitchAlsup wrote:
>> BGB wrote:
>>
>>
>>> If the "memcpy's" could be eliminated, this could roughly halve the
>>> cost of doing a syscall.
>>
>> I have MM (memory move) as a 3-operand instruction.
>>

> None in my case...

> But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
> Still might be better to not do a memcpy in these cases.

> Say, if the ISR handler could "merely" reassign the TBR register to
> switch from one task to another to perform the context switch (still
> ignoring all the loads/stores hidden in the prolog and epilog).

>>> One other option would be to do like RISC-V's privileged spec and have
>>> multiple copies of the register file (and likely instructions for
>>> accessing these alternate register files).
>>
>> There is one CPU register file, and every running thread has an address
>> where that file comes from and goes to--just like a block of 4 cache lines;
>> There is a 5th cache line that contains all the other PSW stuff.
>>

> No direct equivalent.

> I was thinking sort of like the RISC-V Privileged spec, there are
> User/Supervisor/Machine sets, with the mode effecting which of these is
> visible.

> Obvious drawback in my case is that this would effectively increase the
> number of internal GPRs from 64 to 192 (and, at that point, may as well
> go to 4 copies and have 256).

> If this were handled in the decoder, this would mean roughly a 9-bit
> register selector field (vs the current 7 bits).

Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
the other ½ is fan-out buffering--driving the lines into the decoder is more
gates of delay than determining if a given select line should be asserted.}}

> The increase in the number of CRs could be less, since only a few of
> them actually need duplication.

> But, don't want to go this way, and it would only be a partial solution
> that also does not map up well to my current implementation.

> Not sure how an OS on SH-4 would have managed all this, but I suspect
> their interrupt model would have had similar limitations to mine.

> Major differences:
> SH-4 banked out R0..R7 when entering an interrupt;
> The VBR relative entry-point offsets were a bit, ad-hoc.

> There were some fairly arbitrary displacements based on the type of
> interrupt. Almost like they designed their interrupt mechanism around a
> particular chunk of ASM code or something. In my case, I kept a similar
> idea, but just used a fixed 8-byte spacing, with the idea of these spots
> branching to the actual entry point.

> Though, one other difference is in my case I ended up adding a dedicated
> SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
> have gone to the FAULT handler instead.

> It is in-theory possible to jump from Interrupt Mode to normal
> Supervisor Mode without a full context switch,

but why ?? the probability that control returns from a given IST to its
softIRQ is less than ½ in a loaded system.

> but the specifics of
> doing so would get a bit more hairy and arcane (which is sort of why I
> just sorta ended up using a context switch).

> Not sure what Linux on SH-4 had done, didn't really investigate this
> part of the code all that much at the time.

> In theory, the ISR handlers could be made to mimic the x86 TSS
> mechanism, but this wouldn't gain much.

Stay away from anything you see in x86 except in using it a moniker
to avoid.

> I think at one point, I had considered having tasks have both User and
> Supervisor state (with two stacks and two copies of all the registers),
> but ended up not going this way (and instead giving the syscalls their
> designated own task context; which also saves on per-task memory overhead).

>>> Worth the cost? Dunno.
>>
>> In my opinion--Absolutely worth it.
>>
>>> Not too much different to modern Windows, where slow syscalls are
>>> still fairly common (and despite the slowness of the mechanism, it
>>> seems like BJX2 sycalls still manage to be around an order of
>>> magnitude faster than Windows syscalls in terms of clock-cycle cost...).
>>
>> Now, just get it down to a cache missing {L1, L2} instruction fetch.
>>

> Looked into it a little more, realized that "an order of magnitude" may
> have actually been a little conservative; seems like Windows syscalls
> may be more in the area of 50-100k cycles.

> Why exactly? Dunno.

> This is still ignoring some of the "slow cases" which may take millions
> of clock cycles.

> It also seems like fast-ish syscalls may be more of a Linux thing.

>>>
>>>> Why not just treat the RF as a cache with a known address in physical
>>>> memory.
>>>> In MY 66000 that is what I do and then just push and pull 4 cache
>>>> lines at a
>>>> time.
>>>>
>>
>>> Possible, but poses its own share of problems...
>>
>>> Not sure how this could be implemented cost-effectively, or for that
>>> matter, more cheaply than a RISC-V style mode-banked register-file.
>>
>> 1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
>> of having 4 cache lines of state and 1 doubleword of address, you need
>> 16 cache lines of state.
>>

> OK.

> Having only 1 set of registers is good...

> Issue is the mechanism for how to get all the contents in/out of the
> register file, in a way that is both cost effective, and faster than
> using a series of Load/Store instructions would have otherwise been.

6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).

> Short of a pipeline redesign, it is unlikely to exceed a best case of
> around 128 bits per clock cycle, with (in practice) there typically
> being other penalties due to things like L1 misses and similar.

6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

> One bit of trickery would be, "what if" the Boot SRAM region were inside
> the L1 cache rather than out on the ringbus?...

2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).

> But, then one would have the cost of keeping 8K of SRAM close to the CPU
> core that is mostly only ever used during interrupt handling (but,
> probably still cheaper than making the register file 3x bigger, in any
> case...).

Is the Icache and Dcache not close enough ?? If not then add L2 !!

> Though keeping it tied to a specific CPU core (and effectively processor
> local) would avoid the ugly "what if" scenario of two CPU cores trying
> to service an interrupt at the same time and potentially stepping on
> each others' stacks. The main tradeoff vs putting the stacks in DRAM is
> mostly that DRAM may have (comparably more expensive) L2 misses.

The interrupt (re)mapping table takes care of this prior to the CPU being
bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine
which set of CPUs should be bothered, the affinity mask of the "Originating"
thread is used to qualify which CPU from the priority set, and one of these
is selected. The selected CPU is tapped on the shoulder, and sends a get-
Interrupt request to the Interrupt table logic which sends back the priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the
claim-Interrupt message, that interrupt will not get handed to any other
CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and
destroyed.}}


Click here to read the complete article
Re: Concertina II Progress

<ujp27q$229mq$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35090&group=comp.arch#35090

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi...@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Thu, 23 Nov 2023 21:36:41 -0500
Organization: A noiseless patient Spider
Lines: 488
Message-ID: <ujp27q$229mq$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
<ujohle$209gb$1@dont-email.me>
<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 02:36:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c747113e7ff105f906852486f1fad6fc";
logging-data="2172634"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18V53ORyyEfjsfesT86MAT+7p/lqGduXMY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:nn+/Lb1OHHilpooHyB81Kuq05iM=
Content-Language: en-US
In-Reply-To: <835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
 by: Robert Finch - Fri, 24 Nov 2023 02:36 UTC

On 2023-11-23 6:30 p.m., MitchAlsup wrote:
> BGB wrote:
>
>> On 11/23/2023 10:53 AM, MitchAlsup wrote:
>>> BGB wrote:
>>>
>>>
>>>> If the "memcpy's" could be eliminated, this could roughly halve the
>>>> cost of doing a syscall.
>>>
>>> I have MM (memory move) as a 3-operand instruction.
>>>
>
>> None in my case...
>
>> But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
>> Still might be better to not do a memcpy in these cases.
>
>> Say, if the ISR handler could "merely" reassign the TBR register to
>> switch from one task to another to perform the context switch (still
>> ignoring all the loads/stores hidden in the prolog and epilog).
>
>
>>>> One other option would be to do like RISC-V's privileged spec and
>>>> have multiple copies of the register file (and likely instructions
>>>> for accessing these alternate register files).
>>>
>>> There is one CPU register file, and every running thread has an address
>>> where that file comes from and goes to--just like a block of 4 cache
>>> lines;
>>> There is a 5th cache line that contains all the other PSW stuff.
>>>
>
>> No direct equivalent.
>
>
>> I was thinking sort of like the RISC-V Privileged spec, there are
>> User/Supervisor/Machine sets, with the mode effecting which of these
>> is visible.
>
>> Obvious drawback in my case is that this would effectively increase
>> the number of internal GPRs from 64 to 192 (and, at that point, may as
>> well go to 4 copies and have 256).
>
>> If this were handled in the decoder, this would mean roughly a 9-bit
>> register selector field (vs the current 7 bits).
>
> Decode is not the problem, sensing 1:256 is a big problem, in practice
> even SRAMs only have 32-pairs of cells on a bit line using exotic timed
> sense amps.
> {{Decode is almost NEVER the logic delay problem:: ½ is situation
> recognition,
> the other ½ is fan-out buffering--driving the lines into the decoder is
> more
> gates of delay than determining if a given select line should be
> asserted.}}
>
>> The increase in the number of CRs could be less, since only a few of
>> them actually need duplication.
>
>
>> But, don't want to go this way, and it would only be a partial
>> solution that also does not map up well to my current implementation.
>
>
>
>> Not sure how an OS on SH-4 would have managed all this, but I suspect
>> their interrupt model would have had similar limitations to mine.
>
>> Major differences:
>>    SH-4 banked out R0..R7 when entering an interrupt;
>>    The VBR relative entry-point offsets were a bit, ad-hoc.
>
>> There were some fairly arbitrary displacements based on the type of
>> interrupt. Almost like they designed their interrupt mechanism around
>> a particular chunk of ASM code or something. In my case, I kept a
>> similar idea, but just used a fixed 8-byte spacing, with the idea of
>> these spots branching to the actual entry point.
>
>> Though, one other difference is in my case I ended up adding a
>> dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
>> which would have gone to the FAULT handler instead.
>
>
>> It is in-theory possible to jump from Interrupt Mode to normal
>> Supervisor Mode without a full context switch,
>
> but why ?? the probability that control returns from a given IST to its
> softIRQ is less than ½ in a loaded system.
>
>>                                                 but the specifics of
>> doing so would get a bit more hairy and arcane (which is sort of why I
>> just sorta ended up using a context switch).
>
>> Not sure what Linux on SH-4 had done, didn't really investigate this
>> part of the code all that much at the time.
>
>
>> In theory, the ISR handlers could be made to mimic the x86 TSS
>> mechanism, but this wouldn't gain much.
>
> Stay away from anything you see in x86 except in using it a moniker
> to avoid.
>
>> I think at one point, I had considered having tasks have both User and
>> Supervisor state (with two stacks and two copies of all the
>> registers), but ended up not going this way (and instead giving the
>> syscalls their designated own task context; which also saves on
>> per-task memory overhead).
>
>
>>>> Worth the cost? Dunno.
>>>
>>> In my opinion--Absolutely worth it.
>>>
>>>> Not too much different to modern Windows, where slow syscalls are
>>>> still fairly common (and despite the slowness of the mechanism, it
>>>> seems like BJX2 sycalls still manage to be around an order of
>>>> magnitude faster than Windows syscalls in terms of clock-cycle
>>>> cost...).
>>>
>>> Now, just get it down to a cache missing {L1, L2} instruction fetch.
>>>
>
>> Looked into it a little more, realized that "an order of magnitude"
>> may have actually been a little conservative; seems like Windows
>> syscalls may be more in the area of 50-100k cycles.
>
>> Why exactly? Dunno.
>
>
>> This is still ignoring some of the "slow cases" which may take
>> millions of clock cycles.
>
>> It also seems like fast-ish syscalls may be more of a Linux thing.
>
>
>>>>
>>>>> Why not just treat the RF as a cache with a known address in
>>>>> physical memory.
>>>>> In MY 66000 that is what I do and then just push and pull 4 cache
>>>>> lines at a
>>>>> time.
>>>>>
>>>
>>>> Possible, but poses its own share of problems...
>>>
>>>> Not sure how this could be implemented cost-effectively, or for that
>>>> matter, more cheaply than a RISC-V style mode-banked register-file.
>>>
>>> 1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
>>> of having 4 cache lines of state and 1 doubleword of address, you need
>>> 16 cache lines of state.
>>>
>
>> OK.
>
>
>> Having only 1 set of registers is good...
>
>> Issue is the mechanism for how to get all the contents in/out of the
>> register file, in a way that is both cost effective, and faster than
>> using a series of Load/Store instructions would have otherwise been.
>
> 6R6W RFs are as big as one can practically build. You can get as much
> Read BW by duplication, but you only have "so much" Write BW (even when
> you know each write is to a different register).
>
>> Short of a pipeline redesign, it is unlikely to exceed a best case of
>> around 128 bits per clock cycle, with (in practice) there typically
>> being other penalties due to things like L1 misses and similar.
>
> 6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.
>
>> One bit of trickery would be, "what if" the Boot SRAM region were
>> inside the L1 cache rather than out on the ringbus?...
>
> 2 things::
> a) By giving threadstate an address you gain the ability to load the
> initial RF image from ROM as the CPU comes out of reset--it comes out
> with a complete RF, a complete thread.header, mapping tables, privilege
> and priority.
> b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
> state (no underlying DRAM address availible) so you have ~1MB to play
> around
> with until you find DRAM, configure, initialize, and put in fee-pool.)
> So, here, you HAVE "enough" storage to program BOOT activities in a HLL
> (of your choice).
>
>> But, then one would have the cost of keeping 8K of SRAM close to the
>> CPU core that is mostly only ever used during interrupt handling (but,
>> probably still cheaper than making the register file 3x bigger, in any
>> case...).
>
> Is the Icache and Dcache not close enough ?? If not then add L2 !!
>
>> Though keeping it tied to a specific CPU core (and effectively
>> processor local) would avoid the ugly "what if" scenario of two CPU
>> cores trying to service an interrupt at the same time and potentially
>> stepping on each others' stacks. The main tradeoff vs putting the
>> stacks in DRAM is mostly that DRAM may have (comparably more
>> expensive) L2 misses.
>
> The interrupt (re)mapping table takes care of this prior to the CPU being
> bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
> table associated with the "Originating" thread. (IO/-MMU). That interrupt
> is logged into the table and if enabled its priority is used to determine
> which set of CPUs should be bothered, the affinity mask of the
> "Originating"
> thread is used to qualify which CPU from the priority set, and one of these
> is selected. The selected CPU is tapped on the shoulder, and sends a get-
> Interrupt request to the Interrupt table logic which sends back the
> priority
> and number of a pending interrupt. If the CPU is still at lower priority
> than the returning interrupt, the CPU <at this point> stops running code
> from the old thread and begins running code on the new thread.
> {{During the sending of the interrupt to the CPU and the receipt of the
> claim-Interrupt message, that interrupt will not get handed to any other
> CPU}} So, the CPU continues to run instructions while the CPUs contend
> for and claim unique interrupts. There are 512 unique interrupt at each of
> 64 priority levels, and each process can have its own Interrupt Table.
> These tables need no maintenance except when interrupts are created and
> destroyed.}}
>
> HV, Guest HV, Guest OS each have their own unique interrupt tables;
> Although it could be arranged such that all could use the same table.
>
>> Would add a potential "wonk" factor though, if this SRAM region were
>> only visible for D$ access, but inaccessible from the I$. But, I guess
>> one can argue, there isn't really a valid reason to try to run code
>> from the ISR stack or similar.
>
>
>>>> Though, could make sense if one has a mechanism where a context
>>>> switch could have a mechanism to dump the whole register file to
>>>> Block-RAM, and some sort of mechanism to access this RAM via an MMIO
>>>> interface.
>>>
>>> Just put it in DRAM at SW controlled (via TLB) addresses.
>>>
>
>> Possibly.
>
>> It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
>> stuff could be baked into hardware... But, I don't want to go this
>> route (baking parts of it into the C ABI is at least "slightly" less
>> evil).
>
> My mechanism is taking that struct task.....s (at least the part HW
> needs to understand) and associating each one into a table that points
> at DRAM. Now, when you want this thread to run, you load up the pointer
> set the e-bit (enabled) and write it into the current header at its
> privilege level. Poof--all 5 cache lines of state from the currently
> running thread goes back to where it permanent home in DRAM is, and
> the new thread fetches 5 cache lines of state of the new thread.
> a) you can start the reads before you start the writes
> b) you can start the writes anytime you have outbound access to "the bus"
> c) the writes can be no late than the ½ cycle before the reads get written.
> Which is a lot faster than you can do in SW with LDs and STs.
>
>> Also possible could be to add another CR for "Dump context registers
>> here", this adds the costs of another CR though.
>
> I config-space mapped all my CRs, so you get an unlimited number of them.
>
>> I guess I can probably safely rule out MMIO under the basis that
>> context switching via moving registers via MMIO would be slower than
>> the current mechanism (of using a series of Load/Store instructions).
> .................
>>> Yes, but PTHREADing can be done without privilege and in a single
>>> instruction.
>>>
>
>> OK.
>
>> Luckily, a thread-switch only needs to go 1-way, reducing it to around
>> 500 cycles as-is in my case.
>
> In my case it is about MemoryLatency+5 cycles.
> Yes, thread switch is a 1-way function--which is the reason you can
> allow a user to preempt himself and allow a compatriot to run in his
> place.....
>
>> Theoretical minimum would be around 150-200 cycles, with most of the
>> savings based on eliminating around 1.5kB worth of "memcpy()"...
>
> My Real Time version of MY 66000 does 10-ish cycle context switch
> (as seen at the CPU) but here a hunk of HW has gathered up those 5 cache
> lines and sent them to the targeted CPU and all the CPU has to do is push
> out the old state (5-cache liens) So the data was heading towards the
> CPU before the CPU even knew it wanted that data !!
>
>> This need not involve an ISA change, could in theory be done by making
>> the SYSCALL ISR mandate that TBR be valid (and the associated compiler
>> changes, likely the main issue here).
>
>
>
>> Well, nevermind any cost of locating the next thread, but at the
>> moment, I am using a fairly simplistic round-robin scheduling
>> strategy, so the scheduler mostly starts at a given PID, and looks for
>> the next PID that holds a valid/running task (wrapping back to PID 1
>> if it hits the end, and stopping the search if it gets back to the
>> original PID).
>
>
>> The high-level threading model wasn't based on pthreads in my case,
>> but rather C11 threads (and had implemented a lot of the "threads.h"
>> stuff).
>
>> One could potentially mimic pthreads on top of C11 threads though.
>
>> At the moment, I forgot why I decided to go with C11 threads over
>> pthreads, but IIRC I think I had felt at the time like C11 threads
>> were a better fit.
>
>
>>>> One could have enough register banks for N logical tasks, but
>>>> supporting 4 or 8 copies of the register file is going to cost more
>>>> than 2 or 3.
>>>
>>>
>>>> Above, I was describing what the hardware was doing.
>>>
>>>> The software side is basically more like:
>>>>    Branch from VBR-table to ISR entry point;
>>>>    Get R0 and R1 saved onto the stack;
>>>
>>> Where did you get the address of this stack ??
>>>
>
>> SP and SSP swap places on interrupt entry (currently by renumbering
>> the registers in the instruction decoder).
>
> So, in effect, you actually have 33 registers with only 32 visible at
> any instant. I am just so glad not to have gone down that rabbet hole
> this time......
>
>> SSP is initialized early on to the SRAM stack, so when an interrupt
>> happens, the 'SP' register automatically becomes the SRAM stack.
>
>> Essentially, both SP and SSP are SPRs, but:
>>    SP is mapped into R15 in the GPR space;
>>    SSP is mapped into the CR space.
>
>> So, when executing an ISR, it is effectively using SSP as its SP.
>
>
>> If I were eliminate this implicit register-swap mechanism, then the
>> ISR entry would likely need to reload a constant address each time.
>> Though, this change would also break binary compatibility with my
>> existing code.
>
>> But, in theory, eliminating the register swap could allow demoting SP
>> to being a normal GPR.
>
>> Also, things like renumbering parts of the register space based on CPU
>> mode is expensive.
>
>
>> Though, some of my more recent design ideas would have gone over to an
>> ordering slightly more like RISC-V, say:
>>    R0: ZR or PC  (ALU or MEM)
>>    R1: LR or TBR (ALU or MEM)
>>    R2: SP
>>    R3: GP (GBR)
>>    R4 -R15: Scratch
>>    R16-R31: Callee Save
>>    R32-R47: Scratch
>>    R48-R63: Callee Save
>
>> Would likely not adopt RISC-V's C ABI though.
>
> R0::     GPR, Return Address, proxy for IP, proxy for 0
> R1..R9   Arguments and results passed in registers
> R10..R15 Temporary Registers (scratch)
> R16..R29 Callee Save
> R30      FP when in use, Callee Save
> R31      SP
>
>> Though, if one assumes R4..R63 are GPRs, this would allow both this
>> ISA and RISC-V to still use the same register numbering.
>
>> This is already fairly close to the register numbering scheme used in
>> XG2RV, though the assumption was that XG2RV would have used RV's ABI,
>> but this was stalled out mostly due to compiler issues (getting BGBCC
>> to be able to follow RISC-V's C ABI rules would be a non-trivial level
>> of effort; but is rendered moot if one still needs to use call thunking).
>
>
>> The interpretation for R0 and R1 would depend on how they are used:
>>    ALU or similar: ZR and LR (Zero and Link Register)
>>    Load/Store Base: PC and TBR.
>
>> Idea being that in userland, TBR effectively still exists as a
>> Read-Only register (allowing userland to modify TBR would effectively
>> also allow userland to wreck the OS).
>
>
>> Thing is mostly that needing to renumber registers in the decoder
>> based on CPU mode isn't entirely free in terms of LUT cost or timing
>> latency (even if it only applies to a subset of the register space).
>
>> Note that for RV decoding:
>>    X0..X31 -> R0 ..R31 (more or less)
>>    F0..F31 -> R32..R63
>
>> But, RV's FPU instructions don't match up exactly 1:1, and some cases
>> would have semantic differences.
>
>> Though, it seems like most RV code could likely tolerate some
>> deviation in some areas (will it care that the high 32 bits of a
>> Binary32 register don't hold NaN? Will it care about the extra
>> funkiness going on in LR? ...).
>
>
>>>>    Get some of the CRs saved off (we need R0 and R1 free here);
>>>>    Get the rest of the GPRs saved onto the stack;
>>>>    Call into the main part of the ISR handler (using normal C ABI);
>>>>    Restore most of the GPRs;
>>>>    Restore most of the CRs;
>>>>    Restore R0 and R1;
>>>>    Do an RTE.
>>>
>>> If HW does register file save/restore the above looks like::
>>>
>>>> The software side is basically more like:
>>>>    Branch from VBR-table to ISR entry point;
>>>>    Call into the main part of the ISR handler (using normal C ABI);
>>>>    Do an RTE.
>>>
>>> See what it saves ??
>
>> This is fewer instructions.
>
>> But, hardware cost,
>
> the HW cost has already been purchased by the state machine that writes
> out 5-cache lines and waits for 5-cache lines to arrive.
>
> and clock-cycle savings?...
> The reads can arrive before you start the writes, you can go so far as
> to organize your pipeline so the read data being written pushes
> out the write data that needs to return to memory-making the timing
> brain dead easy to achieve.
>
>
>> As-is, I can't come up with much that is both:
>>    Fairly cheap to implement in hardware;
>>    Would saves a lot of clock-cycles over software-based options.
>
>> As noted, the former is also why I had thus far mostly rejected the
>> RISC-V strategy (*).
>
> Yet, you seem to be buying insurance as if you might need to head in that
> direction.
>
>> *: Ironically, despite RISC-V having fewer GPRs, to implement the
>> Privileged spec, RISC-V would still end up needing a somewhat bigger
>> register file... Nevermind what exactly is going on with CSRs...
>
> Whereas that special State is only a dozen register <with state>
> in My 66000--the rest being either memory resident or memory mapped.


Click here to read the complete article
Re: Concertina II Progress

<c2f644d4318b752714f80a453db90b97@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35092&group=comp.arch#35092

  copy link   Newsgroups: comp.arch
Date: Fri, 24 Nov 2023 03:11:17 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Spam-Level: *
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$jKjq/9Qzt.E17XrJK2R0v.mcZ17rZOEuXYhrMRR8IydYOSKnRegV2
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <ujmi69$1n5ko$1@dont-email.me> <b0e6671f87b1d447b23372180d119e2e@news.novabbs.com> <ujohle$209gb$1@dont-email.me> <835e3e7fe735cae6ea0206af6077615a@news.novabbs.com> <ujp27q$229mq$1@dont-email.me>
Organization: novaBBS
Message-ID: <c2f644d4318b752714f80a453db90b97@news.novabbs.com>
 by: MitchAlsup - Fri, 24 Nov 2023 03:11 UTC

Robert Finch wrote:

> On 2023-11-23 6:30 p.m., MitchAlsup wrote:
>> BGB wrote:
>>
>
>>
>> Whereas that special State is only a dozen register <with state>
>> in My 66000--the rest being either memory resident or memory mapped.

> My 68000 CPU core had a couple of task switching instructions added to
> it. I made a dedicated task switch RAM wide enough to load or store all
> the 68k registers in a single clock. Total task switch time was about
> four clocks IIRC. The interrupt vector table was setup to be able to
> automatically task switch on interrupt. The RAM had storage for up to
> 512 tasks, but it was dedicated inside the CPU core rather than storing
> task information in the memory system.

This is headed in the right direction. Make context switching something
easy to pull off.

> Q+ has a 64 register file, so it would take eight or nine cache lines to
> store the context. Q+ register file is 4w18r ATM. Getting from the
> register file to or from a cache line is a challenge. To access groups
> of eight registers at once would mean adding or using eight register
> file ports. The register file has only four write ports so only ½ of a
> cache line could be written to the file in a clock cycle. It is
> appealing to handle multiple registers per clock. Read/write ports are
> dedicated to specific function units, so making use of them for task
> switching may involve additional logic. I called the CSR to store the
> task state address the TS CSR.

4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
Yet you chose 18. Why ?

This is above and beyond the "typical" operand consumption of a RISC ISA.
Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
(or forwarding). What are you using the other 5-operands for ??

> As I understand it normally RISCV does not use multiple register files,

RISC-V has a 32 entry GPR and a 32 entry FPR.

> it has only a single file. There may be implementations out there that
> do make use of multiple files, but I think the standard is setup to get
> by with a single file.

Re: Concertina II Progress

<ujp9b3$26sea$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35094&group=comp.arch#35094

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi...@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Thu, 23 Nov 2023 23:37:54 -0500
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <ujp9b3$26sea$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me>
<uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me>
<4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me>
<uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
<ujohle$209gb$1@dont-email.me>
<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
<ujp27q$229mq$1@dont-email.me>
<c2f644d4318b752714f80a453db90b97@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 04:37:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c747113e7ff105f906852486f1fad6fc";
logging-data="2322890"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+d/QzI4Om4LJC9RnfS7vni2tIwITw/QHA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Yd0gwJFfIXkwhBnLRzq/oXL0SKQ=
In-Reply-To: <c2f644d4318b752714f80a453db90b97@news.novabbs.com>
Content-Language: en-US
 by: Robert Finch - Fri, 24 Nov 2023 04:37 UTC

On 2023-11-23 10:11 p.m., MitchAlsup wrote:
> Robert Finch wrote:
>
>> On 2023-11-23 6:30 p.m., MitchAlsup wrote:
>>> BGB wrote:
>>>
>>
>>>
>>> Whereas that special State is only a dozen register <with state>
>>> in My 66000--the rest being either memory resident or memory mapped.
>
>> My 68000 CPU core had a couple of task switching instructions added to
>> it. I made a dedicated task switch RAM wide enough to load or store
>> all the 68k registers in a single clock. Total task switch time was
>> about four clocks IIRC. The interrupt vector table was setup to be
>> able to automatically task switch on interrupt. The RAM had storage
>> for up to 512 tasks, but it was dedicated inside the CPU core rather
>> than storing task information in the memory system.
>
> This is headed in the right direction. Make context switching something
> easy to pull off.
>
>> Q+ has a 64 register file, so it would take eight or nine cache lines
>> to store the context. Q+ register file is 4w18r ATM. Getting from the
>> register file to or from a cache line is a challenge. To access groups
>> of eight registers at once would mean adding or using eight register
>> file ports. The register file has only four write ports so only ½ of a
>> cache line could be written to the file in a clock cycle. It is
>> appealing to handle multiple registers per clock. Read/write ports are
>> dedicated to specific function units, so making use of them for task
>> switching may involve additional logic. I called the CSR to store the
>> task state address the TS CSR.
>
> 4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
> Yet you chose 18. Why ?
> This is above and beyond the "typical" operand consumption of a RISC ISA.
> Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
> 12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
> (or forwarding). What are you using the other 5-operands for ??
>
>> As I understand it normally RISCV does not use multiple register files,
>
> RISC-V has a 32 entry GPR and a 32 entry FPR.
>
>> it has only a single file. There may be implementations out there that
>> do make use of multiple files, but I think the standard is setup to
>> get by with a single file.

I have 4w1r replicated 18 times. That is enough read ports to supply
three operands each to six functional units. All six functional units
may be scheduled at the same time. I have thought of trying to use fewer
read ports by prioritizing the ports as it is unlikely that all ports
would be needed at the same time. The current design is simple, but not
resource efficient. Six function units are ALU0, ALU1, FPU, FCU, LOAD,
STORE. The FCU really only needs two source operands.

There is no forwarding in the design (yet). I have read this cost about
10% in performance. I think this may be made up for by a smaller design
that can operate at a higher fmax. I have found in the past that
forwarding muxes appear on the critical timing path. I have seen another
design eliminating forwarding. It made the difference between operating
at 50 MHz or 60 MHz+. 20% gain in fmax. I think this may be an aspect of
an FPGA implementation.

Re: Concertina II Progress

<ujpgnm$27omg$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35096&group=comp.arch#35096

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 24 Nov 2023 00:44:04 -0600
Organization: A noiseless patient Spider
Lines: 823
Message-ID: <ujpgnm$27omg$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
<ujohle$209gb$1@dont-email.me>
<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 06:44:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b92c6551fc723f135e33c10e8e50bbf7";
logging-data="2351824"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19N2OL/Xd6X0DdxRQV7rSDA"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:/T62ICAoxeEvoi+v2VLms8YSvqQ=
In-Reply-To: <835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
Content-Language: en-US
 by: BGB - Fri, 24 Nov 2023 06:44 UTC

On 11/23/2023 5:30 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 11/23/2023 10:53 AM, MitchAlsup wrote:
>>> BGB wrote:
>>>
>>>
>>>> If the "memcpy's" could be eliminated, this could roughly halve the
>>>> cost of doing a syscall.
>>>
>>> I have MM (memory move) as a 3-operand instruction.
>>>
>
>> None in my case...
>
>> But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
>> Still might be better to not do a memcpy in these cases.
>
>> Say, if the ISR handler could "merely" reassign the TBR register to
>> switch from one task to another to perform the context switch (still
>> ignoring all the loads/stores hidden in the prolog and epilog).
>
>
>>>> One other option would be to do like RISC-V's privileged spec and
>>>> have multiple copies of the register file (and likely instructions
>>>> for accessing these alternate register files).
>>>
>>> There is one CPU register file, and every running thread has an address
>>> where that file comes from and goes to--just like a block of 4 cache
>>> lines;
>>> There is a 5th cache line that contains all the other PSW stuff.
>>>
>
>> No direct equivalent.
>
>
>> I was thinking sort of like the RISC-V Privileged spec, there are
>> User/Supervisor/Machine sets, with the mode effecting which of these
>> is visible.
>
>> Obvious drawback in my case is that this would effectively increase
>> the number of internal GPRs from 64 to 192 (and, at that point, may as
>> well go to 4 copies and have 256).
>
>> If this were handled in the decoder, this would mean roughly a 9-bit
>> register selector field (vs the current 7 bits).
>
> Decode is not the problem, sensing 1:256 is a big problem, in practice
> even SRAMs only have 32-pairs of cells on a bit line using exotic timed
> sense amps.
> {{Decode is almost NEVER the logic delay problem:: ½ is situation
> recognition,
> the other ½ is fan-out buffering--driving the lines into the decoder is
> more
> gates of delay than determining if a given select line should be
> asserted.}}
>

I had noted that there is a noticeable LUT cost difference between 32
and 64 GPRs, which seems to go somewhat bigger than the difference
expected from going from 5b/3b LUTRAMs to 6b/2b LUTRAMs.

Like, adding a bit to the internal register ID fields (6b to 7b)
propagated cost across the whole pipeline.

The alternative would be to handle the register banking in the register
file, using the CPU mode to select between the possible register banks.

However, if still using LUTRAMs, the increase in register file size
would likely increase the number of LUTs by roughly 5x.

A theoretical estimate for the core number of LUTRAMs and "array support
LUTs":
32 GPRs: 396
64 GPRs: 576
256 GPRs: 2880

This is ignoring the LUTs going into things like register forwarding, etc.

Based on past experience, I suspect the actual cost difference to be a
bit larger (given, say, the difference between a 32 GPR and 64 GPR
configuration is notably larger than 180 LUTs).

>> The increase in the number of CRs could be less, since only a few of
>> them actually need duplication.
>
>
>> But, don't want to go this way, and it would only be a partial
>> solution that also does not map up well to my current implementation.
>
>
>
>> Not sure how an OS on SH-4 would have managed all this, but I suspect
>> their interrupt model would have had similar limitations to mine.
>
>> Major differences:
>>    SH-4 banked out R0..R7 when entering an interrupt;
>>    The VBR relative entry-point offsets were a bit, ad-hoc.
>
>> There were some fairly arbitrary displacements based on the type of
>> interrupt. Almost like they designed their interrupt mechanism around
>> a particular chunk of ASM code or something. In my case, I kept a
>> similar idea, but just used a fixed 8-byte spacing, with the idea of
>> these spots branching to the actual entry point.
>
>> Though, one other difference is in my case I ended up adding a
>> dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
>> which would have gone to the FAULT handler instead.
>
>
>> It is in-theory possible to jump from Interrupt Mode to normal
>> Supervisor Mode without a full context switch,
>
> but why ?? the probability that control returns from a given IST to its
> softIRQ is less than ½ in a loaded system.
>

One might want to jump to save the cost of 2 context switches, but the
hair this would involve didn't seem worth it.

It would also result in a few other issues:
System calls would not be interruptible;
System calls could not reschedule the caller.
Effectively, this would hinder things like "usleep()" or "yield()".

Seemed better to go the route I did.

>>                                                 but the specifics of
>> doing so would get a bit more hairy and arcane (which is sort of why I
>> just sorta ended up using a context switch).
>
>> Not sure what Linux on SH-4 had done, didn't really investigate this
>> part of the code all that much at the time.
>
>
>> In theory, the ISR handlers could be made to mimic the x86 TSS
>> mechanism, but this wouldn't gain much.
>
> Stay away from anything you see in x86 except in using it a moniker
> to avoid.
>

Yeah, not really losing much by not having a TSS...
But, Intel probably thought it was a good idea...

>> I think at one point, I had considered having tasks have both User and
>> Supervisor state (with two stacks and two copies of all the
>> registers), but ended up not going this way (and instead giving the
>> syscalls their designated own task context; which also saves on
>> per-task memory overhead).
>
>
>>>> Worth the cost? Dunno.
>>>
>>> In my opinion--Absolutely worth it.
>>>
>>>> Not too much different to modern Windows, where slow syscalls are
>>>> still fairly common (and despite the slowness of the mechanism, it
>>>> seems like BJX2 sycalls still manage to be around an order of
>>>> magnitude faster than Windows syscalls in terms of clock-cycle
>>>> cost...).
>>>
>>> Now, just get it down to a cache missing {L1, L2} instruction fetch.
>>>
>
>> Looked into it a little more, realized that "an order of magnitude"
>> may have actually been a little conservative; seems like Windows
>> syscalls may be more in the area of 50-100k cycles.
>
>> Why exactly? Dunno.
>
>
>> This is still ignoring some of the "slow cases" which may take
>> millions of clock cycles.
>
>> It also seems like fast-ish syscalls may be more of a Linux thing.
>
>
>>>>
>>>>> Why not just treat the RF as a cache with a known address in
>>>>> physical memory.
>>>>> In MY 66000 that is what I do and then just push and pull 4 cache
>>>>> lines at a
>>>>> time.
>>>>>
>>>
>>>> Possible, but poses its own share of problems...
>>>
>>>> Not sure how this could be implemented cost-effectively, or for that
>>>> matter, more cheaply than a RISC-V style mode-banked register-file.
>>>
>>> 1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
>>> of having 4 cache lines of state and 1 doubleword of address, you need
>>> 16 cache lines of state.
>>>
>
>> OK.
>
>
>> Having only 1 set of registers is good...
>
>> Issue is the mechanism for how to get all the contents in/out of the
>> register file, in a way that is both cost effective, and faster than
>> using a series of Load/Store instructions would have otherwise been.
>
> 6R6W RFs are as big as one can practically build. You can get as much
> Read BW by duplication, but you only have "so much" Write BW (even when
> you know each write is to a different register).
>
>> Short of a pipeline redesign, it is unlikely to exceed a best case of
>> around 128 bits per clock cycle, with (in practice) there typically
>> being other penalties due to things like L1 misses and similar.
>
> 6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.
>


Click here to read the complete article
Re: Concertina II Progress

<ujq7t8$2b79a$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35106&group=comp.arch#35106

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!feeder1.feed.usenet.farm!feed.usenet.farm!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 24 Nov 2023 08:19:34 -0500
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <ujq7t8$2b79a$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
<ujohle$209gb$1@dont-email.me>
<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 13:19:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a2817ef162912d8149f08d6bbc2139f2";
logging-data="2465066"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+e2DB7gvU2yJBpMfbyhaJvU0k3qheiUd8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:iyRjI35jyauKnF0XFY1LLjlguYc=
In-Reply-To: <835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
 by: Paul A. Clayton - Fri, 24 Nov 2023 13:19 UTC

On 11/23/23 6:30 PM, MitchAlsup wrote:
[snip]
> Stay away from anything you see in x86 except in using it a moniker
> to avoid.

Even a stopped (12-hour) clock is right twice a day.

I hope you are not going to remove from My 66000 variable length
instruction encoding, hardware handling of (some for x86, XSAVE/
XRSTR) context saving and restoring, or even Memory Move.

One could go even further and claim avoiding anything seen in x86
means not having registers (a storage region with simple, compact
addressing that an implementation will optimize as the common case
for operands — the Mill's Belt counts as "registers" in this sense
and even something like a transport-trigger architecture would
likely have storage for values with temporal locality coarser than
immediate use but frequent enough to justify simpler and more
compact addressing).

Yes, x86 messes up even these aspects. VLE does not have to be
byte granular or use multiple prefixes in variable order. Hardware
context save/restore does not have to be limited to extended
state. A memory move instruction does not *need* to have a variant
for each possible/likely chunk size or be implemented as
substantially less performant than a software implementation, even
with compile-time known size and alignment. Registers do not have
to be limited to 8 or be accessed in sub-units.

(Sub-unit access has some attraction to me for more efficiently
using a limited storage space while still trying to keep access
simple by limiting variability of shifting and complexity of
partial write ordering, but less efficient storage use can easily
be better than complexity of accessing the fastest and most
commonly accessed storage. More recent ISAs have implemented
partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
into high and low halves to increase the number of values
available in the nominally 16 GPRs. AArch64 has 32-bit computer
operations motivated, I think, for power saving, which do not
increase the number of values and so avoids the shift and partial-
write problems.)

I suspect you could write a multi-volume treatise on x86 about
hardware-software interface design and management (including the
social and economic considerations of project/product management).
Ignoring human factors, including those outside the organization
owning the interface, seems attractive to a certain engineering
mindset but human factors are significant design considerations.

[Yet once more stating what is obvious, especially to one skilled
in the art.]

Re: Concertina II Progress

<ujqcqc$2btrh$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35107&group=comp.arch#35107

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: robfi...@gmail.com (Robert Finch)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 24 Nov 2023 09:43:23 -0500
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <ujqcqc$2btrh$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me>
<uij9lt$3054t$1@newsreader4.netcologne.de> <uijjcd$2d9sp$1@dont-email.me>
<uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me>
<uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad>
<uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
<ujohle$209gb$1@dont-email.me>
<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
<ujq7t8$2b79a$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Nov 2023 14:43:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c747113e7ff105f906852486f1fad6fc";
logging-data="2488177"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18F9RULYi/eFWrTUTzvmJuPT9yyl/gEOYk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:P95vLPRYIjwnxMEoC3CKPb9x/QI=
In-Reply-To: <ujq7t8$2b79a$1@dont-email.me>
Content-Language: en-US
 by: Robert Finch - Fri, 24 Nov 2023 14:43 UTC

On 2023-11-24 8:19 a.m., Paul A. Clayton wrote:
> On 11/23/23 6:30 PM, MitchAlsup wrote:
> [snip]
>> Stay away from anything you see in x86 except in using it a moniker
>> to avoid.
>
> Even a stopped (12-hour) clock is right twice a day.
>
> I hope you are not going to remove from My 66000 variable length
> instruction encoding, hardware handling of (some for x86, XSAVE/
> XRSTR) context saving and restoring, or even Memory Move.
>
> One could go even further and claim avoiding anything seen in x86
> means not having registers (a storage region with simple, compact
> addressing that an implementation will optimize as the common case
> for operands — the Mill's Belt counts as "registers" in this sense
> and even something like a transport-trigger architecture would
> likely have storage for values with temporal locality coarser than
> immediate use but frequent enough to justify simpler and more
> compact addressing).
>
> Yes, x86 messes up even these aspects. VLE does not have to be
> byte granular or use multiple prefixes in variable order. Hardware
> context save/restore does not have to be limited to extended
> state. A memory move instruction does not *need* to have a variant
> for each possible/likely chunk size or be implemented as
> substantially less performant than a software implementation, even
> with compile-time known size and alignment. Registers do not have
> to be limited to 8 or be accessed in sub-units.
>
> (Sub-unit access has some attraction to me for more efficiently
> using a limited storage space while still trying to keep access
> simple by limiting variability of shifting and complexity of
> partial write ordering, but less efficient storage use can easily
> be better than complexity of accessing the fastest and most
> commonly accessed storage. More recent ISAs have implemented
> partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
> into high and low halves to increase the number of values
> available in the nominally 16 GPRs. AArch64 has 32-bit computer
> operations motivated, I think, for power saving, which do not
> increase the number of values and so avoids the shift and partial-
> write problems.)
>
> I suspect you could write a multi-volume treatise on x86 about
> hardware-software interface design and management (including the
> social and economic considerations of project/product management).
> Ignoring human factors, including those outside the organization
> owning the interface, seems attractive to a certain engineering
> mindset but human factors are significant design considerations.
>
> [Yet once more stating what is obvious, especially to one skilled
> in the art.]

There is a lot of value in having a unique architecture. The x86 has had
a lot of things bolted on to it. It has adapted over time. Being able to
see how things have changed is valuable. I suspect just about any
architecture adapted over a 40 or 50 years period would look no so
appealing. I happen to like the segmented approach, not necessarily
because it is a good way to do things, but it was certainly interesting
and challenging. An interesting, challenging, and somewhat mysterious
architecture may be more appealing than the best organized, most
performant, energy efficient one. There is a trade-off between ‘the
best’ and the ‘human factor’. I can imagine that there might be treaties
limiting computer performance somewhere. Just how fast of a CPU is legal?

Re: Concertina II Progress

<ac183cbca2ecffb9b4a3b09be16a697f@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35108&group=comp.arch#35108

  copy link   Newsgroups: comp.arch
Date: Fri, 24 Nov 2023 18:24:00 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
X-Spam-Level: *
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$GOeFN6o7pR1oVlysjb/rK.RwcuVBERU3sGBAtfA2zyMnq8UAJ0L2O
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me> <uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me> <4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me> <uilvki$2vjld$1@dont-email.me> <74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <ujmi69$1n5ko$1@dont-email.me> <b0e6671f87b1d447b23372180d119e2e@news.novabbs.com> <ujohle$209gb$1@dont-email.me> <835e3e7fe735cae6ea0206af6077615a@news.novabbs.com> <ujq7t8$2b79a$1@dont-email.me>
Organization: novaBBS
Message-ID: <ac183cbca2ecffb9b4a3b09be16a697f@news.novabbs.com>
 by: MitchAlsup - Fri, 24 Nov 2023 18:24 UTC

Paul A. Clayton wrote:

> On 11/23/23 6:30 PM, MitchAlsup wrote:
> [snip]
>> Stay away from anything you see in x86 except in using it a moniker
>> to avoid.

> Even a stopped (12-hour) clock is right twice a day.

> I hope you are not going to remove from My 66000 variable length
> instruction encoding, hardware handling of (some for x86, XSAVE/
> XRSTR) context saving and restoring, or even Memory Move.

It is these things which allows my architecture to only need 70%
of the instructions RISC-V needs.

> One could go even further and claim avoiding anything seen in x86
> means not having registers (a storage region with simple, compact
> addressing that an implementation will optimize as the common case
> for operands — the Mill's Belt counts as "registers" in this sense
> and even something like a transport-trigger architecture would
> likely have storage for values with temporal locality coarser than
> immediate use but frequent enough to justify simpler and more
> compact addressing).

Having 1 set of flat (any register can do any result or operand) is
a My 66000 requirement, The only things I took from x86-64 is
the [base+index<<scale+displacement] memory addressing model, and
the 2-level MMU, even here I used the I/O MMU version rather than
the processor version.

> Yes, x86 messes up even these aspects.
> VLE does not have to be byte granular or use multiple prefixes in variable order.
VLE does not need prefixes of any kind.
> Hardware context save/restore does not have to be limited to extended state.
HW S/R is most useful when it deals with ALL the state.
> A memory move instruction does not *need* to have a variant
> for each possible/likely chunk size or be implemented as
> substantially less performant than a software implementation,
One can synthesize SIMD and Vector saving 90% of the OpCode space
< even with compile-time known size and alignment. Registers do not have
> to be limited to 8 or be accessed in sub-units.

> (Sub-unit access has some attraction to me for more efficiently
> using a limited storage space while still trying to keep access
> simple by limiting variability of shifting and complexity of
> partial write ordering, but less efficient storage use can easily
> be better than complexity of accessing the fastest and most
> commonly accessed storage. More recent ISAs have implemented
> partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
> into high and low halves to increase the number of values
> available in the nominally 16 GPRs. AArch64 has 32-bit computer
> operations motivated, I think, for power saving, which do not
> increase the number of values and so avoids the shift and partial-
> write problems.)

I suspect this came out of already having to implement HW for
IC (insert Character) instruction from System 360 time.

> I suspect you could write a multi-volume treatise on x86 about
> hardware-software interface design and management (including the
> social and economic considerations of project/product management).
> Ignoring human factors, including those outside the organization
> owning the interface, seems attractive to a certain engineering
> mindset but human factors are significant design considerations.

It would be more beneficial to the world just to build an architecture
without any of those flaws--just to show them how its done.

> [Yet once more stating what is obvious, especially to one skilled
> in the art.]

Captain Obvious would be proud.

Re: Concertina II Progress

<jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35111&group=comp.arch#35111

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.furie.org.uk!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 24 Nov 2023 13:41:26 -0500
Organization: A noiseless patient Spider
Lines: 6
Message-ID: <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org>
References: <uigus7$1pteb$1@dont-email.me> <uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<jwvedghqxha.fsf-monnier+comp.arch@gnu.org>
<6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com>
<OSO7N.18767$_z86.2860@fx46.iad>
<e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="923a7c58556f3548ff39770877d6a143";
logging-data="2563318"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18+xjcfbBiZPJ/hO4q0QWp/QJ2F2rH1ZqA="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:IZSFzESfzjQ5V6la6FdOzfQSAHM=
sha1:KnRchgKsV7Ylj8v9dSPMQoapa0g=
 by: Stefan Monnier - Fri, 24 Nov 2023 18:41 UTC

> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Stefan

Re: Concertina II Progress

<2f9e4efc17534de4a061be88aff92a99@news.novabbs.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35113&group=comp.arch#35113

  copy link   Newsgroups: comp.arch
Date: Fri, 24 Nov 2023 22:21:53 +0000
Subject: Re: Concertina II Progress
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$0XYsJfPhVYjihQIM/znP2eANP/NMx5WUvsq1OO33fACRYVfbfwgm2
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <uigus7$1pteb$1@dont-email.me> <uj3380$1rnvb$1@dont-email.me> <5412afba176e6044e28a72965f13ac4a@news.novabbs.com> <uj37t1$1sgg4$1@dont-email.me> <063885f383205c854c2387dcea32ba7a@news.novabbs.com> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org>
Organization: novaBBS
Message-ID: <2f9e4efc17534de4a061be88aff92a99@news.novabbs.com>
 by: MitchAlsup - Fri, 24 Nov 2023 22:21 UTC

Stefan Monnier wrote:

>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}

> Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

This seems to mimic RISC-V set of levels but done/named differently.

The Guest OS and Guest HV levels are done in such a way that you can have a
stack of Guest OSs of any depth and a stack or Guest HVs of any depth, although
the HW only supports 4 levels HW with SW intervention supports any number of
levels.

In particular: Guest OS manages faults from Application, Guest HV manages
faults from Guest OS {Which makes it possible to recover from page faults
in the "sticky" places of interrupt and exception handling}, Real HV
manages faults from Guest HV.

Application accesses only 63-bits of virtual address space. If application
makes an access with the HOB of the virtual address set, the access takes
a fault.

Guest OS can reach down into Application by accessing with the HOB clear (0)
or access its own VAS with the HOB set (1).

Guest HV can reach down into Guest OS by accessing with the HOB set (1)
or access its own VAS with the HOB clear(0).

Real HV can reach down into Guest HV by accessing with the HOB clear (0)
or access its own VAS with the HOB set (1).

Assuming we are running with a HV::
Application accesses use 2-level paging through Application Mapping Tables.
Guest OS accesses Application use 2-level paging through Application Mapping
Tables; and accesses Guest OS use 2-level paging through Guest OS Tables.
Guest HV accessing Guest OS use 2-level paging through Guest OS Tables,
and access Guest HV use 1-level paging through Guest HV Tables.
Real HV accesses Guest HV use 1-level paging through Guest HV Tables,
and accesses Real HV use 1-level paging through Real HV Tables.

--------------

When a 2-level Mapping creates an UnCacheable, MMI/O, ROM or config space
access, this intermediate addressed space determines the memory order. So,
Guest OS can make a process address sequentially consistent by making all
the PTEs use MMI/O space accesses. The second level of translation will,
then, translate that access back to <say> cacheable DRAM to be performed.
Likewise, should the second level of translation produce an access other
than cacheable DRAM, memory order is determined by the stronger method
of both translations.

Re: Concertina II Progress

<0Pa8N.2233$PJoc.1323@fx04.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35118&group=comp.arch#35118

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx04.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Concertina II Progress
Newsgroups: comp.arch
References: <uigus7$1pteb$1@dont-email.me> <ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me> <57b4666649236a3e79cd04773a76f7ee@news.novabbs.com> <ujjt01$16pav$1@dont-email.me> <41d9a7b20ac6da242578c6a53758f625@news.novabbs.com> <jwvedghqxha.fsf-monnier+comp.arch@gnu.org> <6e757154a4c6aa8e070975c534c0fda8@news.novabbs.com> <OSO7N.18767$_z86.2860@fx46.iad> <e95ce095c8dccb161be2b4fad0697aea@news.novabbs.com> <jwvwmu7ouzf.fsf-monnier+comp.arch@gnu.org>
Lines: 13
Message-ID: <0Pa8N.2233$PJoc.1323@fx04.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 25 Nov 2023 00:01:00 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 25 Nov 2023 00:01:00 GMT
X-Received-Bytes: 1589
 by: Scott Lurndal - Sat, 25 Nov 2023 00:01 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
>
>Hmm... why 4 rather than 1, 2, 3, 5, 42, 2^32, ... ?

Would require priority decoders to differeniate rather
than simple gates, probably.

Although I wonder at the missing firmware privilege level, a la SMM or EL3.

ARM added support for nested hypervisors without adding a
new exception level. Although interesting, there isn't much
evidence of it being used in production. Yet anyway.

Re: Concertina II Progress

<ujrnco$2lqen$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35126&group=comp.arch#35126

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Concertina II Progress
Date: Fri, 24 Nov 2023 20:49:57 -0600
Organization: A noiseless patient Spider
Lines: 150
Message-ID: <ujrnco$2lqen$1@dont-email.me>
References: <uigus7$1pteb$1@dont-email.me> <uijk93$2dc2i$2@dont-email.me>
<uijr5g$2ep8o$1@dont-email.me> <uikc1s$2lh5f$2@dont-email.me>
<4sr3N.17406$AqO5.3263@fx11.iad> <uilskk$2v1d2$1@dont-email.me>
<uilvki$2vjld$1@dont-email.me>
<74fd95a7bc98b42a4c1c8517ab7cdac8@news.novabbs.com>
<uj3380$1rnvb$1@dont-email.me>
<5412afba176e6044e28a72965f13ac4a@news.novabbs.com>
<uj37t1$1sgg4$1@dont-email.me>
<063885f383205c854c2387dcea32ba7a@news.novabbs.com>
<ujg54v$c6r4$1@dont-email.me> <ujgrel$h32p$1@dont-email.me>
<57b4666649236a3e79cd04773a76f7ee@news.novabbs.com>
<ujjt01$16pav$1@dont-email.me>
<41d9a7b20ac6da242578c6a53758f625@news.novabbs.com>
<ujmi69$1n5ko$1@dont-email.me>
<b0e6671f87b1d447b23372180d119e2e@news.novabbs.com>
<ujohle$209gb$1@dont-email.me>
<835e3e7fe735cae6ea0206af6077615a@news.novabbs.com>
<ujq7t8$2b79a$1@dont-email.me>
<ac183cbca2ecffb9b4a3b09be16a697f@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 25 Nov 2023 02:50:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7453bc91ed922c1bfb3262e310d4156c";
logging-data="2812375"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+94CgdaTocvY4sZabvQUw0"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:yNHlD+Ch5PjNMZua7B8IkKgvYhI=
Content-Language: en-US
In-Reply-To: <ac183cbca2ecffb9b4a3b09be16a697f@news.novabbs.com>
 by: BGB - Sat, 25 Nov 2023 02:49 UTC

On 11/24/2023 12:24 PM, MitchAlsup wrote:
> Paul A. Clayton wrote:
>
>> On 11/23/23 6:30 PM, MitchAlsup wrote:
>> [snip]
>>> Stay away from anything you see in x86 except in using it a moniker
>>> to avoid.
>
>> Even a stopped (12-hour) clock is right twice a day.
>
>> I hope you are not going to remove from My 66000 variable length
>> instruction encoding, hardware handling of (some for x86, XSAVE/
>> XRSTR) context saving and restoring, or even Memory Move.
>
> It is these things which allows my architecture to only need 70%
> of the instructions RISC-V needs.
>

In some of my tests, the total number of executed instructions tends to
be less than RISC-V as well.

Best I can tell, the main things that save instructions are mostly:
Register-indexed load/store (~ 30% of Ld/St);
MOV.X (~ 12% of Ld/St);
Jumbo prefixes (~ 6%).

Though, apparently, someone posted something recently showing RV64 and
ARM64 to be much closer than expected, which is curious. The main
instructions that seem to have "the most bang for the buck" are ones
that ARM64 has equivalents of.

More testing may be needed.

In other news:
Did add the compiler support to eliminate the "memcpy()" step from the
task-switching (by having the prolog/epilog saving/restoring registers
directly from the task context).

Should roughly halve syscall overhead, along with shaving 480 bytes off
the stack frame (some of the GPRs and CRs still need to be shuffled via
the stack, so it only saves 480 bytes rather than 640).

>> One could go even further and claim avoiding anything seen in x86
>> means not having registers (a storage region with simple, compact
>> addressing that an implementation will optimize as the common case
>> for operands — the Mill's Belt counts as "registers" in this sense
>> and even something like a transport-trigger architecture would
>> likely have storage for values with temporal locality coarser than
>> immediate use but frequent enough to justify simpler and more
>> compact addressing).
>
> Having 1 set of flat (any register can do any result or operand) is
> a My 66000 requirement, The only things I took from x86-64 is
> the [base+index<<scale+displacement] memory addressing model, and
> the 2-level MMU, even here I used the I/O MMU version rather than the
> processor version.
>

Yeah. Flat registers are good.

Internally, [base+index*scale] or [base+disp*scale] can do "most of it"...

Having [base+index*scale+disp] could do a little more, but seems to be
somewhat rarer. I had experimented with such an encoding, but it didn't
seem like it saw enough use-cases to justify the cost of its existence.

Granted, it might be more useful if it could be encoded like:
JumboImm+JumboOp+LdSt
With a 33 bit displacement (rather than an 9/11 bit displacement), as
this could potentially allow using it to address global arrays.

IOW, potentially allowing:
FEdd-dddd-FFw0-0Vdd-F0nm-0eoZ MOV.x (Rm, Ro*Sc, Disp33s), Rn

>> Yes, x86 messes up even these aspects. VLE does not have to be byte
>> granular or use multiple prefixes in variable order.
> VLE does not need prefixes of any kind.
>> Hardware context save/restore does not have to be limited to extended
>> state.
> HW S/R is most useful when it deals with ALL the state.
>> A memory move instruction does not *need* to have a variant
>> for each possible/likely chunk size or be implemented as
>> substantially less performant than a software implementation,
> One can synthesize SIMD and Vector saving 90% of the OpCode space
> < even with compile-time known size and alignment. Registers do not have
>> to be limited to 8 or be accessed in sub-units.
>
>> (Sub-unit access has some attraction to me for more efficiently
>> using a limited storage space while still trying to keep access
>> simple by limiting variability of shifting and complexity of
>> partial write ordering, but less efficient storage use can easily
>> be better than complexity of accessing the fastest and most
>> commonly accessed storage. More recent ISAs have implemented
>> partial register accesses. IBM ZArch, S/360 descendant, spit GPRs
>> into high and low halves to increase the number of values
>> available in the nominally 16 GPRs. AArch64 has 32-bit computer
>> operations motivated, I think, for power saving, which do not
>> increase the number of values and so avoids the shift and partial-
>> write problems.)
>
> I suspect this came out of already having to implement HW for IC (insert
> Character) instruction from System 360 time.

Seems sane.

>> I suspect you could write a multi-volume treatise on x86 about
>> hardware-software interface design and management (including the
>> social and economic considerations of project/product management).
>> Ignoring human factors, including those outside the organization
>> owning the interface, seems attractive to a certain engineering
>> mindset but human factors are significant design considerations.
>
> It would be more beneficial to the world just to build an architecture
> without any of those flaws--just to show them how its done.
>

People can probably debate what is ideal.

There seem to be people around who see RISC-V as the model of perfection.

I disagree, where some things seem to be corner cutting in areas where
doing so is a foot gun, and other areas being needlessly expensive (and
some things in the reaches of "extensions land" being just kinda absurd).

In some ways, it is (as I see it) better to define some things and leave
them as optional, rather than define little, and leave everyone else to
make an incoherent mess of things.

Then again, likely there is disagreements as to what sorts of features
seem meaningful, wasteful, or needless extravagance.

Granted, it does seem like x86 probably needs to be retired at some point...

>> [Yet once more stating what is obvious, especially to one skilled
>> in the art.]
>
> Captain Obvious would be proud.


devel / comp.arch / Re: Concertina II Progress

Pages:1234567891011121314151617181920212223242526272829303132333435363738
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor