Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

If a 'train station' is where a train stops, what's a 'workstation'?

Why My 66000 is and is not RISC

Subject	Author
Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Terje Mathisen
Re: Why My 66000 is and is not RISC	Marcus
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Ivan Godard
Re: Why My 66000 is and is not RISC	Thomas Koenig
Re: Why My 66000 is and is not RISC	Ivan Godard
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Ivan Godard
Re: Why My 66000 is and is not RISC	Terje Mathisen
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Code Density Deltas (Re: Why My 66000 is and is not RISC)	BGB
Re: Code Density Deltas (Re: Why My 66000 is and is not RISC)	MitchAlsup
Re: Code Density Deltas (Re: Why My 66000 is and is not RISC)	BGB
Re: Code Density Deltas (Re: Why My 66000 is and is not RISC)	MitchAlsup
Re: Why My 66000 is and is not RISC	Thomas Koenig
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	EricP
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	Stephen Fuld
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	Stephen Fuld
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Stefan Monnier
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	luke.l...@gmail.com
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Agner Fog
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	luke.l...@gmail.com
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Timothy McCaffrey
Re: Why My 66000 is and is not RISC	John Dallman
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Timothy McCaffrey
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Ivan Godard
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Ivan Godard
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	Thomas Koenig
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Brett
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	David Brown
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	David Brown
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	David Brown
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	Thomas Koenig
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB
Re: Why My 66000 is and is not RISC	MitchAlsup
Re: Why My 66000 is and is not RISC	BGB

Pages:12 3 4

Why My 66000 is and is not RISC

<f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26032&group=comp.arch#26032

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:eb84:0:b0:21b:84dd:4d86 with SMTP id t4-20020adfeb84000000b0021b84dd4d86mr5418570wrn.288.1655946202722;
Wed, 22 Jun 2022 18:03:22 -0700 (PDT)
X-Received: by 2002:a05:622a:1049:b0:305:2f0f:63c6 with SMTP id
f9-20020a05622a104900b003052f0f63c6mr5702997qte.331.1655946201900; Wed, 22
Jun 2022 18:03:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 22 Jun 2022 18:03:21 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:46a:4cb2:1774:6e1c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:46a:4cb2:1774:6e1c
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
Subject: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 23 Jun 2022 01:03:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Thu, 23 Jun 2022 01:03 UTC

I could not find the question asking me to make a list of why My 66000
instruction set architecture is like and unlike the tenets of the original
RISC. So I spent some time looking up what the internet is currently saying
about RISCs. There is a short list, but I will start with a few statements
from Hennessey and Paterson::

Hennessey:: The goal of any instruction format should be: 1. simple decode,
2. simple decode, and 3. simple decode. Any attempts at improved code
density at the expense of CPU performance should be ridiculed at every
opportunity.

Patterson:: more is not better -- microcode is bad
Subroutines need low overhead

RISC axioms:
a) the ISA is primarily designed to make the pipeline simple.
b) the ISA is primarily designed as a target for compilers.
c) instructions only exist if they add performance.
d) frequently accessed data is kept in registers.

RISC tenets:
a) 1 word == 1 instruction
b) 1 instructions flows down the pipeline in 1 cycle
c) 1 instruction can cause 0 or 1 exception
d) instruction encoding uses few patterns
e) there is a large uniformly addressable register space

So where does My 66000 ISA stand with respect to these axioms and
tenets::

RISC axioms: My 66000 ISA embodies all of the RISC axioms
RISC tenets: My 66000 ISA rejects ½ of RISC tenets

With minor exceptions to both::

My 66000 contains 32×64-bit general purpose registers. Some might
think this is too few and a FP register file should be added. Looking
at code such as BLASS, Livermore Loops, Linpack indicates otherwise
-- as long as one assumes some hints of OoO pipelining. Looking at
various C libraries this seems perfectly sufficient.

My 66000 ISA contains 6 decoding patterns; 1 for each of
{instructions with 16-bit immediates, instructions with 12-bit
immediates, scaled memory reference, 2-operand reg-reg,
1-operand reg-reg, 3-operand reg-reg }

The 12-bit immediate format is used for shift instructions and
for Predicate instructions and positioned such that predicate
instructions are only 1-bit different than their corresponding
branch instruction. This saves 6×16-bit immediate encodings.

Scaled memory reference, 1-operand, 2-operand, 3-operand
all have access to 32-bit or 64-bit immediates/displacements
in substitution for a register. This eliminates any need to use
instructions or waste registers pasting constants together.

1-operand, 2-operand, 3-operand instructions all have sign control
over their operands. There is no SUB instruction My 66000 uses
ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
instructions from execution. The 2-operand group allows the
5-bit register specifier to be used as a 6-bit sign extended
immediate, making ADD Rd,#1,-Rs2 easily encoded.

There are Compare instructions that return a bit-vector of everything
the compare circuitry can determine, including range checks like:
0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
+zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
to add "any byte equal", "any halfword equal", "any word equal".

There are 2 kinds of conditional flow: branching and predication and
each has 2 principle kinds of instructions:: condition is determined
from a single bit in a register, or condition is determined by comparing
a register with 0. In addition there are conditionless branches, jumps,
and a special addition supporting PIC for method calls and switches.
Compare-to-zero and branch can access certain HW know information
that is not capable of being stored in a ISA register--this includes things
like a query to the Memory Unit asking if it has seen any interference
between the start of an ATOMIC sequence and "now". The exception
and interrupts and std. return are also encoded here.

Memory reference instructions enable building of ATOMIC primitives
that can touch as many as 8 cache lines of data in a single ATOMIC
event. This is equivalent to the MIPS LL and SC except it operates
over much larger chunks of data. This is sufficient to move an entry
of a shared data structure from one place to another place in a single
event. This minimizes the number of ATOMIC events that are needed,
and comes with guarantees of forward progress.

The ST instruction can store a constant in either 5-bit sign extended
form, or in 32-bit or 64-bit forms. No need to put a constant into a
register in order to ST it to memory. This is along with the ability
to use 32-bit or 64-bit displacement constants.

There are 5 "special" memory reference instructions:: ENTER is used
to setup a new stack, and save registers, EXIT is used to tear down the
stack and restore registers, LDM loads multiple registers, STM stores
multiple registers, and MM moves data from memory to memory.
MM has the property that both cached and uncached memory smaller
than a page is moved as single ATOMIC transfer. {PCIe can do this,
so should CPUs attached to PCIe peripherals.} There is expected to
be a sequencer in the memory unit that performs these out of the
data-path.

The Floating Point group includes Transcendental instructions.
Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
that are only 1 constant different in the calculations. Ln2 takes
only 14 cycles, sin takes 19 cycles. These are included because
they actually do improve performance.

Conversions between FP and FP or FP and INT are provided by
1 instruction (CVT) which has 49 variants to deal with 5 specified
rounding modes and 1 implied rounding mode (current) any time
a rounding could transpire. This falls into the category of "once
you have the HW to do <say> ANINT (of FORTRAN) you have the
95% of the logic to do them all".

The exception model is based on message passing (as is SVCs),
rather than wandering through the high-level OS exception
dispatcher. This model supports threads (processes or tasks)
that are paranoid of the OS looking at their data (such as banking
applications running on a home PC), and can indeed restrict the
OS from looking at the address space.

I/O devices are virtualized, and operate on the virtual address
space of originating requestor. So while the I/O device can DMA
directly into paranoid application address space, and while OS
can verify the given space and bounds are acceptable, OS cannot
look into that address space. This gets rid of the need of a secured
mode of operation.

Deferred procedure calls are handled as messages (argument
setup + 1 instruction) with continuation. The messaging sub-system
operates over both HyperVisor and GuestOS domains simultaneously.
Anyone with a "method" can call that method and get a response
even if that method is running under a different GuestOS.

There is a 66-bit remapped address space--any thread can access
64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
and ROM} The address space is configured to efficiently transport
requests over a significant network (ala HyperTransport and Intel
equivalent). DRAM is cache coherent, configuration is strongly ordered,
MMIO is sequentially consistent, ROM is "lax").

The system repeater transports requests from chip to chip, and
amalgamates coherence requests so that the originator counts
responses from cores on his chip, and the number of chips in
the system (rather than counting from every core).

Memory management cannot be turned off--My 66000 imple-
mentations come out of reset with the MMUs turned on. HostBridge
is configured with a MMU/TLB that uses exactly the same tables as
CPUs and can share tables as applicable. Levels in the virtual
address space translations can be skipped! So an application as
simple as 'cat' can be managed with a single page of translation
overhead.

Memory management is inherently HyperVisor/GuestOS. Privilege
is determined by the assortment of root pointers in use on a per
invocation basis.

GuestOS can activate a thread (taking it from a waiting state to
running in a core) in a single instruction and remotely. So can
HyperVisor.

Finally, there is no notion of one thread morphing into a different
thread over a series of instructions manipulating control registers
one by one. For example: an ISR cleanup handler takes a thread
off a wait state queue, places it on a run state queue, and signals
GuestOS to see what threads should be running "right now". This
is all 1 instruction and 1 cycle as far as the core performing the
instruction sees.

My 66000 is not just another ISA, it is a rethink of most of the components
that make up a system. A context switch from one thread to another
within a single GuestOS is 10 cycles. A context switch from one thread
to a thread under a different GuestOS remains 10 cycles. The typical
current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
across GuestOSs.

OH, and BTW, The FP transcendentals are patented.

Re: Why My 66000 is and is not RISC

<t90vhb$1m8n$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26040&group=comp.arch#26040

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!y0sttPrO1OAcON/g+jAtOw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Thu, 23 Jun 2022 08:00:15 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t90vhb$1m8n$1@gioia.aioe.org>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="55575"; posting-host="y0sttPrO1OAcON/g+jAtOw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 23 Jun 2022 06:00 UTC

MitchAlsup wrote:
> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".

If you can add the in-reg SIMD compare ops without slowing stuff down,
please do so!

Even having VMM, if you can identify the final \0 byte anywhere in a
64-bit reg, then that's a win for lots of code.

That said, just having your current VMM setup would obviate the need for
SIMD style ops in almost all programs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Why My 66000 is and is not RISC

<t9134k$3tg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26045&group=comp.arch#26045

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Thu, 23 Jun 2022 09:01:39 +0200
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <t9134k$3tg$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 23 Jun 2022 07:01:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="262bca5fa9b334ff3bf17b1a33d0f82d";
logging-data="4016"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+sFWd8hl0GoG7LtcgggLjriosyzK2/7nQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.9.1
Cancel-Lock: sha1:i0AjdEj3SLtZCn8xsIjmKYhrwt4=
In-Reply-To: <t90vhb$1m8n$1@gioia.aioe.org>
Content-Language: en-US

by: Marcus - Thu, 23 Jun 2022 07:01 UTC

On 2022-06-23, Terje Mathisen wrote:
> MitchAlsup wrote:
>> There are Compare instructions that return a bit-vector of everything
>> the compare circuitry can determine, including range checks like:
>> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
>> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
>> to add "any byte equal", "any halfword equal", "any word equal".
>
> If you can add the in-reg SIMD compare ops without slowing stuff down,
> please do so!

In-reg SIMD can be useful. In MRISC32 you can do:

seq.b r2, r1, z ; Byte-wise compare r1 to zero, "Set if EQual"
bz r2, foo1 ; Branch if no byte equal (mask zero)
bnz r2, foo2 ; Branch if any byte equal (mask not zero)
bs r2, foo3 ; Branch if all bytes equal (mask set)
bns r2, foo4 ; Branch if any byte not equal (mask not set)

....and similar with seq.h for half-words. There are also inequality and
gt/lt comparisons, for instance.

Note: "Set" means all bits of the byte/half-word/word are 1. The
opposite (false) outcome of the set instructions is the all bits of the
byte/half-word/word are 0.

The cute part is that I did not have to add special "SIMD" branch
instructions, since the same instructions make sense for both packed and
unpacked comparison results.

/Marcus

>
> Even having VMM, if you can identify the final \0 byte anywhere in a
> 64-bit reg, then that's a win for lots of code.
>
> That said, just having your current VMM setup would obviate the need for
> SIMD style ops in almost all programs.
>
> Terje
>

Re: Why My 66000 is and is not RISC

<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26055&group=comp.arch#26055

copy link Newsgroups: comp.arch

X-Received: by 2002:a1c:7206:0:b0:39c:4d16:683f with SMTP id n6-20020a1c7206000000b0039c4d16683fmr6020617wmc.197.1656012641770;
Thu, 23 Jun 2022 12:30:41 -0700 (PDT)
X-Received: by 2002:a05:6214:dcb:b0:470:8fae:eb99 with SMTP id
11-20020a0562140dcb00b004708faeeb99mr2110529qvt.11.1656012641041; Thu, 23 Jun
2022 12:30:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 23 Jun 2022 12:30:40 -0700 (PDT)
In-Reply-To: <t9134k$3tg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3d02:8cd5:9916:1278;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3d02:8cd5:9916:1278
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 23 Jun 2022 19:30:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Thu, 23 Jun 2022 19:30 UTC

In todays installment I touch on things about My 66000 not covered above.

My 66000 ISA requires an instruction buffer and a 2 stage instruction
processing pipeline I call PARSE and DECODE. Hennessey would be booing
at this point. However, using this, I get branch overhead down to 0.03 cycles
per taken branch without having any delay slot. {This also makes a unified
L1 cache feasible. But since Fetch and MemRef are so far apart on the die
My implementations have chosen not to utilize this capability.}

PARSE finds the instruction boundaries (main job) and scans ahead for branches,
determines which function units, and looks for CoIssue opportunities. The scan
ahead branches are processes in parallel by DECODE to fetch branch targets
even before the branch instruction is executed. So if a taken prediction is made
the instructions on the taken path are already ready to enter execution. PARSE
identifies immediates and displacements and cancels register port requests,
providing opportunities for ST to read the register file..........

DECODE processes the instructions from PARSE , accesses register file,
computes forwarding, and starts instruction into the execution pipeline.
DECODE routes immediates and displacements to the required instruction.
ST instruction pass through DECODE twice, the 1st time is for AGEN, the
2nd time is for ST.data when a register file port is available.

---------------------------instruction stuff-----------------------------------------------------------

The shift instructions have 2×6-bit fields dealing with the shift amount and
the width of data being shifted. These are used to access odd-sized data
(ala EXTRACT) and to SMASH data calculated at "machine" size back down
into containers of "language" size so the container cannot contain a value
outside of the range of its container. When the width field is 0 it is considered
to be 64-bits. When encoded as an immediate, the 2 fields are back-to-back,
when found in a register there is 26-bits separating the 2 fields, in data<38:32>
both 1000000 and 0000000 are considered to be 64-bits, while 1xxxxxxx
with any of the x's non-zero is considered an Operand exception.

The Multiplex Instruction MPX. MPX basically allows for selecting bits from
a pair of registers based on another register:: ( a & b ) | ( ~a & c ), however
it has other flavors to provide ( !!a & b ) | ( !a & c ) which is CMOV and by
using the immediate encodings in My 66000 provides MOV Rd,#IMM32 and
MOV Rd,#IMM64 along with MOV Rd, Rs1 and MOV Rd,Rs2. They fall out
for free saving MOV opcodes elsewhere.

Vectorization: My 66000 ISA contains loop vectorization. This allows
for performing vectorized loops are several iterations per cycle even
1-wide machines can perform at 32+ instructions per cycle. My main
(as yet unproven) hope is that this takes the pressure off of the design
width. The basic argument is as follows:
a) 1-wide machines operate at 0.7 IPC
b) 2-wide SuperScalar machines operate at 1.0 IPC
c) GBOoO machines operate at 2.0 IPC
d) programs spend more than ½ their time in loops.
So, if one can get a 2× performance advantage of the 1-wide machine
this puts it in spitting distance of the GBOoO machine, which in turn
means the Medium OoO machine can be competitive with the GBOoO
machine are significantly lower {cost, design time, area, power}

AND while investigating loop vectorization, I discovered that a RISC
pipeline with a 3R-1W register file can perform 1.3 IPC. Branch
instructions (20%) do not use the result register, ST instructions
(10%) can borrow the write port AFTER cache tag and translation
validations, AND in the general code I have seen there is significant
opportunity to perform write-elision in the data path, freeing up even
more ports. This, again takes pressure of the width of the design.
So, with vectorization, a 3 (or 4)-wide machine is competitive with
a 6-wide machine,.....

None of this prevents or makes wide GBOoO more difficult.

----------------------instruction modifiers------------------------------------------

CARRY is the first of the Instruction-Modifiers. An instruction-modifier
supplies "bits" for several future instructions so that one does not need
the cartesian product of a given subset encoded in the ISA. Thus, there
are shift instructions and when used with CARRY these perform shifts
as wide as you like 128, 256, 512,.....no need to clog up the encoding
space for lightly used but necessary functionality. Oven in the FP arena
CARRY provides access to exact FP arithmetics.

CARRY provides access to multiprecision arithmetic both integer and FP.
CARRY provides a register which can be used as either/both Input and Output
to a set of instructions. This provides a link from one instruction to another
where data is transmitted but not encoded in the instruction itself.

Since we are in the realm of power limited, My 66000 ISA has an ABS
instruction. Over in the integer side, this instruction can be performed
by subjugating the sign control built into the data path and be "executed"
without taking any pipeline delay (executes in zero cycles). Over on the
FP side this never adds any latency (executes in zero cycles). ABS always
takes less power than performing the instruction in any other way.

DBLE is an instruction modifier that supplies register encodings and
adds 64-bits to the calculation width of the modified instruction. Applied
to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
and presto: we get FP128 by adding exactly 1 instruction, the compiler
can pick any 8 registers it desires alleviating register allocation concerns.
DBLE is a "get by" kind of addition, frowned upon by Hennessey.

I can envision a SIMD instruction modifier that defines the SIMD parameters
of several subsequent instructions and allows 64-bit SIMD to transpire.
I am still thinking about these. What I cannot envision is a wide SIMD
register file--this is what VVM already provides.

These instruction-modifiers, it seems to me, are vastly more efficient
than throwing hundreds to thousands of unique instructions into ISA.
Especially if those unique instructions <on average> are not used
"that much".

-----------------------------Safe Stack--------------------------------------------------------

Safe Stack. My 66000 architecture contains the notion of a Safe Stack.
Only 3 instructions have access to Safe Stack: {ENTER, EXIT, and RET}
When Safe Stack is in use, the return address goes directly to the Safe
Stack, and return address comes directly off safe stack. Preserved
registers are placed on Safe Stack {ENTER} and their register values
(conceptually) set to 0. Safe Stack is in normal thread memory but
the PTEs are marked RWE = 000 so any access causes page faults.
EXIT reloads the preserved registers from Safe Stack and transfers
control directly back to caller. When Safe Stack is not in use, R0
is used to hold the return address. Proper compiled code runs the
same when safe stack is on or off, so one can share dynamic libraries
between modes.

Safe Stack monitors the value in SP and KILLs lines that no longer
need to reach out into the cache hierarchy, Safe Stack can efficiently
use Allocate memory semantics. Much/most of the time, nothing
in safe stack leaves the cache hierarchy.

Buffer overflows on the "stack" do not corrupt the call/return flow of
control. ROP cannot happen as application has no access to Return
Address. Application cannot see the values in the preserved registers
augmenting safety and certainly cannot modify them.

-------------------------------ABI----------------------------------------------------------------------

Subroutine Calling Convention {A.K.A. ABI}:
Registers R1..R8 contain the first 8 arguments to the subroutine.
SP points are argument[9]
R9..R15 are considered as temporary registers
R16..R29 are preserved registers
R30=FP is a preserved registers but used as a Frame Pointer when
...............language semantics need.
R31=SP is a preserved register and used as a Stack Pointer. SP must
...............remain doubleword aligned at all times.

ABI is very RISC

So, let's say we want to call a subroutine that wants to allocate 1024
bytes on the stack for its own local data, is long running and needs
to preserve all 14 preserved registers, and is using a FP along with a
SP. Let us further complicate the mater by stating this subroutine
takes variable number of arguments. Entry Prologue:

ENTRY subroutine_name
subrutine_name:
ENTER R16,R8,#(1024 | 2)

At this point the register passed arguments have been saved with the
memory passed arguments, FP is pointing at the "other" end of local
data on the stack, after pushing the registers, 1024 bytes has been
allocated onto the SP, the old FP has been saved and the new FP setup.
{This works both with and without Safe Stack}

Your typical RISC-only ISA would require at least 29 instructions to
do this amount of work getting into the subroutine, and another 17
getting out. If the ISA has both INT and FP register files 29 becomes 37.

The Same happens in Epilogue: 1 instruction.

Click here to read the complete article

Re: Why My 66000 is and is not RISC

<t92sv9$jbb$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26061&group=comp.arch#26061

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Thu, 23 Jun 2022 23:28:49 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 207
Message-ID: <t92sv9$jbb$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org>
<t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 23 Jun 2022 23:28:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b027aacfed9d2faa614b0e51945b0468";
logging-data="19819"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+wq+S0OKkBpZIYDkYMDK5R"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:yCiyYNOT3oL9NYcQ1COS5IKUXr4=
sha1:/aKwcQ80uuMNAqbe/PjNng/Us6k=

by: Brett - Thu, 23 Jun 2022 23:28 UTC

MitchAlsup <MitchAlsup@aol.com> wrote:
> In todays installment I touch on things about My 66000 not covered above.
>
> My 66000 ISA requires an instruction buffer and a 2 stage instruction
> processing pipeline I call PARSE and DECODE. Hennessey would be booing
> at this point. However, using this, I get branch overhead down to 0.03 cycles
> per taken branch without having any delay slot. {This also makes a unified
> L1 cache feasible. But since Fetch and MemRef are so far apart on the die
> My implementations have chosen not to utilize this capability.}
>
> PARSE finds the instruction boundaries (main job) and scans ahead for branches,
> determines which function units, and looks for CoIssue opportunities. The scan
> ahead branches are processes in parallel by DECODE to fetch branch targets
> even before the branch instruction is executed. So if a taken prediction is made
> the instructions on the taken path are already ready to enter execution. PARSE
> identifies immediates and displacements and cancels register port requests,
> providing opportunities for ST to read the register file..........
>
> DECODE processes the instructions from PARSE , accesses register file,
> computes forwarding, and starts instruction into the execution pipeline.
> DECODE routes immediates and displacements to the required instruction.
> ST instruction pass through DECODE twice, the 1st time is for AGEN, the
> 2nd time is for ST.data when a register file port is available.
>
> ---------------------------instruction
> stuff-----------------------------------------------------------
>
> The shift instructions have 2×6-bit fields dealing with the shift amount and
> the width of data being shifted. These are used to access odd-sized data
> (ala EXTRACT) and to SMASH data calculated at "machine" size back down
> into containers of "language" size so the container cannot contain a value
> outside of the range of its container. When the width field is 0 it is considered
> to be 64-bits. When encoded as an immediate, the 2 fields are back-to-back,
> when found in a register there is 26-bits separating the 2 fields, in data<38:32>
> both 1000000 and 0000000 are considered to be 64-bits, while 1xxxxxxx
> with any of the x's non-zero is considered an Operand exception.
>
> The Multiplex Instruction MPX. MPX basically allows for selecting bits from
> a pair of registers based on another register:: ( a & b ) | ( ~a & c ), however
> it has other flavors to provide ( !!a & b ) | ( !a & c ) which is CMOV and by
> using the immediate encodings in My 66000 provides MOV Rd,#IMM32 and
> MOV Rd,#IMM64 along with MOV Rd, Rs1 and MOV Rd,Rs2. They fall out
> for free saving MOV opcodes elsewhere.
>
> Vectorization: My 66000 ISA contains loop vectorization. This allows
> for performing vectorized loops are several iterations per cycle even
> 1-wide machines can perform at 32+ instructions per cycle. My main
> (as yet unproven) hope is that this takes the pressure off of the design
> width. The basic argument is as follows:
> a) 1-wide machines operate at 0.7 IPC
> b) 2-wide SuperScalar machines operate at 1.0 IPC
> c) GBOoO machines operate at 2.0 IPC
> d) programs spend more than ½ their time in loops.
> So, if one can get a 2× performance advantage of the 1-wide machine
> this puts it in spitting distance of the GBOoO machine, which in turn
> means the Medium OoO machine can be competitive with the GBOoO
> machine are significantly lower {cost, design time, area, power}
>
> AND while investigating loop vectorization, I discovered that a RISC
> pipeline with a 3R-1W register file can perform 1.3 IPC. Branch
> instructions (20%) do not use the result register, ST instructions
> (10%) can borrow the write port AFTER cache tag and translation
> validations, AND in the general code I have seen there is significant
> opportunity to perform write-elision in the data path, freeing up even
> more ports. This, again takes pressure of the width of the design.
> So, with vectorization, a 3 (or 4)-wide machine is competitive with
> a 6-wide machine,.....
>
> None of this prevents or makes wide GBOoO more difficult.
>
> ----------------------instruction modifiers------------------------------------------
>
> CARRY is the first of the Instruction-Modifiers. An instruction-modifier
> supplies "bits" for several future instructions so that one does not need
> the cartesian product of a given subset encoded in the ISA. Thus, there
> are shift instructions and when used with CARRY these perform shifts
> as wide as you like 128, 256, 512,.....no need to clog up the encoding
> space for lightly used but necessary functionality. Oven in the FP arena
> CARRY provides access to exact FP arithmetics.
>
> CARRY provides access to multiprecision arithmetic both integer and FP.
> CARRY provides a register which can be used as either/both Input and Output
> to a set of instructions. This provides a link from one instruction to another
> where data is transmitted but not encoded in the instruction itself.
>
> Since we are in the realm of power limited, My 66000 ISA has an ABS
> instruction. Over in the integer side, this instruction can be performed
> by subjugating the sign control built into the data path and be "executed"
> without taking any pipeline delay (executes in zero cycles). Over on the
> FP side this never adds any latency (executes in zero cycles). ABS always
> takes less power than performing the instruction in any other way.
>
> DBLE is an instruction modifier that supplies register encodings and
> adds 64-bits to the calculation width of the modified instruction. Applied
> to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> and presto: we get FP128 by adding exactly 1 instruction, the compiler
> can pick any 8 registers it desires alleviating register allocation concerns.
> DBLE is a "get by" kind of addition, frowned upon by Hennessey.
>
> I can envision a SIMD instruction modifier that defines the SIMD parameters
> of several subsequent instructions and allows 64-bit SIMD to transpire.
> I am still thinking about these. What I cannot envision is a wide SIMD
> register file--this is what VVM already provides.
>
> These instruction-modifiers, it seems to me, are vastly more efficient
> than throwing hundreds to thousands of unique instructions into ISA.
> Especially if those unique instructions <on average> are not used
> "that much".
>
> -----------------------------Safe
> Stack--------------------------------------------------------
>
> Safe Stack. My 66000 architecture contains the notion of a Safe Stack.
> Only 3 instructions have access to Safe Stack: {ENTER, EXIT, and RET}
> When Safe Stack is in use, the return address goes directly to the Safe
> Stack, and return address comes directly off safe stack. Preserved
> registers are placed on Safe Stack {ENTER} and their register values
> (conceptually) set to 0. Safe Stack is in normal thread memory but
> the PTEs are marked RWE = 000 so any access causes page faults.
> EXIT reloads the preserved registers from Safe Stack and transfers
> control directly back to caller. When Safe Stack is not in use, R0
> is used to hold the return address. Proper compiled code runs the
> same when safe stack is on or off, so one can share dynamic libraries
> between modes.
>
> Safe Stack monitors the value in SP and KILLs lines that no longer
> need to reach out into the cache hierarchy, Safe Stack can efficiently
> use Allocate memory semantics. Much/most of the time, nothing
> in safe stack leaves the cache hierarchy.
>
> Buffer overflows on the "stack" do not corrupt the call/return flow of
> control. ROP cannot happen as application has no access to Return
> Address. Application cannot see the values in the preserved registers
> augmenting safety and certainly cannot modify them.
>
> -------------------------------ABI----------------------------------------------------------------------
>
> Subroutine Calling Convention {A.K.A. ABI}:
> Registers R1..R8 contain the first 8 arguments to the subroutine.
> SP points are argument[9]
> R9..R15 are considered as temporary registers
> R16..R29 are preserved registers
> R30=FP is a preserved registers but used as a Frame Pointer when
> ..............language semantics need.
> R31=SP is a preserved register and used as a Stack Pointer. SP must
> ..............remain doubleword aligned at all times.
>
> ABI is very RISC
>
> So, let's say we want to call a subroutine that wants to allocate 1024
> bytes on the stack for its own local data, is long running and needs
> to preserve all 14 preserved registers, and is using a FP along with a
> SP. Let us further complicate the mater by stating this subroutine
> takes variable number of arguments. Entry Prologue:
>
> ENTRY subroutine_name
> subrutine_name:
> ENTER R16,R8,#(1024 | 2)
>
> At this point the register passed arguments have been saved with the
> memory passed arguments, FP is pointing at the "other" end of local
> data on the stack, after pushing the registers, 1024 bytes has been
> allocated onto the SP, the old FP has been saved and the new FP setup.
> {This works both with and without Safe Stack}
>
> Your typical RISC-only ISA would require at least 29 instructions to
> do this amount of work getting into the subroutine, and another 17
> getting out. If the ISA has both INT and FP register files 29 becomes 37.
>
> The Same happens in Epilogue: 1 instruction.
>
> While ABI is very RISC the ISA of Prologue and Epilogue is not.
>
> As a side note: My 66000 is achieving similar code density as x86-64.
>
> A few other interesting side
> bits:------------------------------------------------------------
>
> LDM and STM to unCacheable address are performed as if ATOMIC::
> that is:: as a single bus transaction. All interested 3rd parties see the
> memory before any writes have been performed or after all writes
> have been performed. A device driver can read several MMIO device
> control registers and know that nobody else in the system has access
> to the device control registers that could cause interference. A device
> driver can store multiple control register locations without interference.
>
> There is a page ¿in ROM? known to contain zeros. A Memory Move
> instruction can cause a page accessing this ¿ROM? data to be zeroed
> without even bothering to access ¿ROM?--and the entire page is zeroed
> at the target. Thus, pages being reclaimed to the free pool are but 1
> instruction away from being in the already zeroed page pool. Zeroing
> pages is performed at the DRAM end of the system (coherently). And
> no <deleterious> bus activity is utilized.

Click here to read the complete article

Re: Why My 66000 is and is not RISC

<t932qd$j78$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26062&group=comp.arch#26062

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Thu, 23 Jun 2022 20:08:24 -0500
Organization: A noiseless patient Spider
Lines: 521
Message-ID: <t932qd$j78$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Jun 2022 01:08:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="12539b15be74db0911dceb8fd16ca93d";
logging-data="19688"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19aVF6kagA4ZwRF4d6JMO7E"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:beQwszsVOnwv/0C+aQoWuAvi+QI=
In-Reply-To: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 24 Jun 2022 01:08 UTC

On 6/22/2022 8:03 PM, MitchAlsup wrote:
> I could not find the question asking me to make a list of why My 66000
> instruction set architecture is like and unlike the tenets of the original
> RISC. So I spent some time looking up what the internet is currently saying
> about RISCs. There is a short list, but I will start with a few statements
> from Hennessey and Paterson::
>
> Hennessey:: The goal of any instruction format should be: 1. simple decode,
> 2. simple decode, and 3. simple decode. Any attempts at improved code
> density at the expense of CPU performance should be ridiculed at every
> opportunity.
>
> Patterson:: more is not better -- microcode is bad
> Subroutines need low overhead
>
> RISC axioms:
> a) the ISA is primarily designed to make the pipeline simple.
> b) the ISA is primarily designed as a target for compilers.
> c) instructions only exist if they add performance.
> d) frequently accessed data is kept in registers.
>

BJX2 generally upholds the above.

While some instructions are pretty niche, most still tend to have
use-cases, and I am mostly trying to avoid adding stuff that is
(completely) useless.

> RISC tenets:
> a) 1 word == 1 instruction
> b) 1 instructions flows down the pipeline in 1 cycle
> c) 1 instruction can cause 0 or 1 exception
> d) instruction encoding uses few patterns
> e) there is a large uniformly addressable register space
>

My case, 3 out of 5.

a, 16/32 and bundle-encodings break this one.

d, Some extra complexity exists due to the lack of an architectural Zero
Register and similar, and some instructions (early on) which ended up
with both 2R and 3R encodings.

Early on, I wasn't confident, for example, that "ADD R4, R5" and "ADD
R5, R4, R5" would have been semantically equivalent in all cases.

There were some other cases (Mostly 32-bit 2R Load/Store variants) which
were dropped due to being entirely redundant with the 3R encodings (or
which became redundant once predication was added).

Some other parts of the ISA also ended up being dropped and then later
re-added a few times before becoming more-or-less permanent (and some
other features are in limbo due to not really adding enough to to
justify their existence).

> So where does My 66000 ISA stand with respect to these axioms and
> tenets::
>
> RISC axioms: My 66000 ISA embodies all of the RISC axioms
> RISC tenets: My 66000 ISA rejects ½ of RISC tenets
>
> With minor exceptions to both::
>
> My 66000 contains 32×64-bit general purpose registers. Some might
> think this is too few and a FP register file should be added. Looking
> at code such as BLASS, Livermore Loops, Linpack indicates otherwise
> -- as long as one assumes some hints of OoO pipelining. Looking at
> various C libraries this seems perfectly sufficient.
>

My case: 32|64 x 64-bit.

I am still on the fence as to whether 32 GPRs is "fully sufficient", or
whether 64 GPRs can offer enough gain (in certain use-cases) to justify
its existence. It "kinda helps" for TKRA-GL but is seemingly kinda moot
for pretty much everything else.

The way the encodings for the 64 GPR case are handled is a bit hacky,
but it was a tradeoff (I came up with something which could be done
without breaking binary compatibility or requiring a separate operating
mode). Ironically, everything still works OK so long as "most of the
code" sticks to only using the low 32 GPRs (otherwise, some of the seams
might start to show).

> My 66000 ISA contains 6 decoding patterns; 1 for each of
> {instructions with 16-bit immediates, instructions with 12-bit
> immediates, scaled memory reference, 2-operand reg-reg,
> 1-operand reg-reg, 3-operand reg-reg }
>

Hmm (8 major for 32-bit):
FZnm_ZeoZ //3R "Rm, Ro, Rn"
FZnm_ZeZZ //2R "Rm, Rn"
FZnm_Zeii //3RI (Imm9/Disp9), "Rm, Imm9, Rn" / "(Rm, Disp9), Rn"
FZnZ_Zeii //2RI (Imm10), "Imm10, Rn"
FZZZ_ZeoZ //1R (Ro treated as Rn for these)
FZZn_iiii //2RI (Imm16), "Imm16, Rn"
FZdd_Zddd //Disp20 (Branch)
FZii_iiii //"LDIz Imm24, R0"

Add a few more if one counts the 16-bit ops:
ZZnm //2R
ZZni //2RI (Imm4)
ZZnZ //1R
Znii //2RI (Imm8)
ZZdd //Disp8 (Branch)

The Jumbo and Op64 encodings may or may not be considered new forms,
however they don't actually add "new" instruction-forms per-se, but
rather modify the existing encodings in predefined ways (and reuse the
existing 32-bit decoder; just with more bits "glued on" to the instruction).

One could potentially also interpret the 32-bit encodings as zero-padded
versions of a longer internal encoding space:
FEii_iiii_FZnm_Zeii //3RI, "Rm, Imm33, Rn"
...

With a few special cases, eg:
FEii_iiii_FAii_iiii //"LDIZ Imm48, R0"
FFii_iiii_FAii_iiii //"BRA Abs48"

There are more forms if one considers "minor" patterns, but these don't
really effect instruction encoding, but more how the various parts are
interpreted and mapped to the internal pipeline:
Logically, each instruction is decoded as if it had:
3 read ports, 1 write port;
A 33-bit immediate/displacement field;
Op / Sub-Op;
...

This then combines with an outer stage that deals with the bundle as a
whole, mapping SIMD ops to two lanes, along with Abs48 and Imm64
encodings (where the immediate can't fit into a single pipeline lane).

The output of this using being the configuration for the entire pipeline.

> The 12-bit immediate format is used for shift instructions and
> for Predicate instructions and positioned such that predicate
> instructions are only 1-bit different than their corresponding
> branch instruction. This saves 6×16-bit immediate encodings.
>

Differs in my case:
Shifts and friends use Imm9 forms;
However, because one doesn't need all 9 bits for a typical shift, had
also kinda shoe-horned SIMD shuffle instructions into the mix as well.

Or, in effect, shuffle can be imagined sort of like a conjoined twin
stuck onto the shift instruction (and a variable shift imagined as
masking-off the bit that causes it to able to behave like a shuffle).

Or, one can also imagine that there could have been an alternate
universe where passing a sufficiently out-of-range value to the shift
instruction caused it to shuffle the value instead...

Predicate instructions work very differently in my case, having their
own copy of the 32-bit encoding space which mirrors the format of the
normal opcode space (just replacing the WEX bit with a True/False bit),
and the encoding spots that would have normally encoded Imm24 and Jumbo
being repurposed as Predication+WEX / "PrWEX" (but only applying to a
subset of the ISA).

> Scaled memory reference, 1-operand, 2-operand, 3-operand
> all have access to 32-bit or 64-bit immediates/displacements
> in substitution for a register. This eliminates any need to use
> instructions or waste registers pasting constants together.
>

Via Jumbo, these can all expand to 33 bits.

The 64-bit cases are a bit more limited, but not usually a huge issue.

There are also some Imm56 encodings "on paper" (these are in a similar
limbo as the 48-bit instruction encodings).

Seemingly the vast majority of what one needs a larger immediate for can
be handled via Imm33, where, say, only about 4% of the constants
actually go outside of this limit (the vast majority of these being
either MMIO pointers or irrational floating-point constants).

The Imm56 cases looking like they would be too rare to really be worth
bothering with at present.

> 1-operand, 2-operand, 3-operand instructions all have sign control
> over their operands. There is no SUB instruction My 66000 uses
> ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
> instructions from execution. The 2-operand group allows the
> 5-bit register specifier to be used as a 6-bit sign extended
> immediate, making ADD Rd,#1,-Rs2 easily encoded.
>

No equivalent in my case.

> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".
>

No equivalent.

I did the same thing as SuperH here:
CMPxx instructions twiddles the SR.T bit;
Branches / Predication / ... all operate off the SR.T bit.

Ironically, because of the way Verilog works, so much stuff hanging off
a single bit causes it to get something like 1000x more expensive.

> There are 2 kinds of conditional flow: branching and predication and
> each has 2 principle kinds of instructions:: condition is determined
> from a single bit in a register, or condition is determined by comparing
> a register with 0. In addition there are conditionless branches, jumps,
> and a special addition supporting PIC for method calls and switches.
> Compare-to-zero and branch can access certain HW know information
> that is not capable of being stored in a ISA register--this includes things
> like a query to the Memory Unit asking if it has seen any interference
> between the start of an ATOMIC sequence and "now". The exception
> and interrupts and std. return are also encoded here.
>

Click here to read the complete article

Re: Why My 66000 is and is not RISC

<e914214f-4881-448a-94ff-6506dde833c3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26063&group=comp.arch#26063

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:fb91:0:b0:21b:89b5:20 with SMTP id a17-20020adffb91000000b0021b89b50020mr10632285wrr.74.1656033442483;
Thu, 23 Jun 2022 18:17:22 -0700 (PDT)
X-Received: by 2002:a05:620a:cce:b0:6af:59b:17bb with SMTP id
b14-20020a05620a0cce00b006af059b17bbmr214003qkj.423.1656033441779; Thu, 23
Jun 2022 18:17:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 23 Jun 2022 18:17:21 -0700 (PDT)
In-Reply-To: <t92sv9$jbb$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3d02:8cd5:9916:1278;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3d02:8cd5:9916:1278
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e914214f-4881-448a-94ff-6506dde833c3n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 01:17:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Fri, 24 Jun 2022 01:17 UTC

On Thursday, June 23, 2022 at 6:28:53 PM UTC-5, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
> > In todays installment I touch on things about My 66000 not covered above.
> >
> > My 66000 ISA requires an instruction buffer and a 2 stage instruction
> > processing pipeline I call PARSE and DECODE. Hennessey would be booing
> > at this point. However, using this, I get branch overhead down to 0.03 cycles
> > per taken branch without having any delay slot. {This also makes a unified
> > L1 cache feasible. But since Fetch and MemRef are so far apart on the die
> > My implementations have chosen not to utilize this capability.}
> >
> > PARSE finds the instruction boundaries (main job) and scans ahead for branches,
> > determines which function units, and looks for CoIssue opportunities. The scan
> > ahead branches are processes in parallel by DECODE to fetch branch targets
> > even before the branch instruction is executed. So if a taken prediction is made
> > the instructions on the taken path are already ready to enter execution.. PARSE
> > identifies immediates and displacements and cancels register port requests,
> > providing opportunities for ST to read the register file..........
> >
> > DECODE processes the instructions from PARSE , accesses register file,
> > computes forwarding, and starts instruction into the execution pipeline..
> > DECODE routes immediates and displacements to the required instruction.
> > ST instruction pass through DECODE twice, the 1st time is for AGEN, the
> > 2nd time is for ST.data when a register file port is available.
> >
> > ---------------------------instruction
> > stuff-----------------------------------------------------------
> >
> > The shift instructions have 2×6-bit fields dealing with the shift amount and
> > the width of data being shifted. These are used to access odd-sized data
> > (ala EXTRACT) and to SMASH data calculated at "machine" size back down
> > into containers of "language" size so the container cannot contain a value
> > outside of the range of its container. When the width field is 0 it is considered
> > to be 64-bits. When encoded as an immediate, the 2 fields are back-to-back,
> > when found in a register there is 26-bits separating the 2 fields, in data<38:32>
> > both 1000000 and 0000000 are considered to be 64-bits, while 1xxxxxxx
> > with any of the x's non-zero is considered an Operand exception.
> >
> > The Multiplex Instruction MPX. MPX basically allows for selecting bits from
> > a pair of registers based on another register:: ( a & b ) | ( ~a & c ), however
> > it has other flavors to provide ( !!a & b ) | ( !a & c ) which is CMOV and by
> > using the immediate encodings in My 66000 provides MOV Rd,#IMM32 and
> > MOV Rd,#IMM64 along with MOV Rd, Rs1 and MOV Rd,Rs2. They fall out
> > for free saving MOV opcodes elsewhere.
> >
> > Vectorization: My 66000 ISA contains loop vectorization. This allows
> > for performing vectorized loops are several iterations per cycle even
> > 1-wide machines can perform at 32+ instructions per cycle. My main
> > (as yet unproven) hope is that this takes the pressure off of the design
> > width. The basic argument is as follows:
> > a) 1-wide machines operate at 0.7 IPC
> > b) 2-wide SuperScalar machines operate at 1.0 IPC
> > c) GBOoO machines operate at 2.0 IPC
> > d) programs spend more than ½ their time in loops.
> > So, if one can get a 2× performance advantage of the 1-wide machine
> > this puts it in spitting distance of the GBOoO machine, which in turn
> > means the Medium OoO machine can be competitive with the GBOoO
> > machine are significantly lower {cost, design time, area, power}
> >
> > AND while investigating loop vectorization, I discovered that a RISC
> > pipeline with a 3R-1W register file can perform 1.3 IPC. Branch
> > instructions (20%) do not use the result register, ST instructions
> > (10%) can borrow the write port AFTER cache tag and translation
> > validations, AND in the general code I have seen there is significant
> > opportunity to perform write-elision in the data path, freeing up even
> > more ports. This, again takes pressure of the width of the design.
> > So, with vectorization, a 3 (or 4)-wide machine is competitive with
> > a 6-wide machine,.....
> >
> > None of this prevents or makes wide GBOoO more difficult.
> >
> > ----------------------instruction modifiers------------------------------------------
> >
> > CARRY is the first of the Instruction-Modifiers. An instruction-modifier
> > supplies "bits" for several future instructions so that one does not need
> > the cartesian product of a given subset encoded in the ISA. Thus, there
> > are shift instructions and when used with CARRY these perform shifts
> > as wide as you like 128, 256, 512,.....no need to clog up the encoding
> > space for lightly used but necessary functionality. Oven in the FP arena
> > CARRY provides access to exact FP arithmetics.
> >
> > CARRY provides access to multiprecision arithmetic both integer and FP.
> > CARRY provides a register which can be used as either/both Input and Output
> > to a set of instructions. This provides a link from one instruction to another
> > where data is transmitted but not encoded in the instruction itself.
> >
> > Since we are in the realm of power limited, My 66000 ISA has an ABS
> > instruction. Over in the integer side, this instruction can be performed
> > by subjugating the sign control built into the data path and be "executed"
> > without taking any pipeline delay (executes in zero cycles). Over on the
> > FP side this never adds any latency (executes in zero cycles). ABS always
> > takes less power than performing the instruction in any other way.
> >
> > DBLE is an instruction modifier that supplies register encodings and
> > adds 64-bits to the calculation width of the modified instruction. Applied
> > to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> > we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> > and presto: we get FP128 by adding exactly 1 instruction, the compiler
> > can pick any 8 registers it desires alleviating register allocation concerns.
> > DBLE is a "get by" kind of addition, frowned upon by Hennessey.
> >
> > I can envision a SIMD instruction modifier that defines the SIMD parameters
> > of several subsequent instructions and allows 64-bit SIMD to transpire.
> > I am still thinking about these. What I cannot envision is a wide SIMD
> > register file--this is what VVM already provides.
> >
> > These instruction-modifiers, it seems to me, are vastly more efficient
> > than throwing hundreds to thousands of unique instructions into ISA.
> > Especially if those unique instructions <on average> are not used
> > "that much".
> >
> > -----------------------------Safe
> > Stack--------------------------------------------------------
> >
> > Safe Stack. My 66000 architecture contains the notion of a Safe Stack.
> > Only 3 instructions have access to Safe Stack: {ENTER, EXIT, and RET}
> > When Safe Stack is in use, the return address goes directly to the Safe
> > Stack, and return address comes directly off safe stack. Preserved
> > registers are placed on Safe Stack {ENTER} and their register values
> > (conceptually) set to 0. Safe Stack is in normal thread memory but
> > the PTEs are marked RWE = 000 so any access causes page faults.
> > EXIT reloads the preserved registers from Safe Stack and transfers
> > control directly back to caller. When Safe Stack is not in use, R0
> > is used to hold the return address. Proper compiled code runs the
> > same when safe stack is on or off, so one can share dynamic libraries
> > between modes.
> >
> > Safe Stack monitors the value in SP and KILLs lines that no longer
> > need to reach out into the cache hierarchy, Safe Stack can efficiently
> > use Allocate memory semantics. Much/most of the time, nothing
> > in safe stack leaves the cache hierarchy.
> >
> > Buffer overflows on the "stack" do not corrupt the call/return flow of
> > control. ROP cannot happen as application has no access to Return
> > Address. Application cannot see the values in the preserved registers
> > augmenting safety and certainly cannot modify them.
> >
> > -------------------------------ABI----------------------------------------------------------------------
> >
> > Subroutine Calling Convention {A.K.A. ABI}:
> > Registers R1..R8 contain the first 8 arguments to the subroutine.
> > SP points are argument[9]
> > R9..R15 are considered as temporary registers
> > R16..R29 are preserved registers
> > R30=FP is a preserved registers but used as a Frame Pointer when
> > ..............language semantics need.
> > R31=SP is a preserved register and used as a Stack Pointer. SP must
> > ..............remain doubleword aligned at all times.
> >
> > ABI is very RISC
> >
> > So, let's say we want to call a subroutine that wants to allocate 1024
> > bytes on the stack for its own local data, is long running and needs
> > to preserve all 14 preserved registers, and is using a FP along with a
> > SP. Let us further complicate the mater by stating this subroutine
> > takes variable number of arguments. Entry Prologue:
> >
> > ENTRY subroutine_name
> > subrutine_name:
> > ENTER R16,R8,#(1024 | 2)
> >
> > At this point the register passed arguments have been saved with the
> > memory passed arguments, FP is pointing at the "other" end of local
> > data on the stack, after pushing the registers, 1024 bytes has been
> > allocated onto the SP, the old FP has been saved and the new FP setup.
> > {This works both with and without Safe Stack}
> >
> > Your typical RISC-only ISA would require at least 29 instructions to
> > do this amount of work getting into the subroutine, and another 17
> > getting out. If the ISA has both INT and FP register files 29 becomes 37.
> >
> > The Same happens in Epilogue: 1 instruction.
> >
> > While ABI is very RISC the ISA of Prologue and Epilogue is not.
> >
> > As a side note: My 66000 is achieving similar code density as x86-64.
> >
> > A few other interesting side
> > bits:------------------------------------------------------------
> >
> > LDM and STM to unCacheable address are performed as if ATOMIC::
> > that is:: as a single bus transaction. All interested 3rd parties see the
> > memory before any writes have been performed or after all writes
> > have been performed. A device driver can read several MMIO device
> > control registers and know that nobody else in the system has access
> > to the device control registers that could cause interference. A device
> > driver can store multiple control register locations without interference.
> >
> > There is a page ¿in ROM? known to contain zeros. A Memory Move
> > instruction can cause a page accessing this ¿ROM? data to be zeroed
> > without even bothering to access ¿ROM?--and the entire page is zeroed
> > at the target. Thus, pages being reclaimed to the free pool are but 1
> > instruction away from being in the already zeroed page pool. Zeroing
> > pages is performed at the DRAM end of the system (coherently). And
> > no <deleterious> bus activity is utilized.
<
> X86-64 has crap code density, your one instruction stack save restore alone
> should make you significantly better, unless perhaps you have gone 32+32.
<
It is a major contributor to getting as small as it got.
>
> Add some accumulator ops and most instructions will fit in 16 bits ops with
> ease, and you have the extra decode stage to do it anyway.
<
I looked at this a few years ago and the damage to long term ISA growth
was catastrophic. As it is I have nearly ½ of the OpCode space in each
OpCode group left for the future. and can PARSE instructions in 31 gates
with only 4 gates of delay. All that goes out the window with a meaningful
16-bit "extension". I pass.
>
> I would argue that 8 bit opcodes are best when you have an accumulator in
> your 32 register RISC design, but that is a bridge too far for most.
<
My 66000 only has 59 total instructions. What makes you think you need 256 ?
>
> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?
<
I have absolutely no interest in things that small. IoT devices don't need
a HyperVisor, or even that much of a supervisor. I have no interest in
register sizes smaller than 64-bits. And quite frankly, say you did get a
design that small and into production, you have to sell billions (maybe
trillions) of then at $0.05 to pay for the design team and recurring
engineering expenses.
<
If you do, more power to you.

Click here to read the complete article

Re: Why My 66000 is and is not RISC

<87fc640e-0f6b-4522-850c-418d3e5eabc2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26064&group=comp.arch#26064

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:993:b0:21b:8f16:5b3f with SMTP id by19-20020a056000099300b0021b8f165b3fmr10709550wrb.628.1656034724007;
Thu, 23 Jun 2022 18:38:44 -0700 (PDT)
X-Received: by 2002:a37:c403:0:b0:6a7:7763:fe07 with SMTP id
d3-20020a37c403000000b006a77763fe07mr8746130qki.579.1656034723550; Thu, 23
Jun 2022 18:38:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 23 Jun 2022 18:38:43 -0700 (PDT)
In-Reply-To: <t932qd$j78$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3d02:8cd5:9916:1278;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3d02:8cd5:9916:1278
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com> <t932qd$j78$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <87fc640e-0f6b-4522-850c-418d3e5eabc2n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 01:38:44 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5067

by: MitchAlsup - Fri, 24 Jun 2022 01:38 UTC

On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
> On 6/22/2022 8:03 PM, MitchAlsup wrote:
<snip>
> > The Floating Point group includes Transcendental instructions.
> > Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> > that are only 1 constant different in the calculations. Ln2 takes
> > only 14 cycles, sin takes 19 cycles. These are included because
> > they actually do improve performance.
> >
> No equivalent, nearly all math functions done in software in my case.
>
> Originally, there were no FDIV or FSQRT instructions either, but these
> exist now.
>
> Current timings are:
> FDIV: 130 cycles
> FSQRT: 384 cycles
>
Mc 88100 did these in / = 56 and SQRT in ~66
Mc 88120 did these in / = 17 and Sqrt in 22
>
> The trig functions generally run from around 500 to 1000 cycles or so
> (via unrolled Taylor expansion).
<
You need to use Chebyshev coefficients--more accurate sometimes fewer
terms, always better error bounds..
>
<<snip>
> My case: 48 or 96 bit virtual, 48 bit physical.
>
> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
> and not allow another request to pass until the former request has
> completed.
<
What are you going to do when there are 24 CPUs in a system and
everybody wants to write to the same MMI/O page ?
>
> All MMIO accesses are fully synchronous from the L1 cache down to the
> target device (unlike normal memory), though this does mean that
> accessing MMIO carries a fairly steep performance penalty relative to
> normal memory accesses.
>
The penalty is inherent in the requirements. However, My 66000 can ameliorate
the latency by grouping multiple writes to neighboring MMI/O control registers
into a single bus transaction. In theory, one can write all the necessary stuff
into the control registers to cause a disk drive to DMA a disk sector wherever
in a single write transaction to MMI/O and a single DMA write transaction
when data returns.
>
<
> > My 66000 is not just another ISA, it is a rethink of most of the components
> > that make up a system. A context switch from one thread to another
> > within a single GuestOS is 10 cycles. A context switch from one thread
> > to a thread under a different GuestOS remains 10 cycles. The typical
> > current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> > across GuestOSs.
> >
> > OH, and BTW, The FP transcendentals are patented.
> I would assume you mean FP transcendentals in hardware (in whatever way
> they are implemented), as opposed to in-general.
<
You might be surprised at what was allowed in the claims.
>
> Their existence in things like "math.h" and so on would likely preclude
> any sort of patent protection in the "in general" sense.
>
Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
different because of what one can do inside a HW function unit compared
to what one can do using only instructions) with several unique features.
They even bother to get the inexact bit set correctly.
>
> Very different, I have doubts about how well a lot of this could be
> pulled off in a low-cost implementation. Best I can come up with at the
> moment would effectively amount to faking it using lots of microcode or
> a software-based emulation layer.
>
Microcode generally refers to a control machine interpreting instructions.
Is a function unit run by ROM sequencer microcode ? What if the ROM got
turned into equivalent gates: Is it still microcode, or just a sequencer ?
In any event there are only 3 different sequences used (reminiscent of
Goldschmidt DIV and SQRT sequences,)
>
> I also consider my ISA to be "fairly unique", albeit in different ways
> (and a little more conservative in terms of implementation concerns).
<
And hard to read..............

Re: Why My 66000 is and is not RISC

<t93h92$lm$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26066&group=comp.arch#26066

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 00:15:10 -0500
Organization: A noiseless patient Spider
Lines: 262
Message-ID: <t93h92$lm$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t932qd$j78$1@dont-email.me>
<87fc640e-0f6b-4522-850c-418d3e5eabc2n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Jun 2022 05:15:14 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="12539b15be74db0911dceb8fd16ca93d";
logging-data="694"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+0rFhgh/xWMDljYqm/agri"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:Azk5n/wr77LfzxNE16yjSVxpiSQ=
In-Reply-To: <87fc640e-0f6b-4522-850c-418d3e5eabc2n@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 24 Jun 2022 05:15 UTC

On 6/23/2022 8:38 PM, MitchAlsup wrote:
> On Thursday, June 23, 2022 at 8:08:33 PM UTC-5, BGB wrote:
>> On 6/22/2022 8:03 PM, MitchAlsup wrote:
> <snip>
>>> The Floating Point group includes Transcendental instructions.
>>> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
>>> that are only 1 constant different in the calculations. Ln2 takes
>>> only 14 cycles, sin takes 19 cycles. These are included because
>>> they actually do improve performance.
>>>
>> No equivalent, nearly all math functions done in software in my case.
>>
>> Originally, there were no FDIV or FSQRT instructions either, but these
>> exist now.
>>
>> Current timings are:
>> FDIV: 130 cycles
>> FSQRT: 384 cycles
>>
> Mc 88100 did these in / = 56 and SQRT in ~66
> Mc 88120 did these in / = 17 and Sqrt in 22

It is partly based on the strategy used:
Rig the FMUL unit into a feedback loop;
Wait N cycles for answer to converge;
Assume it has converged on the answer.

Generally seems to take roughly this long for the algo to converge on
the answer.

When I first re-added FDIV, it was using the same basic algo (just with
slightly different inputs), and took a similar number of clock-cycles.

Then I had the idea that I could tweak a few things in the Shift-Add
integer divider, and get it to also do FDIV. Though, the way it was
rigged up still needs ~ 130 cycles, but 130 is still less than 384.

>>
>> The trig functions generally run from around 500 to 1000 cycles or so
>> (via unrolled Taylor expansion).
> <
> You need to use Chebyshev coefficients--more accurate sometimes fewer
> terms, always better error bounds..

Possible.

There are also a few faster algos, such as "lookup and interpolate",
but, while faster, these don't give sufficient precision to really be a
good option for the "math.h" functions (assumed to be accurate, even if
not the fastest possible).

There is also CORDIC, but I haven't really messed with it.

In any case, unrolled Taylor expansion is a few orders of magnitude
faster than calculating an exponential and factorial and performing a
floating-point divide and similar every time around the loop...

I was not the person who wrote that code originally, not sure why they
originally wrote it this way.

>>
> <<snip>
>> My case: 48 or 96 bit virtual, 48 bit physical.
>>
>> MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
>> and not allow another request to pass until the former request has
>> completed.
> <
> What are you going to do when there are 24 CPUs in a system and
> everybody wants to write to the same MMI/O page ?

It all gets serialized to them accessing it one at a time.

Though, ideally, only device drivers and similar should be accessing
MMIO, so this isn't likely to be a huge issue.

By the time I get to 24 cores, will have probably came up with a
different solution.

There is also the option of putting devices stuff on the ringbus. I had
partly already started going this way for VRAM (it is faster to write to
the framebuffer by going through the RAM interface than by going through
the MMIO interface).

However, for accessing hardware devices, in general, one kind of wants
"slower but strictly synchronous" IO over "faster but chaotic" IO.

For VRAM, it is a little different, because generally one is trying to
push several MB/sec out to the screen and don't really care if things
are strictly in-order (if things arrive in the framebuffer in a slightly
different order than they were stored into the L1 cache, who cares?...).

>>
>> All MMIO accesses are fully synchronous from the L1 cache down to the
>> target device (unlike normal memory), though this does mean that
>> accessing MMIO carries a fairly steep performance penalty relative to
>> normal memory accesses.
>>
> The penalty is inherent in the requirements. However, My 66000 can ameliorate
> the latency by grouping multiple writes to neighboring MMI/O control registers
> into a single bus transaction. In theory, one can write all the necessary stuff
> into the control registers to cause a disk drive to DMA a disk sector wherever
> in a single write transaction to MMI/O and a single DMA write transaction
> when data returns.

I was generally accessing MMIO 32 or 64 bits at a time (depending on the
device).

No DMA at present, pretty much everything is still polling IO and similar.

So, for example, for SDcard:
Store a byte to Data register.
Load Control register.
OR a bit in loaded value.
Store modified value to Control register.
Dummy load from Status register (1)
Loop:
Load from Status register.
If BUSY, Continue.
Load byte from Data register.
Repeat until bytes have been moved.

As noted, in the original form, this hit a wall at around 600 K/s.

The modified interface adds a QDATA register (64-bit), and a different
control register bit for "Transfer 8 bytes".

This QDATA version instead hits a wall at around 5 MB/s.

This interface is sufficient for SPI, but if I went to a faster mode,
pretty much as soon as I made the switch, I would be at the bandwidth
limit of this interface (and would then need to come up with something
different).

*1: When operating at "hitting the wall" speeds, the first Status load
will almost invariably be BUSY, nut the second load will typically be
"not BUSY", since the SPI transfer would have completed by the time it
has taken for the request to transfer all the way around the ring and
back again. So, a dummy load can make it faster.

Say, 13 MHz SPI gives 1.5 MB/s, but 13 MHz in UHS-I mode would boost
this up to 13 MB/s (basically, pushing 4 bits per clock-edge).

Much faster than this, and I almost may as well consider going "full
hardware" and memory mapping the SDcard...

>>
> <
>>> My 66000 is not just another ISA, it is a rethink of most of the components
>>> that make up a system. A context switch from one thread to another
>>> within a single GuestOS is 10 cycles. A context switch from one thread
>>> to a thread under a different GuestOS remains 10 cycles. The typical
>>> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
>>> across GuestOSs.
>>>
>>> OH, and BTW, The FP transcendentals are patented.
>> I would assume you mean FP transcendentals in hardware (in whatever way
>> they are implemented), as opposed to in-general.
> <
> You might be surprised at what was allowed in the claims.

OK.

>>
>> Their existence in things like "math.h" and so on would likely preclude
>> any sort of patent protection in the "in general" sense.
>>
> Yes, I did not reinvent ancient SW as HW. The algorithms are new (well
> different because of what one can do inside a HW function unit compared
> to what one can do using only instructions) with several unique features.
> They even bother to get the inexact bit set correctly.

OK.

In my case, they don't generally get used that heavily IME, so software
is OK so long as it is not unreasonably slow.

In cases where they would have gotten used more heavily, such as sin/cos
being used for the water-warping effects in Quake, lookup tables had
been used instead to good effect.

It is also possible to reduce these lookup tables to half float
precision, since the water warp effect doesn't seem to mind all that much.

>>
>> Very different, I have doubts about how well a lot of this could be
>> pulled off in a low-cost implementation. Best I can come up with at the
>> moment would effectively amount to faking it using lots of microcode or
>> a software-based emulation layer.
>>
> Microcode generally refers to a control machine interpreting instructions.
> Is a function unit run by ROM sequencer microcode ? What if the ROM got
> turned into equivalent gates: Is it still microcode, or just a sequencer ?
> In any event there are only 3 different sequences used (reminiscent of
> Goldschmidt DIV and SQRT sequences,)

Dunno. I was just sort of imagining doing it as a big ROM on top of a
RISC-style core, with chunks of the ISA being effectively treated like
special function calls into this ROM.

It is likely that parts of the Verilog would be procedurally generated,
such as the entry points into the various functions within this ROM.

I had considered something like this a few times in my case, but
generally ended up taking a different approach:
If I can't do it directly in hardware, I wont do it at all.

Only reason I ended up with the functionality of the RISC-V 'M'
extension was because I had thought up a way to implement it affordably.

Even then, it wasn't until earlier today that I got around to adding
"proper" support for 32-bit integer divide (reducing its latency from 68
to 36 cycles). Mostly because in some cases it was being used often
enough to become significant.

Click here to read the complete article

Re: Why My 66000 is and is not RISC

<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26068&group=comp.arch#26068

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:5a12:0:b0:21b:993a:7f8c with SMTP id bq18-20020a5d5a12000000b0021b993a7f8cmr13337317wrb.43.1656080870241;
Fri, 24 Jun 2022 07:27:50 -0700 (PDT)
X-Received: by 2002:ac8:5b94:0:b0:317:c7e2:2ef4 with SMTP id
a20-20020ac85b94000000b00317c7e22ef4mr3796762qta.423.1656080869628; Fri, 24
Jun 2022 07:27:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 07:27:49 -0700 (PDT)
In-Reply-To: <t92sv9$jbb$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=73.188.126.34; posting-account=ujX_IwoAAACu0_cef9hMHeR8g0ZYDNHh
NNTP-Posting-Host: 73.188.126.34
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: timcaff...@aol.com (Timothy McCaffrey)
Injection-Date: Fri, 24 Jun 2022 14:27:50 +0000
Content-Type: text/plain; charset="UTF-8"

by: Timothy McCaffrey - Fri, 24 Jun 2022 14:27 UTC

On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:

> X86-64 has crap code density, your one instruction stack save restore alone
> should make you significantly better, unless perhaps you have gone 32+32.
>
The X86-64 was left with a lot of baggage because of bad design decision to
try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
great if the instruction encoding had been refactored, and some other cruft
removed (e.g. only being able to use CL for a dynamic shift count).

It would have also been a great time to be able set up the encodings so that
the instruction parser could figure out the instruction length from the first chunk
(whatever size that was, I suspect 16 bit chunks make sense).

- Tim

Re: Why My 66000 is and is not RISC

<memo.20220624160142.20756A@jgd.cix.co.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26069&group=comp.arch#26069

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 16:01 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <memo.20220624160142.20756A@jgd.cix.co.uk>
References: <2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="3f15e2d8623e2ed51578beddd82bde57";
logging-data="15805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+zk4fPDzmzQF9zApkUyrdc3Y/Dzs0ydzw="
Cancel-Lock: sha1:hhw+rw64M8KmamNJ9lKiz5rhG88=

by: John Dallman - Fri, 24 Jun 2022 15:01 UTC

In article <2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>,
timcaffrey@aol.com (Timothy McCaffrey) wrote:

> The X86-64 was left with a lot of baggage because of bad design
> decision to try and reuse the X86 decoder. Most of the remaining
> 1 byte opcodes are either barely used (STC, CLC) or deprecated
> (PUSH/POP).

Remember that the design was done by AMD, who have to tread carefully to
avoid giving Intel an excuse to claim they're breaching their X86 license
in some way.

At the time, Intel were still under the impression that Itanium was going
to conquer the world. When they realised better, AMD had Opterons on the
market. Intel wanted to build an AMD-incompatible 64-bit x86 to drive AMD
out of the market. They were restrained by Microsoft, who weren't
interested in supporting two different extended x86 ISAs.

Given how we got here, things could be a lot worse.

John

Re: Why My 66000 is and is not RISC

<t94kmo$li0$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26070&group=comp.arch#26070

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-3fa9-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 15:19:52 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t94kmo$li0$1@newsreader4.netcologne.de>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
<t92sv9$jbb$1@dont-email.me>
Injection-Date: Fri, 24 Jun 2022 15:19:52 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-3fa9-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:3fa9:0:7285:c2ff:fe6c:992d";
logging-data="22080"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Fri, 24 Jun 2022 15:19 UTC

Brett <ggtgp@yahoo.com> schrieb:

> How big is the code store needed for an IOT (Internet Of Things smart
> toaster) code stack? And what is the savings for the next size down?

It will be hard to beat an ARM Cortex-M - based microcontroller
which are firmly embedded in the market, and for which a lot of
software has been written, and which cost a bit more than four
dollars per unit.

And if that's too expensive and you do not need the performance,
you can always use a MSP430-based one for considerably less,
less than a dollar at quantity.

The ROM on the latter is somewhere between 1KB and whatever you're
willing to pay for, and the RAM 256 bytes or more. But of course
you're still getting some analog hardware thrown in, such as an
ADC or a comparator.

Not a lot of savings, I'd say.

Re: Why My 66000 is and is not RISC

<d4c43113-28fe-400c-be49-cb1b1c213119n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26071&group=comp.arch#26071

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:1686:b0:21b:9870:47b with SMTP id y6-20020a056000168600b0021b9870047bmr13561177wrd.687.1656087355537;
Fri, 24 Jun 2022 09:15:55 -0700 (PDT)
X-Received: by 2002:a05:620a:254c:b0:6a9:9011:3090 with SMTP id
s12-20020a05620a254c00b006a990113090mr10956544qko.441.1656087354895; Fri, 24
Jun 2022 09:15:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 09:15:54 -0700 (PDT)
In-Reply-To: <memo.20220624160142.20756A@jgd.cix.co.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c0f9:4a5e:187c:bb19;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c0f9:4a5e:187c:bb19
References: <2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com> <memo.20220624160142.20756A@jgd.cix.co.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d4c43113-28fe-400c-be49-cb1b1c213119n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 16:15:55 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 24 Jun 2022 16:15 UTC

On Friday, June 24, 2022 at 10:01:49 AM UTC-5, John Dallman wrote:
> In article <2f5c8378-de57-4ef2...@googlegroups.com>,
> timca...@aol.com (Timothy McCaffrey) wrote:
>
> > The X86-64 was left with a lot of baggage because of bad design
> > decision to try and reuse the X86 decoder. Most of the remaining
> > 1 byte opcodes are either barely used (STC, CLC) or deprecated
> > (PUSH/POP).
> Remember that the design was done by AMD, who have to tread carefully to
> avoid giving Intel an excuse to claim they're breaching their X86 license
> in some way.
>
> At the time, Intel were still under the impression that Itanium was going
> to conquer the world. When they realised better, AMD had Opterons on the
> market. Intel wanted to build an AMD-incompatible 64-bit x86 to drive AMD
> out of the market. They were restrained by Microsoft, who weren't
> interested in supporting two different extended x86 ISAs.
<
Yes, it was MS that made intel do x86-64. Intel had a model that was within
spitting distance and MS told them the x86-64 port was already done. So,
for the first time in its life, Intel complied.
<
But look at how they have diverged after 2 decades of being almost
identical !!
>
> Given how we got here, things could be a lot worse.
>
> John

Re: Why My 66000 is and is not RISC

<t94pu1$21b$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26072&group=comp.arch#26072

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 11:49:01 -0500
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <t94pu1$21b$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
<t92sv9$jbb$1@dont-email.me> <t94kmo$li0$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Jun 2022 16:49:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="12539b15be74db0911dceb8fd16ca93d";
logging-data="2091"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uRs2cSqCb9XpyNmbhziiG"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:7bWbZmC82+ZXaVl1sltokzhu+ls=
In-Reply-To: <t94kmo$li0$1@newsreader4.netcologne.de>
Content-Language: en-US

by: BGB - Fri, 24 Jun 2022 16:49 UTC

On 6/24/2022 10:19 AM, Thomas Koenig wrote:
> Brett <ggtgp@yahoo.com> schrieb:
>
>> How big is the code store needed for an IOT (Internet Of Things smart
>> toaster) code stack? And what is the savings for the next size down?
>
> It will be hard to beat an ARM Cortex-M - based microcontroller
> which are firmly embedded in the market, and for which a lot of
> software has been written, and which cost a bit more than four
> dollars per unit.
>

Another ISA which could potentially compete with Cortex-M might be
RISC-V RV32IMC or similar.

Pros/cons with C though, it is 'dog chewed' to a point (somewhat more so
than Thumb) where I wonder about decoding cost.

Something like RV32GC would likely be a bit more expensive, as the A/F/D
extensions do a lot of stuff that I have doubts about being able to pull
off cheaply.

A more cost-effective option might be:
RV32IMZfinxZdinxC
But, not a lot of code is built for this.

> And if that's too expensive and you do not need the performance,
> you can always use a MSP430-based one for considerably less,
> less than a dollar at quantity.
>

For hobbyist use, the ones in DIP packaging (MSP430Gxxxx) were typically
being sold at several $ per chip last I bought any, but dunno about now.

QFP variants were cheaper per-chip, but QFP is much less usable (can't
use it with perfboard or DIP sockets).

They were generally cheaper than AVR8 chips, though the AVR8's typically
had more RAM and ROM space.

Performance per clock seemed to be better on MSP430 than AVR8, IME.
Though, in either case, one is not usually going to be using them for
performance-intensive tasks.

IIRC:
MSP430, 16 registers, each 16 bit, Mem/Mem addressing, Von Neumann
AVG8: 32x8b or 16x16b, Load/Store, Harvard (Split code/data spaces)

> The ROM on the latter is somewhere between 1KB and whatever you're
> willing to pay for, and the RAM 256 bytes or more. But of course
> you're still getting some analog hardware thrown in, such as an
> ADC or a comparator.
>
> Not a lot of savings, I'd say.

From what I remember, for 'G' style MSP430 chips:
ROM: ~ 4K to 32K
RAM: ~ 256B to 2K
Address space, something like:
MMIO 0000..01FF
RAM 0200..09FF (Say, 0200..02FF for 256B)
(More MMIO and/or RAM, depending on device)
ROM 8000..FFFF
For less, lower-bound moves upward
FFF0..FFFF is reset/interrupt vectors.

The G chips were typically available in DIP16/20/24 packaging IIRC.

Multi-channel ADC/DAC/... are common.
IO pins are typically capable of both In/Out in digital mode;
ADC/DAC is typically limited to certain pins;
...

The 'X' chips have a larger address space, and may have considerably
more RAM and ROM space (within a 20-bit address space). But, typically
only available in QFP packaging or similar.

One can do bit-banged SPI on the MSP430, but practically one would be
limited to fairly slow IO speeds (kHz territory). Low-speed serial is
also possible.

....

Re: Why My 66000 is and is not RISC

<t94tjp$slm$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26073&group=comp.arch#26073

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 12:51:48 -0500
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <t94tjp$slm$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
<t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Jun 2022 17:51:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="12539b15be74db0911dceb8fd16ca93d";
logging-data="29366"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188OGhSU2g3T0Mnj0XjkQiN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:dlz5VDJcYd4tQ+cvDI0mKl0C/vc=
In-Reply-To: <2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 24 Jun 2022 17:51 UTC

On 6/24/2022 9:27 AM, Timothy McCaffrey wrote:
> On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:
>
>> X86-64 has crap code density, your one instruction stack save restore alone
>> should make you significantly better, unless perhaps you have gone 32+32.
>>
> The X86-64 was left with a lot of baggage because of bad design decision to
> try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
> either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
> great if the instruction encoding had been refactored, and some other cruft
> removed (e.g. only being able to use CL for a dynamic shift count).
>
> It would have also been a great time to be able set up the encodings so that
> the instruction parser could figure out the instruction length from the first chunk
> (whatever size that was, I suspect 16 bit chunks make sense).
>

Though, more extensive redesign would have made it effectively an
entirely new ISA, just with an x86 backward-compatibility mode.

But, yeah, 16-bit chunking makes sense, this is what I use in my ISA in
the baseline case (16/32), though one is mostly limited to 32-bit
encodings for WEX bundles.

As can be noted, x86-64 code density ranges from "kinda meh" to
"spectacularly bad", depending mostly on the compiler.

That said, i386 and Thumb2 are both a bit more competitive, kinda harder
to beat them on the code-density front.

I am not entirely sure what exactly is going on here (for x86-64) to
make the code density so bad (it is pretty bad even in size-optimized
modes). The difference is often somewhat outside of what could easily be
explained just by the REX prefix and similar.

Say, for example:
x86 does an Abs32 load, 6 bytes;
x86-64 does a RIP+Disp32 Load, 7 bytes.
Delta: 17% bigger.

Or:
x86 does an 2R-ADD, 2B
x86-64 does a 2R-ADD (w/ REX), 3B
Delta: 50% bigger.

Then again, things like REX prefix and tending to save/restore more
stack variables and similar could be a factor.

Possibly also an increase in 64 bit constant loads, ...

But, often times, the expansion is significantly larger than the
theoretically expected 20-50% or so.

Re: Why My 66000 is and is not RISC

<315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26075&group=comp.arch#26075

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:600c:4fd0:b0:39c:6565:31a0 with SMTP id o16-20020a05600c4fd000b0039c656531a0mr685554wmq.142.1656099109332;
Fri, 24 Jun 2022 12:31:49 -0700 (PDT)
X-Received: by 2002:ad4:5bc1:0:b0:42c:3700:a6df with SMTP id
t1-20020ad45bc1000000b0042c3700a6dfmr370942qvt.94.1656099108677; Fri, 24 Jun
2022 12:31:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 12:31:48 -0700 (PDT)
In-Reply-To: <t94tjp$slm$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=73.188.126.34; posting-account=ujX_IwoAAACu0_cef9hMHeR8g0ZYDNHh
NNTP-Posting-Host: 73.188.126.34
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com> <t94tjp$slm$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: timcaff...@aol.com (Timothy McCaffrey)
Injection-Date: Fri, 24 Jun 2022 19:31:49 +0000
Content-Type: text/plain; charset="UTF-8"

by: Timothy McCaffrey - Fri, 24 Jun 2022 19:31 UTC

On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
> On 6/24/2022 9:27 AM, Timothy McCaffrey wrote:
> > On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:
> >
> >> X86-64 has crap code density, your one instruction stack save restore alone
> >> should make you significantly better, unless perhaps you have gone 32+32.
> >>
> > The X86-64 was left with a lot of baggage because of bad design decision to
> > try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
> > either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
> > great if the instruction encoding had been refactored, and some other cruft
> > removed (e.g. only being able to use CL for a dynamic shift count).
> >
> > It would have also been a great time to be able set up the encodings so that
> > the instruction parser could figure out the instruction length from the first chunk
> > (whatever size that was, I suspect 16 bit chunks make sense).
> >
> Though, more extensive redesign would have made it effectively an
> entirely new ISA, just with an x86 backward-compatibility mode.
>
> But, yeah, 16-bit chunking makes sense, this is what I use in my ISA in
> the baseline case (16/32), though one is mostly limited to 32-bit
> encodings for WEX bundles.
>
>
> As can be noted, x86-64 code density ranges from "kinda meh" to
> "spectacularly bad", depending mostly on the compiler.
>
> That said, i386 and Thumb2 are both a bit more competitive, kinda harder
> to beat them on the code-density front.
>
>
> I am not entirely sure what exactly is going on here (for x86-64) to
> make the code density so bad (it is pretty bad even in size-optimized
> modes). The difference is often somewhat outside of what could easily be
> explained just by the REX prefix and similar.
>
>
> Say, for example:
> x86 does an Abs32 load, 6 bytes;
> x86-64 does a RIP+Disp32 Load, 7 bytes.
> Delta: 17% bigger.
>
> Or:
> x86 does an 2R-ADD, 2B
> x86-64 does a 2R-ADD (w/ REX), 3B
> Delta: 50% bigger.
>
> Then again, things like REX prefix and tending to save/restore more
> stack variables and similar could be a factor.
>
> Possibly also an increase in 64 bit constant loads, ...
>
> But, often times, the expansion is significantly larger than the
> theoretically expected 20-50% or so.

You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
you have to waste a register loading the constant and then use it.

The calling ABI is much different than the 386, where you just usually pushed stuff
on the stack. Now you have some stuff in registers (which registers depends on whether
you are running Windows or Linux) and some stuff on the stack.

Some registers you are required to save before the call (caller save)
and others after the call (callee save).

Since you are not saving values with a simple push (usually) in the subroutine, you go
from a 1 byte PUSH to a 5 or 6 byte MOV to stack.

Simple INC/DEC doubled in size, so probably most code now uses ADD instead (which
is probably faster because you don't have a partial CC update).

Due to stack and structure storage expanding for 8 byte values (e.g. pointers),
you can only store so many of them there using an 8 bit offset. Unfortunately,
the next step up is a 32 bit offset.

And probably a bunch of other stuff I've forgotten....

- Tim

Re: Why My 66000 is and is not RISC

<245f464d-0a73-4e08-8e59-f46db97e83e2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26076&group=comp.arch#26076

copy link Newsgroups: comp.arch

X-Received: by 2002:a1c:f317:0:b0:39c:831c:bd43 with SMTP id q23-20020a1cf317000000b0039c831cbd43mr760636wmq.139.1656100585722;
Fri, 24 Jun 2022 12:56:25 -0700 (PDT)
X-Received: by 2002:a05:6214:500e:b0:470:a553:9d1c with SMTP id
jo14-20020a056214500e00b00470a5539d1cmr676317qvb.16.1656100585003; Fri, 24
Jun 2022 12:56:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 12:56:24 -0700 (PDT)
In-Reply-To: <2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c0f9:4a5e:187c:bb19;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c0f9:4a5e:187c:bb19
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <245f464d-0a73-4e08-8e59-f46db97e83e2n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 19:56:25 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 24 Jun 2022 19:56 UTC

On Friday, June 24, 2022 at 9:27:53 AM UTC-5, timca...@aol.com wrote:
> On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:
>
> > X86-64 has crap code density, your one instruction stack save restore alone
> > should make you significantly better, unless perhaps you have gone 32+32.
> >
> The X86-64 was left with a lot of baggage because of bad design decision to
> try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
> either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
> great if the instruction encoding had been refactored, and some other cruft
> removed (e.g. only being able to use CL for a dynamic shift count).
>
> It would have also been a great time to be able set up the encodings so that
> the instruction parser could figure out the instruction length from the first chunk
> (whatever size that was, I suspect 16 bit chunks make sense).
<
I worked on some x86 decode mechanisms while at AMD, and learned a lot about
x86 encoding {which I still consider BETTER than SPARC-Vis}
<
My 66000 ISA format and encoding is a direct result of this, and indeed, follows
your tenet of having everything needed to determine size in the first word.
>
> - Tim

Re: Why My 66000 is and is not RISC

<d28c3207-b7ad-44ed-8614-29eb065aecd6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26077&group=comp.arch#26077

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:107:b0:21b:8ef9:96cb with SMTP id o7-20020a056000010700b0021b8ef996cbmr701303wrx.709.1656100970925;
Fri, 24 Jun 2022 13:02:50 -0700 (PDT)
X-Received: by 2002:a05:6214:1cc3:b0:46e:64aa:842a with SMTP id
g3-20020a0562141cc300b0046e64aa842amr419564qvd.101.1656100970266; Fri, 24 Jun
2022 13:02:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 13:02:50 -0700 (PDT)
In-Reply-To: <315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c0f9:4a5e:187c:bb19;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c0f9:4a5e:187c:bb19
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com> <t94tjp$slm$1@dont-email.me>
<315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d28c3207-b7ad-44ed-8614-29eb065aecd6n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 20:02:50 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 24 Jun 2022 20:02 UTC

On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
> >
> > Possibly also an increase in 64 bit constant loads, ...
> >
> > But, often times, the expansion is significantly larger than the
> > theoretically expected 20-50% or so.
<
> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
> you have to waste a register loading the constant and then use it.
<
My 66000 does not have this problem. AND while BGB may be able to get buy with
this restriction now, you won't in 10 years hence.
>
> The calling ABI is much different than the 386, where you just usually pushed stuff
> on the stack. Now you have some stuff in registers (which registers depends on whether
> you are running Windows or Linux) and some stuff on the stack.
>
> Some registers you are required to save before the call (caller save)
> and others after the call (callee save).
<
With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
like R9-R15 are simply temps used whenever and forgotten.
>
> Since you are not saving values with a simple push (usually) in the subroutine, you go
> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
<
I go to a single instruction that pushes as much stuff as desired (by compiler)
and then allocates a stack frame for the local-variables.

Re: Why My 66000 is and is not RISC

<t957fh$vba$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26078&group=comp.arch#26078

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-3fa9-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 20:40:17 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t957fh$vba$1@newsreader4.netcologne.de>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
Injection-Date: Fri, 24 Jun 2022 20:40:17 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-3fa9-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:3fa9:0:7285:c2ff:fe6c:992d";
logging-data="32106"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Fri, 24 Jun 2022 20:40 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

First, thanks again for the good explanations.

> DBLE is an instruction modifier that supplies register encodings and
> adds 64-bits to the calculation width of the modified instruction. Applied
> to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> and presto: we get FP128 by adding exactly 1 instruction,

This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
introduced this; POWER still has it as the only option up to POWER8.
POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
from double double to IEEE FP for POWER 9+ (I helped a bit in that
transition, for gfortran).

> the compiler
> can pick any 8 registers it desires alleviating register allocation concerns.

Eight registers is a lot if there are only 32 to go around...

> DBLE is a "get by" kind of addition, frowned upon by Hennessey.
>
> I can envision a SIMD instruction modifier that defines the SIMD parameters
> of several subsequent instructions and allows 64-bit SIMD to transpire.
> I am still thinking about these. What I cannot envision is a wide SIMD
> register file--this is what VVM already provides.

I think a lot of the use cases could also be covered if the
processor were able to process int8 ... int64 and fp16..fp64
(with fp128 being an exception) at the width of an SIMD unit,
so something like

MOV R4,#0
VEC {R5}
LDUH R6,[R10+R4] ! Load half float into R6
LDUH R7,[R11+R4] ! Second one
FADD.F2 R7,R7,R6
STH R7,[R12+R4]
ADD R4,R4,#2
LOOP (something)

could be executed at full SIMD with. Is this feasible? Or
would it be better to do this kind of thing via SIMD?

Re: Why My 66000 is and is not RISC

<t958oj$hgq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26079&group=comp.arch#26079

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 14:02:11 -0700
Organization: A noiseless patient Spider
Lines: 84
Message-ID: <t958oj$hgq$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
<t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
<t94tjp$slm$1@dont-email.me>
<315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Jun 2022 21:02:14 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f70a5d5435de34aa01e3eeec9ae5c488";
logging-data="17946"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Kv8h+b88Jsv3zLulIO/Zv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:DZpXCr0lxQLO+mcvBY/9BeekWFY=
In-Reply-To: <315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Fri, 24 Jun 2022 21:02 UTC

On 6/24/2022 12:31 PM, Timothy McCaffrey wrote:
> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
>> On 6/24/2022 9:27 AM, Timothy McCaffrey wrote:
>>> On Thursday, June 23, 2022 at 7:28:53 PM UTC-4, gg...@yahoo.com wrote:
>>>
>>>> X86-64 has crap code density, your one instruction stack save restore alone
>>>> should make you significantly better, unless perhaps you have gone 32+32.
>>>>
>>> The X86-64 was left with a lot of baggage because of bad design decision to
>>> try and reuse the X86 decoder. Most of the remaining 1 byte opcodes are
>>> either barely used (STC, CLC) or deprecated (PUSH/POP). It would have been
>>> great if the instruction encoding had been refactored, and some other cruft
>>> removed (e.g. only being able to use CL for a dynamic shift count).
>>>
>>> It would have also been a great time to be able set up the encodings so that
>>> the instruction parser could figure out the instruction length from the first chunk
>>> (whatever size that was, I suspect 16 bit chunks make sense).
>>>
>> Though, more extensive redesign would have made it effectively an
>> entirely new ISA, just with an x86 backward-compatibility mode.
>>
>> But, yeah, 16-bit chunking makes sense, this is what I use in my ISA in
>> the baseline case (16/32), though one is mostly limited to 32-bit
>> encodings for WEX bundles.
>>
>>
>> As can be noted, x86-64 code density ranges from "kinda meh" to
>> "spectacularly bad", depending mostly on the compiler.
>>
>> That said, i386 and Thumb2 are both a bit more competitive, kinda harder
>> to beat them on the code-density front.
>>
>>
>> I am not entirely sure what exactly is going on here (for x86-64) to
>> make the code density so bad (it is pretty bad even in size-optimized
>> modes). The difference is often somewhat outside of what could easily be
>> explained just by the REX prefix and similar.
>>
>>
>> Say, for example:
>> x86 does an Abs32 load, 6 bytes;
>> x86-64 does a RIP+Disp32 Load, 7 bytes.
>> Delta: 17% bigger.
>>
>> Or:
>> x86 does an 2R-ADD, 2B
>> x86-64 does a 2R-ADD (w/ REX), 3B
>> Delta: 50% bigger.
>>
>> Then again, things like REX prefix and tending to save/restore more
>> stack variables and similar could be a factor.
>>
>> Possibly also an increase in 64 bit constant loads, ...
>>
>> But, often times, the expansion is significantly larger than the
>> theoretically expected 20-50% or so.
>
> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
> you have to waste a register loading the constant and then use it.
>
> The calling ABI is much different than the 386, where you just usually pushed stuff
> on the stack. Now you have some stuff in registers (which registers depends on whether
> you are running Windows or Linux) and some stuff on the stack.
>
> Some registers you are required to save before the call (caller save)
> and others after the call (callee save).
>
> Since you are not saving values with a simple push (usually) in the subroutine, you go
> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
>
> Simple INC/DEC doubled in size, so probably most code now uses ADD instead (which
> is probably faster because you don't have a partial CC update).
>
> Due to stack and structure storage expanding for 8 byte values (e.g. pointers),
> you can only store so many of them there using an 8 bit offset. Unfortunately,
> the next step up is a 32 bit offset.
>
> And probably a bunch of other stuff I've forgotten....
>
> - Tim

Binary compatibility is a real bear. Either leave a *lot* of free
entropy (my66), or push the problem to the software and build machinery
(Mill), or suffer bloat and slow decode (x86, RISCV).

Re: Why My 66000 is and is not RISC

<t95bpp$77b$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26080&group=comp.arch#26080

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 21:54:02 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <t95bpp$77b$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org>
<t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
<t92sv9$jbb$1@dont-email.me>
<t94kmo$li0$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 24 Jun 2022 21:54:02 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7608caaa316156d92cab1e64d52205c1";
logging-data="7403"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qYGNzqKKCTp7du7SYEYkO"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:B+91wt08WogRkUFcH7Ed2l9Ry5g=
sha1:fZTyjTpT1jVs7ackV13vEyBdOPs=

by: Brett - Fri, 24 Jun 2022 21:54 UTC

Thomas Koenig <tkoenig@netcologne.de> wrote:
> Brett <ggtgp@yahoo.com> schrieb:
>
>> How big is the code store needed for an IOT (Internet Of Things smart
>> toaster) code stack? And what is the savings for the next size down?
>
> It will be hard to beat an ARM Cortex-M - based microcontroller
> which are firmly embedded in the market, and for which a lot of
> software has been written, and which cost a bit more than four
> dollars per unit.
>
> And if that's too expensive and you do not need the performance,
> you can always use a MSP430-based one for considerably less,
> less than a dollar at quantity.
>
> The ROM on the latter is somewhere between 1KB and whatever you're
> willing to pay for, and the RAM 256 bytes or more. But of course
> you're still getting some analog hardware thrown in, such as an
> ADC or a comparator.
>
> Not a lot of savings, I'd say.
>

You are missing the I in internet, no wifi I can find in that chip.

Talking about a network stack to talk to your phone. Smart color changing
lightbulbs and soon all the appliances in your home, washer, dryer, stove,
microwave, thermostat, security cameras, just everything.

Plus your home router, which uses a much more powerful wifi block and CPU.

There are markets here that will pay for better code density, assuming a
network stack is significant?

Re: Why My 66000 is and is not RISC

<cd887041-9f82-4309-a2cb-83760490f610n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26081&group=comp.arch#26081

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:1686:b0:21b:9870:47b with SMTP id y6-20020a056000168600b0021b9870047bmr983843wrd.687.1656107709294;
Fri, 24 Jun 2022 14:55:09 -0700 (PDT)
X-Received: by 2002:a05:6214:d4a:b0:470:49bf:a51e with SMTP id
10-20020a0562140d4a00b0047049bfa51emr1053969qvr.88.1656107708759; Fri, 24 Jun
2022 14:55:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!pasdenom.info!usenet-fr.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 14:55:08 -0700 (PDT)
In-Reply-To: <t957fh$vba$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c0f9:4a5e:187c:bb19;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c0f9:4a5e:187c:bb19
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t957fh$vba$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cd887041-9f82-4309-a2cb-83760490f610n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 21:55:09 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 24 Jun 2022 21:55 UTC

On Friday, June 24, 2022 at 3:40:20 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
>
> First, thanks again for the good explanations.
> > DBLE is an instruction modifier that supplies register encodings and
> > adds 64-bits to the calculation width of the modified instruction. Applied
> > to a FP instruction: DBLE Rd1,Rs11,Rs22,Rs33; FMAC Rd2,Rs12,Rs22,Rs33
> > we execute: FMUL {Rd1,Rd2},{Rs11,Rs22},{Rs21,Rs22},{Rs31,Rs32}
> > and presto: we get FP128 by adding exactly 1 instruction,
<
> This means pair-of-doubles 128-bit, not IEEE 128-bit. I think S/360
<
No this means FP with 14-bit exponent and 113-bit fraction (if my math is
right) pairs of double are available using exact FP arithmetics via CARRY
not DBLE.
<
> introduced this; POWER still has it as the only option up to POWER8.
> POWER9 has hardware support for IEEE 128-bit, and IBM is moving away
> from double double to IEEE FP for POWER 9+ (I helped a bit in that
> transition, for gfortran).
> > the compiler
> > can pick any 8 registers it desires alleviating register allocation concerns.
<
> Eight registers is a lot if there are only 32 to go around...
<
It is not a machine designed to crunch FP128 all the time.
It is a machine designed so the occasional use is satisfactory.
<
> > DBLE is a "get by" kind of addition, frowned upon by Hennessey.
> >
> > I can envision a SIMD instruction modifier that defines the SIMD parameters
> > of several subsequent instructions and allows 64-bit SIMD to transpire.
> > I am still thinking about these. What I cannot envision is a wide SIMD
> > register file--this is what VVM already provides.
<
> I think a lot of the use cases could also be covered if the
> processor were able to process int8 ... int64 and fp16..fp64
> (with fp128 being an exception) at the width of an SIMD unit,
> so something like
>
> MOV R4,#0
> VEC {R5}
> LDUH R6,[R10+R4] ! Load half float into R6
> LDUH R7,[R11+R4] ! Second one
> FADD.F2 R7,R7,R6
> STH R7,[R12+R4]
> ADD R4,R4,#2
> LOOP (something)
>
> could be executed at full SIMD with. Is this feasible? Or
> would it be better to do this kind of thing via SIMD?

Re: Why My 66000 is and is not RISC

<745d5f3a-201e-4988-9b90-d2db7172c0a2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26082&group=comp.arch#26082

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:600c:1d12:b0:39c:4307:8b10 with SMTP id l18-20020a05600c1d1200b0039c43078b10mr5934298wms.103.1656107859750;
Fri, 24 Jun 2022 14:57:39 -0700 (PDT)
X-Received: by 2002:a05:6214:21e5:b0:470:a567:edf7 with SMTP id
p5-20020a05621421e500b00470a567edf7mr974330qvj.67.1656107858276; Fri, 24 Jun
2022 14:57:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 14:57:38 -0700 (PDT)
In-Reply-To: <t95bpp$77b$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c0f9:4a5e:187c:bb19;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c0f9:4a5e:187c:bb19
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
<t94kmo$li0$1@newsreader4.netcologne.de> <t95bpp$77b$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <745d5f3a-201e-4988-9b90-d2db7172c0a2n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Jun 2022 21:57:39 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 24 Jun 2022 21:57 UTC

On Friday, June 24, 2022 at 4:54:05 PM UTC-5, gg...@yahoo.com wrote:
> Thomas Koenig <tko...@netcologne.de> wrote:
> > Brett <gg...@yahoo.com> schrieb:
> >
> >> How big is the code store needed for an IOT (Internet Of Things smart
> >> toaster) code stack? And what is the savings for the next size down?
> >
> > It will be hard to beat an ARM Cortex-M - based microcontroller
> > which are firmly embedded in the market, and for which a lot of
> > software has been written, and which cost a bit more than four
> > dollars per unit.
> >
> > And if that's too expensive and you do not need the performance,
> > you can always use a MSP430-based one for considerably less,
> > less than a dollar at quantity.
> >
> > The ROM on the latter is somewhere between 1KB and whatever you're
> > willing to pay for, and the RAM 256 bytes or more. But of course
> > you're still getting some analog hardware thrown in, such as an
> > ADC or a comparator.
> >
> > Not a lot of savings, I'd say.
> >
> You are missing the I in internet, no wifi I can find in that chip.
>
> Talking about a network stack to talk to your phone. Smart color changing
> lightbulbs and soon all the appliances in your home, washer, dryer, stove,
> microwave, thermostat, security cameras, just everything.
>
> Plus your home router, which uses a much more powerful wifi block and CPU.
>
> There are markets here that will pay for better code density, assuming a
> network stack is significant?
<
I don't see it:: a 10G or 100G network interface already has a memory footprint
(for its own buffering concerns) that skimping on the CPU and ROM seems a
waste.

Re: Why My 66000 is and is not RISC

<t95mek$33n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26083&group=comp.arch#26083

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Why My 66000 is and is not RISC
Date: Fri, 24 Jun 2022 19:55:44 -0500
Organization: A noiseless patient Spider
Lines: 198
Message-ID: <t95mek$33n$1@dont-email.me>
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com>
<t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com>
<t94tjp$slm$1@dont-email.me>
<315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com>
<d28c3207-b7ad-44ed-8614-29eb065aecd6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Jun 2022 00:55:48 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bc60642288ede9e52880f470535a9ad3";
logging-data="3191"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/+MqTRRIOVVSUXWhAUcON5"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:4vPtRVHXCe/ykHFdXrZBe28FGCw=
In-Reply-To: <d28c3207-b7ad-44ed-8614-29eb065aecd6n@googlegroups.com>
Content-Language: en-US

by: BGB - Sat, 25 Jun 2022 00:55 UTC

On 6/24/2022 3:02 PM, MitchAlsup wrote:
> On Friday, June 24, 2022 at 2:31:52 PM UTC-5, timca...@aol.com wrote:
>> On Friday, June 24, 2022 at 1:51:56 PM UTC-4, BGB wrote:
>>>
>>> Possibly also an increase in 64 bit constant loads, ...
>>>
>>> But, often times, the expansion is significantly larger than the
>>> theoretically expected 20-50% or so.
> <
>> You can't have a 64 bit constant in an instruction, except for immediate load (IIRC), so
>> you have to waste a register loading the constant and then use it.
> <
> My 66000 does not have this problem. AND while BGB may be able to get buy with
> this restriction now, you won't in 10 years hence.

Still better in my case than it is in RISC-V where this case would
require a memory load...

As-is (in BJX2), encodings have been defined ("on paper"), eg, for Imm56
encodings for some instructions; just they haven't been put into use yet.

Partly is is a combination of:
Non-zero decoding cost;
It is pretty rare to exceed the existing 33-bit limit for 3RI ops.

Spending 1 extra cycle to load a constant into a register isn't usually
a huge issue.

IME, the vast majority of 64-bit constant loads thus far tend to be
Binary64 constants; usually irrational or repeating 'double' constants
or similar (most other constants will be compacted down to a smaller
format).

Some constant-load stats (from my GLQuake port):
Imm8: 11% (Byte range)
Imm16: 70% (Int16 or UInt16)
Binary16: 12% (Double encoded as Half-Float)
Imm32/33: 4.4%
Imm33s: 2.2% (Int32 or UInt32, Zero/Sign Extend)
Imm32Hi: 1.1% (Int32 into high-order 32 bits, low 32 are 0)
Binary32: 0.8% (Double as a pair of Binary32)
2xBinary16: 0.3% (2xBinary32 as 2xBinary16)
Imm64: 2.5% (Fallback Case)

Some "rarely used" types:
Load value into the high 10 bits of target (rare);
Say: zzz0000000000000
Load bits into the middle of a 64-bit value (rare);
Say: 0000zzzzzzzz0000
Load 4xFP16 encoded as 4xFP8;
...
These cases seem to be rare enough to be mostly ignored.

Grouping constant loads by instruction length:
16-bit: 11%
32-bit: 82%
64-bit: 4.4%
96-bit: 2.5%

Note that this is only for discrete constant loads, and does not count
immediate values or displacements.

Constant loads reflect ~ 7.4% of the total Jumbo prefixes, with the rest
going into immediate fields.

Calculating stats:
85% Imm9/Disp9
15% Imm33/Disp33

Rough estimate of upper-bound of overflowerd immed cases:
Less than 3% (Excluding Load/Store ops)
Less than 0.4% (Including Load/Store ops)

Where the Imm/Disp balance is roughly:
89% Disp (Load/Store Displacements)
11% Imm (Immediates for ALU instructions and similar).

However, given that most of the 64-bit constants (dumped into a log) are
fairly obviously either MMIO addresses or floating-point constants, the
actual bound for overflowing the 33-bit immediate range is likely much
smaller.

I don't have a stat for the relative use of Jumbo between Imm and Disp
encodings, however (based on what I typically see in disassembly dumps),
I would estimate Disp to be the dominant case.

It can be noted that a significant chunk of the cases which are being
encoded as Imm33/Disp33 could also be handled by Imm17/Disp17 encodings
(my compiler doesn't typically use these unless the instruction is
*also* using XGPR).

While arguably code "could change" here, such as due to ever-expanding
memory usage, I suspect this is less likely to be an issue in a
statistical sense.

The main thing that would be the "likely existential risk" for this,
would be programs exceeding 4GB in the ".bss" section, which would
require a bigger displacement.

For x86-64, one would run into a similar problem if text+data+bss
exceeds 2GB (thus breaking ones' ability to use RIP-relative addressing).

>>
>> The calling ABI is much different than the 386, where you just usually pushed stuff
>> on the stack. Now you have some stuff in registers (which registers depends on whether
>> you are running Windows or Linux) and some stuff on the stack.
>>
>> Some registers you are required to save before the call (caller save)
>> and others after the call (callee save).
> <
> With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> like R9-R15 are simply temps used whenever and forgotten.

That is presumably how it is supposed to be...

In my case, it is roughly a 50/50 split between caller save (scratch)
and callee save (preserved) registers.

For leaf functions, one wants a lot of scratch registers, and for
non-leaf functions, a lot of callee-save registers.

But, sadly, no party can be entirely happy:
Leaf functions wishing they could have more registers to play with,
without needing to save them first;
Non-leaf functions wishing they could have more registers for variables
which wont get stomped on the next call;
....

Can note that, IIRC:
Win64 gave a bigger part of this pie to callee-save;
SysV/AMD64 gave a bigger part of the pie to caller-save.

A roughly even split seemed like an easy answer, lacking any good way to
find a general/optimal balance across a range of programs.

Conceivably, it could also be possible to have a certain number of
"flexible" registers which a compiler could use to "fine tune" the
balance in the ABI, but these would be annoying at DLL/SO edges, as it
would require "worst case" handling (treating them like caller-save when
calling an import, and like callee-save for DLL exports).

In such an ABI, likely:
2/3: Nominally Callee Save
1/3: Caller Save / Scratch
With 1/3 of the register space able to be re-balanced from callee-save
to caller save by the compiler.

>>
>> Since you are not saving values with a simple push (usually) in the subroutine, you go
>> from a 1 byte PUSH to a 5 or 6 byte MOV to stack.
> <
> I go to a single instruction that pushes as much stuff as desired (by compiler)
> and then allocates a stack frame for the local-variables.
>

I once had PUSH/POP in BJX2, but then I dropped them (mostly for
cost-saving reasons; after noting that adjusting the stack-pointer and
then using a series of stores, or performing a series of loads and then
adjusting the stack pointer, could be similarly effective).

So, it is basically using Load/Store instructions...

However, in most cases:
MOV.X Rn, (SP, Disp4*8)
Can also be encoded in a 16-bit instruction format...

x86-64 would need 2x as many instructions here, and each instruction
would also need 5-bytes to encode, ...

So, so roughly a 500% encoding-cost delta in this case for x86-64 vs BJX2.

Then again, can also note that I am often seeing around a 300% delta
between BJX2 and x86-64 in terms of ".text" sizes and similar.

Though, I still tend to fall a bit short of being able to match Thumb2
or similar at this game...

Re: Why My 66000 is and is not RISC

<4cde7f77-e3d5-48bb-bf6d-8329cb91f805n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26084&group=comp.arch#26084

copy link Newsgroups: comp.arch

X-Received: by 2002:a1c:f60f:0:b0:3a0:3e0c:1de1 with SMTP id w15-20020a1cf60f000000b003a03e0c1de1mr3358480wmc.56.1656120961706;
Fri, 24 Jun 2022 18:36:01 -0700 (PDT)
X-Received: by 2002:ac8:5d87:0:b0:305:bbf:cd85 with SMTP id
d7-20020ac85d87000000b003050bbfcd85mr1605379qtx.618.1656120961148; Fri, 24
Jun 2022 18:36:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Jun 2022 18:36:01 -0700 (PDT)
In-Reply-To: <t95mek$33n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c0f9:4a5e:187c:bb19;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c0f9:4a5e:187c:bb19
References: <f647da4a-0617-4292-9a1b-b3be674150cbn@googlegroups.com>
<t90vhb$1m8n$1@gioia.aioe.org> <t9134k$3tg$1@dont-email.me>
<7dc3d1b2-3c75-4f38-b15e-3e9b5c2cbc0dn@googlegroups.com> <t92sv9$jbb$1@dont-email.me>
<2f5c8378-de57-4ef2-8e24-01209e4c7a20n@googlegroups.com> <t94tjp$slm$1@dont-email.me>
<315e7041-51de-4ea0-aa3c-3e7a0cf8bff0n@googlegroups.com> <d28c3207-b7ad-44ed-8614-29eb065aecd6n@googlegroups.com>
<t95mek$33n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4cde7f77-e3d5-48bb-bf6d-8329cb91f805n@googlegroups.com>
Subject: Re: Why My 66000 is and is not RISC
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Jun 2022 01:36:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sat, 25 Jun 2022 01:36 UTC

On Friday, June 24, 2022 at 7:55:52 PM UTC-5, BGB wrote:
> On 6/24/2022 3:02 PM, MitchAlsup wrote:

> > With <realistically> 30-64-bit registers in use by compiler and 16 of these preserved,
> > I am not seeing very much caller-save register traffic from Brian's LLVM port. It is more
> > like R9-R15 are simply temps used whenever and forgotten.
> That is presumably how it is supposed to be...
>
>
> In my case, it is roughly a 50/50 split between caller save (scratch)
> and callee save (preserved) registers.
<
I, too, have 50%/50%:: R1-15 are temps, R16-30 are preserved.
R0 receives Return Address, R31 is Stack Pointer. ½ of the temps
can be used to carry arguments and results covering the 98%-ile.
>
> For leaf functions, one wants a lot of scratch registers, and for
> non-leaf functions, a lot of callee-save registers.
>
> But, sadly, no party can be entirely happy:
> Leaf functions wishing they could have more registers to play with,
> without needing to save them first;
> Non-leaf functions wishing they could have more registers for variables
> which wont get stomped on the next call;
> ...
>
>
> Can note that, IIRC:
> Win64 gave a bigger part of this pie to callee-save;
> SysV/AMD64 gave a bigger part of the pie to caller-save.
<
CRAY-1 had only temp registers at the call/return interface. (Lee Higbe circa 1990)
IBM 360 had only preserved registers.
VAX had only preserved registers--both had 16 registers.
>
> A roughly even split seemed like an easy answer, lacking any good way to
> find a general/optimal balance across a range of programs.
>
The choice is a lot easier 50%/50% when you have 32 registers.
>
<snip>
> >
> I once had PUSH/POP in BJX2, but then I dropped them (mostly for
> cost-saving reasons; after noting that adjusting the stack-pointer and
> then using a series of stores, or performing a series of loads and then
> adjusting the stack pointer, could be similarly effective).
<
Push instructions make::
PUSH R1
PUSH R2
PUSH R3
more expensive than:
SUB SP,SP,#12
ST R1,[SP+8]
ST R1,[SP+4]
ST R1,[SP]
due to the serial dependency.
<
The peep hole HW optimizer in K9 would perform this transformation.
{Yes, the optimizer was a piece of HW the compiler knew nothing about.}

Pages:12 3 4

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor