novaBBS - comp.arch - Re: Sequencer vs microcode

On 6/25/2021 2:52 PM, MitchAlsup wrote:
> On Friday, June 25, 2021 at 2:17:23 PM UTC-5, BGB wrote:
>> On 6/25/2021 12:51 PM, Marcus wrote:
>>> On 2021-06-24, MitchAlsup wrote:
>>>> On Thursday, June 24, 2021 at 1:56:56 AM UTC-5, Marcus wrote:
>>>>> I guess this is mostly a question for Mitch...
>>>>>
>>>>> I have seen the terms "sequencer" and "microcode" being used, and while
>>>>> I think that I know what microcode is, I'm not sure what the definition
>>>>> of a sequencer is.
>>>>>
>>>>> AFAIK microcode (at least in the context of a traditional multi-CPI
>>>>> machine) sits in the decoder and pumps out "micro instructions" from
>>>>> a microcode ROM - often more than one micro instruction per instruction.
>>>> <
>>>> There are function units which have a myriad of sequences to run and they
>>>> might be implemented with microcode, too. You are restricting microcode
>>>> to the "pipeline control" (Decoder through Writeback).
>>>> <
>>>> But microcode requires a ROM that is programmable after it is designed
>>>> and laid out. Often with a few additional words that are programmable
>>>> even so far as after power on.
>>>> <
>>>>> I guess that in modern CISC implementations (e.g. x86) the microcode is
>>>>> part of the translation front end, and probably only as an optional part
>>>>> to cover for the cases that the fast-path translators can not deal with.
>>>>>
>>>>> Anyway... How about a sequencer? My understanding is that it is a
>>>>> simpler variant that does not need a microcode ROM - e.g. it could be an
>>>>> FSM or just a counter, right?
>>>> <
>>>> A sequencer is a machine that receives various inputs and asserts a
>>>> sequence
>>>> of control lines that cause the HW to perform "stuff" in certain
>>>> sequences.
>>>> Early IRCS machiens would have a single sequencer "run" the pipeline.
>>>> Some RISC machines had a sequencer that "ran" the FP ADD unit (FADD,
>>>> FCMP,
>>>> FDIV,...) or "ran" the FP MUL unit (FMUL, IMUL, UMUL)
>>>> <
>>>> Many people have a mistaken belief that a sequencer necessarily has a
>>>> single
>>>> point of control, this is not correct, a sequencer for PDP-11 will
>>>> have at least
>>>> 4 sequencer within the "sequencer" (en-the-large) 1 of these sequence out
>>>> Fetch requests, 2 of these sequence out address modes, and one sequences
>>>> out the data path. {This is why HW is not SW and never can be}
>>>>>
>>>>> For instance could the vector control unit in my scalar MRISC32-A1
>>>>> implementation, which is basically a counter that stalls the front end
>>>>> while iterating over vector register elements, be considered to be a
>>>>> sequencer? How about the control unit for an iterative divider?
>>>> <
>>>> Both of these are sequencers, and a "sequencer" can be implemented in
>>>> several ways, gates and flip-flops at one end of the spectrum, microcode
>>>> at the other end.
>>>>>
>>>>> When talking about using a sequencer to implement My 66000 instructions
>>>>> such as ENTER, EXIT and MM - where would those sequencers typically be
>>>>> located? E.g. in decode or execute?
>>>> <
>>>> AGEN would have these sequencers, after all, most of what they do is to
>>>> spew addresses at the cache. The "math" the ENTER and EXIT perform
>>>> is easily a subset of AGEN arithmetic.
>>>>>
>>>>> Would they output a sequence of (internal) instructions (which I assume
>>>>> would be the case if they sit in the decoder), or would they act more
>>>>> like controllers (which I assume would be the case if they sit in the
>>>>> EU:s)?
>>>> <
>>>> Its more like the instruction shows up (ENTER) and AGEN spews out cache
>>>> addresses and register read requests to stream registers onto the stack
>>>> and finishes with stores to SP and optionally to FP.
>>>> <
>>>> The front end sees 1 instruction and then the AGEN unit is busy until it
>>>> is no longer busy. The front end is allowed to start executing
>>>> instructions
>>>> that are not dependent on the register to stack movement and not needing
>>>> the AGEN unit.
>>>
>>> Thanks for taking the time to explain! That cleared things up a great
>>> deal for me.
>>>
>> I had previously considered possibly adding such a "sequencer" to BJX2,
>> which could have allowed a few operations which can't currently be done
>> via a single operation; or possibly allow "MOV.X" operations on a 1-wide
>> core, ...
> <
> When I did this in Mc 88100, we simply made the AGEN unit go busy if it were
> processing a DoubleWord LD or ST. This was the same Busy the main pipeline
> would get if the Cache took a miss or a snoop prevented a LD or ST from
> accessing the cache.

OK. I had considered that ID1 would do something funky and turn it into
a pair of MOV.Q ops ...

But, then it is unclear how much sense it makes in this case, vs just
having the compiler emit a pair of MOV.Q ops if targeting a core which
doesn't have MOV.X ...

Granted, it does mean that compiler options have to be set to match the
CPU feature set, otherwise, code is either sub-optimal (doesn't use a
feature which exists; or can only use it indirectly via function calls,
....), or doesn't work at all if it tries to use a feature which doesn't
exist.

This is "probably not ideal" for something meant as a general purpose
ISA, but is pretty standard in embedded (where one frequently needs to
tell the compiler the specific model of the specific processor to
generate code for).

But, the other extreme is something like x86, where they try to keep
hardware support for every feature of every previous x86 chip which
every existed (whether or not the features still make sense).

I suspect, at some point, emulation may make more sense than trying to
keep the same hardware ISA. Though, there is the issue that this puts
this under the control of the OS, and even some OS's which were known
for this (eg, Windows) kinda drag their feet (meaning that one needs to
use 3rd party emulators to run 30 year old binaries).

Granted, I have yet to decide what the eventual "redistributable binary
format" will be, or even if such a thing makes sense for my project.

Would almost assume trying to use something "off the shelf", except that
the existing mainstream options (JVM, .NET, Dalvik, ...), "kinda suck".
Also not really a fan of "just use repurposed LLVM IR".

One of the front-runner options is "Use WAD files holding RIL3 Bytecode
blobs" (doesn't really require much in terms of "new technology" on my
part), but this option "still kinda sucks". However, RIL3 is high-level
enough to at least mostly gloss-over machine-specific issues (it is a
stack machine along vaguely similar lines to .NET bytecode hybridized
with parts of PostScript; and the wackiness that rather than metadata
being structure-based or similar, the entire bytecode image is basically
a single massive blob of bytecode which is linearly interpreted to
rebuild all the various compiler structures and metadata... Or,
basically, "it seemed like a good idea at the time"...).

Also, unlike some of the other options, it is reasonably well adapted to
C and C++ and similar, and also supports inline ASM and a few other
useful features, ...

Taken slightly further, it is possible that the ".ifarch" feature could
be extended to allow inline ASM blobs with multiple instruction sets, or
maybe be extended to functions:
__declspec(ifnarch(BJX2)) void Foo() { ... }
So the function will only generate machine-code for machine if not
running on BJX2 (could be further wrapped in preprocessor macros to
retain compatibility with other compilers).

Likewise:
__declspec(ifarch(BJX2)) __asm {
Foo:
...
};

Would only emit the inline ASM blob if generating code for BJX2, but
allow falling back to a C version otherwise (and ".ifarch" can be used
within the ASM code to generate different code based on ISA feature flags).

Possibly:
__declspec(ifarch(BJX2,A32,A64)) ...
Is true if any of these is present.
__declspec(ifnarch(BJX2,A32,A64)) ...
Is true if none of these is present.

Likely with multiple ifarch/ifnarch attributes on the same declaration
combining in an AND relationship.

As-is, this sort of thing is mostly so that static libraries don't
necessarily need to be compiled with exactly the same options as those
used for generating the final binary (and things like "sizeof(foo_t)"
are not yet resolved to constant values, ...).

Though, I guess the bigger question here would be the cost (memory cost
and computational overhead) of using the backend from BGBCC as an AOT
compiler (it is either that or look into a lighter-weight AOT backend).

>>
>> This would likely involve adding another split and stall signal to the
>> pipeline, which can stall IF and ID1 (Input Side), but allow ID1 to emit
>> new ops, and ID2 and the EX stages to continue on as normal.
>>
>> Hadn't done so yet, as it seems like a bit of added cost and complexity.
>>
>>
>>
>> Or maybe even a microcode ROM, but this seems like a lot of extra
>> complexity. Could also need banked registers, or reserving that certain
>> registers may be stomped during the operation (eg: "This instruction may
>> leave R0/R1 and R32..R35 in an undefined state" or similar).
> <
> In Mc 88100 we simply used a state machine to cycle off the long FDIV
> sequences--it was a counting problem not a hairy sequencing problem.

OK.

Early on, I had tried using a fairly naive state machine to implement
FDIV and FSQRT, but it was fairly slow to converge, and also kinda
expensive. At the time, decided to cut the losses and just do it
entirely in software.

In my case, there is no microcode, but I want kinda the opposite direction:
You have FADD, FSUB, FMUL, Type Conversion Ops, ...
What more do you "really" need?...

Well, there is SIMD, but this is mostly because doing vectors via scalar
ops is a lot slower than having SIMD ops, but this wasn't really true of
FDIV or FSQRT. And, the part I could easily accelerate with helper ops
is not the part that eats most of the clock cycles.

>>
>> So, say, there is an internal pipeline register 'UXPC' where:
>> UXPC = 0, Instructions come from I$
>> UXPC != 0, Instructions come from UXROM, IF is stalled.
>>
>> Where, say, triggering a microcode op:
>> Copies Rm and Ro to R0 and R1 or similar;
>> Sets UXPC to a non-zero address.
>>
>> Possibly, the UXROM is 108 bits wide:
>> 96 bits: VLIW bundle;
>> 12 bits: Next UXPC + Control Tag (0 for End).
>>
>> As soon as the Next UXPC is 0, whatever is in R0 is redirected to the
>> original Rn. The UXROM code could resemble normal BJX2 code, just
>> hard-wired into 3-wide WEX mode (with Lanes 2/3 padded with NOPs when
>> unused).
>>
>> The 12-bit code would encode control-flow within the UXROM as a sort of
>> special mini-instruction (Branch-Continue, Terminate, Branch-True,
>> Branch-False, Branch-True-Or-Terminate, Branch-False-Or-Terminate). Note
>> that the ROM would likely be fairly small.
>>
>>
>> The banked case would be similar, except that and maybe R16..R23 or
>> similar, would be remapped to special internal registers and not effect
>> externally visible state (possibly Rs/Rt/Ru/Rv/Rx/Ry =>
>> R4/R6/R5/R7/R2/R3, with termination mapping R2/R3 to Rn or Xn).
>>
>> This approach could have a lot less performance overhead than using an
>> interrupt or similar though (it is typically a lot faster/cheaper to use
>> function calls for these sorts of cases than to use an interrupt handler).
>>
>> Cost could be a concern though...
>>
>>
>>> /Marcus

Subject	Replies	Author
Sequencer vs microcode By: Marcus on Thu, 24 Jun 2021	159	Marcus

fortune: cannot execute. Out of cookies.

computers / comp.arch / Re: Sequencer vs microcode