Message-ID:

6 May, 2024: The networking issue during the past two days has been identified and appears to be fixed. Will keep monitoring.

devel / comp.arch / Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<1105bfe4-56ba-42a8-a0ac-691d25e1aadcn@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=31773&group=comp.arch#31773

X-Received: by 2002:a05:622a:1a1a:b0:3ef:2db1:6e5f with SMTP id f26-20020a05622a1a1a00b003ef2db16e5fmr3458014qtb.13.1682212140368;
Sat, 22 Apr 2023 18:09:00 -0700 (PDT)
X-Received: by 2002:a05:6871:b183:b0:17e:7304:6a98 with SMTP id
an3-20020a056871b18300b0017e73046a98mr3003478oac.8.1682212140122; Sat, 22 Apr
2023 18:09:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 22 Apr 2023 18:08:59 -0700 (PDT)
In-Reply-To: <u21ub7$3g8ik$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f5f2:17bc:f70f:1190;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f5f2:17bc:f70f:1190
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
<u21ub7$3g8ik$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1105bfe4-56ba-42a8-a0ac-691d25e1aadcn@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 Apr 2023 01:09:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 13261

by: MitchAlsup - Sun, 23 Apr 2023 01:08 UTC

On Saturday, April 22, 2023 at 7:32:43 PM UTC-5, BGB wrote:
> On 4/22/2023 6:11 PM, MitchAlsup wrote:
> > On Saturday, April 22, 2023 at 4:44:25 PM UTC-5, BGB wrote:
> >> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
> >
> >
> >> Yeah.
> >>
> >> There are a few reasons I went with VLIW rather than superscalar...
> >> Both VLIW and an in-order superscalar require similar logic from the
> >> compiler in order to be used efficiently, and the main cost of VLIW here
> >> reduces to losing 1 bit of instruction entropy and some additional logic
> >> in the compiler to detect if/when instructions can run in parallel.
> >>
> >> Superscalar effectively requires a big glob of pattern recognition early
> >> in the pipeline, which seems like a roadblock.
> > <
> > It is "such a massive roadblock" that my ISA has to look at <gasp> all
> > of 6-bits to determine where an instruction gets routed (its function unit)
> > in unary, one gate later you know if a conflict is present.
> It would require a slightly bigger lookup table for RISC-V or BJX2,
> mostly because instructions are not organized by prefix/suffix class;
> nor by which function-unit handles them.
>
> Also routing to FU's gets a bit ad-hoc (or subject to vary based on
> which features are enabled or disabled), ...
> >>
> >> Had considered support for superscalar RISC-V by using a lookup to
> >> classify instructions as "valid prefix" and "valid suffix" and also
> >> logic to check for register clashes, and then behaving as if there were
> >> a "virtual WEX bit" based on this.
> >>
> >> Hadn't got around to finishing this. RISC-V support on the BJX2 core is
> >> still mostly untested, and will still be limited to in-order operation
> >> for the time being (and, the design of superscalar mechanism would only
> >> be able to give 2-wide operation for 32-bit instructions only).
> >>
> >>
> >> Running POWER ISA code on the BJX2 pipeline would be a bit more of a
> >> stretch though (I added RISC-V as it was already pretty close to a
> >> direct subset of BJX2 at that point).
> >>
> >>
> >>
> >> Things like "modulo scheduling" could in theory help with a VLIW
> >> machine, but in my experience modulo-scheduling could likely also help
> >> with some big OoO machines as well (faking it manually in the C code
> >> being a moderately effective optimization strategy on x86-64 machines as
> >> well).
> > <
> > Modulo scheduling reduces the number of resources in flight to get
> > loops-with-recurrences running smoothly. It helps mono-scalar and larger
> > in-order pipelines, and does no harm to OoO designs whatsoever.
> Yes.
>
> As noted, it seems to help with both x86-64 and BJX2 in my experience.
>
> Not so much with ARM IME, but less sure as to why.
> At least, on ARMv8, theoretically there should not be any detrimental
> effect to modulo scheduling the loops.
> >>
> >> Apparently, clang supports this optimization, but this sort of thing is
> >> currently a bit out of scope of what I can currently manage in BGBCC.
> >>
> >>
> >> Though, have observed that this strategy seems to be counter-productive
> >> on ARM machines (where it seems to often be faster to not try to
> >> manually modulo-schedule the loops). Though, this may depend on the ARM
> >> core (possibly an OoO ARM core might fare better; most of the ones I had
> >> tested on had been in-order superscalar).
> >>
> >> Though, I wouldn't expect there to be all that huge of a difference
> >> between AArch64 and BJX2 on this front, where manual modulo-scheduling
> >> is generally effective on BJX2.
> >>
> >>
> >>
> >> But, as noted in my case on BJX2, latency is sort of like:
> >> 1-cycle:
> >> Basic converter ops, like sign and zero extension;
> >> MOV reg/reg, imm/reg, ...
> > <
> > Are these the zero calculation-cost "calculations"
> > from the set {FABS, IMOV, FMOV, FNEG, FcopySign, INVERT} ?
> > <
> Yes, FABS and FNEG and similar are also 1-cycle ops.
> My list was not exactly exhaustive...
> >> ...
> >> 2-cycle:
> >> Most ALU ops (ADD/SUB/CMPxx/etc);
> >> Some ALU ops could be made 1-cycle, but "worth the cost?".
> > <
> > Warning "Will Robinson":: the slope is very slippery; but candidates
> > are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}
> > <
> Yeah.
> 32-bit ADDS.L / SUBS.L
> AND/OR/XOR
> ...
> Could all be turned into 1-cycle ops.
>
> However, most are not high enough on the ranking to show
> significant/obvious benefit from doing so.
>
> It would "maybe" reduce the average interlock penalty from ~ 7% to
> around 5% or 6% of the total clock cycles, but I don't expect much more
> than this (with most of the rest of the interlock penalty being due to
> memory loads).
> >> More complex converter-class instructions ('CONV2').
> >> Many of the FPU and SIMD format converters go here.
> >> 3-cycle:
> >> MUL (32-bit only);
> >> "low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
> >> Memory Loads;
> > <
> > Given that integer ADD is 2-cycles {in your definition of the list ordinality}
> > I find it interesting that you get::
> > {
> > route AGEN adder to SRAMs address decoder,
> > SRAM access (1-full-cycle)
> Both happen in EX1, L1 SRAM is accessed on 1 clock-edge.
> Cache is direct-mapped, so only the low order bits need to reach the
> L1's SRAM during this clock cycle.
>
> This can roughly handle a 16K or (maybe) 32K array at 50MHz; 64K is
> basically no-go.
>
>
> Contrast, the L2 cache uses multiple clock-cycles to access the SRAM
> array, so can support a somewhat larger array.
> > SRAM route to data-path; tag address compare;
> > LD align; Set-Selection;
> Mostly happens in EX2.
> The "check for L1 cache miss and stall pipeline" logic is one of the
> major "tight" paths in the core. Anything touching this pathway is
> basically a mine-field.
> > Drive result bus
> > }
> > in 1 more cycle.
> Final result handling is in EX3, say:
> Final sign/zero extension;
> Single -> Double (FMOV.S);
> Pixel Extraction (LDTEX).
>
> > <
>
> The 2-cycle ADD was originally partly an artifact.
> Originally I had not discovered some tricks to make the adder faster,
> and a naive 64-bit add has a fairly high latency.
>
> Though, the trick only really helps with larger adders, so I can *also*
> support 128-bit ADD in 2 clock cycles (effectively by having both the
> Lane 1 and 2 ALUs being able to combine into a larger virtual ALU).
>
> But, I can note that:
> ADD, SUB, CMPxx, PADDx, ...
> Are all basically routed through the same logic.
>
>
>
> But, I can also note that getting the L1 D$ to not blow out timing
> constraints is a semi-constant annoyance...
>
> Poke at something, somewhere random, and once again logic in the L1 D$
> has decided to start failing timing again...
>
> It would be easier here if I made the pipeline longer so I could do
> 4-cycle memory loads.
> >> The newer RGB5MINMAX and RGB5CCENC instructions;
> >> ...
> >>
> >> *: The support for Binary32 is optional, and was pulled off by fiddling
> >> the FPU to "barely" give correct-ish Binary32 results, with the notable
> >> restriction that it is truncate-only rounding.
> >>
> >> RGB5MINMAX was basically:
> >> Cycle 1: Figure out the RGB555 Y values;
> >> Cycle 2: Compare and select based on Y values;
> >> Cycle 3: Deliver output from Cycle 2.
> >>
> >> Initial attempt had routed this through the CONV2 path, but it was bad
> >> for cost and timing, so I had reworked it to share the RGB5CCENC module,
> >> which also had similar logic (both needed to find Y based on RGB555
> >> values, ...).
> >>
> >> RGB5CCENC was basically:
> >> Cycle 1:
> >> Figure out the RGB555 Y values (for pixels);
> >> Figure out the RGB555 Y values (for Mid, Lo-Sel, Hi-Sel);
> >> Cycle 2: Compare and generate selector indices based on Y values;
> >> Cycle 3: Deliver output from Cycle 2.
> >>
> >>
> >> Some longer non-pipelined cases:
> >> 6-cycle: FADD/FMUL/etc (main FPU)
> >> 10-cycle: FP-SIMD via main FPU ("high precision").
> >> 40-cycle: Integer DIVx.L and MODx.L
> >> 80-cylce: Integer DIVx.Q and MODx.Q, 64-bit MULx.Q, ...
> >> 120-cycle: FDIV.
> >> 480-cycle: FSQRT.
> > <
> > If I were to take the machine I am designing AND it happened that I
> > had a calculation unit which could do FADD or FMUL or FMAC in
> > 64-bits and in a 6-cycle pipeline then::
> > IDIV IMOD is 24-28 cycles
> > FDIV is 24-26 cycles
> > SQRT is 27-32 cycles
> > in a pipeline where::
> > LD latency is 4 cycles
> > IMUL is 6 cycles
> Part of the "magic" that allows 3-cycle IMUL and 6-cycle FMUL is the
> DSP48 hard-logic. If not for this, these would not likely be possible.
>
> As noted, DSP48 can effectively do:
> 18s*18s => 36s
> 17u*17u => 34u
> With either optionally being added to a 48-bit accumulator value.
>
> Had noted I could get acceptable 6-cycle FMUL by using 6x DSP48's and
> some extra bit-twiddly for the low-order-bits.
> > <
> > If I stated the latencies appropriate to 4-cycle {FADD, FMUL, FMAC}
> > IDIV IMOD is 17 cycles
> > FDIV is 17 cycles
> > SQRT is 22-cycles
> > in a pipeline where::
> > LD latency is 3-cycles
> > IMUL is 4 cycles.
> OK.
> >>
> >> For integer divide and modulo, it is mostly a toss-up between the ISA
> >> instruction and "just doing it in software".
> >>
> >> For 64-bit integer multiply, doing it in software is still faster.
> >>
> >> For floating-point divide, doing it in software is faster, but the
> >> hardware FDIV is able to give more accurate results (software N-R
> >> seemingly being unable to correctly converge the last few low-order bits).
> > <
> > This is a consequence of not calculating all of the partial product bits
> > and then summing to a correct result. N-R only converges absolutely
> > when the integer parts of the arithmetic are correct. Any error here,
> > which is similar to the uncorrected error of Goldschmidt, prevents
> > convergence to correctness.
> So, possibly a consequence then of my "only calculate high-order bits
> and let all the low-order bits fall off the bottom" FMUL design?...
<
Yes.
>
> In my case, it is only "most of the time" that one will get a
> correctly-rounded FMUL result (where "most of the time" in the sense
> that one has ~ 2.5 bits below the ULP, below which the low parts of the
> multiplier are effectively "hanging into the void").
>
> ...
>
> Well, and the FADD also effectively has the low order bits falling into
> the void as well (shift right into oblivion). But, the mantissa is wide
> enough to handle 64-bit integer conversion, so in this case one
> effectively has around an extra 12 bits below the ULP.
>
>
> ...

Click here to read the complete article

On Sunday, April 23, 2023 at 2:48:07 PM UTC-5, BGB wrote:
> On 4/23/2023 9:05 AM, Scott Lurndal wrote:
> > BGB <cr8...@gmail.com> writes:
> >> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
> >

> There is an "easier problem" and a "hard problem":
> "easy problem": Make VLIW performance competitive with in-order superscalar;
> "hard problem": Make VLIW performance competitive with OoO.
>
> The easy problem is "easy" because the compiler still needs to do the
> same work in both cases; getting instructions into the correct order
> that parallel execution is possible, the only real difference being that
> for VLIW, the compiler needs to flag which instructions can be run in
> parallel (and know the width of the target machine and which
> instructions are allowed to run in what combinations).
>
> The "hard" problem is, well, a lot harder. This is the one that Intel
> would have needed to defeat.
>
> Competing with OoO is less of an issue:
> For the most part, OoO cores are too large to fit on Spartan or Artix
> class FPGAs (and, even if they did, the timing situation would likely
> not be good).
>
On the other hand a 6-wide 96-deep GBOoO machine is 2mm^2 in 5nm.
>
> I am also assuming different scenarios:
> Embedded processors;
> Processors in cases where Moore's Law has hit a wall, and it becomes
> more attractive to start trying shrink the CPUs again (rather than
> trying to maximize single-threaded performance).
>
When you don't need/want that kind of performance you can lose the
G & B parts and still keep the OoO; 2-wide OoO, hit-under-miss caching
and get significantly more than 2-wide IO. Here the CPU is 4× smaller
then the GBOoO but still 3× bigger than the 1-wide IO.
>
> Itanium also suffered from "particularly awful" code density. I have
> managed to avoid this, mostly by sticking with 32-bit instructions and
> variable-length bundles (more like what is common in DSP architectures).
<
Itanium failed primarily because <drum roll> too many cooks in the kitchen.
>

On 4/23/2023 3:37 PM, MitchAlsup wrote:
> On Sunday, April 23, 2023 at 2:48:07 PM UTC-5, BGB wrote:
>> On 4/23/2023 9:05 AM, Scott Lurndal wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
>>>
>
>> There is an "easier problem" and a "hard problem":
>> "easy problem": Make VLIW performance competitive with in-order superscalar;
>> "hard problem": Make VLIW performance competitive with OoO.
>>
>> The easy problem is "easy" because the compiler still needs to do the
>> same work in both cases; getting instructions into the correct order
>> that parallel execution is possible, the only real difference being that
>> for VLIW, the compiler needs to flag which instructions can be run in
>> parallel (and know the width of the target machine and which
>> instructions are allowed to run in what combinations).
>>
>> The "hard" problem is, well, a lot harder. This is the one that Intel
>> would have needed to defeat.
>>
>> Competing with OoO is less of an issue:
>> For the most part, OoO cores are too large to fit on Spartan or Artix
>> class FPGAs (and, even if they did, the timing situation would likely
>> not be good).
>>
> On the other hand a 6-wide 96-deep GBOoO machine is 2mm^2 in 5nm.

Still probably not going to fit into an typical Artix-7...

There are some "really big" FPGA's with 50x as many LUTs, but the cost
for the FPGA is absurd, and they are not going to be supported by the
freeware version of Vivado in any case.

>>
>> I am also assuming different scenarios:
>> Embedded processors;
>> Processors in cases where Moore's Law has hit a wall, and it becomes
>> more attractive to start trying shrink the CPUs again (rather than
>> trying to maximize single-threaded performance).
>>
> When you don't need/want that kind of performance you can lose the
> G & B parts and still keep the OoO; 2-wide OoO, hit-under-miss caching
> and get significantly more than 2-wide IO. Here the CPU is 4× smaller
> then the GBOoO but still 3× bigger than the 1-wide IO.

OK.

My current core is pretty heavy vs the smaller cores I have pulled off,
but is also pretty feature rich. Much more of the cost seems to be due
to things which are independent of core width (for things like the L1
caches, FPU and SIMD units, ...), than due to parts which are added for
each lane (such as the additional Lane 2/3 ALUs).

Whereas, while a 32-bit integer-only core can be a fair-bit smaller, it
is a lot more limited.

But, I guess it is partly a question of what one compares to.

I suspect a similar core with a double-precision FPU and 4-wide Binary32
SIMD unit would still have a comparably similar cost even if it were
only 1-wide (given L1 caches + FPU + SIMD unit represents roughly half
of the total LUT cost for the core).

This also represented an issue for the design of a "GPU core", which
would still end up needing most of the "expensive parts" of my main core
(and about the only major part I could drop was the Double-Precision
FPU, but this didn't really save enough to make it viable).

And, throwing a 1-wide integer-only core at the problem wouldn't really
work, ...

Though, I am at least "back in business" for going dual-core on the
XC7A200T...

Some features originally intended to help with the GPU core have ended
up in the main core, as they still sorta help.

Some features, like LDTEX, do technically help in terms of performance,
but are still debatable in terms of their added cost.

And, LDTEX is still mostly effectively limited to square power-of-2 UTX2
compressed textures stored in Morton-order. For non-square textures, it
is still necessary to fall back to using raster order and manually using
the BLKUTX2 instruction or similar instead (texel addressing logic in a
raster-order texture being a more complicated problem than in a
Morton-order texture).

Does also mean one needs alternate versions of span drawing loops for
combinations of texture format and blending mode.

Texture Formats:
UTX2+Morton (Square, Compressed);
UTX2+Raster (Non-Square, Compressed);
Uncompressed Raster (RGB555A, Uncompressed).

Blending Modes:
Direct Texture, No-Alpha-Blend, No Z-Test
Direct Texture, No-Alpha-Blend, Z-Test
Texture + Modulation, No-Alpha-Blend, No Z-Test
Texture + Modulation, No-Alpha-Blend, Z-Test
Texture + Modulation, Alpha-Blend, No Z-Test
Texture + Modulation, Alpha-Blend, Z-Test
Texture + Modulation, Alpha Test, No Z-Test
Texture + Modulation, Alpha Test, Z-Test
...

These being relevant for:
(GL_ONE, GL_ZERO), (GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
Whereas, say:
(GL_ZERO, GL_ONE_MINUS_SRC_COLOR)
Requiring a whole different set of span functions.

And, other combinations needing to fall back to "generic" (but somewhat
slower) span-drawing functions (effectively needing to invoke function
pointers for texture-fetch, blending, and drawing the resulting pixel to
the framebuffer).

Where, say, there is a big mess of function pointers that needs to be
updated whenever a non-trivial change is made:
Program has bound a different texture;
Enabling or disabling flags like GL_DEPTH_TEST, GL_ALPHA_TEST, ...
...

With only a small sliver of a few of the common texture span-drawing
functions actually using the LDTEX instruction (since, as noted, it only
works for square-Morton-order-compressed-textures).

Note that in the front-end API, textures would be uploaded using DXT1 or
DXT5, but are then quietly transformed into UTX2 internally (mostly
because the UTX2 decoding logic is cheaper than S3TC would have been).

>>
>> Itanium also suffered from "particularly awful" code density. I have
>> managed to avoid this, mostly by sticking with 32-bit instructions and
>> variable-length bundles (more like what is common in DSP architectures).
> <
> Itanium failed primarily because <drum roll> too many cooks in the kitchen.

Probably that too.
The thing was absurdly over-engineered.

It was also implausibly expensive.
Combined with the weak performance, this was a killer.

Itanium might have done better it it were less absurd, and cheap...

Though, it would still not likely have been able to compete in ARM's
original part of the market segment (say, for PDAs and older style
feature phones).

It would more likely have needed to try to compete with PowerPC in terms
of early and mid 2000s game-consoles and "set-top boxes".

But, no one is going to put a CPU that costs $k into a game console...

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31779&group=comp.arch#31779

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1194:b0:3e8:f79d:bdfa with SMTP id m20-20020a05622a119400b003e8f79dbdfamr4661167qtk.0.1682294389068;
Sun, 23 Apr 2023 16:59:49 -0700 (PDT)
X-Received: by 2002:aca:bf43:0:b0:38e:bf92:9722 with SMTP id
p64-20020acabf43000000b0038ebf929722mr924493oif.11.1682294388724; Sun, 23 Apr
2023 16:59:48 -0700 (PDT)
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!peer03.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 Apr 2023 16:59:48 -0700 (PDT)
In-Reply-To: <u24b8c$3vu5s$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 23 Apr 2023 23:59:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2474

by: luke.l...@gmail.com - Sun, 23 Apr 2023 23:59 UTC

On Sunday, April 23, 2023 at 11:25:20 PM UTC+1, BGB wrote:

> but is also pretty feature rich. Much more of the cost seems to be due
> to things which are independent of core width (for things like the L1
> caches,

CAMs in FPGAs are always expensive to implement, unless there is
a special CAM block (like the DSP block).

> FPU and SIMD units, ...),

yep - now you know why i designed that Dynamic Partitioned SIMD
unit i mentioned a few months ago. estimated 50% extra resources
you can use 64-bit wide single-arithmetical-ALU for everything: 2x32
4x16 8x8 - all dynamically selectable.

add is easy: just add extra bits at the partition points, use carry-rollover
to "bleed" the 7th bit's carry over to the 9th bit (etc.). the rest follow the same
kind of principle.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31781&group=comp.arch#31781

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:174a:b0:3ef:5993:42d4 with SMTP id l10-20020a05622a174a00b003ef599342d4mr4058617qtk.12.1682297215905;
Sun, 23 Apr 2023 17:46:55 -0700 (PDT)
X-Received: by 2002:aca:f384:0:b0:384:a13:952a with SMTP id
r126-20020acaf384000000b003840a13952amr3285265oih.11.1682297215756; Sun, 23
Apr 2023 17:46:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 Apr 2023 17:46:55 -0700 (PDT)
In-Reply-To: <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:16c:2e2f:e3ac:1d52;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:16c:2e2f:e3ac:1d52
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 Apr 2023 00:46:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3669

by: MitchAlsup - Mon, 24 Apr 2023 00:46 UTC

On Sunday, April 23, 2023 at 6:59:50 PM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, April 23, 2023 at 11:25:20 PM UTC+1, BGB wrote:
>
> > but is also pretty feature rich. Much more of the cost seems to be due
> > to things which are independent of core width (for things like the L1
> > caches,
> CAMs in FPGAs are always expensive to implement, unless there is
> a special CAM block (like the DSP block).
> > FPU and SIMD units, ...),
> yep - now you know why i designed that Dynamic Partitioned SIMD
> unit i mentioned a few months ago. estimated 50% extra resources
> you can use 64-bit wide single-arithmetical-ALU for everything: 2x32
> 4x16 8x8 - all dynamically selectable.
>
> add is easy: just add extra bits at the partition points, use carry-rollover
> to "bleed" the 7th bit's carry over to the 9th bit (etc.). the rest follow the same
> kind of principle.
<
Take your 64-bit adder and stretch it to 72 bits.
every 8-bits hardwire the operand-bit inputs to a bit-pair control state.
If this bit-pair is 00 then carry into bit<9> is 0
if this bit-pair is 01 or 10 carry propagates from bit<7> to bit<9>
if this bit-pair is 11 then carry in to bit<9> is 1
<
So if the bit-pair is {00,00,00,00,00,00,00,00} you have eight 8-bit adders SIMD
..... if the bit-pair is {01,00,01,00,01,00,01,00} you have four 16-bit adders SIMD
..... if the bit-pair is (01,01,01,00,01,01,01,00} you have two 32-bit adders SIMD
..... if the bit-pair is {01,01,01,01,01,01,01,00} you have one 64-bit adder..
<
But more exotic things can be constructed::
..... if the bit pair is {01,00,01,01,00,01,01,00} you have one 16-bit add and 2×24-bit adds.
..
..
..
The reason you use a 2-bit control scheme comes into play when you want the
individual terms to insert their own carries. ADD8C-like. The carry in from the
operand is used to gate the bit-pair.
>
> l.

On 4/23/2023 6:59 PM, luke.l...@gmail.com wrote:
> On Sunday, April 23, 2023 at 11:25:20 PM UTC+1, BGB wrote:
>
>> but is also pretty feature rich. Much more of the cost seems to be due
>> to things which are independent of core width (for things like the L1
>> caches,
>
> CAMs in FPGAs are always expensive to implement, unless there is
> a special CAM block (like the DSP block).
>

I am using direct-mapped caches.

But, forwarding the results of a memory store on one clock-cycle into
the next clock cycle adds significant cost, but can make a big
difference in terms of performance for code involving stores to
consecutive locations in memory.

Have experimented with associative caches, but my results have been
mixed. They are more expensive, but any performance gains tend to be a
bit hit or miss (and programs like Doom and similar tended to perform
better with direct-mapped caches than with associative caches in my
testing).

Though, in a strict hit/miss rate sense, a 2-way or 4-way LRU cache
would compare well against a direct-mapped cache (but, somehow, the DM
cache seems to perform better in Doom despite the slightly worse miss
rate in an absolute sense...).

>> FPU and SIMD units, ...),
>
> yep - now you know why i designed that Dynamic Partitioned SIMD
> unit i mentioned a few months ago. estimated 50% extra resources
> you can use 64-bit wide single-arithmetical-ALU for everything: 2x32
> 4x16 8x8 - all dynamically selectable.
>

I handle packed integer SIMD via the ALUs.

The "expensive" SIMD unit is the one for doing 4x Binary32 operations in
3 clock-cycles...

Without this unit, most of the FPU-SIMD operations would require 10
clock cycles. So, it is basically paying a big chunk of LUTs for fully
pipelined Binary32 SIMD ops...

Or, I can save some cost, by having the low-precision unit only handle
Binary16 and an S.E8.F16 format (but, then by default Binary32 SIMD ops
need to be routed through the main FPU, and thus cost 10 cycles).

Well, and extra cost and timing hassles resulting from another
experimental feature where thew SIMD ops can perform an inline vector
shuffle (vs needing to use external vector shuffles).

Maximum performance for the SIMD unit is 200 MFLOP/s at 50 MHz.

However, vector shuffle can eat into this a fair bit in certain cases.
There were some "very experimental" ops, like:
PMULSHX.H Rm, Imm48fv8sh, Rn

Which can combine a vector multiply against an immediate (4x S.E5.F6)
with a 4-element vector shuffle.

Which can allow the SIMD unit to operate "relatively close" to the 200
MFLOP/s hard-limit (and, at the same time, manage to outperform an
otherwise significantly faster early 2000's laptop at running a neural
net via x87 ops...).

Though, it is almost kinda moot though given this operation is otherwise
"extremely niche" (and a slightly newer Vista era laptop will stomp on
both of them...).

Also, it is pretty much entirely moot if one wants to be able to train
the net (which requires keeping the weight vectors and similar in RAM).

Some of this could be relevant to GLSL, but I am not yet sure how I
would get usable performance from code generated by a GLSL compiler
(but, uploading shaders as BJX2 ASM would be a bit non-standard...).

> add is easy: just add extra bits at the partition points, use carry-rollover
> to "bleed" the 7th bit's carry over to the 9th bit (etc.). the rest follow the same
> kind of principle.
>

I have both MMX and SSE style.

Packed Integer:
4x Int16
2x Int32
4x Int32

Packed Float:
4x Binary16
2x Binary32
4x Binary32
2x Binary64 (Main FPU)

With converters to/from some additional formats:
3x 20 bits (S.E5.F10:4);
64-bit storage <-> 4x Binary32
3x 10 bits (S.E5.F4)
32-bit storage <-> 4x Binary16
4x 8 bits:
E4.F4, E4.F3.S (FP8U/FP8S)
S.E3.F4 (A-Law; unit range only)
32-bit storage <-> 4x Binary16
3x 42-bit (S.E8.F23:10)
128-bit storage <-> 4x Binary64.
Packed/Unpacked piecewise from 2x Binary64, multi-step conversion.

The 3x formats mostly set W to either 0.0 or 1.0 on unpack, ignoring W
on pack.

And, some "partial" packed Integer <-> Float SIMD converters (to reduce
cost, the mechanism is slightly wonky).

BGB <cr88192@gmail.com> writes:
>On 4/22/2023 3:49 AM, robf...@gmail.com wrote:

>>> ...
>> I have not had much luck getting a superscalar to clock over 20 MHz in
>> an FPGA. One issue is the size of the core slows things down. Even at
>> 20 MHz though performance is like that of a 40 to 50 MHz scalar core.
>> One nice thing about a lower clock speed is that there are fewer issues
>> with some instructions which can be more complex because more
>> clock space is available. I got a in-order PowerPC clone to work a little
>> faster than 20 MHz.
>>
>
>Yeah.
>
>There are a few reasons I went with VLIW rather than superscalar...
>Both VLIW and an in-order superscalar require similar logic from the
>compiler in order to be used efficiently, and the main cost of VLIW here
>reduces to losing 1 bit of instruction entropy and some additional logic
>in the compiler to detect if/when instructions can run in parallel.

I think there is pretty clear evidence (c.f. Itanic) that expecting compilers
to fully utilize the capabilities of a general purpose VLIW instruction
set is a very difficult problem, even with highly skilled (SGI/HP/Intel)
compiler engineers.

On 4/23/2023 9:05 AM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
>
>>>> ...
>>> I have not had much luck getting a superscalar to clock over 20 MHz in
>>> an FPGA. One issue is the size of the core slows things down. Even at
>>> 20 MHz though performance is like that of a 40 to 50 MHz scalar core.
>>> One nice thing about a lower clock speed is that there are fewer issues
>>> with some instructions which can be more complex because more
>>> clock space is available. I got a in-order PowerPC clone to work a little
>>> faster than 20 MHz.
>>>
>>
>> Yeah.
>>
>> There are a few reasons I went with VLIW rather than superscalar...
>> Both VLIW and an in-order superscalar require similar logic from the
>> compiler in order to be used efficiently, and the main cost of VLIW here
>> reduces to losing 1 bit of instruction entropy and some additional logic
>> in the compiler to detect if/when instructions can run in parallel.
>
> I think there is pretty clear evidence (c.f. Itanic) that expecting compilers
> to fully utilize the capabilities of a general purpose VLIW instruction
> set is a very difficult problem, even with highly skilled (SGI/HP/Intel)
> compiler engineers.
>

There is an "easier problem" and a "hard problem":
"easy problem": Make VLIW performance competitive with in-order superscalar;
"hard problem": Make VLIW performance competitive with OoO.

The easy problem is "easy" because the compiler still needs to do the
same work in both cases; getting instructions into the correct order
that parallel execution is possible, the only real difference being that
for VLIW, the compiler needs to flag which instructions can be run in
parallel (and know the width of the target machine and which
instructions are allowed to run in what combinations).

The "hard" problem is, well, a lot harder. This is the one that Intel
would have needed to defeat.

Competing with OoO is less of an issue:
For the most part, OoO cores are too large to fit on Spartan or Artix
class FPGAs (and, even if they did, the timing situation would likely
not be good).

I am also assuming different scenarios:
Embedded processors;
Processors in cases where Moore's Law has hit a wall, and it becomes
more attractive to start trying shrink the CPUs again (rather than
trying to maximize single-threaded performance).

Itanium also suffered from "particularly awful" code density. I have
managed to avoid this, mostly by sticking with 32-bit instructions and
variable-length bundles (more like what is common in DSP architectures).

But, I guess, more like Itanium than DSPs:
BJX2 does still perform register interlock checking;
There are no delay slots.

For example, in a more traditional VLIW (such as TMS320C6x), one might
need to NOP pad a Load or similar if there was nothing to do in the
meantime, and there might be two or more delay-slot cycles following a
branch, ... I didn't go this route, and instead the CPU will stall the
pipeline for as many cycles are needed for the conflicting operation to
finish.

There are similarities and differences in the predication mechanism:
IA-64 used an array of predicate-bit registers;
It was effectively "Bit is 1, execute; Bit is 0, no-execute".
BJX2 generally only uses a single bit (SR.T);
Instructions are "?T" (Execute if SR.T==1) or "?F" (If SR.T==0).
...

The main competition then is with in-order superscalar cores, but most
of these end up with the limitation of typically only running at 25 or
30 MHz (eg: SweRV and friends have this issue).

It still gets impressive looking Dhrystone numbers, but as far as I can
tell, much of this is because GCC tends to be somewhat better at code
generation (particularly in the Dhrystone case).

Things like Doom performance are still "kinda suck" though.

OTOH:
Got around to testing my color-cell instructions, and they mostly serve
their purpose.

Running Doom in the 640x400 color-cell mode (in a "Window") goes from
6fps to around 20fps.

With the helper ops, color-cell encoder is roughly 6% of the CPU time
(down from 70%).

However, transforming a 4x4 cell still leaves around 40% of the
clock-cycles in the encoder as interlock penalties. I may consider
reworking it to transform multiple color-cells at a time (say: 8x4
pixels or 16x4 pixels); which could potentially reduce the amount of
cycles spent on interlock penalties.

Note that 640x400 screen redraw at 20fps will require shoving roughly 10
MB/s through the color-cell encoder. Or ~ 5 megapixels/sec.

Scaling for CPU use, it seems I am not too far off from the theoretical
limits of the design of the color-cell helper instructions.

Theoretically, the encoder instructions could also be used to help with
encoding the RPZA or CRAM video formats as well. Though, real-time video
capture from the BJX2 core isn't an immediate priority.

More it is a case of color-cell being used to reduce video-memory and
video-memory-bandwidth requirements (a 640x400 or 640x480 hi-color
bitmapped frambuffer would need too much memory bandwidth; and is too
big to fit into the L2 cache).

The helper instructions don't currently directly cover the 800x600 mode
(which uses a color-cell format with 1-bit selectors vs 2-bit), or the
1024x768 mode (1-bit monochrome, with an optional Bayer Matrix sub-mode
to mimic full color output, in addition to a range of fixed-color
options). Though, admittedly, haven't really used these modes much thus far.

IIRC, did test them on actual hardware, and the monitor I was testing
with seemed to accept them (despite the non-standard timings), though
fails to correctly identify the intended resolution. Then again, the
monitor identifies the 320x200 mode as 720x400, so, ...

Still not confirmed whether they would work on an actual CRT, vs an LCD.

Even then, I have had mixed results with the LCD's I have tried, like
another LCD had a hard time aligning the image and would generally cut
off part of the screen (effectively "letter boxing" the image after
misidentifying the 320x200 mode as 640x360 ...).

Also with the current monitor, the temporal dithering feature seems to
not work correctly, as the screen will not update the pixels (until a
more major change occurs), causing it to hold an arbitrary combination
of fixed dither patterns.

....

Still don't know yet if I will develop a GUI for this...

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31814&group=comp.arch#31814

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:9c9:b0:5e6:9d14:6af2 with SMTP id dp9-20020a05621409c900b005e69d146af2mr2880235qvb.8.1682369393447;
Mon, 24 Apr 2023 13:49:53 -0700 (PDT)
X-Received: by 2002:aca:f1c2:0:b0:38e:96b5:5b73 with SMTP id
p185-20020acaf1c2000000b0038e96b55b73mr3982740oih.1.1682369393052; Mon, 24
Apr 2023 13:49:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Apr 2023 13:49:52 -0700 (PDT)
In-Reply-To: <5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 24 Apr 2023 20:49:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3812

by: luke.l...@gmail.com - Mon, 24 Apr 2023 20:49 UTC

On Monday, April 24, 2023 at 1:46:57 AM UTC+1, MitchAlsup wrote:

> So if the bit-pair is {00,00,00,00,00,00,00,00} you have eight 8-bit adders SIMD
> .... if the bit-pair is {01,00,01,00,01,00,01,00} you have four 16-bit adders SIMD
> .... if the bit-pair is (01,01,01,00,01,01,01,00} you have two 32-bit adders SIMD
> .... if the bit-pair is {01,01,01,01,01,01,01,00} you have one 64-bit adder.

ta-daaa. from add you get subtract, from that you get compare.
but equals greater less-than are easy to construct, based on byte-level
analysis. equals is the obvious one, GT and LT require a Carry-Propagation
Cascade.
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/eq/

eq0 = a[0:7] == b[0:7]
eq1 = a[8:15] == b[8:15]
eq2 = a[16:23] == b[16:23]
eq3 = a[24:31] == b[24:31]

and then you combine those based on the partitions.

multiply is also possible by doing 8x8->16-bit result multiply
blocks then performing a Wallace Tree reduction using,
ta-daaa, the above adder scheme, and you now have every
permutation of Dynamic SIMD multiply.

shift was hair-raising and there is a caveat: you can't make the
partitions too small, because if the shift amount is *also*
partitioned you lose the ability to correctly express the
amount of shifting required in a SIMD context.

but, it is all doable:
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/

> But more exotic things can be constructed::
> .... if the bit pair is {01,00,01,01,00,01,01,00} you have one 16-bit add and 2×24-bit adds.

yes, i ended up accidentally implementing that in the Dynamic
Partitioned SIMD HDL Library.

> The reason you use a 2-bit control scheme comes into play when you want the
> individual terms to insert their own carries. ADD8C-like. The carry in from the
> operand is used to gate the bit-pair.

ooo niiice, i missed that. ok i did and i didn't. Power ISA has
carry-in carry-out (XER.CA), and yes that bit needs to go in at one end
and come out at the other.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31816&group=comp.arch#31816

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:c3:b0:3ef:2db1:6e59 with SMTP id p3-20020a05622a00c300b003ef2db16e59mr6025809qtw.9.1682370915972;
Mon, 24 Apr 2023 14:15:15 -0700 (PDT)
X-Received: by 2002:a9d:53cd:0:b0:6a2:e6f6:b484 with SMTP id
i13-20020a9d53cd000000b006a2e6f6b484mr4421838oth.1.1682370915814; Mon, 24 Apr
2023 14:15:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Apr 2023 14:15:15 -0700 (PDT)
In-Reply-To: <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4579:8eb0:83e0:b9dd;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4579:8eb0:83e0:b9dd
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com> <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 Apr 2023 21:15:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4204

by: MitchAlsup - Mon, 24 Apr 2023 21:15 UTC

On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, April 24, 2023 at 1:46:57 AM UTC+1, MitchAlsup wrote:
>
> > So if the bit-pair is {00,00,00,00,00,00,00,00} you have eight 8-bit adders SIMD
> > .... if the bit-pair is {01,00,01,00,01,00,01,00} you have four 16-bit adders SIMD
> > .... if the bit-pair is (01,01,01,00,01,01,01,00} you have two 32-bit adders SIMD
> > .... if the bit-pair is {01,01,01,01,01,01,01,00} you have one 64-bit adder.
> ta-daaa. from add you get subtract, from that you get compare.
> but equals greater less-than are easy to construct, based on byte-level
> analysis. equals is the obvious one, GT and LT require a Carry-Propagation
> Cascade.
> https://libre-soc.org/3d_gpu/architecture/dynamic_simd/eq/" rel="nofollow" target="_blank">https://libre-soc.org/3d_gpu/architecture/dynamic_simd/eq/
>
> eq0 = a[0:7] == b[0:7]
> eq1 = a[8:15] == b[8:15]
> eq2 = a[16:23] == b[16:23]
> eq3 = a[24:31] == b[24:31]
>
> and then you combine those based on the partitions.
>
> multiply is also possible by doing 8x8->16-bit result multiply
> blocks then performing a Wallace Tree reduction using,
> ta-daaa, the above adder scheme, and you now have every
> permutation of Dynamic SIMD multiply.
>
> shift was hair-raising and there is a caveat: you can't make the
> partitions too small, because if the shift amount is *also*
> partitioned you lose the ability to correctly express the
> amount of shifting required in a SIMD context.
>
> but, it is all doable:
> https://libre-soc.org/3d_gpu/architecture/dynamic_simd/
> > But more exotic things can be constructed::
> > .... if the bit pair is {01,00,01,01,00,01,01,00} you have one 16-bit add and 2×24-bit adds.
> yes, i ended up accidentally implementing that in the Dynamic
> Partitioned SIMD HDL Library.
> > The reason you use a 2-bit control scheme comes into play when you want the
> > individual terms to insert their own carries. ADD8C-like. The carry in from the
> > operand is used to gate the bit-pair.
<
> ooo niiice,
<
grasshopper, why does it amaze you so..........
<
> i missed that. ok i did and i didn't. Power ISA has
> carry-in carry-out (XER.CA), and yes that bit needs to go in at one end
> and come out at the other.
>
> l.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<64d7c016-32e9-47be-9558-50b359d55b3en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31821&group=comp.arch#31821

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1826:b0:3e6:55b2:35f with SMTP id t38-20020a05622a182600b003e655b2035fmr5947697qtc.5.1682387657799;
Mon, 24 Apr 2023 18:54:17 -0700 (PDT)
X-Received: by 2002:aca:bc0b:0:b0:38c:66d3:67c9 with SMTP id
m11-20020acabc0b000000b0038c66d367c9mr4296044oif.9.1682387657572; Mon, 24 Apr
2023 18:54:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Apr 2023 18:54:17 -0700 (PDT)
In-Reply-To: <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb71:2b00:71e1:c49d:974d:8e1e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb71:2b00:71e1:c49d:974d:8e1e
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <64d7c016-32e9-47be-9558-50b359d55b3en@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 25 Apr 2023 01:54:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1952

by: Quadibloc - Tue, 25 Apr 2023 01:54 UTC

On Saturday, April 22, 2023 at 5:11:33 PM UTC-6, MitchAlsup wrote:

> Warning "Will Robinson":: the slope is very slippery; but candidates
> are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}

And here I thought that a barrel shifter was standard equipment on
any processor above the tiniest size these days, and so *all* shifts and
rotates are usually single-cycle operations.

John Savard

On Monday, April 24, 2023 at 8:54:19 PM UTC-5, Quadibloc wrote:
> On Saturday, April 22, 2023 at 5:11:33 PM UTC-6, MitchAlsup wrote:
>
> > Warning "Will Robinson":: the slope is very slippery; but candidates
> > are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}
> And here I thought that a barrel shifter was standard equipment on
> any processor above the tiniest size these days, and so *all* shifts and
> rotates are usually single-cycle operations.
<
As I have repeatedly stated: Nobody has use a barrel shifter (as seen
in Meade Conway) since 1990--we all use multiplexer shifters.
<
A barrel shifter has the property that it is n+n wires wide
A multiplexer shifter has the property that it is n+lnk(n) bits wide.
(Where k is 2 or 4) {IBM circa 1989±}
<
This is similar to people giving Wallace credit for how multiplier trees
are constructed, whereas all multiplier trees use Dadda form these
days (Wallace is a peculiar subset of Dadda.) Wallace wrote the
paper, Dadda wrote the book.
>
> John Savard

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<908caa3a-7dd5-44d6-b1d2-6b083eee2748n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31824&group=comp.arch#31824

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5a46:0:b0:5ef:5302:9cb with SMTP id ej6-20020ad45a46000000b005ef530209cbmr2599224qvb.2.1682389583234;
Mon, 24 Apr 2023 19:26:23 -0700 (PDT)
X-Received: by 2002:a05:6871:b183:b0:17e:7304:6a98 with SMTP id
an3-20020a056871b18300b0017e73046a98mr4896907oac.8.1682389583012; Mon, 24 Apr
2023 19:26:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Apr 2023 19:26:22 -0700 (PDT)
In-Reply-To: <579cb256-39e3-4812-9a73-a8605a071013n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb71:2b00:71e1:c49d:974d:8e1e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb71:2b00:71e1:c49d:974d:8e1e
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
<64d7c016-32e9-47be-9558-50b359d55b3en@googlegroups.com> <579cb256-39e3-4812-9a73-a8605a071013n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <908caa3a-7dd5-44d6-b1d2-6b083eee2748n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 25 Apr 2023 02:26:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2146

by: Quadibloc - Tue, 25 Apr 2023 02:26 UTC

On Monday, April 24, 2023 at 7:59:19 PM UTC-6, MitchAlsup wrote:

> As I have repeatedly stated: Nobody has use a barrel shifter (as seen
> in Meade Conway) since 1990--we all use multiplexer shifters.
> <
> A barrel shifter has the property that it is n+n wires wide
> A multiplexer shifter has the property that it is n+lnk(n) bits wide.
> (Where k is 2 or 4) {IBM circa 1989±}

Oh, dear. What I was thinking of, then, would be a multiplexer shifter.

John Savard

On 4/24/2023 8:54 PM, Quadibloc wrote:
> On Saturday, April 22, 2023 at 5:11:33 PM UTC-6, MitchAlsup wrote:
>
>> Warning "Will Robinson":: the slope is very slippery; but candidates
>> are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}
>
> And here I thought that a barrel shifter was standard equipment on
> any processor above the tiniest size these days, and so *all* shifts and
> rotates are usually single-cycle operations.
>

FWIW:

I actually ended up actually using funnel shifters, with some logic to
reconfigure the input values depending on what type of shift/rotate was
being performed.

Shift Left:
High Input: Rs
Low Input: 0
Offset: 64-Ri
Shift Right:
High Input: 0 or SExt(Rs)
Low Input: Rs
Offset: Ri (or (-Ri) / (1+(~Ri)) )
Rotate Left:
High Input: Rs
Low Input: Rs
Offset: 64-Ri ( or (65+(~Ri)) )

Essentially, the shift extracting a 64-bit value from within a sliding
128-bit window (which can in turn handle everything from 0..63).

With the actual shift being essentially a stack of multiplexers driven
by each bit of the offset.

And, with some additional trickery, two such 64-bit shift units can
combine to perform a 128-bit shift or rotate.

Add some extra special case handling to deal with the shift-values being
signed for SHADx/SHLDx:
Ri>0: Shift Left
Ri<0: Shift Right

Well, and 64-Ri => 65+(~Ri), ...

And, add some additional logic to support 32-bit shifts (hint, more
stuff about configuring the values in the input window, ...).

But, yeah, something to this effect...

> John Savard

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<b4694626-8cda-4ab1-ba95-bd79e6d6ee33n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31826&group=comp.arch#31826

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:64a:b0:3ef:3af7:1c42 with SMTP id a10-20020a05622a064a00b003ef3af71c42mr5758666qtb.6.1682393010804;
Mon, 24 Apr 2023 20:23:30 -0700 (PDT)
X-Received: by 2002:aca:6285:0:b0:383:fcba:70e6 with SMTP id
w127-20020aca6285000000b00383fcba70e6mr3126540oib.1.1682393010573; Mon, 24
Apr 2023 20:23:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Apr 2023 20:23:30 -0700 (PDT)
In-Reply-To: <908caa3a-7dd5-44d6-b1d2-6b083eee2748n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb71:2b00:5101:a1b7:9ca4:ec8b;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb71:2b00:5101:a1b7:9ca4:ec8b
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
<64d7c016-32e9-47be-9558-50b359d55b3en@googlegroups.com> <579cb256-39e3-4812-9a73-a8605a071013n@googlegroups.com>
<908caa3a-7dd5-44d6-b1d2-6b083eee2748n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b4694626-8cda-4ab1-ba95-bd79e6d6ee33n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 25 Apr 2023 03:23:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2896

by: Quadibloc - Tue, 25 Apr 2023 03:23 UTC

On Monday, April 24, 2023 at 8:26:24 PM UTC-6, Quadibloc wrote:
> On Monday, April 24, 2023 at 7:59:19 PM UTC-6, MitchAlsup wrote:
>
> > As I have repeatedly stated: Nobody has use a barrel shifter (as seen
> > in Meade Conway) since 1990--we all use multiplexer shifters.
> > <
> > A barrel shifter has the property that it is n+n wires wide
> > A multiplexer shifter has the property that it is n+lnk(n) bits wide.
> > (Where k is 2 or 4) {IBM circa 1989±}

> Oh, dear. What I was thinking of, then, would be a multiplexer shifter.

I couldn't find a definition of a "multiplexer shifter" online.

After further searching, I found one reference to what I was thinking of:
something which had a layer that shifted by 1, another layer that shifted by
2, another layer that shifted by 4, and so on.

But the source I saw it in called it a "logarithmic barrel shifter". It noted that
each stage would do a rotate, and a mask would change the rotate into a shift
as desired. Plus the switch from a left shift to a right shift would be done by
optionally reversing all the bits on entry and exit.

John Savard

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<264087f0-5ee4-4f31-a33c-8eb0b5ec0798n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31827&group=comp.arch#31827

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:134b:b0:3bf:b9d9:6759 with SMTP id w11-20020a05622a134b00b003bfb9d96759mr5731595qtk.8.1682393665165;
Mon, 24 Apr 2023 20:34:25 -0700 (PDT)
X-Received: by 2002:aca:e056:0:b0:38e:a983:9226 with SMTP id
x83-20020acae056000000b0038ea9839226mr2663625oig.10.1682393664928; Mon, 24
Apr 2023 20:34:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 Apr 2023 20:34:24 -0700 (PDT)
In-Reply-To: <b4694626-8cda-4ab1-ba95-bd79e6d6ee33n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb71:2b00:5101:a1b7:9ca4:ec8b;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb71:2b00:5101:a1b7:9ca4:ec8b
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
<64d7c016-32e9-47be-9558-50b359d55b3en@googlegroups.com> <579cb256-39e3-4812-9a73-a8605a071013n@googlegroups.com>
<908caa3a-7dd5-44d6-b1d2-6b083eee2748n@googlegroups.com> <b4694626-8cda-4ab1-ba95-bd79e6d6ee33n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <264087f0-5ee4-4f31-a33c-8eb0b5ec0798n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 25 Apr 2023 03:34:25 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3422

by: Quadibloc - Tue, 25 Apr 2023 03:34 UTC

On Monday, April 24, 2023 at 9:23:32 PM UTC-6, Quadibloc wrote:
> On Monday, April 24, 2023 at 8:26:24 PM UTC-6, Quadibloc wrote:
> > On Monday, April 24, 2023 at 7:59:19 PM UTC-6, MitchAlsup wrote:
> >
> > > As I have repeatedly stated: Nobody has use a barrel shifter (as seen
> > > in Meade Conway) since 1990--we all use multiplexer shifters.
> > > <
> > > A barrel shifter has the property that it is n+n wires wide
> > > A multiplexer shifter has the property that it is n+lnk(n) bits wide.
> > > (Where k is 2 or 4) {IBM circa 1989±}
>
> > Oh, dear. What I was thinking of, then, would be a multiplexer shifter.
> I couldn't find a definition of a "multiplexer shifter" online.
>
> After further searching, I found one reference to what I was thinking of:
> something which had a layer that shifted by 1, another layer that shifted by
> 2, another layer that shifted by 4, and so on.
>
> But the source I saw it in called it a "logarithmic barrel shifter". It noted that
> each stage would do a rotate, and a mask would change the rotate into a shift
> as desired. Plus the switch from a left shift to a right shift would be done by
> optionally reversing all the bits on entry and exit.

Another paper that I found described both a barrel shifter and a funnel
shifter as having layers that shift by powers of two. With a funnel shifter,
the difference between a shift and a rotate is whether you fill both halves,
or just one, of the double-width input, which is simpler than preparing a
mask.

John Savard

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31837&group=comp.arch#31837

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2aef:b0:74e:4f4e:2fe2 with SMTP id bn47-20020a05620a2aef00b0074e4f4e2fe2mr1269194qkb.9.1682411711620;
Tue, 25 Apr 2023 01:35:11 -0700 (PDT)
X-Received: by 2002:a05:6808:1825:b0:38e:b1e0:95f3 with SMTP id
bh37-20020a056808182500b0038eb1e095f3mr3463350oib.2.1682411711198; Tue, 25
Apr 2023 01:35:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 25 Apr 2023 01:35:10 -0700 (PDT)
In-Reply-To: <b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com> <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 25 Apr 2023 08:35:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2419

by: luke.l...@gmail.com - Tue, 25 Apr 2023 08:35 UTC

On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
> > > The reason you use a 2-bit control scheme comes into play when you want the
> > > individual terms to insert their own carries. ADD8C-like. The carry in from the
> > > operand is used to gate the bit-pair.
> <
> > ooo niiice,
> <
> grasshopper, why does it amaze you so..........

because relatively speaking this is all entirely new to me! :)
i say "relatively", it's been 4 years...

On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
> On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
>> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
>>>> The reason you use a 2-bit control scheme comes into play when you want the
>>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
>>>> operand is used to gate the bit-pair.
>> <
>>> ooo niiice,
>> <
>> grasshopper, why does it amaze you so..........
>
> because relatively speaking this is all entirely new to me! :)
> i say "relatively", it's been 4 years...
>

I have been poking around with this stuff for around 7 years now...

I am much older now than when I started it seems...

Elsewhere, an online argument came up, where arguably my graphics
hardware is much crappier than some people would expect for the class of
FPGA I am using.

Eg:
320x200 hi-color
640x400 color-cell (or 16-color RGBI)
800x600 color-cell (or 4-color CGA-like)
1024x768 text-cell (or 1-bpp monochrome)

In text-cell mode, one could still sorta do graphics by loading the
graphics tiles into the "Font RAM".

But...
Higher bpp modes, RAM fetch can't keep up with screen refresh.
Or, if I do Block-RAM... It sorta eats all the Block-RAM.

Still doesn't help that my DDR RAM controller is kinda slow.

But, the "effort curve" for a lot of this is pretty steep.

....

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31862&group=comp.arch#31862

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:1924:b0:5ef:5517:dc33 with SMTP id es4-20020a056214192400b005ef5517dc33mr36969qvb.3.1682453419014;
Tue, 25 Apr 2023 13:10:19 -0700 (PDT)
X-Received: by 2002:a05:6870:9895:b0:17b:7376:8c82 with SMTP id
eg21-20020a056870989500b0017b73768c82mr5714243oab.1.1682453418806; Tue, 25
Apr 2023 13:10:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 25 Apr 2023 13:10:18 -0700 (PDT)
In-Reply-To: <u29a9s$10dab$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b5ad:edf7:9882:726e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b5ad:edf7:9882:726e
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com> <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com> <360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>
<u29a9s$10dab$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 25 Apr 2023 20:10:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2859

by: MitchAlsup - Tue, 25 Apr 2023 20:10 UTC

On Tuesday, April 25, 2023 at 2:39:44 PM UTC-5, BGB wrote:
> On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
> > On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
> >> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail..com wrote:
> >>>> The reason you use a 2-bit control scheme comes into play when you want the
> >>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
> >>>> operand is used to gate the bit-pair.
> >> <
> >>> ooo niiice,
> >> <
> >> grasshopper, why does it amaze you so..........
> >
> > because relatively speaking this is all entirely new to me! :)
> > i say "relatively", it's been 4 years...
> >
> I have been poking around with this stuff for around 7 years now...
<
If I had only been poking around for 7 years I would only be 10% older...........

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u2ab3f$18ifq$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31870&group=comp.arch#31870

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Tue, 25 Apr 2023 23:59:21 -0500
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <u2ab3f$18ifq$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me>
<d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me>
<9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me>
<417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me>
<8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com>
<cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com>
<360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>
<u29a9s$10dab$1@dont-email.me>
<17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 26 Apr 2023 04:59:28 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0c54b0af3d32391803b7c96e5785a870";
logging-data="1329658"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19k12+1o+wb1y4gtbgWPDMd"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:55Z69YZGkvfy2mhybCcP/2awXrU=
Content-Language: en-US
In-Reply-To: <17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>

by: BGB - Wed, 26 Apr 2023 04:59 UTC

On 4/25/2023 3:10 PM, MitchAlsup wrote:
> On Tuesday, April 25, 2023 at 2:39:44 PM UTC-5, BGB wrote:
>> On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
>>> On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
>>>> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
>>>>>> The reason you use a 2-bit control scheme comes into play when you want the
>>>>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
>>>>>> operand is used to gate the bit-pair.
>>>> <
>>>>> ooo niiice,
>>>> <
>>>> grasshopper, why does it amaze you so..........
>>>
>>> because relatively speaking this is all entirely new to me! :)
>>> i say "relatively", it's been 4 years...
>>>
>> I have been poking around with this stuff for around 7 years now...
> <
> If I had only been poking around for 7 years I would only be 10% older..........
>

Yeah... More around 18% of my lifespan in my case...

Was near the start of my 30s at the start of this project, now I am near
the end...

And I am using BGBCC as my C compiler, which came into existence (as its
own entity) when I was in my early 20s, based on fork of a VM I wrote
back when I was still a teenager (for a language I would later end up
calling BGBScript); end then shelved for a while, as it turned out C did
not make for a particularly good scripting language (and the VM was a
pain to debug). But then roughly a decade later I had need for a C
compiler, and had a still mostly intact C front-end,...

Well, nevermind if BGBCC can still sorta compile my script language,
which was in turn influenced by JavaScript and ActionScript (and
following after a Scheme interpreter I had written earlier during my
time in high-school).

Partly this was in turn because in my youth I had also crossed paths
with things like Adobe Flash and similar.

....

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<c167df4f-729e-4aef-ac14-5083b6d2b920n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31876&group=comp.arch#31876

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:15cb:b0:3bf:c1f3:84bc with SMTP id d11-20020a05622a15cb00b003bfc1f384bcmr7599250qty.11.1682517407478;
Wed, 26 Apr 2023 06:56:47 -0700 (PDT)
X-Received: by 2002:aca:ba04:0:b0:38e:d835:8cf0 with SMTP id
k4-20020acaba04000000b0038ed8358cf0mr3190759oif.9.1682517407169; Wed, 26 Apr
2023 06:56:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 26 Apr 2023 06:56:46 -0700 (PDT)
In-Reply-To: <u2ab3f$18ifq$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com> <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com> <360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>
<u29a9s$10dab$1@dont-email.me> <17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>
<u2ab3f$18ifq$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c167df4f-729e-4aef-ac14-5083b6d2b920n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 26 Apr 2023 13:56:47 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5097

by: robf...@gmail.com - Wed, 26 Apr 2023 13:56 UTC

On Wednesday, April 26, 2023 at 12:59:32 AM UTC-4, BGB wrote:
> On 4/25/2023 3:10 PM, MitchAlsup wrote:
> > On Tuesday, April 25, 2023 at 2:39:44 PM UTC-5, BGB wrote:
> >> On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
> >>> On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
> >>>> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
> >>>>>> The reason you use a 2-bit control scheme comes into play when you want the
> >>>>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
> >>>>>> operand is used to gate the bit-pair.
> >>>> <
> >>>>> ooo niiice,
> >>>> <
> >>>> grasshopper, why does it amaze you so..........
> >>>
> >>> because relatively speaking this is all entirely new to me! :)
> >>> i say "relatively", it's been 4 years...
> >>>
> >> I have been poking around with this stuff for around 7 years now...
> > <
> > If I had only been poking around for 7 years I would only be 10% older...........
> >
> Yeah... More around 18% of my lifespan in my case...
>
>
> Was near the start of my 30s at the start of this project, now I am near
> the end...
>
> And I am using BGBCC as my C compiler, which came into existence (as its
> own entity) when I was in my early 20s, based on fork of a VM I wrote
> back when I was still a teenager (for a language I would later end up
> calling BGBScript); end then shelved for a while, as it turned out C did
> not make for a particularly good scripting language (and the VM was a
> pain to debug). But then roughly a decade later I had need for a C
> compiler, and had a still mostly intact C front-end,...
>
>
> Well, nevermind if BGBCC can still sorta compile my script language,
> which was in turn influenced by JavaScript and ActionScript (and
> following after a Scheme interpreter I had written earlier during my
> time in high-school).
>
> Partly this was in turn because in my youth I had also crossed paths
> with things like Adobe Flash and similar.
>
> ...
I be in my late 50's now. Into computers for 40+ years and I am still
learning new things. It amazes me the kind of knowledge accumulation
required.

Just on the topic of size of OoO cores, they can be fit into a large Artix
FPGA if one is careful and has not too many varieties of operations. I
have seen a superscalar core fit into as little as 10k LUTs. I have been
experimenting some with scalar OoO CPUs rather than superscalar.
The idea was to hide some of the memory latency. I have found that
memory access seems to be what is limiting the cores a lot of the
time. It does not matter if it is superscalar if it cannot be fed
instructions and data fast enough; it sits idle. OoO memory
operations can really hide latency.

If it takes four clocks to get data from the cache and 30 clocks to get
external data, using an extra clock or two in the CPU core for
processing has a low impact.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<2da88315-93f9-454a-8fb7-4661806d3f84n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31878&group=comp.arch#31878

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5c52:0:b0:3ef:33fc:96d0 with SMTP id j18-20020ac85c52000000b003ef33fc96d0mr789741qtj.4.1682524157751;
Wed, 26 Apr 2023 08:49:17 -0700 (PDT)
X-Received: by 2002:aca:6245:0:b0:390:7dca:9606 with SMTP id
w66-20020aca6245000000b003907dca9606mr89799oib.8.1682524157456; Wed, 26 Apr
2023 08:49:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 26 Apr 2023 08:49:17 -0700 (PDT)
In-Reply-To: <c167df4f-729e-4aef-ac14-5083b6d2b920n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b85b:1a81:8e02:b63a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b85b:1a81:8e02:b63a
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me> <417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me> <8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com> <cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com> <360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>
<u29a9s$10dab$1@dont-email.me> <17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>
<u2ab3f$18ifq$1@dont-email.me> <c167df4f-729e-4aef-ac14-5083b6d2b920n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2da88315-93f9-454a-8fb7-4661806d3f84n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 26 Apr 2023 15:49:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6238

by: MitchAlsup - Wed, 26 Apr 2023 15:49 UTC

On Wednesday, April 26, 2023 at 8:56:49 AM UTC-5, robf...@gmail.com wrote:
> On Wednesday, April 26, 2023 at 12:59:32 AM UTC-4, BGB wrote:
> > On 4/25/2023 3:10 PM, MitchAlsup wrote:
> > > On Tuesday, April 25, 2023 at 2:39:44 PM UTC-5, BGB wrote:
> > >> On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
> > >>> On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
> > >>>> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
> > >>>>>> The reason you use a 2-bit control scheme comes into play when you want the
> > >>>>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
> > >>>>>> operand is used to gate the bit-pair.
> > >>>> <
> > >>>>> ooo niiice,
> > >>>> <
> > >>>> grasshopper, why does it amaze you so..........
> > >>>
> > >>> because relatively speaking this is all entirely new to me! :)
> > >>> i say "relatively", it's been 4 years...
> > >>>
> > >> I have been poking around with this stuff for around 7 years now...
> > > <
> > > If I had only been poking around for 7 years I would only be 10% older..........
> > >
> > Yeah... More around 18% of my lifespan in my case...
> >
> >
> > Was near the start of my 30s at the start of this project, now I am near
> > the end...
> >
> > And I am using BGBCC as my C compiler, which came into existence (as its
> > own entity) when I was in my early 20s, based on fork of a VM I wrote
> > back when I was still a teenager (for a language I would later end up
> > calling BGBScript); end then shelved for a while, as it turned out C did
> > not make for a particularly good scripting language (and the VM was a
> > pain to debug). But then roughly a decade later I had need for a C
> > compiler, and had a still mostly intact C front-end,...
> >
> >
> > Well, nevermind if BGBCC can still sorta compile my script language,
> > which was in turn influenced by JavaScript and ActionScript (and
> > following after a Scheme interpreter I had written earlier during my
> > time in high-school).
> >
> > Partly this was in turn because in my youth I had also crossed paths
> > with things like Adobe Flash and similar.
> >
> > ...
> I be in my late 50's now. Into computers for 40+ years and I am still
> learning new things. It amazes me the kind of knowledge accumulation
> required.
<
Yes, indeed, just about the time one has been exposed to enough of
what computers are made of to be a good computer architect, it is
time to retire.
>
> Just on the topic of size of OoO cores, they can be fit into a large Artix
> FPGA if one is careful and has not too many varieties of operations. I
> have seen a superscalar core fit into as little as 10k LUTs. I have been
> experimenting some with scalar OoO CPUs rather than superscalar.
> The idea was to hide some of the memory latency. I have found that
> memory access seems to be what is limiting the cores a lot of the
> time. It does not matter if it is superscalar if it cannot be fed
> instructions and data fast enough; it sits idle. OoO memory
> operations can really hide latency.
<
I have been watching what happens when one gets 1 constant per
instruction, and where that constant can be of any size. Many
loops can get unrolled, and all of the memory references eliminated,
leaving only a string of FMAC instructions......
<
If you were to compare to a more typical RISC ISA, the more typical RISC
architecture would be executing 2× to 3× more instructions per clock
but take more time to perform the calculation......These extra instructions
are accessing the constants which were directly in the instruction stream
of the atypical ISA.
>
> If it takes four clocks to get data from the cache and 30 clocks to get
> external data, using an extra clock or two in the CPU core for
> processing has a low impact.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u2cs0l$1ovtv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31899&group=comp.arch#31899

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Wed, 26 Apr 2023 23:00:15 -0500
Organization: A noiseless patient Spider
Lines: 204
Message-ID: <u2cs0l$1ovtv$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me>
<d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me>
<9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me> <ZWa1M.173552$qpNc.145879@fx03.iad>
<u2421j$3uhip$1@dont-email.me>
<417c5f26-a69e-4a3d-bb5a-03f7fbfaf6b3n@googlegroups.com>
<u24b8c$3vu5s$1@dont-email.me>
<8383c3e3-d900-4350-89b4-0d63dd8d1b10n@googlegroups.com>
<5037a641-f409-4a1f-9443-5c23c8f93f76n@googlegroups.com>
<cf7fbf57-54ea-4f80-8e14-2cb86a005709n@googlegroups.com>
<b1e80691-f79e-4fd8-a471-107d86857612n@googlegroups.com>
<360d513f-a47b-4598-a42a-ce5e3e80eea4n@googlegroups.com>
<u29a9s$10dab$1@dont-email.me>
<17d74d96-7421-4abc-b4c1-1700e5297cbdn@googlegroups.com>
<u2ab3f$18ifq$1@dont-email.me>
<c167df4f-729e-4aef-ac14-5083b6d2b920n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 27 Apr 2023 04:00:22 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="faebaf0ae58b9210b97c2b3b3de57934";
logging-data="1867711"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+isQCb8U71tASyb7kA+CZy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:FkwSM+RmXZIVWkUUR7Ke4ZbmNPc=
Content-Language: en-US
In-Reply-To: <c167df4f-729e-4aef-ac14-5083b6d2b920n@googlegroups.com>

by: BGB - Thu, 27 Apr 2023 04:00 UTC

On 4/26/2023 8:56 AM, robf...@gmail.com wrote:
> On Wednesday, April 26, 2023 at 12:59:32 AM UTC-4, BGB wrote:
>> On 4/25/2023 3:10 PM, MitchAlsup wrote:
>>> On Tuesday, April 25, 2023 at 2:39:44 PM UTC-5, BGB wrote:
>>>> On 4/25/2023 3:35 AM, luke.l...@gmail.com wrote:
>>>>> On Monday, April 24, 2023 at 10:15:17 PM UTC+1, MitchAlsup wrote:
>>>>>> On Monday, April 24, 2023 at 3:49:55 PM UTC-5, luke.l...@gmail.com wrote:
>>>>>>>> The reason you use a 2-bit control scheme comes into play when you want the
>>>>>>>> individual terms to insert their own carries. ADD8C-like. The carry in from the
>>>>>>>> operand is used to gate the bit-pair.
>>>>>> <
>>>>>>> ooo niiice,
>>>>>> <
>>>>>> grasshopper, why does it amaze you so..........
>>>>>
>>>>> because relatively speaking this is all entirely new to me! :)
>>>>> i say "relatively", it's been 4 years...
>>>>>
>>>> I have been poking around with this stuff for around 7 years now...
>>> <
>>> If I had only been poking around for 7 years I would only be 10% older..........
>>>
>> Yeah... More around 18% of my lifespan in my case...
>>
>>
>> Was near the start of my 30s at the start of this project, now I am near
>> the end...
>>
>> And I am using BGBCC as my C compiler, which came into existence (as its
>> own entity) when I was in my early 20s, based on fork of a VM I wrote
>> back when I was still a teenager (for a language I would later end up
>> calling BGBScript); end then shelved for a while, as it turned out C did
>> not make for a particularly good scripting language (and the VM was a
>> pain to debug). But then roughly a decade later I had need for a C
>> compiler, and had a still mostly intact C front-end,...
>>
>>
>> Well, nevermind if BGBCC can still sorta compile my script language,
>> which was in turn influenced by JavaScript and ActionScript (and
>> following after a Scheme interpreter I had written earlier during my
>> time in high-school).
>>
>> Partly this was in turn because in my youth I had also crossed paths
>> with things like Adobe Flash and similar.
>>
>> ...
> I be in my late 50's now. Into computers for 40+ years and I am still
> learning new things. It amazes me the kind of knowledge accumulation
> required.
>

I haven't been around that long, but don't think I am doing too horribly
at least...

As can be noted, a lot of my early code wasn't particularly good, and it
wasn't really until my 20s that my coding skills "stopped sucking
enough" that I really started making all that much progress.

Sometimes, it seems like my "biggest achievements" were in the past, but
a lot of my code and engineering back then was pretty terrible (and some
of my current tech is still in the shadow of past design missteps).

But, yeah, back when I was in high-school, some things were still pretty
hyped:
XML
Adobe Flash (and Flash animations)
The Web / websites / ...
Intel Itanium
...

I guess, in the wake of both "the dot-com crash" and "the whole 9/11
thing", ...

Java was still popular, but its initial hype was starting to fade. In
this case, .NET was still pretty new, and didn't reach "full hype"
status until a few years later (as seemingly most of the "Java fandom"
started to jump ship over to .NET).

AMD having released x86-64 around this time also meant "the writing was
on the wall" for Itanium; and XML quickly went from being praised to
being despised; ...

Can't say I was entirely immune to some of this...

*cough* me at the time proceeding to implement a JavaScript/ActionScript
clone on top of using an XML DOM implementation for the ASTs (and
DOM-tree-walking logic to run for the interpreter).

Yes... Its performance was epic...

Ironically, it was a fork of this that later became BGBCC...

However, given the initially horrible results with my DOM walking script
interpreter, I re-implemented this VM by writing a parser for my JS/AS
variant and then running this on top of the core of the Scheme
interpreter I had written previously (which had initially worked by
using an S-Expression walking design, but was then modified to compile
the cons-lists / S-Expressions into a stack-oriented bytecode).

The above VM survived mostly until the early 2010s, when I wrote a new
VM which has instead used ex-nihilo objects as the main AST building
block in opposition to cons-cell lists; and had otherwise moved to a
"fully static typed" core (but keeping dynamic types as an optional
feature). And, the language had migrated from being "more like
ActionScript" to "more like Java" in the name of better performance.

By this point, I had also moved BGBCC's core over to static typing as
well. Its original type-system was more a mimic of C's type-system built
on top of dynamic type system.

However, BGBCC still retains various "scars" from what it once was (and
is in a way an unbroken like from some of my earliest code).

And, my C variant still retains support for dynamic types, and my ISA
partly ended up being designed around supporting them, because they are
"still useful sometimes".

Sometimes, it seems like I am always living in the shadow of my past self.

I also sit here with an elderly cat who has also been with me through
much of this journey (and nearly half of my lifespan thus far). Sadly,
neither of us will live forever...

> Just on the topic of size of OoO cores, they can be fit into a large Artix
> FPGA if one is careful and has not too many varieties of operations. I
> have seen a superscalar core fit into as little as 10k LUTs. I have been
> experimenting some with scalar OoO CPUs rather than superscalar.
> The idea was to hide some of the memory latency. I have found that
> memory access seems to be what is limiting the cores a lot of the
> time. It does not matter if it is superscalar if it cannot be fed
> instructions and data fast enough; it sits idle. OoO memory
> operations can really hide latency.
>

Yeah. I can currently fit two BJX2 cores into the XC7A200T I am
currently using (and using ~ 67% of the total LUT budget).

But, yeah, by current models (in my emulator) roughly 44% of the
clock-cycles (in Doom) are spent waiting for memory access (the bulk of
this being L2->DRAM).

But, from what I have seen, I suspect my emulator is still a little
optimistic on this front (my Verilog implementations seem to spend more
clock-cycles waiting for RAM).

In both the emulator and partial simulation, if one assumes a 100% L2
hit rate, Doom is pegged at 30fps (its internal frame-rate limiter, *1),
and GLQuake is "mostly playable" (typically high single-digit or
double-digit at 50MHz).

In GLQuake, in the "100% L2 hit" scenario, the majority of CPU time is
also spent in the OpenGL backend rather than in the BSP walk and similar
(seemingly "R_RecursiveWorldNode" and similar causing a *lot* of L2
cache misses).

In the "L2 always hits" scenario, it also seems drawing pixel spans
becomes one of the major "hot spots" in terms of clock-cycle budget.

For my newer FPGA board, it is a partial tradeoff as it can have a
bigger L2 cache (512K vs 256K, due to more block RAM), but has slower
DRAM access (due to the higher minimum CAS latency from the DDR3 chip used).

Doesn't really effect Dhrystone, which already has a pretty much 100% L2
hit rate.

*1: Though one can exclude Doom in the "GUI Mode" from this, as now the
"UpdateWindowStack" code eats a big chunk of CPU time; seemingly L2
cache misses are not the primary cause of its slowness...

By default, the emulator and partial simulation try to model the full
memory hierarchy; mostly to try to keep performance consistent with the
full implementation.

> If it takes four clocks to get data from the cache and 30 clocks to get
> external data, using an extra clock or two in the CPU core for
> processing has a low impact.

OK.

I can note there are reasons my own code is using formats like Binary16
and similar so often...

And, for TKRA-GL using an RGB555 framebuffer, Z16 / Z12.S4 depth buffer,
and typically using texture compression. It isn't just about saving RAM,
so much as saving RAM can also help with performance.

Subject	Author
chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is	EricP
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is	EricP
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Anton Ertl
Re: chained multi-issue reg-renaming in the same clock cycle: is	EricP
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Anton Ertl
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Scott Lurndal
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	robf...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Scott Lurndal
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	robf...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup