Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

We cannot command nature except by obeying her. -- Sir Francis Bacon


devel / comp.arch / Re: Design a better 16 or 32 bit processor

SubjectAuthor
* Java bytecode processorsThomas Koenig
+* Re: Java bytecode processorsMitchAlsup
|+* Re: Java bytecode processorsBGB
||`* Re: Java bytecode processorsaph
|| +* Re: Java bytecode processorsMitchAlsup
|| |`* Re: Java bytecode processorsThomas Koenig
|| | +- Re: Java bytecode processorsMitchAlsup
|| | `- Re: Java bytecode processorsBGB
|| `- Re: Java bytecode processorsBGB
|+* Re: Java bytecode processorsThomas Koenig
||`* Re: Java bytecode processorsEricP
|| +- Re: Java bytecode processorsEricP
|| `* Re: Java bytecode processorsTerje Mathisen
||  `- Re: Java bytecode processorsEricP
|+- Re: Java bytecode processorsTerje Mathisen
|`* Re: Java bytecode processorsAnton Ertl
| +* Re: Java bytecode processorsStefan Monnier
| |+- Re: Java bytecode processorsBGB
| |+* Re: Java bytecode processorsAnton Ertl
| ||`* Re: Java bytecode processorsMitchAlsup
| || `- Re: Java bytecode processorsNemo
| |`* Re: Java bytecode processorsNemo
| | +* Re: Java bytecode processorsEricP
| | |`- Re: Java bytecode processorsMitchAlsup
| | `- Re: Java bytecode processorsMitchAlsup
| `* Re: Java bytecode processorsMitchAlsup
|  `* Re: Java bytecode processorsBernd Linsel
|   +- Re: Java bytecode processorsMitchAlsup
|   `* Re: Java bytecode processorsThomas Koenig
|    +- Re: Java bytecode processorsMitchAlsup
|    +- Re: Java bytecode processorsBGB
|    `* Re: bad code, Java bytecode processorsJohn Levine
|     +- Re: bad code, Java bytecode processorsThomas Koenig
|     `- Re: bad code, Java bytecode processorsAnton Ertl
+- Re: Java bytecode processorsAnton Ertl
+* Re: Java bytecode processorsgareth evans
|+* Re: Java bytecode processorsAnton Ertl
||`* Re: Java bytecode processorsBGB
|| `* Re: Java bytecode processorsMitchAlsup
||  `* Re: Java bytecode processorsBGB
||   `- Re: Java bytecode processorsMitchAlsup
|`* Re: Java bytecode processorsMitchAlsup
| `* Re: Java bytecode processorsgareth evans
|  `* Re: Java bytecode processorsMitchAlsup
|   `- Re: Java bytecode processorsBGB
+- Re: Java bytecode processorsMarcus
`* Re: Java bytecode processorsantispam
 `* Re: Java bytecode processorsJimBrakefield
  +- Re: Java bytecode processorsJimBrakefield
  `* Re: Java bytecode processorsJohn Levine
   `* Re: Java bytecode processorsMitchAlsup
    `* Re: not the PDP-11, was Java bytecode processorsJohn Levine
     +- Re: not the PDP-11, was Java bytecode processorsJimBrakefield
     `* Design a better 16 or 32 bit processorBrett
      +* Re: Design a better 16 or 32 bit processorThomas Koenig
      |`- Re: Design a better 16 or 32 bit processorBrett
      +* Re: Design a better 16 or 32 bit processorJohn Dallman
      |+- Re: Design a better 16 or 32 bit processorBGB
      |`- Re: Design a better 16 or 32 bit processorBrett
      +* Re: not a vax, was Design a better 16 or 32 bit processorJohn Levine
      |+- Re: not a vax, was Design a better 16 or 32 bit processorBrett
      |+* Re: not a vax, was Design a better 16 or 32 bit processorAnton Ertl
      ||`* Re: not a 360 either, was Design a better 16 or 32 bit processorJohn Levine
      || +- Re: not a 360 either, was Design a better 16 or 32 bit processorMitchAlsup
      || +* Re: not a 360 either, was Design a better 16 or 32 bit processorAnne & Lynn Wheeler
      || |`* Re: not a 360 either, was Design a better 16 or 32 bit processorJohn Levine
      || | `* Re: not a 360 either, was Design a better 16 or 32 bit processorAnne & Lynn Wheeler
      || |  `* vs/pascal (Was: Re: not a 360 either, was Design a better 16 or 32Terje Mathisen
      || |   `- Re: vs/pascalAnne & Lynn Wheeler
      || `* Re: not a 360 either, was Design a better 16 or 32 bit processorAnton Ertl
      ||  `* Re: not a 360 either, was Design a better 16 or 32 bit processorEricP
      ||   `- Re: not a 360 either, was Design a better 16 or 32 bit processorMitchAlsup
      |`* Re: not a vax, was Design a better 16 or 32 bit processorEricP
      | `- Re: not a vax, was Design a better 16 or 32 bit processorMitchAlsup
      +* Re: Design a better 16 or 32 bit processorBrett
      |+* Re: Design a better 16 or 32 bit processorIvan Godard
      ||`- Re: Design a better 16 or 32 bit processorBrett
      |`* Re: Design a better 16 or 32 bit processorMitchAlsup
      | `* Re: Design a better 16 or 32 bit processorBrett
      |  +- Re: Design a better 16 or 32 bit processorBrett
      |  +* Re: Design a better 16 or 32 bit processorMitchAlsup
      |  |+* Re: Design a better 16 or 32 bit processorStefan Monnier
      |  ||`* Re: Design a better 16 or 32 bit processorAnton Ertl
      |  || +- Re: Design a better 16 or 32 bit processorEricP
      |  || `* Re: Design a better 16 or 32 bit processorDavid Brown
      |  ||  `* Re: Design a better 16 or 32 bit processorStephen Fuld
      |  ||   `- Re: Design a better 16 or 32 bit processorDavid Brown
      |  |`* Re: Design a better 16 or 32 bit processorBrett
      |  | `* Re: Design a better 16 or 32 bit processorMitchAlsup
      |  |  `* Re: Design a better 16 or 32 bit processorIvan Godard
      |  |   `* Re: Design a better 16 or 32 bit processorBrett
      |  |    `* Re: Design a better 16 or 32 bit processorJimBrakefield
      |  |     `- Re: Design a better 16 or 32 bit processorBrett
      |  +* Re: Design a better 16 or 32 bit processorBGB
      |  |+* Re: Design a better 16 or 32 bit processorMitchAlsup
      |  ||+- Re: Design a better 16 or 32 bit processorBGB
      |  ||`- Re: Design a better 16 or 32 bit processorIvan Godard
      |  |`- Re: Design a better 16 or 32 bit processorBGB
      |  `* Re: Design a better 16 or 32 bit processorTerje Mathisen
      |   `* Re: Design a better 16 or 32 bit processorMitchAlsup
      |    +- Re: Design a better 16 or 32 bit processorStephen Fuld
      |    +- Re: Design a better 16 or 32 bit processorBGB
      |    `* Re: Design a better 16 or 32 bit processorEricP
      `* ARM just added MEMCPY instructions.Brett

Pages:12345678910
Re: Design a better 16 or 32 bit processor

<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20432&group=comp.arch#20432

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:d804:: with SMTP id h4mr1796636qvj.37.1631570152770;
Mon, 13 Sep 2021 14:55:52 -0700 (PDT)
X-Received: by 2002:a4a:e3cf:: with SMTP id m15mr11059047oov.21.1631570152508;
Mon, 13 Sep 2021 14:55:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 13 Sep 2021 14:55:52 -0700 (PDT)
In-Reply-To: <shocru$pl0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:49b:5002:45eb:aaea;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:49b:5002:45eb:aaea
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
Subject: Re: Design a better 16 or 32 bit processor
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 13 Sep 2021 21:55:52 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 153
 by: MitchAlsup - Mon, 13 Sep 2021 21:55 UTC

On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
> Brett <gg...@yahoo.com> wrote:
> > John Levine <jo...@taugh.com> wrote:
> >> According to MitchAlsup <Mitch...@aol.com>:
> >>>> Sounds a lot like a PDP-11.
> >>> <
> >>> Which died because its successor was too hard to pipeline.
> >>
> >> The PDP-11 was a really good design for the late 1960s, when memory
> >> was starting to be affordable and microcode ROM was still a lot faster
> >> than core.
> >>
> >> The VAX was also a good design for the late 1960s but unfortunately it
> >> was introduced in the late 1970s. It wasn't just that it was too hard
> >> to pipeline.
> >
> >> The instruction set was full of overcomplicated
> >> microcoded instructions that were often slower than a sequence of
> >> simple instructions,
> >
> > A Myth what was promoted by RISC proponents, which has been debunked.
<
> An alternative or addition is wide packed instructions with chaining.
> The example of wide packed is Itanic, but of course they did it wrong.
>
> You would go 256 bits wide and support a variable number of instructions in
> the packet and by supporting chaining you save 10 bits of register
> specifiers for each chain segment, minus a bit to indicate the chain link..
<
Why is there any romance associated with power of 2 widths in support of
wide {Fetch-Decode-Execute} ?
<
4 in 128-bits
6 in 192-bits
8 in 256-bits
10 in 320-bits
.....
<
It seems to me that powers of 2 in byte count is an artificial boundary that
is not good for the ISA in general.
<
Also this seriously hinders building little implementation and big implementations
at the same time, being semi-optimal only for a few implementations in the middle.
>
> You would use heads and tails encoding and support jumps into and out of
> the packet. Chaining alone should give you the best instruction density and
> add some micro coded instructions and you should get dominating instruction
> density that makes manufactures take a serious look at your offerings.
<
My 66000 ISA is getting near x86-64 instruction density without any of this..
>
> A variable width instruction set can support chaining today by just adding
> the instructions, I am perplexed as to why no one has.
<
Why don't you give it a go and see what comes out ?
>
> A new architecture needs a hook to get noticed and dominating instruction
> density is one way to get that notice.
<
My guess is that lower context switch overhead would garner more wins
than instruction density; for example, an ADA call to an entry accept point
in a different address space costing only 12 cycles.
<
> > RISC only made sense for a decade back in the ancient history of the
> > 1980’s.
<
RISC made sense in the brief interval when 32-bit ISAs were too complicated
to all be on 1 chip. By shedding the area of the microcode, one got the space
to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
got 1 instructions every clock (less cache miss latency).
> >
> > Today if I wanted to build a better 16 or 32 bit processor the first step
> > would be to find what micro coded instructions I could add to reduce
> > instruction density, and thus win the lowest cost war.
<
In My case (My 66000) the biggest code density benefit was in creating
ENTER and EXIT instructions, second best was giving every instruction
access to any width immediate.
<
Secondarily, I doubt the 16-bit market is in the search for a new architecture.
> >
> > The 8086 with hard coded registers was quite good for the era, but we can
> > do better today, by micro coding much more complicated sequences.
<
At a different level, transcendental instructions in My 66000 are microcoded
IN the FUNCTION UNIT (not in the Decoder). The Function unit goes "busy"
until the result is ready to be delivered. SIN() COS() Ln() EXP() take 19 cycles
about the latency of an FDIV.
> >
> > The first instruction I would add is a one instruction memcpy loop, which
> > would use three hard coded registers to make the instruction short. The
> > data register would not be visible so that I could use vector registers if
> > I wanted. And there would be several variants for copy size and alignment.
> > And a bit to decide if the count is part of the instruction.
> >
> > Another instruction I would add is add plus store, etc.
> >
> > The case for load plus add is harder, and might not make the cut due to
> > transistor implementation cost outweighing memory transistor savings.
> > Load compare and branch is in the same boat and has to be added to the
> > total system cost of load compute.
> >
> > I seriously think a new 16 or 32 bit processor with micro coded
> > instructions could win market share, by simple expedient of smaller total
> > size with code included.
<
Not a single RISC architecture got a single design win due to code density !!
> >
> > My template is a pipelined 386 with 32 registers and far more complex and
> > longer micro code instructions.
> >
> > There is a major company that went for an updated 386, but did not improve
> > anything besides instruction encoding and minor fixes. Failed to go big and
> > so failed, go big or go home.
> >
> >> and they somehow didn't notice that memory was
> >> getting a lot cheaper, with a super-dense super-general instruction
> >> set intended for assembler programmers, and tiny 512 byte pages that
> >> even at the time were obviously too small.

Re: Design a better 16 or 32 bit processor

<shoo1r$h8b$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20433&group=comp.arch#20433

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Mon, 13 Sep 2021 23:49:15 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <shoo1r$h8b$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com>
<shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<shog6o$nvj$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 13 Sep 2021 23:49:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="208133be679912524de33a5b71bab2cc";
logging-data="17675"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19DrvXFOcfIkHHiLb2rpTMJ"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:z0tI+yxapo7BwnNOpTZlblElt6A=
sha1:LPDouxXqiCF+9adndeTjGYHGahM=
 by: Brett - Mon, 13 Sep 2021 23:49 UTC

Ivan Godard <ivan@millcomputing.com> wrote:
> On 9/13/2021 1:38 PM, Brett wrote:
>> Brett <ggtgp@yahoo.com> wrote:
>>> John Levine <johnl@taugh.com> wrote:
>>>> According to MitchAlsup <MitchAlsup@aol.com>:
>>>>>> Sounds a lot like a PDP-11.
>>>>> <
>>>>> Which died because its successor was too hard to pipeline.
>>>>
>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>> than core.
>>>>
>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>> to pipeline.
>>>
>>>> The instruction set was full of overcomplicated
>>>> microcoded instructions that were often slower than a sequence of
>>>> simple instructions,
>>>
>>> A Myth what was promoted by RISC proponents, which has been debunked.
>>
>> An alternative or addition is wide packed instructions with chaining.
>> The example of wide packed is Itanic, but of course they did it wrong.
>>
>> You would go 256 bits wide and support a variable number of instructions in
>> the packet and by supporting chaining you save 10 bits of register
>> specifiers for each chain segment, minus a bit to indicate the chain link.
>>
>> You would use heads and tails encoding and support jumps into and out of
>> the packet. Chaining alone should give you the best instruction density and
>> add some micro coded instructions and you should get dominating instruction
>> density that makes manufactures take a serious look at your offerings.
>>
>> A variable width instruction set can support chaining today by just adding
>> the instructions, I am perplexed as to why no one has.
>
> And how is this different from Mill belt?

Mill is BETTER than RISC with chaining. ;)

We are discussing current state of the art, not your leap forward. ;)

>> A new architecture needs a hook to get noticed and dominating instruction
>> density is one way to get that notice.
>>
>

Re: Design a better 16 or 32 bit processor

<shoqnc$ub2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20434&group=comp.arch#20434

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 00:34:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 158
Message-ID: <shoqnc$ub2$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com>
<shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 00:34:52 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="208133be679912524de33a5b71bab2cc";
logging-data="31074"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18RSXIwnGmBJWZ17hb8upJE"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:wCULqNqvPRxrlpaIT/YK5G3P+Pw=
sha1:GNcZ9c9iSYWld8JbC+ZzsveWBHM=
 by: Brett - Tue, 14 Sep 2021 00:34 UTC

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>> Brett <gg...@yahoo.com> wrote:
>>> John Levine <jo...@taugh.com> wrote:
>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>> Sounds a lot like a PDP-11.
>>>>> <
>>>>> Which died because its successor was too hard to pipeline.
>>>>
>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>> than core.
>>>>
>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>> to pipeline.
>>>
>>>> The instruction set was full of overcomplicated
>>>> microcoded instructions that were often slower than a sequence of
>>>> simple instructions,
>>>
>>> A Myth what was promoted by RISC proponents, which has been debunked.
> <
>> An alternative or addition is wide packed instructions with chaining.
>> The example of wide packed is Itanic, but of course they did it wrong.
>>
>> You would go 256 bits wide and support a variable number of instructions in
>> the packet and by supporting chaining you save 10 bits of register
>> specifiers for each chain segment, minus a bit to indicate the chain link.
> <
> Why is there any romance associated with power of 2 widths in support of
> wide {Fetch-Decode-Execute} ?
> <
> 4 in 128-bits
> 6 in 192-bits
> 8 in 256-bits
> 10 in 320-bits
> ....
> <
> It seems to me that powers of 2 in byte count is an artificial boundary that
> is not good for the ISA in general.
> <
> Also this seriously hinders building little implementation and big implementations
> at the same time, being semi-optimal only for a few implementations in the middle.

Agree.

>> You would use heads and tails encoding and support jumps into and out of
>> the packet. Chaining alone should give you the best instruction density and
>> add some micro coded instructions and you should get dominating instruction
>> density that makes manufactures take a serious look at your offerings.
> <
> My 66000 ISA is getting near x86-64 instruction density without any of this.

Um, x86-64 is not much better than RISC and sometimes worse depending on
how much floating point you do as that adds another byte to each
instruction.

If ARM Cortex/Thumb2 crushes you on density then you are in trouble.

>> A variable width instruction set can support chaining today by just adding
>> the instructions, I am perplexed as to why no one has.
> <
> Why don't you give it a go and see what comes out ?

Chaining opcodes is complex homework that every company should have taken a
look at, including you, results should be somewhere on the internet.

Getting the RISC guys to go variable width was worse than pulling teeth.
Threats of firings and resignations were involved at ARM, and the MIPS
founder did fire people for such suggestions, though most were weeded out
at hiring interviews leading to brain dead group think that killed the
company when the market changed.

Adding a chaining register dependency is maybe 10 times worse in these
peoples minds.

>> A new architecture needs a hook to get noticed and dominating instruction
>> density is one way to get that notice.
> <
> My guess is that lower context switch overhead would garner more wins
> than instruction density; for example, an ADA call to an entry accept point
> in a different address space costing only 12 cycles.

Sounds good.

> <
>>> RISC only made sense for a decade back in the ancient history of the
>>> 1980’s.
> <
> RISC made sense in the brief interval when 32-bit ISAs were too complicated
> to all be on 1 chip. By shedding the area of the microcode, one got the space
> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
> got 1 instructions every clock (less cache miss latency).

Agree.

>>> Today if I wanted to build a better 16 or 32 bit processor the first step
>>> would be to find what micro coded instructions I could add to reduce
>>> instruction density, and thus win the lowest cost war.
> <
> In My case (My 66000) the biggest code density benefit was in creating
> ENTER and EXIT instructions, second best was giving every instruction
> access to any width immediate.
> <
> Secondarily, I doubt the 16-bit market is in the search for a new architecture.

The 16 bit market has the worst choices to pick from, the problem is that
these devices are cheap, so everyone ignores this market. Definition of
opportunity.

In the 32 bit market you have to compete with RISC-V which is free. Sure it
is crap compared to the new architectures discussed here, but it’s free.

Pick your poison. ;)

>>> The 8086 with hard coded registers was quite good for the era, but we can
>>> do better today, by micro coding much more complicated sequences.
> <
> At a different level, transcendental instructions in My 66000 are microcoded
> IN the FUNCTION UNIT (not in the Decoder). The Function unit goes "busy"
> until the result is ready to be delivered. SIN() COS() Ln() EXP() take 19 cycles
> about the latency of an FDIV.
>>>
>>> The first instruction I would add is a one instruction memcpy loop, which
>>> would use three hard coded registers to make the instruction short. The
>>> data register would not be visible so that I could use vector registers if
>>> I wanted. And there would be several variants for copy size and alignment.
>>> And a bit to decide if the count is part of the instruction.
>>>
>>> Another instruction I would add is add plus store, etc.
>>>
>>> The case for load plus add is harder, and might not make the cut due to
>>> transistor implementation cost outweighing memory transistor savings.
>>> Load compare and branch is in the same boat and has to be added to the
>>> total system cost of load compute.
>>>
>>> I seriously think a new 16 or 32 bit processor with micro coded
>>> instructions could win market share, by simple expedient of smaller total
>>> size with code included.
> <
> Not a single RISC architecture got a single design win due to code density !!
>>>
>>> My template is a pipelined 386 with 32 registers and far more complex and
>>> longer micro code instructions.
>>>
>>> There is a major company that went for an updated 386, but did not improve
>>> anything besides instruction encoding and minor fixes. Failed to go big and
>>> so failed, go big or go home.
>>>
>>>> and they somehow didn't notice that memory was
>>>> getting a lot cheaper, with a super-dense super-general instruction
>>>> set intended for assembler programmers, and tiny 512 byte pages that
>>>> even at the time were obviously too small.
>

Re: Design a better 16 or 32 bit processor

<shore9$20n$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20435&group=comp.arch#20435

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 00:47:05 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 162
Message-ID: <shore9$20n$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com>
<shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 00:47:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="208133be679912524de33a5b71bab2cc";
logging-data="2071"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+kAmBx67krsIUJmCq4dimI"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:Z7n3ArSxwgIH825rzXMtjNsWJy8=
sha1:n/pE92/hv6pSG/hJNh7rAG6NcsU=
 by: Brett - Tue, 14 Sep 2021 00:47 UTC

Brett <ggtgp@yahoo.com> wrote:
> MitchAlsup <MitchAlsup@aol.com> wrote:
>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>> Brett <gg...@yahoo.com> wrote:
>>>> John Levine <jo...@taugh.com> wrote:
>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>> Sounds a lot like a PDP-11.
>>>>>> <
>>>>>> Which died because its successor was too hard to pipeline.
>>>>>
>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>>> than core.
>>>>>
>>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>> to pipeline.
>>>>
>>>>> The instruction set was full of overcomplicated
>>>>> microcoded instructions that were often slower than a sequence of
>>>>> simple instructions,
>>>>
>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>> <
>>> An alternative or addition is wide packed instructions with chaining.
>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>
>>> You would go 256 bits wide and support a variable number of instructions in
>>> the packet and by supporting chaining you save 10 bits of register
>>> specifiers for each chain segment, minus a bit to indicate the chain link.
>> <
>> Why is there any romance associated with power of 2 widths in support of
>> wide {Fetch-Decode-Execute} ?
>> <
>> 4 in 128-bits
>> 6 in 192-bits
>> 8 in 256-bits
>> 10 in 320-bits
>> ....
>> <
>> It seems to me that powers of 2 in byte count is an artificial boundary that
>> is not good for the ISA in general.
>> <
>> Also this seriously hinders building little implementation and big implementations
>> at the same time, being semi-optimal only for a few implementations in the middle.
>
> Agree.
>
>>> You would use heads and tails encoding and support jumps into and out of
>>> the packet. Chaining alone should give you the best instruction density and
>>> add some micro coded instructions and you should get dominating instruction
>>> density that makes manufactures take a serious look at your offerings.
>> <
>> My 66000 ISA is getting near x86-64 instruction density without any of this.
>
> Um, x86-64 is not much better than RISC and sometimes worse depending on
> how much floating point you do as that adds another byte to each
> instruction.
>
> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>
>>> A variable width instruction set can support chaining today by just adding
>>> the instructions, I am perplexed as to why no one has.
>> <
>> Why don't you give it a go and see what comes out ?
>
> Chaining opcodes is complex homework that every company should have taken a
> look at, including you, results should be somewhere on the internet.
>
> Getting the RISC guys to go variable width was worse than pulling teeth.
> Threats of firings and resignations were involved at ARM, and the MIPS
> founder did fire people for such suggestions, though most were weeded out
> at hiring interviews leading to brain dead group think that killed the
> company when the market changed.
>
> Adding a chaining register dependency is maybe 10 times worse in these
> peoples minds.
>
>>> A new architecture needs a hook to get noticed and dominating instruction
>>> density is one way to get that notice.
>> <
>> My guess is that lower context switch overhead would garner more wins
>> than instruction density; for example, an ADA call to an entry accept point
>> in a different address space costing only 12 cycles.
>
> Sounds good.
>
>> <
>>>> RISC only made sense for a decade back in the ancient history of the
>>>> 1980’s.
>> <
>> RISC made sense in the brief interval when 32-bit ISAs were too complicated
>> to all be on 1 chip. By shedding the area of the microcode, one got the space
>> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
>> got 1 instructions every clock (less cache miss latency).
>
> Agree.
>
>>>> Today if I wanted to build a better 16 or 32 bit processor the first step
>>>> would be to find what micro coded instructions I could add to reduce
>>>> instruction density, and thus win the lowest cost war.
>> <
>> In My case (My 66000) the biggest code density benefit was in creating
>> ENTER and EXIT instructions, second best was giving every instruction
>> access to any width immediate.
>> <
>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.
>
> In the 32 bit market you have to compete with RISC-V which is free. Sure it
> is crap compared to the new architectures discussed here, but it’s free.
>
> Pick your poison. ;)

The 64 bit market is where the 32 bit market was in the pre-Thumb days.
There is a market window between the crap free RISC-V and ARM64, for a
Thumb2 style architecture which My 66000 fits in.
Of course you could get crushed if/when ARM does a thumb implementation in
response to you.

>>>> The 8086 with hard coded registers was quite good for the era, but we can
>>>> do better today, by micro coding much more complicated sequences.
>> <
>> At a different level, transcendental instructions in My 66000 are microcoded
>> IN the FUNCTION UNIT (not in the Decoder). The Function unit goes "busy"
>> until the result is ready to be delivered. SIN() COS() Ln() EXP() take 19 cycles
>> about the latency of an FDIV.
>>>>
>>>> The first instruction I would add is a one instruction memcpy loop, which
>>>> would use three hard coded registers to make the instruction short. The
>>>> data register would not be visible so that I could use vector registers if
>>>> I wanted. And there would be several variants for copy size and alignment.
>>>> And a bit to decide if the count is part of the instruction.
>>>>
>>>> Another instruction I would add is add plus store, etc.
>>>>
>>>> The case for load plus add is harder, and might not make the cut due to
>>>> transistor implementation cost outweighing memory transistor savings.
>>>> Load compare and branch is in the same boat and has to be added to the
>>>> total system cost of load compute.
>>>>
>>>> I seriously think a new 16 or 32 bit processor with micro coded
>>>> instructions could win market share, by simple expedient of smaller total
>>>> size with code included.
>> <
>> Not a single RISC architecture got a single design win due to code density !!
>>>>
>>>> My template is a pipelined 386 with 32 registers and far more complex and
>>>> longer micro code instructions.
>>>>
>>>> There is a major company that went for an updated 386, but did not improve
>>>> anything besides instruction encoding and minor fixes. Failed to go big and
>>>> so failed, go big or go home.
>>>>
>>>>> and they somehow didn't notice that memory was
>>>>> getting a lot cheaper, with a super-dense super-general instruction
>>>>> set intended for assembler programmers, and tiny 512 byte pages that
>>>>> even at the time were obviously too small.

Re: vs/pascal

<878s00c4qi.fsf@localhost>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20437&group=comp.arch#20437

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: lyn...@garlic.com (Anne & Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: vs/pascal
Date: Mon, 13 Sep 2021 15:42:29 -1000
Organization: Wheeler&Wheeler
Lines: 51
Message-ID: <878s00c4qi.fsf@localhost>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<2021Sep12.141107@mips.complang.tuwien.ac.at>
<shlc98$q3q$1@gal.iecc.com> <87a6kh3arw.fsf@localhost>
<shltt3$24e4$1@gal.iecc.com> <87y281ksfo.fsf@localhost>
<shmpfd$1egb$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d78e6361a746a7d0785346552323974a";
logging-data="18663"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ICabBTKc/QE68ajyWUq+58FhLQUeNOkg="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:mDGKCGDozGWQPtM/VoqkcCTtfZE=
sha1:KsAnA27N3ZL0Up9wIP+SmFgb7uE=
 by: Anne & Lynn Whee - Tue, 14 Sep 2021 01:42 UTC

Terje Mathisen <terje.mathisen@tmsw.no> writes:
> Wow!
>
> I once used that vs/pascal to implement large packet Kermit, using up
> to a full page (25x80=2000 bytes) of a 3270 emulator as the packet
> size, instead of just a single 80-byte line.
>
> The result was file transfers running at the same speed as the 3270/PC
> (or the /AT 286 version) while using 3270 protocol emulators to allow
> ascii serial port connections.
>
> I remember that Pascal version as being quite nice. :-)
>
> Terje

the first IBM mainframe tcp/ip stack was implemented in VS/pascal
.... but the communication group was fighting hard to prevent its release
.... when they lost that, they changed their strategy and claimed that
since they had corporate strategic ownership for everything that crossed
datacenter walls ... it had to be repleased through them. What shipped
got 44kbyte/sec aggregate using 3090 processor. I then did RFC1044
support and in some tuning tests at Cray Research between 4341 and Cray
.... got sustained 4341 channel throughput using only modest amount of
4341 processor (something like 500 times improvement in bytes moved per
instruction executed).

after leaving IBM ... during its trouble and recovery years ... IBM was
being reorganized into the 13 baby blues in preparation for breaking up
the company ... article gone behind paywall, but still mostly free at
wayback machine
http://web.archive.org/web/20101120231857/http://www.time.com/time/magazine/article/0,9171,977353,00.html

then board brings in new CEO that reverses the breakup ... who also cuts
a lot of stuff. Much of VLSI design tools were being given away to major
industry standard VLSI design tool company ... but condition was that
since much of the industry ran on SUN ... they had to all be ported to
SUN. We had already left the company but get a contract to port a Los
Gatos, 50,000 vs/pascal statement physical layout app to SUN.

In retrospect instead of converting to SUN pascal, it would be simpler
to have rewritten it in C ... it seemed that SUN pascal many have never
been used for much other than simple educational instruction. It was
easy to drop into SUN hdqtrs to discuss problems ... but it didn't do
much good since pascal support had been outsourced to organization on
the opposite of the world (lets say rocket scientists ... later I got
some uniform insignias that said "space trooper" and "space command"
.... not in english)

--
virtualization experience starting Jan1968, online at home since Mar1970

Re: Design a better 16 or 32 bit processor

<caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20438&group=comp.arch#20438

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:6145:: with SMTP id d5mr2353473qtm.197.1631584145335;
Mon, 13 Sep 2021 18:49:05 -0700 (PDT)
X-Received: by 2002:a05:6808:10c8:: with SMTP id s8mr9877671ois.175.1631584145021;
Mon, 13 Sep 2021 18:49:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 13 Sep 2021 18:49:04 -0700 (PDT)
In-Reply-To: <shoqnc$ub2$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:49b:5002:45eb:aaea;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:49b:5002:45eb:aaea
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com>
Subject: Re: Design a better 16 or 32 bit processor
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 14 Sep 2021 01:49:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 177
 by: MitchAlsup - Tue, 14 Sep 2021 01:49 UTC

On Monday, September 13, 2021 at 7:34:55 PM UTC-5, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
> > On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
> >> Brett <gg...@yahoo.com> wrote:
> >>> John Levine <jo...@taugh.com> wrote:
> >>>> According to MitchAlsup <Mitch...@aol.com>:
> >>>>>> Sounds a lot like a PDP-11.
> >>>>> <
> >>>>> Which died because its successor was too hard to pipeline.
> >>>>
> >>>> The PDP-11 was a really good design for the late 1960s, when memory
> >>>> was starting to be affordable and microcode ROM was still a lot faster
> >>>> than core.
> >>>>
> >>>> The VAX was also a good design for the late 1960s but unfortunately it
> >>>> was introduced in the late 1970s. It wasn't just that it was too hard
> >>>> to pipeline.
> >>>
> >>>> The instruction set was full of overcomplicated
> >>>> microcoded instructions that were often slower than a sequence of
> >>>> simple instructions,
> >>>
> >>> A Myth what was promoted by RISC proponents, which has been debunked.
> > <
> >> An alternative or addition is wide packed instructions with chaining.
> >> The example of wide packed is Itanic, but of course they did it wrong.
> >>
> >> You would go 256 bits wide and support a variable number of instructions in
> >> the packet and by supporting chaining you save 10 bits of register
> >> specifiers for each chain segment, minus a bit to indicate the chain link.
> > <
> > Why is there any romance associated with power of 2 widths in support of
> > wide {Fetch-Decode-Execute} ?
> > <
> > 4 in 128-bits
> > 6 in 192-bits
> > 8 in 256-bits
> > 10 in 320-bits
> > ....
> > <
> > It seems to me that powers of 2 in byte count is an artificial boundary that
> > is not good for the ISA in general.
> > <
> > Also this seriously hinders building little implementation and big implementations
> > at the same time, being semi-optimal only for a few implementations in the middle.
> Agree.
> >> You would use heads and tails encoding and support jumps into and out of
> >> the packet. Chaining alone should give you the best instruction density and
> >> add some micro coded instructions and you should get dominating instruction
> >> density that makes manufactures take a serious look at your offerings.
> > <
> > My 66000 ISA is getting near x86-64 instruction density without any of this.
> Um, x86-64 is not much better than RISC and sometimes worse depending on
> how much floating point you do as that adds another byte to each
> instruction.
>
> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
> >> A variable width instruction set can support chaining today by just adding
> >> the instructions, I am perplexed as to why no one has.
> > <
> > Why don't you give it a go and see what comes out ?
> Chaining opcodes is complex homework that every company should have taken a
> look at, including you, results should be somewhere on the internet.
>
> Getting the RISC guys to go variable width was worse than pulling teeth.
<
It should not have been.
<
I figured it out in the first several weeks of starting an x86-64 microarchitecture.
Variable length instruction sets have a value that you cannot ignore, although
I generally state this in the form that "No instructions should be used in pasting
bits together" rather than VLI.
<
> Threats of firings and resignations were involved at ARM, and the MIPS
> founder did fire people for such suggestions, though most were weeded out
> at hiring interviews leading to brain dead group think that killed the
> company when the market changed.
<
I was not hired by HP back in the Snake days because I was willing to consider
a pipeline where LDs could "short circuit" some piece of the instruction pipeline.
<
I understand the group think going on.
>
> Adding a chaining register dependency is maybe 10 times worse in these
> peoples minds.
<
I has specific chaining in my GPU ISA. Overall, it was probably not necessary
and might have been easier to use 5-bit comparators.
<
> >> A new architecture needs a hook to get noticed and dominating instruction
> >> density is one way to get that notice.
> > <
> > My guess is that lower context switch overhead would garner more wins
> > than instruction density; for example, an ADA call to an entry accept point
> > in a different address space costing only 12 cycles.
<
> Sounds good.
> > <
> >>> RISC only made sense for a decade back in the ancient history of the
> >>> 1980’s.
> > <
> > RISC made sense in the brief interval when 32-bit ISAs were too complicated
> > to all be on 1 chip. By shedding the area of the microcode, one got the space
> > to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
> > got 1 instructions every clock (less cache miss latency).
> Agree.
> >>> Today if I wanted to build a better 16 or 32 bit processor the first step
> >>> would be to find what micro coded instructions I could add to reduce
> >>> instruction density, and thus win the lowest cost war.
> > <
> > In My case (My 66000) the biggest code density benefit was in creating
> > ENTER and EXIT instructions, second best was giving every instruction
> > access to any width immediate.
> > <
> > Secondarily, I doubt the 16-bit market is in the search for a new architecture.
<
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.
<
The problem in not ignoring this market is that you have to pay for the design
team at $0.01 per chip sold. And since a design team is going to cost you
on-the-order-of $50M/yr, you better be selling a crap-ton(R) of chips.
>
> In the 32 bit market you have to compete with RISC-V which is free. Sure it
> is crap compared to the new architectures discussed here, but it’s free.
<
My 66000 is/will be available for free.
>
> Pick your poison. ;)

Re: Design a better 16 or 32 bit processor

<jwv7dfjewbd.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20439&group=comp.arch#20439

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Mon, 13 Sep 2021 22:18:17 -0400
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <jwv7dfjewbd.fsf-monnier+comp.arch@gnu.org>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
<caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="c87ecc51e3d4a78e6fe598ae42400d27";
logging-data="28086"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CwUpH9iP1NgRMWbU3KiLD"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:a0/7INLDYPdrkOeYjfEH3ch1lRE=
sha1:nfAR0R8feoWCrTEUjajjyNZ/8qs=
 by: Stefan Monnier - Tue, 14 Sep 2021 02:18 UTC

>> The 16 bit market has the worst choices to pick from, the problem is that
>> these devices are cheap, so everyone ignores this market. Definition of
>> opportunity.
> The problem in not ignoring this market is that you have to pay for the design
> team at $0.01 per chip sold. And since a design team is going to cost you
> on-the-order-of $50M/yr, you better be selling a crap-ton(R) of chips.

I do wonder: I can see a market for really tiny 8bit CPUs, but how big
is the market for 16bit CPUs (i.e. those that need more than what 8bit
CPUs can offer but for which a tiny 32bit CPU is already too expensive)?

Stefan

Re: Design a better 16 or 32 bit processor

<shp20d$5ed$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20440&group=comp.arch#20440

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 02:39:09 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 139
Message-ID: <shp20d$5ed$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com>
<shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
<caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 02:39:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="208133be679912524de33a5b71bab2cc";
logging-data="5581"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RkV7ly/K44F/Bna9c8Ows"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:pWSjABInZNs+POJo11O5kRLKP7o=
sha1:IMoNawSgxQikP9t/zqEegReY+8w=
 by: Brett - Tue, 14 Sep 2021 02:39 UTC

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Monday, September 13, 2021 at 7:34:55 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>>> Brett <gg...@yahoo.com> wrote:
>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>> <
>>>>>>> Which died because its successor was too hard to pipeline.
>>>>>>
>>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>>>> than core.
>>>>>>
>>>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>>> to pipeline.
>>>>>
>>>>>> The instruction set was full of overcomplicated
>>>>>> microcoded instructions that were often slower than a sequence of
>>>>>> simple instructions,
>>>>>
>>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>>> <
>>>> An alternative or addition is wide packed instructions with chaining.
>>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>>
>>>> You would go 256 bits wide and support a variable number of instructions in
>>>> the packet and by supporting chaining you save 10 bits of register
>>>> specifiers for each chain segment, minus a bit to indicate the chain link.
>>> <
>>> Why is there any romance associated with power of 2 widths in support of
>>> wide {Fetch-Decode-Execute} ?
>>> <
>>> 4 in 128-bits
>>> 6 in 192-bits
>>> 8 in 256-bits
>>> 10 in 320-bits
>>> ....
>>> <
>>> It seems to me that powers of 2 in byte count is an artificial boundary that
>>> is not good for the ISA in general.
>>> <
>>> Also this seriously hinders building little implementation and big implementations
>>> at the same time, being semi-optimal only for a few implementations in the middle.
>> Agree.
>>>> You would use heads and tails encoding and support jumps into and out of
>>>> the packet. Chaining alone should give you the best instruction density and
>>>> add some micro coded instructions and you should get dominating instruction
>>>> density that makes manufactures take a serious look at your offerings.
>>> <
>>> My 66000 ISA is getting near x86-64 instruction density without any of this.
>> Um, x86-64 is not much better than RISC and sometimes worse depending on
>> how much floating point you do as that adds another byte to each
>> instruction.
>>
>> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>>>> A variable width instruction set can support chaining today by just adding
>>>> the instructions, I am perplexed as to why no one has.
>>> <
>>> Why don't you give it a go and see what comes out ?
>> Chaining opcodes is complex homework that every company should have taken a
>> look at, including you, results should be somewhere on the internet.
>>
>> Getting the RISC guys to go variable width was worse than pulling teeth.
> <
> It should not have been.
> <
> I figured it out in the first several weeks of starting an x86-64 microarchitecture.
> Variable length instruction sets have a value that you cannot ignore, although
> I generally state this in the form that "No instructions should be used in pasting
> bits together" rather than VLI.
> <
>> Threats of firings and resignations were involved at ARM, and the MIPS
>> founder did fire people for such suggestions, though most were weeded out
>> at hiring interviews leading to brain dead group think that killed the
>> company when the market changed.
> <
> I was not hired by HP back in the Snake days because I was willing to consider
> a pipeline where LDs could "short circuit" some piece of the instruction pipeline.
> <
> I understand the group think going on.
>>
>> Adding a chaining register dependency is maybe 10 times worse in these
>> peoples minds.
> <
> I has specific chaining in my GPU ISA. Overall, it was probably not necessary
> and might have been easier to use 5-bit comparators.

Did you fix the compilers annoying habit of interleaving loads and math,
hard to chain two adds when there is a needless load in between.

>>>> A new architecture needs a hook to get noticed and dominating instruction
>>>> density is one way to get that notice.
>>> <
>>> My guess is that lower context switch overhead would garner more wins
>>> than instruction density; for example, an ADA call to an entry accept point
>>> in a different address space costing only 12 cycles.
> <
>> Sounds good.
>>> <
>>>>> RISC only made sense for a decade back in the ancient history of the
>>>>> 1980’s.
>>> <
>>> RISC made sense in the brief interval when 32-bit ISAs were too complicated
>>> to all be on 1 chip. By shedding the area of the microcode, one got the space
>>> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
>>> got 1 instructions every clock (less cache miss latency).
>> Agree.
>>>>> Today if I wanted to build a better 16 or 32 bit processor the first step
>>>>> would be to find what micro coded instructions I could add to reduce
>>>>> instruction density, and thus win the lowest cost war.
>>> <
>>> In My case (My 66000) the biggest code density benefit was in creating
>>> ENTER and EXIT instructions, second best was giving every instruction
>>> access to any width immediate.
>>> <
>>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
> <
>> The 16 bit market has the worst choices to pick from, the problem is that
>> these devices are cheap, so everyone ignores this market. Definition of
>> opportunity.
> <
> The problem in not ignoring this market is that you have to pay for the design
> team at $0.01 per chip sold. And since a design team is going to cost you
> on-the-order-of $50M/yr, you better be selling a crap-ton(R) of chips.
>>
>> In the 32 bit market you have to compete with RISC-V which is free. Sure it
>> is crap compared to the new architectures discussed here, but it’s free.
> <
> My 66000 is/will be available for free.
>>
>> Pick your poison. ;)
>
>

Re: Design a better 16 or 32 bit processor

<shpieu$mo5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20442&group=comp.arch#20442

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 02:18:36 -0500
Organization: A noiseless patient Spider
Lines: 349
Message-ID: <shpieu$mo5$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 07:19:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a87b7b2e948ad6ef68b288d5fab3588d";
logging-data="23301"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181xZWv7HKswBqENvUOUJrV"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:tVKxp1J0oGScuruL4vJrlO5kc08=
In-Reply-To: <shoqnc$ub2$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 14 Sep 2021 07:18 UTC

On 9/13/2021 7:34 PM, Brett wrote:
> MitchAlsup <MitchAlsup@aol.com> wrote:
>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>> Brett <gg...@yahoo.com> wrote:
>>>> John Levine <jo...@taugh.com> wrote:
>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>> Sounds a lot like a PDP-11.
>>>>>> <
>>>>>> Which died because its successor was too hard to pipeline.
>>>>>
>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>>> than core.
>>>>>
>>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>> to pipeline.
>>>>
>>>>> The instruction set was full of overcomplicated
>>>>> microcoded instructions that were often slower than a sequence of
>>>>> simple instructions,
>>>>
>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>> <
>>> An alternative or addition is wide packed instructions with chaining.
>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>
>>> You would go 256 bits wide and support a variable number of instructions in
>>> the packet and by supporting chaining you save 10 bits of register
>>> specifiers for each chain segment, minus a bit to indicate the chain link.
>> <
>> Why is there any romance associated with power of 2 widths in support of
>> wide {Fetch-Decode-Execute} ?
>> <
>> 4 in 128-bits
>> 6 in 192-bits
>> 8 in 256-bits
>> 10 in 320-bits
>> ....
>> <
>> It seems to me that powers of 2 in byte count is an artificial boundary that
>> is not good for the ISA in general.
>> <
>> Also this seriously hinders building little implementation and big implementations
>> at the same time, being semi-optimal only for a few implementations in the middle.
>
> Agree.
>
>>> You would use heads and tails encoding and support jumps into and out of
>>> the packet. Chaining alone should give you the best instruction density and
>>> add some micro coded instructions and you should get dominating instruction
>>> density that makes manufactures take a serious look at your offerings.
>> <
>> My 66000 ISA is getting near x86-64 instruction density without any of this.
>
> Um, x86-64 is not much better than RISC and sometimes worse depending on
> how much floating point you do as that adds another byte to each
> instruction.
>
> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>

IME, in terms of code density x86-64 seems to be pretty weak. Though, it
varies some, for example, "gcc -Os" on Linux does somewhat better than MSVC.

In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
A64 seems to do a bit worse here than Thumb2.

As a small example, I had recently hacked together a small voxel based
3D engine along vaguely similar lines to Minecraft Classic for the BJX2
ISA (*1).

A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
build is 175K. Not strictly apples-to-apples, but still...

*1: Its renderer basically sweeps across the screen doing ray-casts,
building up a list of any blocks hit by a ray-cast, and then drawing the
list of blocks (via software-rasterized OpenGL). It has minimal overdraw
(because a raycast will not pass through a wall), but only really works
effectively at small draw distances.

Its performance still manages to be somehow less awful than I originally
imagined (framerates are a little better than Quake, albeit at a 24
block draw-distance). Ended up using a color-fill sky, mostly as this
improves framerate somewhat if compared with drawing a skybox.

This is still with me discovering and occasionally fixing "crazy bad"
compiler bugs, eg:
Turns out the compiler was very-frequently trying to cast-convert
operands of binary operators to the destination type even when they were
the same (resulting in a lot of extra register MOVs, spills, ...).

Eg, if you did something like:
int a, b, c;
c=a+b;
It was tending to often compile it like it were:
c=(int)a+(int)b;

Which at the ASM level would, instead of, say:
ADDS.L R8, R9, R14
Result in something like:
MOV R8, R25
MOV R9, R28
ADDS.L R25, R28, R14
And would also result in higher register pressure and a larger number of
spills.

Fixing this bug gave a roughly 4% reduction in the size of binaries, and
a roughly 20% increase in performance for Doom and similar. This also
caused Dhrystone score to increase from ~51.3k to ~57.1k.

It seemed this was also related to a lot of cases where, say:
c=a+imm;
Was resulting in things like:
MOV Imm, R6
MOV R7, R9
ADDS.L R12, R9, R13
Rather than, say:
ADDS.L R12, Imm, R13

....

Then, relatedly, noted that, eg:
y=x&255;
Was being compiled sorta like:
MOV R11, R7
EXTU.B R7, R7
MOV R7, R28
MOV R28, R14
Vs, say:
EXTU.B R11, R14

Turns out this was stumbling on some logic for a "stale" code-path,
where early on, the compiler would handle operators more like:
Allocate scratch registers;
Load frame variables into scratch registers;
Apply operator to scratch registers;
Store result back to call frame;
Free said scratch registers.

But, this was later replaced with:
Fetch the variables as registers;
Operate on these registers;
Release the registers.

After I had switched over, trying to load/store a frame variable in this
way would typically result in a register MOVs rather than an actual
memory load/store. These older paths have not been entirely eliminated
though.

But, yeah, fixing these appears to have slightly reduced the level of
"general awfulness" in my C compiler output.

Then added another slight compiler tweak which got it up to ~57.9k
(namely, caching and reusing struct-field loads in certain cases).

Was able to push it up to ~ 59.0k by assuming less-conservative
semantics ("strict aliasing"), but I decided against enabling this by
default as it seems unsafe. This mostly effects under which conditions
the cached struct field would be discarded.

At present, it has certain restrictions:
* Does not cross a basic-block boundary;
* Discarded if either the of the cached variables is modified;
* Discarded if any sort of explicit memory store happens;
* ...

But, what it will do, is compile an expression like:
y=foo->x*foo->x;
As if it were:
t0=foo->x;
y=t0*t0;

Though, this optimization does not appear to have any real effect on
Doom and similar.

It is possible a similar trick could be used for array loads or pointer
derefs.

I also went and recently optionally re-added the "FMOV.S" instruction
(Memory Load/Store combined with a Single<->Double conversion), since
this should be able to help some for code which works with
single-precision floating point values (avoids some common penalty cases).

....

>>> A variable width instruction set can support chaining today by just adding
>>> the instructions, I am perplexed as to why no one has.
>> <
>> Why don't you give it a go and see what comes out ?
>
> Chaining opcodes is complex homework that every company should have taken a
> look at, including you, results should be somewhere on the internet.
>
> Getting the RISC guys to go variable width was worse than pulling teeth.
> Threats of firings and resignations were involved at ARM, and the MIPS
> founder did fire people for such suggestions, though most were weeded out
> at hiring interviews leading to brain dead group think that killed the
> company when the market changed.
>
> Adding a chaining register dependency is maybe 10 times worse in these
> peoples minds.
>

It is balance, some.

16/32, by looking at a few bits, is OK.
Decoding a bundle based on also looking at a few bits and daisy-chaining
is also OK.

Fully variable length encodings which depend on looking at lots of
different bits are less OK.

>>> A new architecture needs a hook to get noticed and dominating instruction
>>> density is one way to get that notice.
>> <
>> My guess is that lower context switch overhead would garner more wins
>> than instruction density; for example, an ADA call to an entry accept point
>> in a different address space costing only 12 cycles.
>
> Sounds good.
>
>> <
>>>> RISC only made sense for a decade back in the ancient history of the
>>>> 1980’s.
>> <
>> RISC made sense in the brief interval when 32-bit ISAs were too complicated
>> to all be on 1 chip. By shedding the area of the microcode, one got the space
>> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
>> got 1 instructions every clock (less cache miss latency).
>
> Agree.
>


Click here to read the complete article
Re: Design a better 16 or 32 bit processor

<shpkkc$1mdo$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20443&group=comp.arch#20443

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!ppYixYMWAWh/woI8emJOIQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 09:56:59 +0200
Organization: Aioe.org NNTP Server
Message-ID: <shpkkc$1mdo$1@gioia.aioe.org>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="55736"; posting-host="ppYixYMWAWh/woI8emJOIQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.9
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Tue, 14 Sep 2021 07:56 UTC

Brett wrote:
> MitchAlsup <MitchAlsup@aol.com> wrote:
>> In My case (My 66000) the biggest code density benefit was in creating
>> ENTER and EXIT instructions, second best was giving every instruction
>> access to any width immediate.
>> <
>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.

How cheap must a 32-bit cpu be in order to make 16-bit irrelevant?

I was blown away by a conference talk by "Bunny" Chang several years
ago, telling us that _every_ single USB memory stick contains a 32-bit
cpu, this is what allows the manufacturers to sell all the flash memory
chips they make, no matter how good or bad it turned out: The embedded
cpu does all the testing/verification/remapping/bad block flagging etc
and reports back "This is an 8GB stick", probably also with some
speed/quality markers to allow it to be sold to different markets/price
points.

I can only assume that the cost of those 32-bit memory stick cpus have
gone even further down by now.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Design a better 16 or 32 bit processor

<d9e4b6b9-41de-4227-bb6e-f5924dc65c06n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20449&group=comp.arch#20449

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:20eb:: with SMTP id 11mr6694614qvk.52.1631640013786;
Tue, 14 Sep 2021 10:20:13 -0700 (PDT)
X-Received: by 2002:a9d:ecc:: with SMTP id 70mr16154287otj.96.1631640013446;
Tue, 14 Sep 2021 10:20:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 14 Sep 2021 10:20:13 -0700 (PDT)
In-Reply-To: <shp20d$5ed$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:19d7:d2c:4498:62d8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:19d7:d2c:4498:62d8
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me>
<caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com> <shp20d$5ed$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d9e4b6b9-41de-4227-bb6e-f5924dc65c06n@googlegroups.com>
Subject: Re: Design a better 16 or 32 bit processor
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 14 Sep 2021 17:20:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 202
 by: MitchAlsup - Tue, 14 Sep 2021 17:20 UTC

On Monday, September 13, 2021 at 9:39:12 PM UTC-5, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
> > On Monday, September 13, 2021 at 7:34:55 PM UTC-5, gg...@yahoo.com wrote:
> >> MitchAlsup <Mitch...@aol.com> wrote:
> >>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
> >>>> Brett <gg...@yahoo.com> wrote:
> >>>>> John Levine <jo...@taugh.com> wrote:
> >>>>>> According to MitchAlsup <Mitch...@aol.com>:
> >>>>>>>> Sounds a lot like a PDP-11.
> >>>>>>> <
> >>>>>>> Which died because its successor was too hard to pipeline.
> >>>>>>
> >>>>>> The PDP-11 was a really good design for the late 1960s, when memory
> >>>>>> was starting to be affordable and microcode ROM was still a lot faster
> >>>>>> than core.
> >>>>>>
> >>>>>> The VAX was also a good design for the late 1960s but unfortunately it
> >>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
> >>>>>> to pipeline.
> >>>>>
> >>>>>> The instruction set was full of overcomplicated
> >>>>>> microcoded instructions that were often slower than a sequence of
> >>>>>> simple instructions,
> >>>>>
> >>>>> A Myth what was promoted by RISC proponents, which has been debunked.
> >>> <
> >>>> An alternative or addition is wide packed instructions with chaining..
> >>>> The example of wide packed is Itanic, but of course they did it wrong.
> >>>>
> >>>> You would go 256 bits wide and support a variable number of instructions in
> >>>> the packet and by supporting chaining you save 10 bits of register
> >>>> specifiers for each chain segment, minus a bit to indicate the chain link.
> >>> <
> >>> Why is there any romance associated with power of 2 widths in support of
> >>> wide {Fetch-Decode-Execute} ?
> >>> <
> >>> 4 in 128-bits
> >>> 6 in 192-bits
> >>> 8 in 256-bits
> >>> 10 in 320-bits
> >>> ....
> >>> <
> >>> It seems to me that powers of 2 in byte count is an artificial boundary that
> >>> is not good for the ISA in general.
> >>> <
> >>> Also this seriously hinders building little implementation and big implementations
> >>> at the same time, being semi-optimal only for a few implementations in the middle.
> >> Agree.
> >>>> You would use heads and tails encoding and support jumps into and out of
> >>>> the packet. Chaining alone should give you the best instruction density and
> >>>> add some micro coded instructions and you should get dominating instruction
> >>>> density that makes manufactures take a serious look at your offerings.
> >>> <
> >>> My 66000 ISA is getting near x86-64 instruction density without any of this.
> >> Um, x86-64 is not much better than RISC and sometimes worse depending on
> >> how much floating point you do as that adds another byte to each
> >> instruction.
> >>
> >> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
> >>>> A variable width instruction set can support chaining today by just adding
> >>>> the instructions, I am perplexed as to why no one has.
> >>> <
> >>> Why don't you give it a go and see what comes out ?
> >> Chaining opcodes is complex homework that every company should have taken a
> >> look at, including you, results should be somewhere on the internet.
> >>
> >> Getting the RISC guys to go variable width was worse than pulling teeth.
> > <
> > It should not have been.
> > <
> > I figured it out in the first several weeks of starting an x86-64 microarchitecture.
> > Variable length instruction sets have a value that you cannot ignore, although
> > I generally state this in the form that "No instructions should be used in pasting
> > bits together" rather than VLI.
> > <
> >> Threats of firings and resignations were involved at ARM, and the MIPS
> >> founder did fire people for such suggestions, though most were weeded out
> >> at hiring interviews leading to brain dead group think that killed the
> >> company when the market changed.
> > <
> > I was not hired by HP back in the Snake days because I was willing to consider
> > a pipeline where LDs could "short circuit" some piece of the instruction pipeline.
> > <
> > I understand the group think going on.
> >>
> >> Adding a chaining register dependency is maybe 10 times worse in these
> >> peoples minds.
> > <
> > I has specific chaining in my GPU ISA. Overall, it was probably not necessary
> > and might have been easier to use 5-bit comparators.
<
> Did you fix the compilers annoying habit of interleaving loads and math,
> hard to chain two adds when there is a needless load in between.
<
Reservation stations do this automagically. {And get rid of the need to explicitly
chain one instruction to another.} And to a large extent, so do ScoreBoards..
<
It is only the statically scheduled architectures that have the problem you suggest.
<
> >>>> A new architecture needs a hook to get noticed and dominating instruction
> >>>> density is one way to get that notice.
> >>> <
> >>> My guess is that lower context switch overhead would garner more wins
> >>> than instruction density; for example, an ADA call to an entry accept point
> >>> in a different address space costing only 12 cycles.
> > <
> >> Sounds good.
> >>> <
> >>>>> RISC only made sense for a decade back in the ancient history of the
> >>>>> 1980’s.
> >>> <
> >>> RISC made sense in the brief interval when 32-bit ISAs were too complicated
> >>> to all be on 1 chip. By shedding the area of the microcode, one got the space
> >>> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
> >>> got 1 instructions every clock (less cache miss latency).
> >> Agree.
> >>>>> Today if I wanted to build a better 16 or 32 bit processor the first step
> >>>>> would be to find what micro coded instructions I could add to reduce
> >>>>> instruction density, and thus win the lowest cost war.
> >>> <
> >>> In My case (My 66000) the biggest code density benefit was in creating
> >>> ENTER and EXIT instructions, second best was giving every instruction
> >>> access to any width immediate.
> >>> <
> >>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
> > <
> >> The 16 bit market has the worst choices to pick from, the problem is that
> >> these devices are cheap, so everyone ignores this market. Definition of
> >> opportunity.
> > <
> > The problem in not ignoring this market is that you have to pay for the design
> > team at $0.01 per chip sold. And since a design team is going to cost you
> > on-the-order-of $50M/yr, you better be selling a crap-ton(R) of chips.
> >>
> >> In the 32 bit market you have to compete with RISC-V which is free. Sure it
> >> is crap compared to the new architectures discussed here, but it’s free.
> > <
> > My 66000 is/will be available for free.
> >>
> >> Pick your poison. ;)
> >
> >

Re: Design a better 16 or 32 bit processor

<3b32faed-f670-46a7-b706-2787a72f7e83n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20451&group=comp.arch#20451

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:15d0:: with SMTP id d16mr5748484qty.185.1631640398667;
Tue, 14 Sep 2021 10:26:38 -0700 (PDT)
X-Received: by 2002:a05:6830:b96:: with SMTP id a22mr15716825otv.282.1631640398421;
Tue, 14 Sep 2021 10:26:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 14 Sep 2021 10:26:38 -0700 (PDT)
In-Reply-To: <shpieu$mo5$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:19d7:d2c:4498:62d8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:19d7:d2c:4498:62d8
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me>
<shpieu$mo5$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3b32faed-f670-46a7-b706-2787a72f7e83n@googlegroups.com>
Subject: Re: Design a better 16 or 32 bit processor
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 14 Sep 2021 17:26:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 422
 by: MitchAlsup - Tue, 14 Sep 2021 17:26 UTC

On Tuesday, September 14, 2021 at 2:20:01 AM UTC-5, BGB wrote:
> On 9/13/2021 7:34 PM, Brett wrote:
> > MitchAlsup <Mitch...@aol.com> wrote:
> >> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
> >>> Brett <gg...@yahoo.com> wrote:
> >>>> John Levine <jo...@taugh.com> wrote:
> >>>>> According to MitchAlsup <Mitch...@aol.com>:
> >>>>>>> Sounds a lot like a PDP-11.
> >>>>>> <
> >>>>>> Which died because its successor was too hard to pipeline.
> >>>>>
> >>>>> The PDP-11 was a really good design for the late 1960s, when memory
> >>>>> was starting to be affordable and microcode ROM was still a lot faster
> >>>>> than core.
> >>>>>
> >>>>> The VAX was also a good design for the late 1960s but unfortunately it
> >>>>> was introduced in the late 1970s. It wasn't just that it was too hard
> >>>>> to pipeline.
> >>>>
> >>>>> The instruction set was full of overcomplicated
> >>>>> microcoded instructions that were often slower than a sequence of
> >>>>> simple instructions,
> >>>>
> >>>> A Myth what was promoted by RISC proponents, which has been debunked..
> >> <
> >>> An alternative or addition is wide packed instructions with chaining.
> >>> The example of wide packed is Itanic, but of course they did it wrong..
> >>>
> >>> You would go 256 bits wide and support a variable number of instructions in
> >>> the packet and by supporting chaining you save 10 bits of register
> >>> specifiers for each chain segment, minus a bit to indicate the chain link.
> >> <
> >> Why is there any romance associated with power of 2 widths in support of
> >> wide {Fetch-Decode-Execute} ?
> >> <
> >> 4 in 128-bits
> >> 6 in 192-bits
> >> 8 in 256-bits
> >> 10 in 320-bits
> >> ....
> >> <
> >> It seems to me that powers of 2 in byte count is an artificial boundary that
> >> is not good for the ISA in general.
> >> <
> >> Also this seriously hinders building little implementation and big implementations
> >> at the same time, being semi-optimal only for a few implementations in the middle.
> >
> > Agree.
> >
> >>> You would use heads and tails encoding and support jumps into and out of
> >>> the packet. Chaining alone should give you the best instruction density and
> >>> add some micro coded instructions and you should get dominating instruction
> >>> density that makes manufactures take a serious look at your offerings..
> >> <
> >> My 66000 ISA is getting near x86-64 instruction density without any of this.
> >
> > Um, x86-64 is not much better than RISC and sometimes worse depending on
> > how much floating point you do as that adds another byte to each
> > instruction.
> >
> > If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
> >
> IME, in terms of code density x86-64 seems to be pretty weak. Though, it
> varies some, for example, "gcc -Os" on Linux does somewhat better than MSVC.
>
> In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
> A64 seems to do a bit worse here than Thumb2.
>
>
>
> As a small example, I had recently hacked together a small voxel based
> 3D engine along vaguely similar lines to Minecraft Classic for the BJX2
> ISA (*1).
>
> A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
> build is 175K. Not strictly apples-to-apples, but still...
>
>
> *1: Its renderer basically sweeps across the screen doing ray-casts,
> building up a list of any blocks hit by a ray-cast, and then drawing the
> list of blocks (via software-rasterized OpenGL). It has minimal overdraw
> (because a raycast will not pass through a wall), but only really works
> effectively at small draw distances.
>
> Its performance still manages to be somehow less awful than I originally
> imagined (framerates are a little better than Quake, albeit at a 24
> block draw-distance). Ended up using a color-fill sky, mostly as this
> improves framerate somewhat if compared with drawing a skybox.
>
>
>
> This is still with me discovering and occasionally fixing "crazy bad"
> compiler bugs, eg:
> Turns out the compiler was very-frequently trying to cast-convert
> operands of binary operators to the destination type even when they were
> the same (resulting in a lot of extra register MOVs, spills, ...).
>
> Eg, if you did something like:
> int a, b, c;
> c=a+b;
> It was tending to often compile it like it were:
> c=(int)a+(int)b;
>
> Which at the ASM level would, instead of, say:
> ADDS.L R8, R9, R14
> Result in something like:
> MOV R8, R25
> MOV R9, R28
> ADDS.L R25, R28, R14
> And would also result in higher register pressure and a larger number of
> spills.
>
> Fixing this bug gave a roughly 4% reduction in the size of binaries, and
> a roughly 20% increase in performance for Doom and similar. This also
> caused Dhrystone score to increase from ~51.3k to ~57.1k.
>
> It seemed this was also related to a lot of cases where, say:
> c=a+imm;
> Was resulting in things like:
> MOV Imm, R6
> MOV R7, R9
> ADDS.L R12, R9, R13
> Rather than, say:
> ADDS.L R12, Imm, R13
>
> ...
>
>
> Then, relatedly, noted that, eg:
> y=x&255;
> Was being compiled sorta like:
> MOV R11, R7
> EXTU.B R7, R7
> MOV R7, R28
> MOV R28, R14
> Vs, say:
> EXTU.B R11, R14
>
> Turns out this was stumbling on some logic for a "stale" code-path,
> where early on, the compiler would handle operators more like:
> Allocate scratch registers;
> Load frame variables into scratch registers;
> Apply operator to scratch registers;
> Store result back to call frame;
> Free said scratch registers.
>
> But, this was later replaced with:
> Fetch the variables as registers;
> Operate on these registers;
> Release the registers.
>
> After I had switched over, trying to load/store a frame variable in this
> way would typically result in a register MOVs rather than an actual
> memory load/store. These older paths have not been entirely eliminated
> though.
>
> But, yeah, fixing these appears to have slightly reduced the level of
> "general awfulness" in my C compiler output.
>
>
>
> Then added another slight compiler tweak which got it up to ~57.9k
> (namely, caching and reusing struct-field loads in certain cases).
>
> Was able to push it up to ~ 59.0k by assuming less-conservative
> semantics ("strict aliasing"), but I decided against enabling this by
> default as it seems unsafe. This mostly effects under which conditions
> the cached struct field would be discarded.
>
> At present, it has certain restrictions:
> * Does not cross a basic-block boundary;
> * Discarded if either the of the cached variables is modified;
> * Discarded if any sort of explicit memory store happens;
> * ...
>
> But, what it will do, is compile an expression like:
> y=foo->x*foo->x;
> As if it were:
> t0=foo->x;
> y=t0*t0;
>
> Though, this optimization does not appear to have any real effect on
> Doom and similar.
>
>
> It is possible a similar trick could be used for array loads or pointer
> derefs.
>
> I also went and recently optionally re-added the "FMOV.S" instruction
> (Memory Load/Store combined with a Single<->Double conversion), since
> this should be able to help some for code which works with
> single-precision floating point values (avoids some common penalty cases)..
>
> ...
> >>> A variable width instruction set can support chaining today by just adding
> >>> the instructions, I am perplexed as to why no one has.
> >> <
> >> Why don't you give it a go and see what comes out ?
> >
> > Chaining opcodes is complex homework that every company should have taken a
> > look at, including you, results should be somewhere on the internet.
> >
> > Getting the RISC guys to go variable width was worse than pulling teeth..
> > Threats of firings and resignations were involved at ARM, and the MIPS
> > founder did fire people for such suggestions, though most were weeded out
> > at hiring interviews leading to brain dead group think that killed the
> > company when the market changed.
> >
> > Adding a chaining register dependency is maybe 10 times worse in these
> > peoples minds.
> >
> It is balance, some.
>
> 16/32, by looking at a few bits, is OK.
> Decoding a bundle based on also looking at a few bits and daisy-chaining
> is also OK.
>
> Fully variable length encodings which depend on looking at lots of
> different bits are less OK.
<
Which is why all of the "stuff" necessary to perform an instruction is found in the first
word of the My 66000 Instruction, the variable remainder is for constants (immediates
and displacements.}
<
> >>> A new architecture needs a hook to get noticed and dominating instruction
> >>> density is one way to get that notice.
> >> <
> >> My guess is that lower context switch overhead would garner more wins
> >> than instruction density; for example, an ADA call to an entry accept point
> >> in a different address space costing only 12 cycles.
> >
> > Sounds good.
> >
> >> <
> >>>> RISC only made sense for a decade back in the ancient history of the
> >>>> 1980’s.
> >> <
> >> RISC made sense in the brief interval when 32-bit ISAs were too complicated
> >> to all be on 1 chip. By shedding the area of the microcode, one got the space
> >> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
> >> got 1 instructions every clock (less cache miss latency).
> >
> > Agree.
> >
> FWIW: In theory, my BJX2 core can do ~ 2 or 3 instructions per cycle.
> Though, actual "real-world" results tend to be closer to 0.3 to 0.5 ...
>
> A lot of this is due to cache misses, interlock penalties, and my
> compiler mostly failing to bundle instructions.
>
> Can generally get better results with ASM though.
> >>>> Today if I wanted to build a better 16 or 32 bit processor the first step
> >>>> would be to find what micro coded instructions I could add to reduce
> >>>> instruction density, and thus win the lowest cost war.
> >> <
> >> In My case (My 66000) the biggest code density benefit was in creating
> >> ENTER and EXIT instructions, second best was giving every instruction
> >> access to any width immediate.
> >> <
> >> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
> >
> > The 16 bit market has the worst choices to pick from, the problem is that
> > these devices are cheap, so everyone ignores this market. Definition of
> > opportunity.
> >
> > In the 32 bit market you have to compete with RISC-V which is free. Sure it
> > is crap compared to the new architectures discussed here, but it’s free.
> >
> > Pick your poison. ;)
> >
> For 16-bit, one mostly wants "as cheap as possible".
>
> Though, at least for off-the-shelf microcontrollers, it is hard to
> really beat out something like "just use an MSP430 or similar".
>
> One can design a "better" 16-bit ISA, and then, say, run it on an ICE40
> or similar, but then unless one has a strong use-case to justify needing
> FPGA logic, the ICE40 costs more and is more complicated to use than an
> MSP430.
>
> Then a certain amount of "use a Cortex-M but treat it like it is a
> 16-bit ISA".
>
> Or, if one does custom silicon, how to they get enough "volume" and
> "momentum" to make it cost-effective vs existing options? ...
>
>
> Then again, I am doing my existing project more because I found it
> interesting, than does it necessarily make sense.
>
> For many things, a Cortex-M would be both faster and cheaper.
> Though, it isn't 1:1, because while a (higher end) Cortex-M dev-board
> can do a pretty decent job running Doom or similar, if trying to do
> something like an OpenGL style software rasterizer or similar on it, it
> falls on its face.
>
> Seemingly, the Thumb ISA does rather poorly on workloads that end up
> consisting almost entirely of memory loads and stores.
> >>>> The 8086 with hard coded registers was quite good for the era, but we can
> >>>> do better today, by micro coding much more complicated sequences.
> >> <
> >> At a different level, transcendental instructions in My 66000 are microcoded
> >> IN the FUNCTION UNIT (not in the Decoder). The Function unit goes "busy"
> >> until the result is ready to be delivered. SIN() COS() Ln() EXP() take 19 cycles
> >> about the latency of an FDIV.
> >>>>
> >>>> The first instruction I would add is a one instruction memcpy loop, which
> >>>> would use three hard coded registers to make the instruction short. The
> >>>> data register would not be visible so that I could use vector registers if
> >>>> I wanted. And there would be several variants for copy size and alignment.
> >>>> And a bit to decide if the count is part of the instruction.
> >>>>
> >>>> Another instruction I would add is add plus store, etc.
> >>>>
> >>>> The case for load plus add is harder, and might not make the cut due to
> >>>> transistor implementation cost outweighing memory transistor savings..
> >>>> Load compare and branch is in the same boat and has to be added to the
> >>>> total system cost of load compute.
> >>>>
> >>>> I seriously think a new 16 or 32 bit processor with micro coded
> >>>> instructions could win market share, by simple expedient of smaller total
> >>>> size with code included.
> >> <
> >> Not a single RISC architecture got a single design win due to code density !!
> >>>>
> >>>> My template is a pipelined 386 with 32 registers and far more complex and
> >>>> longer micro code instructions.
> >>>>
> >>>> There is a major company that went for an updated 386, but did not improve
> >>>> anything besides instruction encoding and minor fixes. Failed to go big and
> >>>> so failed, go big or go home.
> >>>>
> >>>>> and they somehow didn't notice that memory was
> >>>>> getting a lot cheaper, with a super-dense super-general instruction
> >>>>> set intended for assembler programmers, and tiny 512 byte pages that
> >>>>> even at the time were obviously too small.
> >>
> >
> >
> >


Click here to read the complete article
Re: Design a better 16 or 32 bit processor

<02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20452&group=comp.arch#20452

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:58d0:: with SMTP id u16mr5932857qta.189.1631640516049;
Tue, 14 Sep 2021 10:28:36 -0700 (PDT)
X-Received: by 2002:a9d:eea:: with SMTP id 97mr15724507otj.331.1631640515791;
Tue, 14 Sep 2021 10:28:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 14 Sep 2021 10:28:35 -0700 (PDT)
In-Reply-To: <shpkkc$1mdo$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:19d7:d2c:4498:62d8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:19d7:d2c:4498:62d8
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me>
<shpkkc$1mdo$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
Subject: Re: Design a better 16 or 32 bit processor
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 14 Sep 2021 17:28:36 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 32
 by: MitchAlsup - Tue, 14 Sep 2021 17:28 UTC

On Tuesday, September 14, 2021 at 2:57:06 AM UTC-5, Terje Mathisen wrote:
> Brett wrote:
> > MitchAlsup <Mitch...@aol.com> wrote:
> >> In My case (My 66000) the biggest code density benefit was in creating
> >> ENTER and EXIT instructions, second best was giving every instruction
> >> access to any width immediate.
> >> <
> >> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
> >
> > The 16 bit market has the worst choices to pick from, the problem is that
> > these devices are cheap, so everyone ignores this market. Definition of
> > opportunity.
> How cheap must a 32-bit cpu be in order to make 16-bit irrelevant?
>
> I was blown away by a conference talk by "Bunny" Chang several years
> ago, telling us that _every_ single USB memory stick contains a 32-bit
> cpu, this is what allows the manufacturers to sell all the flash memory
> chips they make, no matter how good or bad it turned out: The embedded
> cpu does all the testing/verification/remapping/bad block flagging etc
> and reports back "This is an 8GB stick", probably also with some
> speed/quality markers to allow it to be sold to different markets/price
> points.
>
> I can only assume that the cost of those 32-bit memory stick cpus have
> gone even further down by now.
<
Figure the 32-bit CPU gets 2%-5% of the cost of the USB-stick.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Design a better 16 or 32 bit processor

<shqocf$fn4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20456&group=comp.arch#20456

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 13:05:48 -0500
Organization: A noiseless patient Spider
Lines: 356
Message-ID: <shqocf$fn4$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpieu$mo5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 14 Sep 2021 18:07:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a87b7b2e948ad6ef68b288d5fab3588d";
logging-data="16100"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+SC/YWnF7R9sDqjNOA7p9j"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:kzevWgw+omjTbegc3FYlbu8GwFA=
In-Reply-To: <shpieu$mo5$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 14 Sep 2021 18:05 UTC

On 9/14/2021 2:18 AM, BGB wrote:
> On 9/13/2021 7:34 PM, Brett wrote:
>> MitchAlsup <MitchAlsup@aol.com> wrote:
>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com
>>> wrote:
>>>> Brett <gg...@yahoo.com> wrote:
>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>> <
>>>>>>> Which died because its successor was too hard to pipeline.
>>>>>>
>>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>>> was starting to be affordable and microcode ROM was still a lot
>>>>>> faster
>>>>>> than core.
>>>>>>
>>>>>> The VAX was also a good design for the late 1960s but
>>>>>> unfortunately it
>>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>>> to pipeline.
>>>>>
>>>>>> The instruction set was full of overcomplicated
>>>>>> microcoded instructions that were often slower than a sequence of
>>>>>> simple instructions,
>>>>>
>>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>>> <
>>>> An alternative or addition is wide packed instructions with chaining.
>>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>>
>>>> You would go 256 bits wide and support a variable number of
>>>> instructions in
>>>> the packet and by supporting chaining you save 10 bits of register
>>>> specifiers for each chain segment, minus a bit to indicate the chain
>>>> link.
>>> <
>>> Why is there any romance associated with power of 2 widths in support of
>>> wide {Fetch-Decode-Execute} ?
>>> <
>>> 4 in 128-bits
>>> 6 in 192-bits
>>> 8 in 256-bits
>>> 10 in 320-bits
>>> ....
>>> <
>>> It seems to me that powers of 2 in byte count is an artificial
>>> boundary that
>>> is not good for the ISA in general.
>>> <
>>> Also this seriously hinders building little implementation and big
>>> implementations
>>> at the same time, being semi-optimal only for a few implementations
>>> in the middle.
>>
>> Agree.
>>

FWIW: Power-of-2 makes a lot more sense for fixed length instructions
than for variable-length. For variable-length, what matters more tends
to be the minimum alignment.

For BJX2, the minimum alignment has generally been 2, on account of
there being 16-bit encodings, whereas WEX bundles are typically limited
to 32-bit units.

At one point, I had 48-bit encodings, but the original encodings were
lost in a reorganization of the ISA's encoding space, and new 48 bit
encodings have not been introduced. They would offer little advantage
over 64-bit Jumbo encodings, and statistically these are rare enough
that the effect on code-density would be rare.

I had experimented with 24-bit encodings for a microcontroller profile,
but within the ISA these added a lot of hair, and their space-savings
were relatively modest at best. Better would be to have them in an ISA
designed specifically for byte-aligned instructions, eg:
16/24/32/48/64/96.

A few bits could select between 16/24/32. The 48/64/96 encodings could
be used via a 32-bit combiner prefix.

>>>> You would use heads and tails encoding and support jumps into and
>>>> out of
>>>> the packet. Chaining alone should give you the best instruction
>>>> density and
>>>> add some micro coded instructions and you should get dominating
>>>> instruction
>>>> density that makes manufactures take a serious look at your offerings.
>>> <
>>> My 66000 ISA is getting near x86-64 instruction density without any
>>> of this.
>>
>> Um, x86-64 is not much better than RISC and sometimes worse depending on
>> how much floating point you do as that adds another byte to each
>> instruction.
>>
>> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>>
>
> IME, in terms of code density x86-64 seems to be pretty weak. Though, it
> varies some, for example, "gcc -Os" on Linux does somewhat better than
> MSVC.
>
> In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
> A64 seems to do a bit worse here than Thumb2.
>

Add: In my testing, code density for BJX2 tends to be in a similar area
to i386, but generally worse than Thumb2.

Eg, approximate ranking seems to be roughly:
Thumb2 (best)
i386
~ BJX2
SH4
ARM32 + Thumb1 ( ~= SH4 )
A32 / ARM32 (no Thumb)
x86-64 / X64 (GCC / Clang)
A64 / ARM64 / Aarch64
...
x86-64 / X64 (MSVC)

I don't have compilers set up for SPARC/MIPS/IA64/... but, from tables I
had looked at before, I expect them to be somewhat behind x86-64 here.

MSP430 seems to also do pretty good, but given its limited scope, direct
comparison with the others is difficult.

For the most part, a lot of these fall within a fairly small window of
each other. Most seem to fall pretty close to an average of ~ 2.6 to 3.4
bytes per instruction.

I suspect x86-64 takes a pretty big hit relative to i386 due to the REX
prefix (pushes the average closer to ~ 4.2 bytes / instruction or so).

Code density in BJX2 is a little bit of a wildcard, as sometimes
(usually with small examples) it does particularly bad. It does pay a
bit of overhead for small programs as the binaries tend to be statically
linked with their own copy of virtual memory subsystem and the FAT32
filesystem drivers, ... (these are not used if the program is launched
via the shell).

Code density also varies a bit depending on whether WEX is enabled, eg:
WEX = false: 16b = ~60% / 32b = ~40% (avg = ~ 2.8 bytes/op);
WEX = true : 16b = ~20% / 32b = ~80% (avg = ~ 3.6 bytes/op).

The latter could be improved, would mostly require changing how the
compiler deals with this.

Jumbo encodings are infrequent enough to mostly be ignored here.

The compiler also produces LZ4 compressed binaries, which needs to be
accounted for (directly comparing an LZ compressed binary with an
uncompressed binary is misleading).

>
>
> As a small example, I had recently hacked together a small voxel based
> 3D engine along vaguely similar lines to Minecraft Classic for the BJX2
> ISA (*1).
>
> A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
> build is 175K. Not strictly apples-to-apples, but still...
>

With MSVC, it is fairly common (for many other programs) to see binaries
in the size range of several MB.

In some tests, it seems to depend on the program whether /O2, /O1, or
/Os is "actually" fastest. Mostly I suspect this is because "/O2" tends
to prefer to balloon the binary and use lots of misguided attempts at
autovectorization.

>
> *1: Its renderer basically sweeps across the screen doing ray-casts,
> building up a list of any blocks hit by a ray-cast, and then drawing the
> list of blocks (via software-rasterized OpenGL). It has minimal overdraw
> (because a raycast will not pass through a wall), but only really works
> effectively at small draw distances.
>
> Its performance still manages to be somehow less awful than I originally
> imagined (framerates are a little better than Quake, albeit at a 24
> block draw-distance). Ended up using a color-fill sky, mostly as this
> improves framerate somewhat if compared with drawing a skybox.
>

Note, the SW GL rasterizer was static-linked in both of the above cases.

>
>
> This is still with me discovering and occasionally fixing "crazy bad"
> compiler bugs, eg:
> Turns out the compiler was very-frequently trying to cast-convert
> operands of binary operators to the destination type even when they were
> the same (resulting in a lot of extra register MOVs, spills, ...).
>
> Eg, if you did something like:
>   int a, b, c;
>   c=a+b;
> It was tending to often compile it like it were:
>   c=(int)a+(int)b;
>
> Which at the ASM level would, instead of, say:
>   ADDS.L  R8, R9, R14
> Result in something like:
>   MOV     R8, R25
>   MOV     R9, R28
>   ADDS.L  R25, R28, R14
> And would also result in higher register pressure and a larger number of
> spills.
>
> Fixing this bug gave a roughly 4% reduction in the size of binaries, and
> a roughly 20% increase in performance for Doom and similar. This also
> caused Dhrystone score to increase from ~51.3k to ~57.1k.
>
> It seemed this was also related to a lot of cases where, say:
>   c=a+imm;
> Was resulting in things like:
>   MOV    Imm, R6
>   MOV    R7, R9
>   ADDS.L R12, R9, R13
> Rather than, say:
>   ADDS.L R12, Imm, R13
>


Click here to read the complete article
Re: Design a better 16 or 32 bit processor

<shqqi6$vjt$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20458&group=comp.arch#20458

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 11:44:20 -0700
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <shqqi6$vjt$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpkkc$1mdo$1@gioia.aioe.org>
<02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Sep 2021 18:44:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="728b35c2b6000ff47e06eadc4d9a326b";
logging-data="32381"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Xy2R02qVbe5VWgS7tNCmruqSrNDoyB3Q="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:d12xHVf8zQ+d0Dzi16VeFIGk3TM=
In-Reply-To: <02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 14 Sep 2021 18:44 UTC

On 9/14/2021 10:28 AM, MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 2:57:06 AM UTC-5, Terje Mathisen wrote:
>> Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> In My case (My 66000) the biggest code density benefit was in creating
>>>> ENTER and EXIT instructions, second best was giving every instruction
>>>> access to any width immediate.
>>>> <
>>>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>>>
>>> The 16 bit market has the worst choices to pick from, the problem is that
>>> these devices are cheap, so everyone ignores this market. Definition of
>>> opportunity.
>> How cheap must a 32-bit cpu be in order to make 16-bit irrelevant?
>>
>> I was blown away by a conference talk by "Bunny" Chang several years
>> ago, telling us that _every_ single USB memory stick contains a 32-bit
>> cpu, this is what allows the manufacturers to sell all the flash memory
>> chips they make, no matter how good or bad it turned out: The embedded
>> cpu does all the testing/verification/remapping/bad block flagging etc
>> and reports back "This is an 8GB stick", probably also with some
>> speed/quality markers to allow it to be sold to different markets/price
>> points.
>>
>> I can only assume that the cost of those 32-bit memory stick cpus have
>> gone even further down by now.
> <
> Figure the 32-bit CPU gets 2%-5% of the cost of the USB-stick.

Of course, the percentage depends upon the size of the memory in the
stick. As sticks get larger, the percentage goes down.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Design a better 16 or 32 bit processor

<shqrj0$7kn$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20459&group=comp.arch#20459

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 14:00:30 -0500
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <shqrj0$7kn$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpieu$mo5$1@dont-email.me>
<3b32faed-f670-46a7-b706-2787a72f7e83n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Sep 2021 19:01:52 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a87b7b2e948ad6ef68b288d5fab3588d";
logging-data="7831"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+29rK9BWzF3ZHmFNh+Jc2h"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:tbMCJSKFZ8B0bbGBq1HpmaWyJmU=
In-Reply-To: <3b32faed-f670-46a7-b706-2787a72f7e83n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 14 Sep 2021 19:00 UTC

On 9/14/2021 12:26 PM, MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 2:20:01 AM UTC-5, BGB wrote:
>> On 9/13/2021 7:34 PM, Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>> Brett <gg...@yahoo.com> wrote:
>>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>>> <

....

>>
>> ...
>>>>> A variable width instruction set can support chaining today by just adding
>>>>> the instructions, I am perplexed as to why no one has.
>>>> <
>>>> Why don't you give it a go and see what comes out ?
>>>
>>> Chaining opcodes is complex homework that every company should have taken a
>>> look at, including you, results should be somewhere on the internet.
>>>
>>> Getting the RISC guys to go variable width was worse than pulling teeth.
>>> Threats of firings and resignations were involved at ARM, and the MIPS
>>> founder did fire people for such suggestions, though most were weeded out
>>> at hiring interviews leading to brain dead group think that killed the
>>> company when the market changed.
>>>
>>> Adding a chaining register dependency is maybe 10 times worse in these
>>> peoples minds.
>>>
>> It is balance, some.
>>
>> 16/32, by looking at a few bits, is OK.
>> Decoding a bundle based on also looking at a few bits and daisy-chaining
>> is also OK.
>>
>> Fully variable length encodings which depend on looking at lots of
>> different bits are less OK.
> <
> Which is why all of the "stuff" necessary to perform an instruction is found in the first
> word of the My 66000 Instruction, the variable remainder is for constants (immediates
> and displacements.}
> <

Yeah.

My case, the 16/32 case can be determined via the top 4 bits of the
first 16-bit word. Originally, it was "top 3 bits == 111", but this got
messed up by adding the XGPR encodings (7, 9, E, F). Could have been
cleaner, but, didn't want to break binary compatibility with existing
parts of the ISA.

WEX Bundles and Jumbo encodings require looking at a few more bits, but
not too much worse.

Admittedly, the encoding of larger immediate values in BJX2 is a little
bit awful. Not too bad in terms of resource costs, but cases where one
has to deal with them as part applying reloc's is pretty ugly (both in
the compiler, but also some of the PE/COFF relocs need to deal with
instruction encodings to deal with certain parts of the ABI).

As can be noted, it can encode a 64-bit constant load, or 33-bit
immediate value, within a single clock cycle.

Though, FWIW, I suspect RISC-V will have a similar issue with immediate
fields being basically "bit confetti". One could do like some
traditional RISC's and load larger immediate values from memory using
PC-rel ops, but this sucks, so alas...

Luckily at least, an FPGA doesn't really care that much if ones'
immediate fields are bit confetti.

The ISA encoding does have the property that trying to start decoding in
the middle of an instruction will typically cause decoding to realign
within 1 or 2 instructions, and it is possible to look at a pair of
16-bit words and have a fairly high confidence which one represents the
start of a valid instruction.

Re: Design a better 16 or 32 bit processor

<shqujh$t4u$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20460&group=comp.arch#20460

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 14:52:00 -0500
Organization: A noiseless patient Spider
Lines: 79
Message-ID: <shqujh$t4u$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpkkc$1mdo$1@gioia.aioe.org>
<02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Sep 2021 19:53:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a87b7b2e948ad6ef68b288d5fab3588d";
logging-data="29854"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ATN6LsZejgHV25f620js9"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:MkWuE4MLvNZdbM+rTzmUIcuWs18=
In-Reply-To: <02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 14 Sep 2021 19:52 UTC

On 9/14/2021 12:28 PM, MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 2:57:06 AM UTC-5, Terje Mathisen wrote:
>> Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> In My case (My 66000) the biggest code density benefit was in creating
>>>> ENTER and EXIT instructions, second best was giving every instruction
>>>> access to any width immediate.
>>>> <
>>>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>>>
>>> The 16 bit market has the worst choices to pick from, the problem is that
>>> these devices are cheap, so everyone ignores this market. Definition of
>>> opportunity.
>> How cheap must a 32-bit cpu be in order to make 16-bit irrelevant?
>>
>> I was blown away by a conference talk by "Bunny" Chang several years
>> ago, telling us that _every_ single USB memory stick contains a 32-bit
>> cpu, this is what allows the manufacturers to sell all the flash memory
>> chips they make, no matter how good or bad it turned out: The embedded
>> cpu does all the testing/verification/remapping/bad block flagging etc
>> and reports back "This is an 8GB stick", probably also with some
>> speed/quality markers to allow it to be sold to different markets/price
>> points.
>>
>> I can only assume that the cost of those 32-bit memory stick cpus have
>> gone even further down by now.
> <
> Figure the 32-bit CPU gets 2%-5% of the cost of the USB-stick.
> <

I suspect the cost difference between a small 16-bit and 32-bit ISA will
mostly disappear in the noise.

Even on FPGA's:
Most of the cheaper-end FPGAs one can buy, can handle a 32 or 64 bit CPU
core;
One might be more tempted to use a 16-bit core on an ICE40, since these
tend to be pretty small, however, boards with ICE40 FPGAs tend not
really to be any cheaper than one with a Spartan or Artix, and the
Spartan and Artix can generally handle somewhat bigger designs.

I have been able to shove a BJX2 core onto an XC7S25, which is near the
cheaper end of readily available FPGAs. Granted, a 32-bit core would
make more sense here than a 64-bit core.

Though, did note recently that someone had managed to get a small
Minecraft clone running on a board with an ICE40 (ICE40UP5K), which was
partly what led to a revival of my "get voxel-based 3D engine working on
BJX2" effort (the XC7A100T is a significantly bigger FPGA than the
ICE40UP5K).

For expediency, mostly reused the TKRA-GL implementation which I had
written for my GLQuake and Quake 3 Arena efforts. Granted, maybe not the
fastest option possible for this. Had originally considered doing
something similar to ROTT, but this was more complicated, and it was
unclear how to draw the tops/bottoms of blocks without effectively
falling back to a full polygonal rasterizer (at which point one may as
well just do everything with a polygon rasterizer).

Though, I guess a question here could be if there were good ways to
further accelerate things like vertex projection and rasterization tasks.

They got better performance than my effort as well, but were also doing
a lot of the rendering tasks using dedicated FPGA logic, as opposed to
me basically using a software renderer (with raycasts mostly being used
for visibility determination).

Apparently they were using a custom 16-bit RISC style core with a lot of
the rest of the FPGA mostly dedicated to rendering tasks, apparently
using a hardware-based raytracer (though apparently limited to only
being able to deal with surfaces aligned on a cubic grid).

....

Re: Design a better 16 or 32 bit processor

<5e70J.28970$6U3.17891@fx43.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20461&group=comp.arch#20461

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com> <shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com> <shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me> <565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me> <shpkkc$1mdo$1@gioia.aioe.org> <02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
In-Reply-To: <02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 34
Message-ID: <5e70J.28970$6U3.17891@fx43.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 14 Sep 2021 20:03:13 UTC
Date: Tue, 14 Sep 2021 16:03:13 -0400
X-Received-Bytes: 2780
 by: EricP - Tue, 14 Sep 2021 20:03 UTC

MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 2:57:06 AM UTC-5, Terje Mathisen wrote:
>> Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> In My case (My 66000) the biggest code density benefit was in creating
>>>> ENTER and EXIT instructions, second best was giving every instruction
>>>> access to any width immediate.
>>>> <
>>>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>>> The 16 bit market has the worst choices to pick from, the problem is that
>>> these devices are cheap, so everyone ignores this market. Definition of
>>> opportunity.
>> How cheap must a 32-bit cpu be in order to make 16-bit irrelevant?
>>
>> I was blown away by a conference talk by "Bunny" Chang several years
>> ago, telling us that _every_ single USB memory stick contains a 32-bit
>> cpu, this is what allows the manufacturers to sell all the flash memory
>> chips they make, no matter how good or bad it turned out: The embedded
>> cpu does all the testing/verification/remapping/bad block flagging etc
>> and reports back "This is an 8GB stick", probably also with some
>> speed/quality markers to allow it to be sold to different markets/price
>> points.
>>
>> I can only assume that the cost of those 32-bit memory stick cpus have
>> gone even further down by now.
> <
> Figure the 32-bit CPU gets 2%-5% of the cost of the USB-stick.

I would guess there are some per-stick percents for
Microsoft's FAT file system format patents.

Also there is some fello who seems to hold most of the
write wear leveling patents. Few percent to him too.

Re: Design a better 16 or 32 bit processor

<jwv35q63of7.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20462&group=comp.arch#20462

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 16:16:21 -0400
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <jwv35q63of7.fsf-monnier+comp.arch@gnu.org>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpkkc$1mdo$1@gioia.aioe.org>
<02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
<5e70J.28970$6U3.17891@fx43.iad>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d4ebdc5c8139b93057c30b47d83b516f";
logging-data="3735"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX186zjmvrsT6XmCHgsA6iccj"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:zZOz8wzunv0MuiE/GqJb5ejNcSE=
sha1:3i0ptd0DKx4wL/R60f5G7UFsfso=
 by: Stefan Monnier - Tue, 14 Sep 2021 20:16 UTC

> I would guess there are some per-stick percents for
> Microsoft's FAT file system format patents.

Any patents on FAT would have expired years ago (maybe you meant
exFAT?). But in any case, USB flash sticks usually don't implement any
filesystem at all, they only expose a block-device.

Stefan

Re: Design a better 16 or 32 bit processor

<shr0c5$9f2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20463&group=comp.arch#20463

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 13:23:33 -0700
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <shr0c5$9f2$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
<caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com>
<shp20d$5ed$1@dont-email.me>
<d9e4b6b9-41de-4227-bb6e-f5924dc65c06n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Sep 2021 20:23:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="714f6370f4a8acec5d56114887c2bbf8";
logging-data="9698"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Vl2H8tbWzIdLYJROEseNv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:L0ZUrpLO8fckZo3U425Yf4gWcK4=
In-Reply-To: <d9e4b6b9-41de-4227-bb6e-f5924dc65c06n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Tue, 14 Sep 2021 20:23 UTC

On 9/14/2021 10:20 AM, MitchAlsup wrote:
> On Monday, September 13, 2021 at 9:39:12 PM UTC-5, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
>>> On Monday, September 13, 2021 at 7:34:55 PM UTC-5, gg...@yahoo.com wrote:
>>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>>> Brett <gg...@yahoo.com> wrote:
>>>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>>>> <
>>>>>>>>> Which died because its successor was too hard to pipeline.
>>>>>>>>
>>>>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>>>>>> than core.
>>>>>>>>
>>>>>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>>>>> to pipeline.
>>>>>>>
>>>>>>>> The instruction set was full of overcomplicated
>>>>>>>> microcoded instructions that were often slower than a sequence of
>>>>>>>> simple instructions,
>>>>>>>
>>>>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>>>>> <
>>>>>> An alternative or addition is wide packed instructions with chaining.
>>>>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>>>>
>>>>>> You would go 256 bits wide and support a variable number of instructions in
>>>>>> the packet and by supporting chaining you save 10 bits of register
>>>>>> specifiers for each chain segment, minus a bit to indicate the chain link.
>>>>> <
>>>>> Why is there any romance associated with power of 2 widths in support of
>>>>> wide {Fetch-Decode-Execute} ?
>>>>> <
>>>>> 4 in 128-bits
>>>>> 6 in 192-bits
>>>>> 8 in 256-bits
>>>>> 10 in 320-bits
>>>>> ....
>>>>> <
>>>>> It seems to me that powers of 2 in byte count is an artificial boundary that
>>>>> is not good for the ISA in general.
>>>>> <
>>>>> Also this seriously hinders building little implementation and big implementations
>>>>> at the same time, being semi-optimal only for a few implementations in the middle.
>>>> Agree.
>>>>>> You would use heads and tails encoding and support jumps into and out of
>>>>>> the packet. Chaining alone should give you the best instruction density and
>>>>>> add some micro coded instructions and you should get dominating instruction
>>>>>> density that makes manufactures take a serious look at your offerings.
>>>>> <
>>>>> My 66000 ISA is getting near x86-64 instruction density without any of this.
>>>> Um, x86-64 is not much better than RISC and sometimes worse depending on
>>>> how much floating point you do as that adds another byte to each
>>>> instruction.
>>>>
>>>> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>>>>>> A variable width instruction set can support chaining today by just adding
>>>>>> the instructions, I am perplexed as to why no one has.
>>>>> <
>>>>> Why don't you give it a go and see what comes out ?
>>>> Chaining opcodes is complex homework that every company should have taken a
>>>> look at, including you, results should be somewhere on the internet.
>>>>
>>>> Getting the RISC guys to go variable width was worse than pulling teeth.
>>> <
>>> It should not have been.
>>> <
>>> I figured it out in the first several weeks of starting an x86-64 microarchitecture.
>>> Variable length instruction sets have a value that you cannot ignore, although
>>> I generally state this in the form that "No instructions should be used in pasting
>>> bits together" rather than VLI.
>>> <
>>>> Threats of firings and resignations were involved at ARM, and the MIPS
>>>> founder did fire people for such suggestions, though most were weeded out
>>>> at hiring interviews leading to brain dead group think that killed the
>>>> company when the market changed.
>>> <
>>> I was not hired by HP back in the Snake days because I was willing to consider
>>> a pipeline where LDs could "short circuit" some piece of the instruction pipeline.
>>> <
>>> I understand the group think going on.
>>>>
>>>> Adding a chaining register dependency is maybe 10 times worse in these
>>>> peoples minds.
>>> <
>>> I has specific chaining in my GPU ISA. Overall, it was probably not necessary
>>> and might have been easier to use 5-bit comparators.
> <
>> Did you fix the compilers annoying habit of interleaving loads and math,
>> hard to chain two adds when there is a needless load in between.
> <
> Reservation stations do this automagically. {And get rid of the need to explicitly
> chain one instruction to another.} And to a large extent, so do ScoreBoards.
> <
> It is only the statically scheduled architectures that have the problem you suggest.

Only if the architecture restricts chaining to the immediately following
instruction. If chaining has a range of reference then the compiler can
interleave up to the allowable range, at the cost of bigger instructions
to specify which value in the range was desired.

If you think of chaining as an implicit belt, then OP is complaining
about a belt of length one, which is too short for compiled code. At a
guess, useful chaining for a sequential encoding (i.e. not a wide
encoding with bundles) needs a chain range (equivalently belt length) of
four, and two-bit references, with typical compilers.

That's for a compiler that does ordinary scheduling before register and
reference allocation. A compiler that does temporal scheduling so as to
increase the number and length of chains is a non-trivial exercise, but
can double the chaining or alternatively reduce the required reference
range by half.

Re: Design a better 16 or 32 bit processor

<shr1a1$fht$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20464&group=comp.arch#20464

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 15:38:08 -0500
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <shr1a1$fht$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpkkc$1mdo$1@gioia.aioe.org>
<02e405ee-a60b-40c0-a05a-e7402d82acban@googlegroups.com>
<5e70J.28970$6U3.17891@fx43.iad> <jwv35q63of7.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Sep 2021 20:39:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a87b7b2e948ad6ef68b288d5fab3588d";
logging-data="15933"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/g4qcfK8WwkH6OvK8KnSCi"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:PcHoFZcy+KrWBl1KhVOt+QU3ovU=
In-Reply-To: <jwv35q63of7.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: BGB - Tue, 14 Sep 2021 20:38 UTC

On 9/14/2021 3:16 PM, Stefan Monnier wrote:
>> I would guess there are some per-stick percents for
>> Microsoft's FAT file system format patents.
>
> Any patents on FAT would have expired years ago (maybe you meant
> exFAT?). But in any case, USB flash sticks usually don't implement any
> filesystem at all, they only expose a block-device.
>

True enough, but they do usually come already formatted as such, and it
is a question if someone has to pay royalties for the sake of doing so,
such that the "standards compliance" deities are satisfied.

This is unlike HDDs, which are generally sold in an unformatted state.

I suspect ExFAT was mostly devised as a scheme for MS to try to continue
squeezing royalties out of this (after the LFN related patents expired).

MS: FAT32 only works up to 32GB for... Reasons...

Then they (somehow) managed to convince the standards people of this, so
SDcard and USB sticks are generally required to come formatted as ExFAT
(and then typically need to be manually reformatted as FAT32 or similar,
because Linux and friends still generally only understand FAT32, well,
and EXT2/3/4 and similar, ...).

Though, I suspect pretty much everyone realizes FAT32 works well past
this point, and the actual limit is closer to 2TB.

There is the 4GB file-size limit, but like, apart from things like
camcorders, this doesn't really matter.

And, ExFAT still doesn't really offer much of anything else that is
particularly compelling for use-cases other than camcorders or similar.

>
> Stefan
>

Re: Design a better 16 or 32 bit processor

<shr1iu$hft$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20465&group=comp.arch#20465

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Tue, 14 Sep 2021 13:44:14 -0700
Organization: A noiseless patient Spider
Lines: 247
Message-ID: <shr1iu$hft$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me> <shpieu$mo5$1@dont-email.me>
<3b32faed-f670-46a7-b706-2787a72f7e83n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 Sep 2021 20:44:14 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="714f6370f4a8acec5d56114887c2bbf8";
logging-data="17917"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Wfu5sgFIhXCjns3MveGIC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:0XaN/YchI9S1gFbbsVnRBUU/fhE=
In-Reply-To: <3b32faed-f670-46a7-b706-2787a72f7e83n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Tue, 14 Sep 2021 20:44 UTC

On 9/14/2021 10:26 AM, MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 2:20:01 AM UTC-5, BGB wrote:
>> On 9/13/2021 7:34 PM, Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>> Brett <gg...@yahoo.com> wrote:
>>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>>> <
>>>>>>>> Which died because its successor was too hard to pipeline.
>>>>>>>
>>>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>>>>> than core.
>>>>>>>
>>>>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>>>> to pipeline.
>>>>>>
>>>>>>> The instruction set was full of overcomplicated
>>>>>>> microcoded instructions that were often slower than a sequence of
>>>>>>> simple instructions,
>>>>>>
>>>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>>>> <
>>>>> An alternative or addition is wide packed instructions with chaining.
>>>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>>>
>>>>> You would go 256 bits wide and support a variable number of instructions in
>>>>> the packet and by supporting chaining you save 10 bits of register
>>>>> specifiers for each chain segment, minus a bit to indicate the chain link.
>>>> <
>>>> Why is there any romance associated with power of 2 widths in support of
>>>> wide {Fetch-Decode-Execute} ?
>>>> <
>>>> 4 in 128-bits
>>>> 6 in 192-bits
>>>> 8 in 256-bits
>>>> 10 in 320-bits
>>>> ....
>>>> <
>>>> It seems to me that powers of 2 in byte count is an artificial boundary that
>>>> is not good for the ISA in general.
>>>> <
>>>> Also this seriously hinders building little implementation and big implementations
>>>> at the same time, being semi-optimal only for a few implementations in the middle.
>>>
>>> Agree.
>>>
>>>>> You would use heads and tails encoding and support jumps into and out of
>>>>> the packet. Chaining alone should give you the best instruction density and
>>>>> add some micro coded instructions and you should get dominating instruction
>>>>> density that makes manufactures take a serious look at your offerings.
>>>> <
>>>> My 66000 ISA is getting near x86-64 instruction density without any of this.
>>>
>>> Um, x86-64 is not much better than RISC and sometimes worse depending on
>>> how much floating point you do as that adds another byte to each
>>> instruction.
>>>
>>> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>>>
>> IME, in terms of code density x86-64 seems to be pretty weak. Though, it
>> varies some, for example, "gcc -Os" on Linux does somewhat better than MSVC.
>>
>> In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
>> A64 seems to do a bit worse here than Thumb2.
>>
>>
>>
>> As a small example, I had recently hacked together a small voxel based
>> 3D engine along vaguely similar lines to Minecraft Classic for the BJX2
>> ISA (*1).
>>
>> A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
>> build is 175K. Not strictly apples-to-apples, but still...
>>
>>
>> *1: Its renderer basically sweeps across the screen doing ray-casts,
>> building up a list of any blocks hit by a ray-cast, and then drawing the
>> list of blocks (via software-rasterized OpenGL). It has minimal overdraw
>> (because a raycast will not pass through a wall), but only really works
>> effectively at small draw distances.
>>
>> Its performance still manages to be somehow less awful than I originally
>> imagined (framerates are a little better than Quake, albeit at a 24
>> block draw-distance). Ended up using a color-fill sky, mostly as this
>> improves framerate somewhat if compared with drawing a skybox.
>>
>>
>>
>> This is still with me discovering and occasionally fixing "crazy bad"
>> compiler bugs, eg:
>> Turns out the compiler was very-frequently trying to cast-convert
>> operands of binary operators to the destination type even when they were
>> the same (resulting in a lot of extra register MOVs, spills, ...).
>>
>> Eg, if you did something like:
>> int a, b, c;
>> c=a+b;
>> It was tending to often compile it like it were:
>> c=(int)a+(int)b;
>>
>> Which at the ASM level would, instead of, say:
>> ADDS.L R8, R9, R14
>> Result in something like:
>> MOV R8, R25
>> MOV R9, R28
>> ADDS.L R25, R28, R14
>> And would also result in higher register pressure and a larger number of
>> spills.
>>
>> Fixing this bug gave a roughly 4% reduction in the size of binaries, and
>> a roughly 20% increase in performance for Doom and similar. This also
>> caused Dhrystone score to increase from ~51.3k to ~57.1k.
>>
>> It seemed this was also related to a lot of cases where, say:
>> c=a+imm;
>> Was resulting in things like:
>> MOV Imm, R6
>> MOV R7, R9
>> ADDS.L R12, R9, R13
>> Rather than, say:
>> ADDS.L R12, Imm, R13
>>
>> ...
>>
>>
>> Then, relatedly, noted that, eg:
>> y=x&255;
>> Was being compiled sorta like:
>> MOV R11, R7
>> EXTU.B R7, R7
>> MOV R7, R28
>> MOV R28, R14
>> Vs, say:
>> EXTU.B R11, R14
>>
>> Turns out this was stumbling on some logic for a "stale" code-path,
>> where early on, the compiler would handle operators more like:
>> Allocate scratch registers;
>> Load frame variables into scratch registers;
>> Apply operator to scratch registers;
>> Store result back to call frame;
>> Free said scratch registers.
>>
>> But, this was later replaced with:
>> Fetch the variables as registers;
>> Operate on these registers;
>> Release the registers.
>>
>> After I had switched over, trying to load/store a frame variable in this
>> way would typically result in a register MOVs rather than an actual
>> memory load/store. These older paths have not been entirely eliminated
>> though.
>>
>> But, yeah, fixing these appears to have slightly reduced the level of
>> "general awfulness" in my C compiler output.
>>
>>
>>
>> Then added another slight compiler tweak which got it up to ~57.9k
>> (namely, caching and reusing struct-field loads in certain cases).
>>
>> Was able to push it up to ~ 59.0k by assuming less-conservative
>> semantics ("strict aliasing"), but I decided against enabling this by
>> default as it seems unsafe. This mostly effects under which conditions
>> the cached struct field would be discarded.
>>
>> At present, it has certain restrictions:
>> * Does not cross a basic-block boundary;
>> * Discarded if either the of the cached variables is modified;
>> * Discarded if any sort of explicit memory store happens;
>> * ...
>>
>> But, what it will do, is compile an expression like:
>> y=foo->x*foo->x;
>> As if it were:
>> t0=foo->x;
>> y=t0*t0;
>>
>> Though, this optimization does not appear to have any real effect on
>> Doom and similar.
>>
>>
>> It is possible a similar trick could be used for array loads or pointer
>> derefs.
>>
>> I also went and recently optionally re-added the "FMOV.S" instruction
>> (Memory Load/Store combined with a Single<->Double conversion), since
>> this should be able to help some for code which works with
>> single-precision floating point values (avoids some common penalty cases).
>>
>> ...
>>>>> A variable width instruction set can support chaining today by just adding
>>>>> the instructions, I am perplexed as to why no one has.
>>>> <
>>>> Why don't you give it a go and see what comes out ?
>>>
>>> Chaining opcodes is complex homework that every company should have taken a
>>> look at, including you, results should be somewhere on the internet.
>>>
>>> Getting the RISC guys to go variable width was worse than pulling teeth.
>>> Threats of firings and resignations were involved at ARM, and the MIPS
>>> founder did fire people for such suggestions, though most were weeded out
>>> at hiring interviews leading to brain dead group think that killed the
>>> company when the market changed.
>>>
>>> Adding a chaining register dependency is maybe 10 times worse in these
>>> peoples minds.
>>>
>> It is balance, some.
>>
>> 16/32, by looking at a few bits, is OK.
>> Decoding a bundle based on also looking at a few bits and daisy-chaining
>> is also OK.
>>
>> Fully variable length encodings which depend on looking at lots of
>> different bits are less OK.
> <
> Which is why all of the "stuff" necessary to perform an instruction is found in the first
> word of the My 66000 Instruction, the variable remainder is for constants (immediates
> and displacements.}


Click here to read the complete article
Re: Design a better 16 or 32 bit processor

<shrv11$bsv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20470&group=comp.arch#20470

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Wed, 15 Sep 2021 05:06:41 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 144
Message-ID: <shrv11$bsv$1@dont-email.me>
References: <sh3evb$n1v$1@newsreader4.netcologne.de>
<3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com>
<shguv8$2416$1@gal.iecc.com>
<2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com>
<shh02c$267q$1@gal.iecc.com>
<shhl98$3vo$1@dont-email.me>
<shocru$pl0$1@dont-email.me>
<565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com>
<shoqnc$ub2$1@dont-email.me>
<caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com>
<shp20d$5ed$1@dont-email.me>
<d9e4b6b9-41de-4227-bb6e-f5924dc65c06n@googlegroups.com>
<shr0c5$9f2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Sep 2021 05:06:41 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f47fe5ed57ffe259d169ca76af5ed9c0";
logging-data="12191"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19uVed6mS5KkO5ldsiDksSj"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:wESswiQMdAgOKlI69ZOStc+Ic8A=
sha1:8qLIQGcZKFHyh5GC9lQz8tHFTxs=
 by: Brett - Wed, 15 Sep 2021 05:06 UTC

Ivan Godard <ivan@millcomputing.com> wrote:
> On 9/14/2021 10:20 AM, MitchAlsup wrote:
>> On Monday, September 13, 2021 at 9:39:12 PM UTC-5, gg...@yahoo.com wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> On Monday, September 13, 2021 at 7:34:55 PM UTC-5, gg...@yahoo.com wrote:
>>>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>>>> On Monday, September 13, 2021 at 3:38:25 PM UTC-5, gg...@yahoo.com wrote:
>>>>>>> Brett <gg...@yahoo.com> wrote:
>>>>>>>> John Levine <jo...@taugh.com> wrote:
>>>>>>>>> According to MitchAlsup <Mitch...@aol.com>:
>>>>>>>>>>> Sounds a lot like a PDP-11.
>>>>>>>>>> <
>>>>>>>>>> Which died because its successor was too hard to pipeline.
>>>>>>>>>
>>>>>>>>> The PDP-11 was a really good design for the late 1960s, when memory
>>>>>>>>> was starting to be affordable and microcode ROM was still a lot faster
>>>>>>>>> than core.
>>>>>>>>>
>>>>>>>>> The VAX was also a good design for the late 1960s but unfortunately it
>>>>>>>>> was introduced in the late 1970s. It wasn't just that it was too hard
>>>>>>>>> to pipeline.
>>>>>>>>
>>>>>>>>> The instruction set was full of overcomplicated
>>>>>>>>> microcoded instructions that were often slower than a sequence of
>>>>>>>>> simple instructions,
>>>>>>>>
>>>>>>>> A Myth what was promoted by RISC proponents, which has been debunked.
>>>>>> <
>>>>>>> An alternative or addition is wide packed instructions with chaining.
>>>>>>> The example of wide packed is Itanic, but of course they did it wrong.
>>>>>>>
>>>>>>> You would go 256 bits wide and support a variable number of instructions in
>>>>>>> the packet and by supporting chaining you save 10 bits of register
>>>>>>> specifiers for each chain segment, minus a bit to indicate the chain link.
>>>>>> <
>>>>>> Why is there any romance associated with power of 2 widths in support of
>>>>>> wide {Fetch-Decode-Execute} ?
>>>>>> <
>>>>>> 4 in 128-bits
>>>>>> 6 in 192-bits
>>>>>> 8 in 256-bits
>>>>>> 10 in 320-bits
>>>>>> ....
>>>>>> <
>>>>>> It seems to me that powers of 2 in byte count is an artificial boundary that
>>>>>> is not good for the ISA in general.
>>>>>> <
>>>>>> Also this seriously hinders building little implementation and big implementations
>>>>>> at the same time, being semi-optimal only for a few implementations in the middle.
>>>>> Agree.
>>>>>>> You would use heads and tails encoding and support jumps into and out of
>>>>>>> the packet. Chaining alone should give you the best instruction density and
>>>>>>> add some micro coded instructions and you should get dominating instruction
>>>>>>> density that makes manufactures take a serious look at your offerings.
>>>>>> <
>>>>>> My 66000 ISA is getting near x86-64 instruction density without any of this.
>>>>> Um, x86-64 is not much better than RISC and sometimes worse depending on
>>>>> how much floating point you do as that adds another byte to each
>>>>> instruction.
>>>>>
>>>>> If ARM Cortex/Thumb2 crushes you on density then you are in trouble.
>>>>>>> A variable width instruction set can support chaining today by just adding
>>>>>>> the instructions, I am perplexed as to why no one has.
>>>>>> <
>>>>>> Why don't you give it a go and see what comes out ?
>>>>> Chaining opcodes is complex homework that every company should have taken a
>>>>> look at, including you, results should be somewhere on the internet.
>>>>>
>>>>> Getting the RISC guys to go variable width was worse than pulling teeth.
>>>> <
>>>> It should not have been.
>>>> <
>>>> I figured it out in the first several weeks of starting an x86-64 microarchitecture.
>>>> Variable length instruction sets have a value that you cannot ignore, although
>>>> I generally state this in the form that "No instructions should be used in pasting
>>>> bits together" rather than VLI.
>>>> <
>>>>> Threats of firings and resignations were involved at ARM, and the MIPS
>>>>> founder did fire people for such suggestions, though most were weeded out
>>>>> at hiring interviews leading to brain dead group think that killed the
>>>>> company when the market changed.
>>>> <
>>>> I was not hired by HP back in the Snake days because I was willing to consider
>>>> a pipeline where LDs could "short circuit" some piece of the instruction pipeline.
>>>> <
>>>> I understand the group think going on.
>>>>>
>>>>> Adding a chaining register dependency is maybe 10 times worse in these
>>>>> peoples minds.
>>>> <
>>>> I has specific chaining in my GPU ISA. Overall, it was probably not necessary
>>>> and might have been easier to use 5-bit comparators.
>> <
>>> Did you fix the compilers annoying habit of interleaving loads and math,
>>> hard to chain two adds when there is a needless load in between.
>> <
>> Reservation stations do this automagically. {And get rid of the need to explicitly
>> chain one instruction to another.} And to a large extent, so do ScoreBoards.
>> <
>> It is only the statically scheduled architectures that have the problem you suggest.
>
> Only if the architecture restricts chaining to the immediately following
> instruction. If chaining has a range of reference then the compiler can
> interleave up to the allowable range, at the cost of bigger instructions
> to specify which value in the range was desired.
>
> If you think of chaining as an implicit belt, then OP is complaining
> about a belt of length one, which is too short for compiled code. At a
> guess, useful chaining for a sequential encoding (i.e. not a wide
> encoding with bundles) needs a chain range (equivalently belt length) of
> four, and two-bit references, with typical compilers.
>
> That's for a compiler that does ordinary scheduling before register and
> reference allocation. A compiler that does temporal scheduling so as to
> increase the number and length of chains is a non-trivial exercise, but
> can double the chaining or alternatively reduce the required reference
> range by half.

I like this idea a LOT. This is so good that 8 bit encodings may be viable
again. 2 bits to select from ABCD and 2 bits of belt chaining from previous
instructions which could store to any of the 32 registers leaving 4 bits
for instruction which generally indicate more bytes.

You could also have no result register specified for most non-addressing
instructions as you plan to pick up the result in chaining, saving an
opcode byte. Only the load ops will actually specify which of 32 registers
generally, if they specify a result register at all. You need a byte code
extension to specify one of the 28 not covered by ABCD.

To extend this to its logical conclusion most addressing will be from the
four EFGH registers. Also since adding a register specifier adds 8 bits you
could have up to 256 registers, you will likely cap this at 64 and fault on
a larger number. Some customer will ask for more, and you will be a able to
provide for a fee. Super computers like large register files.

Most compressed encodings get to 35% compression, this could break the
impossible 50% wall. Almost as good as the Mill arch for opcode
compression.

Software engineers would flock to this architecture, AND the silicon/money
guys will as well for the die savings for code space.

Mission is a GO, have at it. Get the first mover advantage, grab the gold
ring.

Re: Design a better 16 or 32 bit processor

<2021Sep16.121948@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20477&group=comp.arch#20477

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
Date: Thu, 16 Sep 2021 10:19:48 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 12
Message-ID: <2021Sep16.121948@mips.complang.tuwien.ac.at>
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com> <shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com> <shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me> <565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me> <caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com> <jwv7dfjewbd.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="547a4fa9b895515627fb959d230ad88a";
logging-data="19552"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/vqphxJeIDy2rKUlR8kWQd"
Cancel-Lock: sha1:3s39XQTbsZH4uqy80VTgVbST84E=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 16 Sep 2021 10:19 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>I do wonder: I can see a market for really tiny 8bit CPUs, but how big
>is the market for 16bit CPUs (i.e. those that need more than what 8bit
>CPUs can offer but for which a tiny 32bit CPU is already too expensive)?

This market is vanishing. E.g., TI is not promoting the MSP430 to new
customers, but presents the ARM-based MSP432 as replacement.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Design a better 16 or 32 bit processor

<K_F0J.35318$ol1.21547@fx42.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20478&group=comp.arch#20478

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx42.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Design a better 16 or 32 bit processor
References: <sh3evb$n1v$1@newsreader4.netcologne.de> <3fea318c-62be-4d30-aafb-976eeee14908n@googlegroups.com> <shguv8$2416$1@gal.iecc.com> <2d262fce-9363-4360-bfd9-ba4263e8d703n@googlegroups.com> <shh02c$267q$1@gal.iecc.com> <shhl98$3vo$1@dont-email.me> <shocru$pl0$1@dont-email.me> <565c4086-18fc-4c03-87c7-39ffa3943096n@googlegroups.com> <shoqnc$ub2$1@dont-email.me> <caf9d3df-9842-470b-930f-8e4b45402c6dn@googlegroups.com> <jwv7dfjewbd.fsf-monnier+comp.arch@gnu.org> <2021Sep16.121948@mips.complang.tuwien.ac.at>
In-Reply-To: <2021Sep16.121948@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 27
Message-ID: <K_F0J.35318$ol1.21547@fx42.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 16 Sep 2021 11:36:10 UTC
Date: Thu, 16 Sep 2021 07:35:23 -0400
X-Received-Bytes: 2348
 by: EricP - Thu, 16 Sep 2021 11:35 UTC

Anton Ertl wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> I do wonder: I can see a market for really tiny 8bit CPUs, but how big
>> is the market for 16bit CPUs (i.e. those that need more than what 8bit
>> CPUs can offer but for which a tiny 32bit CPU is already too expensive)?
>
> This market is vanishing. E.g., TI is not promoting the MSP430 to new
> customers, but presents the ARM-based MSP432 as replacement.
>
> - anton

Software development is so much more expensive now that it probably does
not make sense to pay someone to shoe-horn things into 16 bit segments,
and all the programming shenanigans to manage objects through handles
or overlay program or data segments.

32 bits doesn't have to mean complicated.
Eliminate the MMU and have a single flat physical address space.
Tiny multi-threaded OS and some heap routines costs maybe few kB.

And what is the extra hardware cost of 32 bits really?
A few extra register flip flops inside.
It doesn't even need to have a 32-bit data bus
(so you can use byte wide memory chips and save on memory)
and only as many address bus bits as you need.

Pages:12345678910
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor