novaBBS - comp.arch - Re: Encoding 20 and 40 bit instructions in 128 bits

Re: Encoding 20 and 40 bit instructions in 128 bits

<aef1247a-1936-4c59-9af3-67e82056ca5dn@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=23319&group=comp.arch#23319

X-Received: by 2002:a05:620a:a47:: with SMTP id j7mr1352655qka.146.1644419655311;
Wed, 09 Feb 2022 07:14:15 -0800 (PST)
X-Received: by 2002:a05:6808:14cd:: with SMTP id f13mr1484771oiw.84.1644419655065;
Wed, 09 Feb 2022 07:14:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 9 Feb 2022 07:14:14 -0800 (PST)
In-Reply-To: <stvvqd$12h$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9d7b:3ea3:9f29:930d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9d7b:3ea3:9f29:930d
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stvi9q$d03$1@dont-email.me>
<stvmm6$s6c$1@newsreader4.netcologne.de> <stvvqd$12h$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <aef1247a-1936-4c59-9af3-67e82056ca5dn@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 09 Feb 2022 15:14:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 31

by: MitchAlsup - Wed, 9 Feb 2022 15:14 UTC

On Wednesday, February 9, 2022 at 3:00:01 AM UTC-6, Thomas Koenig wrote:
> Thomas Koenig <tko...@netcologne.de> schrieb:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> >> I am probably missing something, but using your 40% figure
> >>
> >> (.4 * 20) + (.6 * 40) = 8 + 24 = 32
> >
> > Density is the number of instructions per unit, you have to
> > add the inverse of the number of bits.
> >
> > 1/(0.4 / 20 + 0.6 / 40) = 1/ (0.02 + 0.015) = 28.57...
> >
> > (Same as when calculating an avarage density of a mixture).
> Ah, never mind. You are right, 40% is not quite enough.
> Have to stick in some compare immediate and branch instructions
> as well.
<
What happens if the 20-bit instructions contains the destructive register
model:
<
OP Rd,Rs
instead of the non-destructive model
OP Rd,Rs1,Rs2
<
The destructive model handles "book keeping codes well (loop,
index, follow {p=*(p+offset)}.
<
This gives you a ton of instructions in this category and should
improve code density)
<
OP Rd,Immed10
<

Re: Encoding 20 and 40 bit instructions in 128 bits

<su0p0o$65c$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23320&group=comp.arch#23320

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Wed, 9 Feb 2022 08:10:00 -0800
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <su0p0o$65c$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<ssuf80$i60$1@dont-email.me> <ssulkf$7n0$1@newsreader4.netcologne.de>
<ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stvi9q$d03$1@dont-email.me>
<stvmm6$s6c$1@newsreader4.netcologne.de>
<stvvqd$12h$1@newsreader4.netcologne.de>
<aef1247a-1936-4c59-9af3-67e82056ca5dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 9 Feb 2022 16:10:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5af9daa82da3ab3c4e194e9076e704ac";
logging-data="6316"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7GUKRRP8B5USUS13TGpunRvSDsQGSXDQ="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:9QnWtX2juo0LaU8Y9tgG8WCzFLg=
In-Reply-To: <aef1247a-1936-4c59-9af3-67e82056ca5dn@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Wed, 9 Feb 2022 16:10 UTC

On 2/9/2022 7:14 AM, MitchAlsup wrote:
> On Wednesday, February 9, 2022 at 3:00:01 AM UTC-6, Thomas Koenig wrote:
>> Thomas Koenig <tko...@netcologne.de> schrieb:
>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>>>> I am probably missing something, but using your 40% figure
>>>>
>>>> (.4 * 20) + (.6 * 40) = 8 + 24 = 32
>>>
>>> Density is the number of instructions per unit, you have to
>>> add the inverse of the number of bits.
>>>
>>> 1/(0.4 / 20 + 0.6 / 40) = 1/ (0.02 + 0.015) = 28.57...
>>>
>>> (Same as when calculating an avarage density of a mixture).
>> Ah, never mind. You are right, 40% is not quite enough.
>> Have to stick in some compare immediate and branch instructions
>> as well.
> <
> What happens if the 20-bit instructions contains the destructive register
> model:
> <
> OP Rd,Rs
> instead of the non-destructive model
> OP Rd,Rs1,Rs2
> <
> The destructive model handles "book keeping codes well (loop,
> index, follow {p=*(p+offset)}.
> <
> This gives you a ton of instructions in this category and should
> improve code density)
> <
> OP Rd,Immed10
> <

Yes, but wouldn't you then need more MR instructions to prevent
overwriting to value when it will be needed again? Might be worth it,
but ISTM that is a negative factor.

BTW, I was surprised that in Thomas' original data that MR was so high
on the list (Not doubting the correctness of the data). Why are so many
MRs necessary?

If there really is a need for the same value in two registers, how about
adding a "Load to two registers" instruction? It would be just like a
load, but sacrifices some displacement bits to encode a second
destination register. Or, you could use a variant of the LM instruction
that loads two registers like LM, but doesn't increment the storage
address. This would cause the same value to be written to both
registers, hopefully saving a future MR instruction.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
> On 2/9/2022 7:14 AM, MitchAlsup wrote:
>> On Wednesday, February 9, 2022 at 3:00:01 AM UTC-6, Thomas Koenig wrote:
>>> Thomas Koenig <tko...@netcologne.de> schrieb:
>>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>>>>> I am probably missing something, but using your 40% figure
>>>>>
>>>>> (.4 * 20) + (.6 * 40) = 8 + 24 = 32
>>>>
>>>> Density is the number of instructions per unit, you have to
>>>> add the inverse of the number of bits.
>>>>
>>>> 1/(0.4 / 20 + 0.6 / 40) = 1/ (0.02 + 0.015) = 28.57...
>>>>
>>>> (Same as when calculating an avarage density of a mixture).
>>> Ah, never mind. You are right, 40% is not quite enough.
>>> Have to stick in some compare immediate and branch instructions
>>> as well.
>> <
>> What happens if the 20-bit instructions contains the destructive register
>> model:
>> <
>> OP Rd,Rs
>> instead of the non-destructive model
>> OP Rd,Rs1,Rs2
>> <
>> The destructive model handles "book keeping codes well (loop,
>> index, follow {p=*(p+offset)}.
>> <
>> This gives you a ton of instructions in this category and should
>> improve code density)
>> <
>> OP Rd,Immed10
>> <
>
> Yes, but wouldn't you then need more MR instructions to prevent
> overwriting to value when it will be needed again? Might be worth it,
> but ISTM that is a negative factor.

There are quite a few instructions of the form addi ra, ra, 1234
or add ra,ra,rb, when the compiler determines that the old value
is no longer required.

I have treated these separately in the statistics below, where I have
also treated loads and stores relative to the stack pointer separately:

mr 5590458 11.06 11.06
bl 3390274 6.71 17.77
ld 3044220 6.02 23.79
ld (stack) 2792897 5.52 29.31
addi (1-reg) 2517847 4.98 34.29
li 2486296 4.92 39.21
std (stack) 2342059 4.63 43.84
addi 2232815 4.42 48.26
addis 1668924 3.30 51.56
b 1534024 3.03 54.60
beq 1507796 2.98 57.58
std 1397405 2.76 60.34
cmpdi 1009923 2.00 62.34
add (2-reg) 962506 1.90 64.25
ori (1-reg) 813062 1.61 65.85
bne 799486 1.58 67.44
stdu (stack) 645407 1.28 68.71
cmpwi 642714 1.27 69.98
lwz 637949 1.26 71.25
mflr 587134 1.16 72.41
stw 562149 1.11 73.52
extsw 465193 0.92 74.44
lbz 453174 0.90 75.34

I could also post the whole analysis including bit counts, but that
would be a bit too long for a Usenet post - 1389 lines.

> BTW, I was surprised that in Thomas' original data that MR was so high
> on the list (Not doubting the correctness of the data). Why are so many
> MRs necessary?

I certainly did not look at 55 million instructions :-) but gave
it a cursory glance. Many of them occurred when calling functions
(chromium, being C++, has a really large number of function calls,
as evidenced by the bl instructions), and keeping a value that is
also passed as a function call is one valid reason for an mr.

Fortran usually uses a different calling convention, and the
stats look diffierent. Here's an overview of the Polyhedron
benchmark, with mr at around 3% instead of 11% for Chromium:

li 28397 7.30 7.30
addi 27797 7.15 14.45
addis 25875 6.65 21.10
std (stack) 20560 5.29 26.38
addi (1-reg) 18997 4.88 31.27
bl 15428 3.97 35.23
ld (stack) 12714 3.27 38.50
mr 11542 2.97 41.47
add (2-reg) 10986 2.82 44.29
ld 10783 2.77 47.07
sldi 10405 2.67 49.74
lfd 8564 2.20 51.94
ori (1-reg) 7475 1.92 53.87
lfs 7222 1.86 55.72
cmpwi 7195 1.85 57.57
beq 7021 1.80 59.38
std 6702 1.72 61.10
stfd 6663 1.71 62.81
b 6225 1.60 64.41
lwz 6016 1.55 65.96
stw 5372 1.38 67.34
stxvd2x 4301 1.11 68.45
stw (stack) 4157 1.07 69.51
stfs 4038 1.04 70.55
bne 3792 0.97 71.53
lxvd2x 3788 0.97 72.50
extsw 3507 0.90 73.40
add 3295 0.85 74.25
ble 3294 0.85 75.10
lis 3289 0.85 75.94
fmul 3148 0.81 76.75

Re: Encoding 20 and 40 bit instructions in 128 bits

<su10no$usc$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23323&group=comp.arch#23323

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Wed, 9 Feb 2022 10:21:43 -0800
Organization: A noiseless patient Spider
Lines: 129
Message-ID: <su10no$usc$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<ssuf80$i60$1@dont-email.me> <ssulkf$7n0$1@newsreader4.netcologne.de>
<ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stvi9q$d03$1@dont-email.me>
<stvmm6$s6c$1@newsreader4.netcologne.de>
<stvvqd$12h$1@newsreader4.netcologne.de>
<aef1247a-1936-4c59-9af3-67e82056ca5dn@googlegroups.com>
<su0p0o$65c$1@dont-email.me> <su0vm6$ivl$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 9 Feb 2022 18:21:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5af9daa82da3ab3c4e194e9076e704ac";
logging-data="31628"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/i+B1DcaBA+Mwl5LVSSxcmkYoUXoCMvPk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:5zKi4+lAPGyjvDFX9Yq1aVdNmek=
In-Reply-To: <su0vm6$ivl$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Stephen Fuld - Wed, 9 Feb 2022 18:21 UTC

On 2/9/2022 10:03 AM, Thomas Koenig wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
>> On 2/9/2022 7:14 AM, MitchAlsup wrote:
>>> On Wednesday, February 9, 2022 at 3:00:01 AM UTC-6, Thomas Koenig wrote:
>>>> Thomas Koenig <tko...@netcologne.de> schrieb:
>>>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>>>>>> I am probably missing something, but using your 40% figure
>>>>>>
>>>>>> (.4 * 20) + (.6 * 40) = 8 + 24 = 32
>>>>>
>>>>> Density is the number of instructions per unit, you have to
>>>>> add the inverse of the number of bits.
>>>>>
>>>>> 1/(0.4 / 20 + 0.6 / 40) = 1/ (0.02 + 0.015) = 28.57...
>>>>>
>>>>> (Same as when calculating an avarage density of a mixture).
>>>> Ah, never mind. You are right, 40% is not quite enough.
>>>> Have to stick in some compare immediate and branch instructions
>>>> as well.
>>> <
>>> What happens if the 20-bit instructions contains the destructive register
>>> model:
>>> <
>>> OP Rd,Rs
>>> instead of the non-destructive model
>>> OP Rd,Rs1,Rs2
>>> <
>>> The destructive model handles "book keeping codes well (loop,
>>> index, follow {p=*(p+offset)}.
>>> <
>>> This gives you a ton of instructions in this category and should
>>> improve code density)
>>> <
>>> OP Rd,Immed10
>>> <
>>
>> Yes, but wouldn't you then need more MR instructions to prevent
>> overwriting to value when it will be needed again? Might be worth it,
>> but ISTM that is a negative factor.
>
> There are quite a few instructions of the form addi ra, ra, 1234
> or add ra,ra,rb, when the compiler determines that the old value
> is no longer required.
>
> I have treated these separately in the statistics below, where I have
> also treated loads and stores relative to the stack pointer separately:
>
> mr 5590458 11.06 11.06
> bl 3390274 6.71 17.77
> ld 3044220 6.02 23.79
> ld (stack) 2792897 5.52 29.31
> addi (1-reg) 2517847 4.98 34.29
> li 2486296 4.92 39.21
> std (stack) 2342059 4.63 43.84
> addi 2232815 4.42 48.26
> addis 1668924 3.30 51.56
> b 1534024 3.03 54.60
> beq 1507796 2.98 57.58
> std 1397405 2.76 60.34
> cmpdi 1009923 2.00 62.34
> add (2-reg) 962506 1.90 64.25
> ori (1-reg) 813062 1.61 65.85
> bne 799486 1.58 67.44
> stdu (stack) 645407 1.28 68.71
> cmpwi 642714 1.27 69.98
> lwz 637949 1.26 71.25
> mflr 587134 1.16 72.41
> stw 562149 1.11 73.52
> extsw 465193 0.92 74.44
> lbz 453174 0.90 75.34
>
> I could also post the whole analysis including bit counts, but that
> would be a bit too long for a Usenet post - 1389 lines.
>
>> BTW, I was surprised that in Thomas' original data that MR was so high
>> on the list (Not doubting the correctness of the data). Why are so many
>> MRs necessary?
>
> I certainly did not look at 55 million instructions :-) but gave
> it a cursory glance. Many of them occurred when calling functions
> (chromium, being C++, has a really large number of function calls,
> as evidenced by the bl instructions), and keeping a value that is
> also passed as a function call is one valid reason for an mr.
>
> Fortran usually uses a different calling convention, and the
> stats look diffierent. Here's an overview of the Polyhedron
> benchmark, with mr at around 3% instead of 11% for Chromium:
>
> li 28397 7.30 7.30
> addi 27797 7.15 14.45
> addis 25875 6.65 21.10
> std (stack) 20560 5.29 26.38
> addi (1-reg) 18997 4.88 31.27
> bl 15428 3.97 35.23
> ld (stack) 12714 3.27 38.50
> mr 11542 2.97 41.47
> add (2-reg) 10986 2.82 44.29
> ld 10783 2.77 47.07
> sldi 10405 2.67 49.74
> lfd 8564 2.20 51.94
> ori (1-reg) 7475 1.92 53.87
> lfs 7222 1.86 55.72
> cmpwi 7195 1.85 57.57
> beq 7021 1.80 59.38
> std 6702 1.72 61.10
> stfd 6663 1.71 62.81
> b 6225 1.60 64.41
> lwz 6016 1.55 65.96
> stw 5372 1.38 67.34
> stxvd2x 4301 1.11 68.45
> stw (stack) 4157 1.07 69.51
> stfs 4038 1.04 70.55
> bne 3792 0.97 71.53
> lxvd2x 3788 0.97 72.50
> extsw 3507 0.90 73.40
> add 3295 0.85 74.25
> ble 3294 0.85 75.10
> lis 3289 0.85 75.94
> fmul 3148 0.81 76.75

Thank you. Your explanation certainly seems reasonable.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

On 2/8/2022 10:12 PM, Brett wrote:
> BGB <cr88192@gmail.com> wrote:
>> On 2/8/2022 3:54 PM, Thomas Koenig wrote:
>>> I looked a bit at what to include in the 20-bit instruction subset.
>>>
>>> Looking at the biggest piece of bloat^H^H^H^H^Hsoftware I can find
>>> on POWER, chromium plus supporting shared libraries, a bit more
>>> than 58 million instructions, I found the following instruction
>>> frequency. The first column is the instruction, the second the
>>> number of instructions, the third one percentage, and the fourth
>>> one culminative percentage. The frequency will probably surprise
>>> no one here:
>>>
>>> ld 5837117 11.55 11.55 (load 64-bit with offset)
>>> mr 5590458 11.06 22.61 (move register)
>>> addi 4750662 9.40 32.00 (add with 16-bit constant)
>>> std 3739464 7.40 39.40 (store 64-bit with offset)
>>> bl 3390274 6.71 46.11 (branch and link)
>>> li 2486296 4.92 51.03 (load immediate)
>>> addis 1668924 3.30 54.33 (add immediate and shift)
>>> b 1534024 3.03 57.36 (branch)
>>> beq 1507796 2.98 60.34 (branch if equal)
>>> add 1266393 2.51 62.85 (add)
>>> cmpdi 1009923 2.00 64.85 (compare immediate)
>>> ori 822801 1.63 66.48 (or immedaite)
>>> lwz 810375 1.60 68.08 (load word and zero)
>>> bne 799486 1.58 69.66 (branch if not equal)
>>> stdu 691161 1.37 71.03 (store double with update)
>>> stw 690701 1.37 72.39 (store word)
>>> cmpwi 642714 1.27 73.66 (compare word immediate)
>>> mflr 587134 1.16 74.83 (move from link register)
>>> lbz 472965 0.94 75.76 (load byte zero)
>>> extsw 465193 0.92 76.68 (extend sign)
>>> subf 435575 0.86 77.54 (subtract)
>>> sldi 370387 0.73 78.28 (shift left)
>>> blr 358973 0.71 78.99 (branch and link)
>>> mtctr 350664 0.69 79.68 (move to counter register)
>>> rlwinm 327689 0.65 80.33 (word shifting)
>>> stb 326003 0.64 80.97 (store byte with offset)
>>>
>>> Not all offsets would fit into a 20-bit container, of course.
>>> Looking at an instruction format consisting of
>>>
>>> - one four-bit opcode
>>> - two five-bit registers
>>> - one six-bit constant
>>>
>>> it would be possible to fit (original data too long to post
>>> here)
>>>
>>> ld 9.72
>>> addi 2.26
>>> std 6.50
>>> li 3.87
>>> cmpdi 1.30
>>> lwz 1.02
>>> sldi 0.62
>>>
>>> into that format, close to 25,3% of instructions.
>>>
>>> For a two or three-register format,
>>>
>>> mr 11.06
>>> add 2.51
>>> extw 0.92
>>> subf 0.86
>>>
>>> would give 15,3% on top.
>>>
>>> For branches, I would say that 5 bit of POWER offset would correspond
>>> to 6 bits of the combined address, which would give a percent or two.
>>> So 40% of half-length instructions sounds reasonable.
>>>
>>> In other words: Recoding POWER into 20 and 40 bit chunks without
>>> using any of the additional freedom gained by 40-bit instructions
>>> would actually be a gain in code density, without any restrictions
>>> in what registers to choose (such as having special instructions
>>> for the stack pointer).
>>>
>>> Trying to compress this into 16 bits would be much more difficult.
>>
>> A decent chunk of the common instructions can be crammed into 16-bit
>> encodings on BJX2, sorta...
>>
>> But, yeah, the overall rankings in my case seem to be vaguely similar.
>>
>>
>> Top ranking instructions from Doom (aggregated by mnemonic):
>> MOV.Q (Load/Store QWord)
>> MOV.L (Load/Store DWord)
>> BF (Branch if False)
>> MOVU.L (Load, Unsigned DWord)
>> BT (Branch if True)
>> MOVU.B (Load, Unsigned Byte)
>> MOV (Move Reg, Reg)
>> MOVU.W (Load, Unsigned Word)
>> MOV.W (Load/Store, Word)
>> ADD (Add, 64-bit)
>> TST (Bit Test, ((A&B)==0))
>> BRA (Unconditional Branch)
>> ADDS.L (Add, 32-bit, Sign-Extending)
>> MOV.X (Load/Store, 128-bit)
>> SHLD (Logical Shift, 32-bit)
>> OR (Bitwise OR)
>> CMPQGT (Compare Greater, 64-bit)
>> LDIZ (Load Immediate, Zero-Extended)
>> ...
>>
>> Or, aggregating by category:
>> Load/Store ops
>> Branch Ops
>> Common ALU ops
>>
>>
>> Within 16-bit encodings, the bulk of the most commonly encoded:
>> MOV (Reg, Reg)
>> Branch ops (Disp8)
>> ADD (Imm8, Reg)
>> LDI (Imm12, R0)
>> LDI (Imm8, Reg)
>> Load/Store with (SP, Disp4), Various
>> CMPEQ (Imm4, Rn)
>>
>> Within 32-bit encodings:
>> Branch Ops (Disp20)
>> Load/Store Ops (Disp9)
>> ALU Ops (Rm, Imm9, Rn)
>> LDI/ADD (Imm16, Rn)
>> ...
>>
>>
>> There is much less of a showing for Load/Store with a non-SP base
>> register in 16-bit land, but the likely reason here is there is a very
>> limited selection of displacement encodings (these encodings are very
>> common in terms of 32-bit encodings).
>>
>> Say, for example, one wants:
>> 4b Base Register
>> 4b Dest Register
>> 3b Disp
>> 3b Format
>>
>> Then, one is already looking at 14 bits of encoding space.
>>
>> Could in theory cram it down to 12 bits of encoding space by using 3-bit
>> register fields, but this would be somewhat limiting.
>
>
> If you do preferred split non-overlapping address-data registers like short
> form 8086 then 16 bits is plenty. Only long form needs access to all
> registers.
>
> This is before you add a short belt, which cuts instruction sizes more.
>

Possible, I was assuming traditional register space more like that in
RISC style ISAs (fairly homogeneous, with roles mostly assigned by the ABI).

Thumb used a lot of 3-bit register fields, with 32-bit ARM (and Thumb2)
using 4-bit registers. However, the relative loss was smaller as a
number of the high-end registers were cut off for other uses anyways
(PC, SP, LR).

Though, I guess one could split up the space several ways:
Data vs Address
Scratch vs Preserved

Then, R0..R7:
R0..R1: Scratch, Data
R2..R3: Scratch, Addr
R4..R5: Preserved, Data
R6..R7: Preserved, Addr

One could have a 2b register field that encodes R0,R1,R4,R5 or
R2,R3,R6,R7; or a 3b "generic" field.

One major downside of a small register space though is that it can
significantly increase the number of load/store ops, as it is necessary
to frequently evict and reload stack variables.

It sorta worked OK with x86 mostly because the could use memory as an
operand, but with Thumb-1 the situation is a little less ideal.

The situation is worse with a simple in-order machine, since loads may
trigger an interlock penalty if one tries to use the result quickly, and
there are insufficient registers to re-order things such to avoid the
interlocks.

While a belt could likely help with code density, it would likely do
little to help with spill rate due to register pressure (but, it could
potentially be counter-productive if instructions need to be able to
encode either a GPR or a belt position; or, alternately, if one needs to
use instructions to move values between GPRs and a belt).

The need for GPRs is likely unavoidable though if one still needs
somewhere to put their local variables, and (IME) local variables tend
to dominate over intermediate working values in any case (one can
usually put these into scratch registers, and then shuffle the results
back into callee preserve GPRs or similar).

With 16 registers, it is at least a little better (spill rate can be
somewhat reduced).

With 32 registers, most code fits fairly comfortably... Except when
"nearly everything" takes up 2 registers (as in my experimental 128-bit
ABI), which effectively halves the number of usable registers.

With 64 registers, it is mostly overkill for normal code (wont see much
of any real advantage over 32). However, it can pull ahead for code
which has an unreasonably large amount of inner-loop state (such as one
might deal with in an OpenGL style software-rasterizer).

Click here to read the complete article

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, February 9, 2022 at 3:00:01 AM UTC-6, Thomas Koenig wrote:
>> Thomas Koenig <tko...@netcologne.de> schrieb:
>> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>> >> I am probably missing something, but using your 40% figure
>> >>
>> >> (.4 * 20) + (.6 * 40) = 8 + 24 = 32
>> >
>> > Density is the number of instructions per unit, you have to
>> > add the inverse of the number of bits.
>> >
>> > 1/(0.4 / 20 + 0.6 / 40) = 1/ (0.02 + 0.015) = 28.57...
>> >
>> > (Same as when calculating an avarage density of a mixture).
>> Ah, never mind. You are right, 40% is not quite enough.
>> Have to stick in some compare immediate and branch instructions
>> as well.
><
> What happens if the 20-bit instructions contains the destructive register
> model:
><
> OP Rd,Rs
> instead of the non-destructive model
> OP Rd,Rs1,Rs2

Just posted some data to a reply to your post.

> The destructive model handles "book keeping codes well (loop,
> index, follow {p=*(p+offset)}.
><
> This gives you a ton of instructions in this category and should
> improve code density)

This certainly warrants a bit more study :-)

BGB <cr88192@gmail.com> wrote:
> On 2/8/2022 10:12 PM, Brett wrote:
>> BGB <cr88192@gmail.com> wrote:
>>> On 2/8/2022 3:54 PM, Thomas Koenig wrote:
>>>> I looked a bit at what to include in the 20-bit instruction subset.
>>>>
>>>> Looking at the biggest piece of bloat^H^H^H^H^Hsoftware I can find
>>>> on POWER, chromium plus supporting shared libraries, a bit more
>>>> than 58 million instructions, I found the following instruction
>>>> frequency. The first column is the instruction, the second the
>>>> number of instructions, the third one percentage, and the fourth
>>>> one culminative percentage. The frequency will probably surprise
>>>> no one here:
>>>>
>>>> ld 5837117 11.55 11.55 (load 64-bit with offset)
>>>> mr 5590458 11.06 22.61 (move register)
>>>> addi 4750662 9.40 32.00 (add with 16-bit constant)
>>>> std 3739464 7.40 39.40 (store 64-bit with offset)
>>>> bl 3390274 6.71 46.11 (branch and link)
>>>> li 2486296 4.92 51.03 (load immediate)
>>>> addis 1668924 3.30 54.33 (add immediate and shift)
>>>> b 1534024 3.03 57.36 (branch)
>>>> beq 1507796 2.98 60.34 (branch if equal)
>>>> add 1266393 2.51 62.85 (add)
>>>> cmpdi 1009923 2.00 64.85 (compare immediate)
>>>> ori 822801 1.63 66.48 (or immedaite)
>>>> lwz 810375 1.60 68.08 (load word and zero)
>>>> bne 799486 1.58 69.66 (branch if not equal)
>>>> stdu 691161 1.37 71.03 (store double with update)
>>>> stw 690701 1.37 72.39 (store word)
>>>> cmpwi 642714 1.27 73.66 (compare word immediate)
>>>> mflr 587134 1.16 74.83 (move from link register)
>>>> lbz 472965 0.94 75.76 (load byte zero)
>>>> extsw 465193 0.92 76.68 (extend sign)
>>>> subf 435575 0.86 77.54 (subtract)
>>>> sldi 370387 0.73 78.28 (shift left)
>>>> blr 358973 0.71 78.99 (branch and link)
>>>> mtctr 350664 0.69 79.68 (move to counter register)
>>>> rlwinm 327689 0.65 80.33 (word shifting)
>>>> stb 326003 0.64 80.97 (store byte with offset)
>>>>
>>>> Not all offsets would fit into a 20-bit container, of course.
>>>> Looking at an instruction format consisting of
>>>>
>>>> - one four-bit opcode
>>>> - two five-bit registers
>>>> - one six-bit constant
>>>>
>>>> it would be possible to fit (original data too long to post
>>>> here)
>>>>
>>>> ld 9.72
>>>> addi 2.26
>>>> std 6.50
>>>> li 3.87
>>>> cmpdi 1.30
>>>> lwz 1.02
>>>> sldi 0.62
>>>>
>>>> into that format, close to 25,3% of instructions.
>>>>
>>>> For a two or three-register format,
>>>>
>>>> mr 11.06
>>>> add 2.51
>>>> extw 0.92
>>>> subf 0.86
>>>>
>>>> would give 15,3% on top.
>>>>
>>>> For branches, I would say that 5 bit of POWER offset would correspond
>>>> to 6 bits of the combined address, which would give a percent or two.
>>>> So 40% of half-length instructions sounds reasonable.
>>>>
>>>> In other words: Recoding POWER into 20 and 40 bit chunks without
>>>> using any of the additional freedom gained by 40-bit instructions
>>>> would actually be a gain in code density, without any restrictions
>>>> in what registers to choose (such as having special instructions
>>>> for the stack pointer).
>>>>
>>>> Trying to compress this into 16 bits would be much more difficult.
>>>
>>> A decent chunk of the common instructions can be crammed into 16-bit
>>> encodings on BJX2, sorta...
>>>
>>> But, yeah, the overall rankings in my case seem to be vaguely similar.
>>>
>>>
>>> Top ranking instructions from Doom (aggregated by mnemonic):
>>> MOV.Q (Load/Store QWord)
>>> MOV.L (Load/Store DWord)
>>> BF (Branch if False)
>>> MOVU.L (Load, Unsigned DWord)
>>> BT (Branch if True)
>>> MOVU.B (Load, Unsigned Byte)
>>> MOV (Move Reg, Reg)
>>> MOVU.W (Load, Unsigned Word)
>>> MOV.W (Load/Store, Word)
>>> ADD (Add, 64-bit)
>>> TST (Bit Test, ((A&B)==0))
>>> BRA (Unconditional Branch)
>>> ADDS.L (Add, 32-bit, Sign-Extending)
>>> MOV.X (Load/Store, 128-bit)
>>> SHLD (Logical Shift, 32-bit)
>>> OR (Bitwise OR)
>>> CMPQGT (Compare Greater, 64-bit)
>>> LDIZ (Load Immediate, Zero-Extended)
>>> ...
>>>
>>> Or, aggregating by category:
>>> Load/Store ops
>>> Branch Ops
>>> Common ALU ops
>>>
>>>
>>> Within 16-bit encodings, the bulk of the most commonly encoded:
>>> MOV (Reg, Reg)
>>> Branch ops (Disp8)
>>> ADD (Imm8, Reg)
>>> LDI (Imm12, R0)
>>> LDI (Imm8, Reg)
>>> Load/Store with (SP, Disp4), Various
>>> CMPEQ (Imm4, Rn)
>>>
>>> Within 32-bit encodings:
>>> Branch Ops (Disp20)
>>> Load/Store Ops (Disp9)
>>> ALU Ops (Rm, Imm9, Rn)
>>> LDI/ADD (Imm16, Rn)
>>> ...
>>>
>>>
>>> There is much less of a showing for Load/Store with a non-SP base
>>> register in 16-bit land, but the likely reason here is there is a very
>>> limited selection of displacement encodings (these encodings are very
>>> common in terms of 32-bit encodings).
>>>
>>> Say, for example, one wants:
>>> 4b Base Register
>>> 4b Dest Register
>>> 3b Disp
>>> 3b Format
>>>
>>> Then, one is already looking at 14 bits of encoding space.
>>>
>>> Could in theory cram it down to 12 bits of encoding space by using 3-bit
>>> register fields, but this would be somewhat limiting.
>>
>>
>> If you do preferred split non-overlapping address-data registers like short
>> form 8086 then 16 bits is plenty. Only long form needs access to all
>> registers.
>>
>> This is before you add a short belt, which cuts instruction sizes more.
>>
>
> Possible, I was assuming traditional register space more like that in
> RISC style ISAs (fairly homogeneous, with roles mostly assigned by the ABI).
>
>
> Thumb used a lot of 3-bit register fields, with 32-bit ARM (and Thumb2)
> using 4-bit registers. However, the relative loss was smaller as a
> number of the high-end registers were cut off for other uses anyways
> (PC, SP, LR).
>
>
> Though, I guess one could split up the space several ways:
> Data vs Address
> Scratch vs Preserved
>
> Then, R0..R7:
> R0..R1: Scratch, Data
> R2..R3: Scratch, Addr
> R4..R5: Preserved, Data
> R6..R7: Preserved, Addr
>
> One could have a 2b register field that encodes R0,R1,R4,R5 or
> R2,R3,R6,R7; or a 3b "generic" field.
>
>
> One major downside of a small register space though is that it can
> significantly increase the number of load/store ops, as it is necessary
> to frequently evict and reload stack variables.

You add an extension byte or short when you need more register bits, like
x86.
So you don’t get the downside of evict and reload.

No compromises have your cake and eat it too.

You can have two extensions if you want to go tiny with the instructions;
2b, 4b, 6b which is 4, 16, and 64 registers. Split 4 is actually 8
registers, address and data, the 16 can be half overlap for 24 registers
and sufficient flexibility for 95% of code. And with 64 you can crush the
competition on FPU code which eats registers for breakfast. As a former
game console programmer I can tell you 32 is not enough.

> It sorta worked OK with x86 mostly because the could use memory as an
> operand, but with Thumb-1 the situation is a little less ideal.
>
> The situation is worse with a simple in-order machine, since loads may
> trigger an interlock penalty if one tries to use the result quickly, and
> there are insufficient registers to re-order things such to avoid the
> interlocks.
>
> While a belt could likely help with code density, it would likely do
> little to help with spill rate due to register pressure (but, it could
> potentially be counter-productive if instructions need to be able to
> encode either a GPR or a belt position; or, alternately, if one needs to
> use instructions to move values between GPRs and a belt).
>
> The need for GPRs is likely unavoidable though if one still needs
> somewhere to put their local variables, and (IME) local variables tend
> to dominate over intermediate working values in any case (one can
> usually put these into scratch registers, and then shuffle the results
> back into callee preserve GPRs or similar).
>
>
>
> With 16 registers, it is at least a little better (spill rate can be
> somewhat reduced).
>
> With 32 registers, most code fits fairly comfortably... Except when
> "nearly everything" takes up 2 registers (as in my experimental 128-bit
> ABI), which effectively halves the number of usable registers.
>
>
> With 64 registers, it is mostly overkill for normal code (wont see much
> of any real advantage over 32). However, it can pull ahead for code
> which has an unreasonably large amount of inner-loop state (such as one
> might deal with in an OpenGL style software-rasterizer).
>
> Well, at least, if one does like me and sticks all of these 128-bit
> vectors into GPR pairs.
>
> If one did like x86+SSE, with two separate register spaces (16x64b GPR,
> 16x128b SIMD), then the register situation would be less awful (would
> accomplish similar effect to having 64 GPRs).
>
>
> Relative to clock speed, it seems to do pretty well at these sorts of
> tasks (seems to fare better than either x86-64 or ARM). If, albeit, it
> is worse at most other metrics (does GL OK, but sucks at running Doom
> and similar; whereas the Ryzen is by far the winner if ability to run
> Doom and similar is the primary metric).
>
>
> However, putting SIMD in the GPRs is a little more versatile.
>
> And, for example, in my case I could chose to randomly start
> experimenting with 128-bit pointers with fairly minimal impact on the
> rest of the ISA (and most of what changes I did add were mostly to try
> to deal with also adding dynamically-adjusted bounds-checks, but
> otherwise there is no real "selling point" at present to bother with the
> larger pointers).
>
> This would not likely be quite so true with x64+SSE (try to stick
> pointers into XMM registers and one would have an awful mess).
>
> Granted, it seems like my C compiler codebase and runtime library don't
> entirely agree with this assessment.
>
> I don't yet know the performance impact, this first requires getting
> everything "mostly working".
>
>>> This can fit a little better into a 24-bit encoding, but in a past
>>> experiment, the savings from 24-bit Load/Store encodings and similar
>>> were overall fairly modest vs plain 16/32.
>>>
>>> In the current form of the ISA, the encoding space that was originally
>>> assigned for 24-bit encodings was reused for the XGPR encodings (Namely,
>>> 32-bit instruction encodings with 6-bit register fields, covering a
>>> "common" ISA subset; with cases which don't fit into the 32-bit encoding
>>> falling back to a 64-bit instruction format, which while not necessarily
>>> the most efficient possibility, this is rare enough that it doesn't
>>> really matter).
>>>
>>> But, in general, a 20 or 24 bit encoding could potentially be able to
>>> fit a better range of Load/Store encodings, which could potentially be
>>> worthwhile (though, in this case, I would probably go for a byte-aligned
>>> 16/24/32 encoding, rather than a bundle-based 20/40 encoding).

Click here to read the complete article

Re: Encoding 20 and 40 bit instructions in 128 bits

<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23328&group=comp.arch#23328

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:514:: with SMTP id l20mr2932352qtx.187.1644443597122;
Wed, 09 Feb 2022 13:53:17 -0800 (PST)
X-Received: by 2002:a05:6808:151e:: with SMTP id u30mr2366684oiw.64.1644443596869;
Wed, 09 Feb 2022 13:53:16 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 9 Feb 2022 13:53:16 -0800 (PST)
In-Reply-To: <su19ub$9hr$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:75c0:2e7a:b394:a627;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:75c0:2e7a:b394:a627
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me> <su19ub$9hr$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 09 Feb 2022 21:53:17 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 12

by: Quadibloc - Wed, 9 Feb 2022 21:53 UTC

On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
> And with 64 you can crush the
> competition on FPU code which eats registers for breakfast. As a former
> game console programmer I can tell you 32 is not enough.

This is useful to know. Despite Mitch noting that decoding more than 32 registers
is a problem, then, I guess I'll be putting the 128-register bank feature back into
my attempts at a high-performance design.

But extension bytes are not something I will tolerate, as they would complicate
instruction decoding.

John Savard

Re: Encoding 20 and 40 bit instructions in 128 bits

<b02e11f0-2672-4c24-9460-b12518904cedn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23329&group=comp.arch#23329

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:dad:: with SMTP id h13mr3043100qvh.7.1644444368693;
Wed, 09 Feb 2022 14:06:08 -0800 (PST)
X-Received: by 2002:a9d:7745:: with SMTP id t5mr1892790otl.254.1644444368416;
Wed, 09 Feb 2022 14:06:08 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 9 Feb 2022 14:06:08 -0800 (PST)
In-Reply-To: <su0p0o$65c$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9d7b:3ea3:9f29:930d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9d7b:3ea3:9f29:930d
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stvi9q$d03$1@dont-email.me>
<stvmm6$s6c$1@newsreader4.netcologne.de> <stvvqd$12h$1@newsreader4.netcologne.de>
<aef1247a-1936-4c59-9af3-67e82056ca5dn@googlegroups.com> <su0p0o$65c$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b02e11f0-2672-4c24-9460-b12518904cedn@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 09 Feb 2022 22:06:08 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 55

by: MitchAlsup - Wed, 9 Feb 2022 22:06 UTC

On Wednesday, February 9, 2022 at 10:10:04 AM UTC-6, Stephen Fuld wrote:
> On 2/9/2022 7:14 AM, MitchAlsup wrote:
> > On Wednesday, February 9, 2022 at 3:00:01 AM UTC-6, Thomas Koenig wrote:
> >> Thomas Koenig <tko...@netcologne.de> schrieb:
> >>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> >>>> I am probably missing something, but using your 40% figure
> >>>>
> >>>> (.4 * 20) + (.6 * 40) = 8 + 24 = 32
> >>>
> >>> Density is the number of instructions per unit, you have to
> >>> add the inverse of the number of bits.
> >>>
> >>> 1/(0.4 / 20 + 0.6 / 40) = 1/ (0.02 + 0.015) = 28.57...
> >>>
> >>> (Same as when calculating an avarage density of a mixture).
> >> Ah, never mind. You are right, 40% is not quite enough.
> >> Have to stick in some compare immediate and branch instructions
> >> as well.
> > <
> > What happens if the 20-bit instructions contains the destructive register
> > model:
> > <
> > OP Rd,Rs
> > instead of the non-destructive model
> > OP Rd,Rs1,Rs2
> > <
> > The destructive model handles "book keeping codes well (loop,
> > index, follow {p=*(p+offset)}.
> > <
> > This gives you a ton of instructions in this category and should
> > improve code density)
> > <
> > OP Rd,Immed10
> > <
> Yes, but wouldn't you then need more MR instructions to prevent
> overwriting to value when it will be needed again? Might be worth it,
> but ISTM that is a negative factor.
<
Almost all book keeping code can use destructive register model.
>
> BTW, I was surprised that in Thomas' original data that MR was so high
> on the list (Not doubting the correctness of the data). Why are so many
> MRs necessary?
<
bad compiler ? unlikely, but this is the normal scape goat.
>
> If there really is a need for the same value in two registers, how about
> adding a "Load to two registers" instruction? It would be just like a
> load, but sacrifices some displacement bits to encode a second
> destination register. Or, you could use a variant of the LM instruction
> that loads two registers like LM, but doesn't increment the storage
> address. This would cause the same value to be written to both
> registers, hopefully saving a future MR instruction.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Encoding 20 and 40 bit instructions in 128 bits

<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23330&group=comp.arch#23330

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:29c8:: with SMTP id gh8mr3130032qvb.126.1644444502861;
Wed, 09 Feb 2022 14:08:22 -0800 (PST)
X-Received: by 2002:a05:6808:13cb:: with SMTP id d11mr2276313oiw.325.1644444502582;
Wed, 09 Feb 2022 14:08:22 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 9 Feb 2022 14:08:22 -0800 (PST)
In-Reply-To: <731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9d7b:3ea3:9f29:930d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9d7b:3ea3:9f29:930d
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me> <su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 09 Feb 2022 22:08:22 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 17

by: MitchAlsup - Wed, 9 Feb 2022 22:08 UTC

On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
> > And with 64 you can crush the
> > competition on FPU code which eats registers for breakfast. As a former
> > game console programmer I can tell you 32 is not enough.
<
> This is useful to know. Despite Mitch noting that decoding more than 32 registers
> is a problem, then, I guess I'll be putting the 128-register bank feature back into
> my attempts at a high-performance design.
<
It is NOT the decoding of registers that is the problem.
It is that the register specifiers eat too many bits in the instruction.
<
>
> But extension bytes are not something I will tolerate, as they would complicate
> instruction decoding.
>
> John Savard

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, February 9, 2022 at 10:10:04 AM UTC-6, Stephen Fuld wrote:

>> BTW, I was surprised that in Thomas' original data that MR was so high
>> on the list (Not doubting the correctness of the data). Why are so many
>> MRs necessary?
><
> bad compiler ? unlikely, but this is the normal scape goat.

OK, some code.

Right at the startup of Chromium, code looks like this:

ee63c0: 47 09 4c 3c addis r2,r12,2375
ee63c4: 40 1c 42 38 addi r2,r2,7232
ee63c8: a6 02 08 7c mflr r0
ee63cc: c1 20 c2 49 bl 2b0848c <ChromeMain@@Base+0x1bf4ad0>
ee63d0: 11 fe 21 f8 stdu r1,-496(r1)
ee63d4: 78 1b 7d 7c mr r29,r3

and a bit later

ee6404: 78 eb a3 7f mr r3,r29
ee6408: c5 e1 17 48 bl 10645cc <ChromeMain@@Base+0x150c10>
ee640c: 00 00 00 60 nop

(the nop is there for POWER ABI reasons)

and then, again a bit later

ee645c: 01 00 80 38 li r4,1
ee6460: 78 eb a3 7f mr r3,r29
ee6464: fd e0 17 48 bl 1064560 <ChromeMain@@Base+0x150ba4>
ee6468: 00 00 00 60 nop
ee646c: 00 00 23 2c cmpdi r3,0
ee6470: a8 ff 82 41 beq ee6418 <_init@@Base+0xa0>
ee6474: fd 00 de 73 andi. r30,r30,253
ee6478: 01 00 9e 2f cmpwi cr7,r30,1
ee647c: a0 ff 9e 40 bne cr7,ee641c <_init@@Base+0xa4>
ee6480: 02 00 80 38 li r4,2
ee6484: 78 eb a3 7f mr r3,r29
ee6488: d9 e0 17 48 bl 1064560 <ChromeMain@@Base+0x150ba4>

so whatever was saved in register 29 is copied to register 3 for
passing to different subroutines.

Looks like a reasonable thing to do, and is certainly the cause
for a fair number of mr instructions - this is C++ with many
small functions.

On 2/9/2022 2:58 PM, Brett wrote:
> BGB <cr88192@gmail.com> wrote:
>> On 2/8/2022 10:12 PM, Brett wrote:
>>> BGB <cr88192@gmail.com> wrote:
>>>> On 2/8/2022 3:54 PM, Thomas Koenig wrote:
>>>>> I looked a bit at what to include in the 20-bit instruction subset.
>>>>>
>>>>> Looking at the biggest piece of bloat^H^H^H^H^Hsoftware I can find
>>>>> on POWER, chromium plus supporting shared libraries, a bit more
>>>>> than 58 million instructions, I found the following instruction
>>>>> frequency. The first column is the instruction, the second the
>>>>> number of instructions, the third one percentage, and the fourth
>>>>> one culminative percentage. The frequency will probably surprise
>>>>> no one here:
>>>>>
>>>>> ld 5837117 11.55 11.55 (load 64-bit with offset)
>>>>> mr 5590458 11.06 22.61 (move register)
>>>>> addi 4750662 9.40 32.00 (add with 16-bit constant)
>>>>> std 3739464 7.40 39.40 (store 64-bit with offset)
>>>>> bl 3390274 6.71 46.11 (branch and link)
>>>>> li 2486296 4.92 51.03 (load immediate)
>>>>> addis 1668924 3.30 54.33 (add immediate and shift)
>>>>> b 1534024 3.03 57.36 (branch)
>>>>> beq 1507796 2.98 60.34 (branch if equal)
>>>>> add 1266393 2.51 62.85 (add)
>>>>> cmpdi 1009923 2.00 64.85 (compare immediate)
>>>>> ori 822801 1.63 66.48 (or immedaite)
>>>>> lwz 810375 1.60 68.08 (load word and zero)
>>>>> bne 799486 1.58 69.66 (branch if not equal)
>>>>> stdu 691161 1.37 71.03 (store double with update)
>>>>> stw 690701 1.37 72.39 (store word)
>>>>> cmpwi 642714 1.27 73.66 (compare word immediate)
>>>>> mflr 587134 1.16 74.83 (move from link register)
>>>>> lbz 472965 0.94 75.76 (load byte zero)
>>>>> extsw 465193 0.92 76.68 (extend sign)
>>>>> subf 435575 0.86 77.54 (subtract)
>>>>> sldi 370387 0.73 78.28 (shift left)
>>>>> blr 358973 0.71 78.99 (branch and link)
>>>>> mtctr 350664 0.69 79.68 (move to counter register)
>>>>> rlwinm 327689 0.65 80.33 (word shifting)
>>>>> stb 326003 0.64 80.97 (store byte with offset)
>>>>>
>>>>> Not all offsets would fit into a 20-bit container, of course.
>>>>> Looking at an instruction format consisting of
>>>>>
>>>>> - one four-bit opcode
>>>>> - two five-bit registers
>>>>> - one six-bit constant
>>>>>
>>>>> it would be possible to fit (original data too long to post
>>>>> here)
>>>>>
>>>>> ld 9.72
>>>>> addi 2.26
>>>>> std 6.50
>>>>> li 3.87
>>>>> cmpdi 1.30
>>>>> lwz 1.02
>>>>> sldi 0.62
>>>>>
>>>>> into that format, close to 25,3% of instructions.
>>>>>
>>>>> For a two or three-register format,
>>>>>
>>>>> mr 11.06
>>>>> add 2.51
>>>>> extw 0.92
>>>>> subf 0.86
>>>>>
>>>>> would give 15,3% on top.
>>>>>
>>>>> For branches, I would say that 5 bit of POWER offset would correspond
>>>>> to 6 bits of the combined address, which would give a percent or two.
>>>>> So 40% of half-length instructions sounds reasonable.
>>>>>
>>>>> In other words: Recoding POWER into 20 and 40 bit chunks without
>>>>> using any of the additional freedom gained by 40-bit instructions
>>>>> would actually be a gain in code density, without any restrictions
>>>>> in what registers to choose (such as having special instructions
>>>>> for the stack pointer).
>>>>>
>>>>> Trying to compress this into 16 bits would be much more difficult.
>>>>
>>>> A decent chunk of the common instructions can be crammed into 16-bit
>>>> encodings on BJX2, sorta...
>>>>
>>>> But, yeah, the overall rankings in my case seem to be vaguely similar.
>>>>
>>>>
>>>> Top ranking instructions from Doom (aggregated by mnemonic):
>>>> MOV.Q (Load/Store QWord)
>>>> MOV.L (Load/Store DWord)
>>>> BF (Branch if False)
>>>> MOVU.L (Load, Unsigned DWord)
>>>> BT (Branch if True)
>>>> MOVU.B (Load, Unsigned Byte)
>>>> MOV (Move Reg, Reg)
>>>> MOVU.W (Load, Unsigned Word)
>>>> MOV.W (Load/Store, Word)
>>>> ADD (Add, 64-bit)
>>>> TST (Bit Test, ((A&B)==0))
>>>> BRA (Unconditional Branch)
>>>> ADDS.L (Add, 32-bit, Sign-Extending)
>>>> MOV.X (Load/Store, 128-bit)
>>>> SHLD (Logical Shift, 32-bit)
>>>> OR (Bitwise OR)
>>>> CMPQGT (Compare Greater, 64-bit)
>>>> LDIZ (Load Immediate, Zero-Extended)
>>>> ...
>>>>
>>>> Or, aggregating by category:
>>>> Load/Store ops
>>>> Branch Ops
>>>> Common ALU ops
>>>>
>>>>
>>>> Within 16-bit encodings, the bulk of the most commonly encoded:
>>>> MOV (Reg, Reg)
>>>> Branch ops (Disp8)
>>>> ADD (Imm8, Reg)
>>>> LDI (Imm12, R0)
>>>> LDI (Imm8, Reg)
>>>> Load/Store with (SP, Disp4), Various
>>>> CMPEQ (Imm4, Rn)
>>>>
>>>> Within 32-bit encodings:
>>>> Branch Ops (Disp20)
>>>> Load/Store Ops (Disp9)
>>>> ALU Ops (Rm, Imm9, Rn)
>>>> LDI/ADD (Imm16, Rn)
>>>> ...
>>>>
>>>>
>>>> There is much less of a showing for Load/Store with a non-SP base
>>>> register in 16-bit land, but the likely reason here is there is a very
>>>> limited selection of displacement encodings (these encodings are very
>>>> common in terms of 32-bit encodings).
>>>>
>>>> Say, for example, one wants:
>>>> 4b Base Register
>>>> 4b Dest Register
>>>> 3b Disp
>>>> 3b Format
>>>>
>>>> Then, one is already looking at 14 bits of encoding space.
>>>>
>>>> Could in theory cram it down to 12 bits of encoding space by using 3-bit
>>>> register fields, but this would be somewhat limiting.
>>>
>>>
>>> If you do preferred split non-overlapping address-data registers like short
>>> form 8086 then 16 bits is plenty. Only long form needs access to all
>>> registers.
>>>
>>> This is before you add a short belt, which cuts instruction sizes more.
>>>
>>
>> Possible, I was assuming traditional register space more like that in
>> RISC style ISAs (fairly homogeneous, with roles mostly assigned by the ABI).
>>
>>
>> Thumb used a lot of 3-bit register fields, with 32-bit ARM (and Thumb2)
>> using 4-bit registers. However, the relative loss was smaller as a
>> number of the high-end registers were cut off for other uses anyways
>> (PC, SP, LR).
>>
>>
>> Though, I guess one could split up the space several ways:
>> Data vs Address
>> Scratch vs Preserved
>>
>> Then, R0..R7:
>> R0..R1: Scratch, Data
>> R2..R3: Scratch, Addr
>> R4..R5: Preserved, Data
>> R6..R7: Preserved, Addr
>>
>> One could have a 2b register field that encodes R0,R1,R4,R5 or
>> R2,R3,R6,R7; or a 3b "generic" field.
>>
>>
>> One major downside of a small register space though is that it can
>> significantly increase the number of load/store ops, as it is necessary
>> to frequently evict and reload stack variables.
>
> You add an extension byte or short when you need more register bits, like
> x86.
> So you don’t get the downside of evict and reload.
>
> No compromises have your cake and eat it too.
>
> You can have two extensions if you want to go tiny with the instructions;
> 2b, 4b, 6b which is 4, 16, and 64 registers. Split 4 is actually 8
> registers, address and data, the 16 can be half overlap for 24 registers
> and sufficient flexibility for 95% of code. And with 64 you can crush the
> competition on FPU code which eats registers for breakfast. As a former
> game console programmer I can tell you 32 is not enough.
>

Click here to read the complete article

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
>> > And with 64 you can crush the
>> > competition on FPU code which eats registers for breakfast. As a former
>> > game console programmer I can tell you 32 is not enough.
><
>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
>> my attempts at a high-performance design.
><
> It is NOT the decoding of registers that is the problem.
> It is that the register specifiers eat too many bits in the instruction.

.... which is one reason I stared thinking about 40-bit instruction
bundles :-)

Re: Encoding 20 and 40 bit instructions in 128 bits

<su65pg$lc$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23355&group=comp.arch#23355

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 11 Feb 2022 09:18:39 -0800
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <su65pg$lc$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<ssuf80$i60$1@dont-email.me> <ssulkf$7n0$1@newsreader4.netcologne.de>
<ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 11 Feb 2022 17:18:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8ee200518c244ca3a9e534e3b393ede4";
logging-data="684"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19YVWvGFrKQ1C2zESdCqVLcKOJu6OIWK04="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:IbOKQGBj48+gtUuH72qBrB1pjHE=
In-Reply-To: <su59a7$avf$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Stephen Fuld - Fri, 11 Feb 2022 17:18 UTC

On 2/11/2022 1:12 AM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
>>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
>>>> And with 64 you can crush the
>>>> competition on FPU code which eats registers for breakfast. As a former
>>>> game console programmer I can tell you 32 is not enough.
>> <
>>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
>>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
>>> my attempts at a high-performance design.
>> <
>> It is NOT the decoding of registers that is the problem.
>> It is that the register specifiers eat too many bits in the instruction.
>
> ... which is one reason I stared thinking about 40-bit instruction
> bundles :-)

Mitch has stated that his ideal instruction size is 36 bits, so how
about this?

Encode 9 instructions, each 36 bits long in a 256 bit bundle. This
wastes 4 bits.

Upsides -

Ability to easily handle 64 registers, longer displacements, etc.
Pretty trivial decoding

Downsides -

"wastes" 12.5% of space over eight 32 bit instructions in the same space.

Note that I haven't really thought through having 18 bit instructions as
well in order to reduce space usage. ISTM that has similar issues to
adding 16 bit instructions to 32 bit "primary" instructions.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

> Encode 9 instructions, each 36 bits long in a 256 bit bundle.

Hmm I think you'd have to use pretty thin bits to make them fit in
a normal 256bit bundle ;-)

IOW, I'll presume you meant 7 rather than 9.

> This wastes 4 bits.
[...]
> Note that I haven't really thought through having 18 bit instructions as
> well in order to reduce space usage. ISTM that has similar issues to adding
> 16 bit instructions to 32 bit "primary" instructions.

Of course, you could try and use the extra 4 bits to store some of the
extra info about which instructions are 18bit and which are 36bit.

Stefan

Re: Encoding 20 and 40 bit instructions in 128 bits

<su67bq$d94$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23359&group=comp.arch#23359

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 11 Feb 2022 09:45:27 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <su67bq$d94$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<jwvy22h2u60.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 11 Feb 2022 17:45:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="33543df21ba2633bcb197ca4b44b6b1f";
logging-data="13604"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX186GjSnwCn6i0b9vbMPsEyXdIuOCDfgbR8="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:uTi2cwMEu2P+e//jfmNL8aZEWQo=
In-Reply-To: <jwvy22h2u60.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Stephen Fuld - Fri, 11 Feb 2022 17:45 UTC

On 2/11/2022 9:27 AM, Stefan Monnier wrote:
>> Encode 9 instructions, each 36 bits long in a 256 bit bundle.
>
> Hmm I think you'd have to use pretty thin bits to make them fit in
> a normal 256bit bundle ;-)

Yeah, I am working on the "thin bits" :-)

> IOW, I'll presume you meant 7 rather than 9.

Right. Sorry about that. :-(

>
>> This wastes 4 bits.
> [...]
>> Note that I haven't really thought through having 18 bit instructions as
>> well in order to reduce space usage. ISTM that has similar issues to adding
>> 16 bit instructions to 32 bit "primary" instructions.
>
> Of course, you could try and use the extra 4 bits to store some of the
> extra info about which instructions are 18bit and which are 36bit.

Yes. Not enough bits for full generality, but some things are possible.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Encoding 20 and 40 bit instructions in 128 bits

<8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23365&group=comp.arch#23365

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5f4c:: with SMTP id p12mr1905192qvg.114.1644604131909;
Fri, 11 Feb 2022 10:28:51 -0800 (PST)
X-Received: by 2002:a05:6870:3813:: with SMTP id y19mr581434oal.282.1644604130181;
Fri, 11 Feb 2022 10:28:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 11 Feb 2022 10:28:50 -0800 (PST)
In-Reply-To: <su65pg$lc$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84f7:6231:3421:a79e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84f7:6231:3421:a79e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me> <su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com> <e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 11 Feb 2022 18:28:51 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 41

by: MitchAlsup - Fri, 11 Feb 2022 18:28 UTC

On Friday, February 11, 2022 at 11:18:43 AM UTC-6, Stephen Fuld wrote:
> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
> >>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
> >>>> And with 64 you can crush the
> >>>> competition on FPU code which eats registers for breakfast. As a former
> >>>> game console programmer I can tell you 32 is not enough.
> >> <
> >>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
> >>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
> >>> my attempts at a high-performance design.
> >> <
> >> It is NOT the decoding of registers that is the problem.
> >> It is that the register specifiers eat too many bits in the instruction.
> >
> > ... which is one reason I stared thinking about 40-bit instruction
> > bundles :-)
> Mitch has stated that his ideal instruction size is 36 bits, so how
> about this?
>
> Encode 9 instructions, each 36 bits long in a 256 bit bundle. This
> wastes 4 bits.
>
> Upsides -
>
> Ability to easily handle 64 registers, longer displacements, etc.
> Pretty trivial decoding
>
> Downsides -
>
> "wastes" 12.5% of space over eight 32 bit instructions in the same space.
<
If you can encode all the work of those 8 32-bit instructions in 7 36-bit
instructions, you are ahead of the game.
>
> Note that I haven't really thought through having 18 bit instructions as
> well in order to reduce space usage. ISTM that has similar issues to
> adding 16 bit instructions to 32 bit "primary" instructions.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Encoding 20 and 40 bit instructions in 128 bits

<71d1dc9e-46e6-4c13-956e-5e90b352984dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23366&group=comp.arch#23366

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7e8c:: with SMTP id w12mr2097649qtj.342.1644604503711;
Fri, 11 Feb 2022 10:35:03 -0800 (PST)
X-Received: by 2002:a05:6870:7c10:: with SMTP id je16mr556236oab.267.1644604503468;
Fri, 11 Feb 2022 10:35:03 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 11 Feb 2022 10:35:03 -0800 (PST)
In-Reply-To: <su65pg$lc$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84f7:6231:3421:a79e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84f7:6231:3421:a79e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me> <su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com> <e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <71d1dc9e-46e6-4c13-956e-5e90b352984dn@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 11 Feb 2022 18:35:03 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 55

by: MitchAlsup - Fri, 11 Feb 2022 18:35 UTC

On Friday, February 11, 2022 at 11:18:43 AM UTC-6, Stephen Fuld wrote:
> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
> >>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
> >>>> And with 64 you can crush the
> >>>> competition on FPU code which eats registers for breakfast. As a former
> >>>> game console programmer I can tell you 32 is not enough.
> >> <
> >>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
> >>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
> >>> my attempts at a high-performance design.
> >> <
> >> It is NOT the decoding of registers that is the problem.
> >> It is that the register specifiers eat too many bits in the instruction.
> >
> > ... which is one reason I stared thinking about 40-bit instruction
> > bundles :-)
<
> Mitch has stated that his ideal instruction size is 36 bits, so how
> about this?
<
Just to bring everyone up to speed::
<
36-bits allows one to encode everything My 66000 encodes in 32-bits
but also to provide room for sized arithmetic {8,16,32,64}×{int,fp},
saturation, overflow, wrap,...along with CARRY {int, fp}, plus
<
>
> Encode 9 instructions, each 36 bits long in a 256 bit bundle. This
> wastes 4 bits.
>
> Upsides -
>
> Ability to easily handle 64 registers, longer displacements, etc.
> Pretty trivial decoding
>
> Downsides -
>
> "wastes" 12.5% of space over eight 32 bit instructions in the same space.
>
> Note that I haven't really thought through having 18 bit instructions as
> well in order to reduce space usage. ISTM that has similar issues to
> adding 16 bit instructions to 32 bit "primary" instructions.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Encoding 20 and 40 bit instructions in 128 bits

<su6ed7$5ke$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23368&group=comp.arch#23368

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 11 Feb 2022 13:45:38 -0600
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <su6ed7$5ke$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<ssuf80$i60$1@dont-email.me> <ssulkf$7n0$1@newsreader4.netcologne.de>
<ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 11 Feb 2022 19:45:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bb025355d156358a87ec912e02455692";
logging-data="5774"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oa7CAVISoWVJVUHJXAdSQ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:6di822gM98BHUqLopPXXxd6LYI0=
In-Reply-To: <su65pg$lc$1@dont-email.me>
Content-Language: en-US

by: BGB - Fri, 11 Feb 2022 19:45 UTC

On 2/11/2022 11:18 AM, Stephen Fuld wrote:
> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
>>>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com
>>>> wrote:
>>>>> And with 64 you can crush the
>>>>> competition on FPU code which eats registers for breakfast. As a
>>>>> former
>>>>> game console programmer I can tell you 32 is not enough.
>>> <
>>>> This is useful to know. Despite Mitch noting that decoding more than
>>>> 32 registers
>>>> is a problem, then, I guess I'll be putting the 128-register bank
>>>> feature back into
>>>> my attempts at a high-performance design.
>>> <
>>> It is NOT the decoding of registers that is the problem.
>>> It is that the register specifiers eat too many bits in the instruction.
>>
>> ... which is one reason I stared thinking about 40-bit instruction
>> bundles :-)
>
> Mitch has stated that his ideal instruction size is 36 bits, so how
> about this?
>
> Encode 9 instructions, each 36 bits long in a 256 bit bundle. This
> wastes 4 bits.
>
> Upsides -
>
>     Ability to easily handle 64 registers, longer displacements, etc.
>     Pretty trivial decoding
>
> Downsides -
>
>     "wastes" 12.5% of space over eight 32 bit instructions in the same
> space.
>
> Note that I haven't really thought through having 18 bit instructions as
> well in order to reduce space usage. ISTM that has similar issues to
> adding 16 bit instructions to 32 bit "primary" instructions.
>

Or, how about 40 bits, and gain the "feature" that one can still have
all their instructions be byte-aligned?...

Or, a 24/40 ISA:
Short instructions are 24 bits;
Long instructions are 40 bits.

My thinking here being, say:
24 bit instructions use 5-bit register fields;
40 bit instructions use 7-bit register fields;

We have, say, 64 GPRs, but roughly half the register space is left for
non-GPRs (no longer have to cut holes in the register map to try to
stick things like a Zero Register and similar in there).

I guess it could also be possible to encode immediate values by gluing
them onto the instruction, rather than by subtracting from the base
instruction.

Say(2b / 3R):
00: Rt=Reg
01: Rt=Imm7
10: Rt=Imm15 (+8 bits)
11: Rt=Imm31 (+24 bits)

Say, low bits of instruction:
000 (3B)
010 (3B)
100 (4B)
110 (6B)
0001 (5B)
0101 (5B)
1001 (6B)
1101 (8B)

....

For 2R ops, the immediate bits are either glued onto a combined (larger)
immediate field, or used as opcode bits.

Or, alternatively, 96 GPRs, with 32 left off for non-registers and
control registers.
R0..R95: GPR Space
R96..R127 = C0..C31 = Control Registers

Though, 96 is "less ideal" for an FPGA than 32 or 64 (the "sweet spots"
for LUTRAM being 32 and 64, and 512 for BRAM, *).

*: Playing the magic numbers game: 512*2*16B = 16K ...

However, one could assume that all the GPRs are actually 128-bits
internally (48 x 128), but merely appear as 96x64 for 64-bit ops (the
LSB of the GPR in effect serving as a Low/High bit).

Though, at least with 96 GPRs, one can consolidate all the "non GPRs"
into a single location.

Examples of non-GPRs being, eg:
PC, GBR, SP, LR, ZR, ...

So, say (allowing for high/low pairing for addresses):
...
C20: TBR.L (Task Base / Context)
C21: TBR.H
C22: GBR.L (Global Base)
C23: GBR.H
C22: SP.L (Stack Pointer)
C24: SP.H
C26: LR.L (Link Register)
C27: LR.H
C28: PC.L (Program Counter)
C29: PC.H
C30: ZZR (Zero)
C31: ZZR (Zero)

The 24-bit ops could use a "compacted" register space, say:
R0..R25: GPRs
R26: TBR
R27: GBR
R28: SP
R29: LR
R30: PC
R31: ZR

With registers being interpreted as either 64 or 128 bit depending on
context (64-bit ops only able to access the low half of each register in
24-bit base encodings).

For 24-bit ops, this would leave:
6b opcode (3R)
11b opcode (2R)

Say:
tttt-tmmm_mmnn-nnnz_zzzz-zppp

Though, this wouldn't go very far, say:
tttt-tmmm_mmnn-nnns_ss00-0ppp (Ld/St)
tttt-tmmm_mmnn-nnns_ss00-1ppp -
tttt-tmmm_mmnn-nnnz_zz01-0ppp (ALU 3R, op=3b)
zzzz-zmmm_mmnn-nnnz_zz01-1ppp (2R, op=8b)

For 40-bit ops, this would leave:
15b opcode (3R)
22b opcode (2R)

....

Oh well, mostly just kinda coming up with wacky stuff that is kinda
going against the wind in this thread...

Re: Encoding 20 and 40 bit instructions in 128 bits

<su6eji$71k$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23369&group=comp.arch#23369

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 11 Feb 2022 11:49:04 -0800
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <su6eji$71k$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<71d1dc9e-46e6-4c13-956e-5e90b352984dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 11 Feb 2022 19:49:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="33543df21ba2633bcb197ca4b44b6b1f";
logging-data="7220"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19bV6TCf03yDC5ArsaX41Zz7Lq2aZj/HDo="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:ql7PvxP0r9EamTWY4O3apP+3gVQ=
In-Reply-To: <71d1dc9e-46e6-4c13-956e-5e90b352984dn@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Fri, 11 Feb 2022 19:49 UTC

On 2/11/2022 10:35 AM, MitchAlsup wrote:
> On Friday, February 11, 2022 at 11:18:43 AM UTC-6, Stephen Fuld wrote:
>> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
>>>>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
>>>>>> And with 64 you can crush the
>>>>>> competition on FPU code which eats registers for breakfast. As a former
>>>>>> game console programmer I can tell you 32 is not enough.
>>>> <
>>>>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
>>>>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
>>>>> my attempts at a high-performance design.
>>>> <
>>>> It is NOT the decoding of registers that is the problem.
>>>> It is that the register specifiers eat too many bits in the instruction.
>>>
>>> ... which is one reason I stared thinking about 40-bit instruction
>>> bundles :-)
> <
>> Mitch has stated that his ideal instruction size is 36 bits, so how
>> about this?
> <
> Just to bring everyone up to speed::
> <
> 36-bits allows one to encode everything My 66000 encodes in 32-bits
> but also to provide room for sized arithmetic {8,16,32,64}×{int,fp},
> saturation, overflow, wrap,...along with CARRY {int, fp}, plus

Did you prematurely send this, or did you intend it to end there?

Also, it seems you want to use the extra bits for more op-codes or op
code modifiers and not to support 64 GPRs. Is that right? Or some mix
of both?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Encoding 20 and 40 bit instructions in 128 bits

<728b6ea4-bc37-40aa-a505-f7c3e173c358n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23371&group=comp.arch#23371

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:40c8:: with SMTP id g8mr1654425qko.706.1644609798064;
Fri, 11 Feb 2022 12:03:18 -0800 (PST)
X-Received: by 2002:a05:6870:7c10:: with SMTP id je16mr653546oab.267.1644609797878;
Fri, 11 Feb 2022 12:03:17 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 11 Feb 2022 12:03:17 -0800 (PST)
In-Reply-To: <su6eji$71k$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:84f7:6231:3421:a79e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:84f7:6231:3421:a79e
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssulkf$7n0$1@newsreader4.netcologne.de>
<ssun38$imq$1@dont-email.me> <c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com>
<stec4m$kg0$1@dont-email.me> <de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com> <4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com> <_IhLJ.4280$0vE9.17@fx17.iad>
<3%tLJ.35102$t2Bb.34664@fx98.iad> <sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com> <stuoqv$97e$1@newsreader4.netcologne.de>
<stv51f$da9$1@dont-email.me> <stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me> <731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com> <su59a7$avf$1@newsreader4.netcologne.de>
<su65pg$lc$1@dont-email.me> <71d1dc9e-46e6-4c13-956e-5e90b352984dn@googlegroups.com>
<su6eji$71k$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <728b6ea4-bc37-40aa-a505-f7c3e173c358n@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 11 Feb 2022 20:03:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 58

by: MitchAlsup - Fri, 11 Feb 2022 20:03 UTC

On Friday, February 11, 2022 at 1:49:10 PM UTC-6, Stephen Fuld wrote:
> On 2/11/2022 10:35 AM, MitchAlsup wrote:
> > On Friday, February 11, 2022 at 11:18:43 AM UTC-6, Stephen Fuld wrote:
> >> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
> >>> MitchAlsup <Mitch...@aol.com> schrieb:
> >>>> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
> >>>>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
> >>>>>> And with 64 you can crush the
> >>>>>> competition on FPU code which eats registers for breakfast. As a former
> >>>>>> game console programmer I can tell you 32 is not enough.
> >>>> <
> >>>>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
> >>>>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
> >>>>> my attempts at a high-performance design.
> >>>> <
> >>>> It is NOT the decoding of registers that is the problem.
> >>>> It is that the register specifiers eat too many bits in the instruction.
> >>>
> >>> ... which is one reason I stared thinking about 40-bit instruction
> >>> bundles :-)
> > <
> >> Mitch has stated that his ideal instruction size is 36 bits, so how
> >> about this?
> > <
> > Just to bring everyone up to speed::
> > <
> > 36-bits allows one to encode everything My 66000 encodes in 32-bits
> > but also to provide room for sized arithmetic {8,16,32,64}×{int,fp},
> > saturation, overflow, wrap,...along with CARRY {int, fp}, plus
> Did you prematurely send this, or did you intend it to end there?
>
> Also, it seems you want to use the extra bits for more op-codes or op
> code modifiers and not to support 64 GPRs. Is that right? Or some mix
> of both?
<
I intended to imply other things can use those extra bits, but that I don't
need to influence those thinking about what they might be used for.
<
I am not a fan of 64-registers,
you gain so little 3%-ish
you lose in subroutines
you lose in context switch
you lose in memory footprint
you eat up 3-4 your instruction bits--which is all you added in the first place !
That it is quite likely you make negative forward progress.
<
A decided choice.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Encoding 20 and 40 bit instructions in 128 bits

<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23374&group=comp.arch#23374

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:94e:: with SMTP id w14mr1736622qkw.485.1644612018677;
Fri, 11 Feb 2022 12:40:18 -0800 (PST)
X-Received: by 2002:a05:6808:2003:: with SMTP id q3mr1063749oiw.133.1644612018379;
Fri, 11 Feb 2022 12:40:18 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 11 Feb 2022 12:40:18 -0800 (PST)
In-Reply-To: <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:1936:c1f:e117:5700;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:1936:c1f:e117:5700
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <ssuf80$i60$1@dont-email.me>
<ssulkf$7n0$1@newsreader4.netcologne.de> <ssun38$imq$1@dont-email.me>
<c9f3b05a-cc34-4ec9-a7da-88c1fb31614dn@googlegroups.com> <stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com> <2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com> <b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de> <9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me> <su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com> <e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me> <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 11 Feb 2022 20:40:18 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 49

by: Quadibloc - Fri, 11 Feb 2022 20:40 UTC

On Friday, February 11, 2022 at 11:28:53 AM UTC-7, MitchAlsup wrote:

> If you can encode all the work of those 8 32-bit instructions in 7 36-bit
> instructions, you are ahead of the game.

That's true. But many people will be skeptical. After all, eight
instructions do eight things, seven instructions do seven things.

Making each instruction a small fraction more flexible doesn't
seem likely to be able to save a whole instruction that frequently.

Your My 66000 architecture, though, includes stuff like inverting
operands as part of the instruction format, so that does add
operations. This has tended not to have been thought of as a
good idea since the PDP-8, so it's not what is usually considered.

In some of my ISA creation endeavors, it looked to me like an
optimal architecture could have 36-bit memory-reference
instructons, and 24-bit register-to-register instructions. Add 18-bit
short instructions... and you need to address six-bit characters,
and instruction decoding becomes a slow serial process. The costs
of making all the instructions exactly 32 bits long seemed to be worth
paying.

*Oh dear.* This discussion has now inspired me with another
crazy idea. While it won't help for immediate values, an ISA where
every instruction is 32 bits long, and yet some instructions are
longer than that, and others are shorter than that, could be pulled
off like _this_ instead of by the scheme I've been using...

Form of short instructions:
the short instruction
a bit indicating one of two instruction piece storage registers of
a given length
the bits to be saved in that register

Form of long instructions:
the first 31 bits of the long instruction
a bit indicating which of the two registers, for the amount of
bits needed, to pull the rest from

Standard instructions are just 32 bits of instruction.

By having two instruction piece storage registers of each length
used (say one pair that's 14 bits long, and one pair that's 7 bits long)
there is a bit of flexibility so that short instructions and long
instructions can almost be freely mixed. Initial decoding is in
parallel, but final decoding of long instructions is serialized.

John Savard

Quadibloc <jsavard@ecn.ab.ca> schrieb:
> On Friday, February 11, 2022 at 11:28:53 AM UTC-7, MitchAlsup wrote:
>
>> If you can encode all the work of those 8 32-bit instructions in 7 36-bit
>> instructions, you are ahead of the game.
>
> That's true. But many people will be skeptical. After all, eight
> instructions do eight things, seven instructions do seven things.

Compare and branch could profit a lot from more bits, 36 or 40.

Assume

- 6 bit primary opcode
- 10 bit for two registers to compare, or for a register and a
constant
- 3 bits for comparison code

you are left with either 13, 17 or 21 bits of offset for a 32,
36 or 40 bit instruction word. Subtract two bits if you have 64
registers / a 64-bit constant. (RISC-V needs to have a 7-bit
primary opcode because of their 16-bit instructions, so it has
one bit less, and loses one bit due to having 16-bit instructions).

It is rather hard to do anything useful with 16 bits with full 32
registers. Does anybody have data how well RISC-V manages to
reduce code size with their 16-bit instructions, and if there
is performance degradation? Unfortunately, I have do not
have access to a RISC-V machine to run my own analysis.

Re: Encoding 20 and 40 bit instructions in 128 bits

<su6r6n$tu4$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23378&group=comp.arch#23378

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 11 Feb 2022 17:24:03 -0600
Organization: A noiseless patient Spider
Lines: 194
Message-ID: <su6r6n$tu4$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<stec4m$kg0$1@dont-email.me>
<de6ecdef-8e30-40aa-838f-df08d10389e7n@googlegroups.com>
<2fd5e668-fbe5-4399-bf74-f5e509d669ebn@googlegroups.com>
<4fca5742-1815-4b31-8ea9-2da1592f3456n@googlegroups.com>
<b38538d0-7394-439b-a227-ede56b4b4040n@googlegroups.com>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<71d1dc9e-46e6-4c13-956e-5e90b352984dn@googlegroups.com>
<su6eji$71k$1@dont-email.me>
<728b6ea4-bc37-40aa-a505-f7c3e173c358n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 11 Feb 2022 23:24:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4c0980b42d00c6b1059fddc107918b04";
logging-data="30660"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/grCXmUFwJr/JcVQhxFxTK"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:yX2IIPLd2L21DmOKYsfhKNUtW/Y=
In-Reply-To: <728b6ea4-bc37-40aa-a505-f7c3e173c358n@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 11 Feb 2022 23:24 UTC

On 2/11/2022 2:03 PM, MitchAlsup wrote:
> On Friday, February 11, 2022 at 1:49:10 PM UTC-6, Stephen Fuld wrote:
>> On 2/11/2022 10:35 AM, MitchAlsup wrote:
>>> On Friday, February 11, 2022 at 11:18:43 AM UTC-6, Stephen Fuld wrote:
>>>> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
>>>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>>>> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
>>>>>>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
>>>>>>>> And with 64 you can crush the
>>>>>>>> competition on FPU code which eats registers for breakfast. As a former
>>>>>>>> game console programmer I can tell you 32 is not enough.
>>>>>> <
>>>>>>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
>>>>>>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
>>>>>>> my attempts at a high-performance design.
>>>>>> <
>>>>>> It is NOT the decoding of registers that is the problem.
>>>>>> It is that the register specifiers eat too many bits in the instruction.
>>>>>
>>>>> ... which is one reason I stared thinking about 40-bit instruction
>>>>> bundles :-)
>>> <
>>>> Mitch has stated that his ideal instruction size is 36 bits, so how
>>>> about this?
>>> <
>>> Just to bring everyone up to speed::
>>> <
>>> 36-bits allows one to encode everything My 66000 encodes in 32-bits
>>> but also to provide room for sized arithmetic {8,16,32,64}×{int,fp},
>>> saturation, overflow, wrap,...along with CARRY {int, fp}, plus
>> Did you prematurely send this, or did you intend it to end there?
>>
>> Also, it seems you want to use the extra bits for more op-codes or op
>> code modifiers and not to support 64 GPRs. Is that right? Or some mix
>> of both?
> <
> I intended to imply other things can use those extra bits, but that I don't
> need to influence those thinking about what they might be used for.
> <
> I am not a fan of 64-registers,
> you gain so little 3%-ish
> you lose in subroutines
> you lose in context switch
> you lose in memory footprint
> you eat up 3-4 your instruction bits--which is all you added in the first place !
> That it is quite likely you make negative forward progress.
> <
> A decided choice.

My current experience (for register counts):
8 generally sucks;
16 isn't quite enough (better than 8 at least);
32 is good enough for most stuff;
64 can be better in some cases.
But, yeah, encoding space is an issue.

32 vs 64 GPRs is kind of a toss up, for many cases one isn't likely to
see a big difference in register spill rate (except when doing nearly
everything in register pairs, then one may have a 32 GPR ISA effectively
functioning as-if it were a 16 GPR ISA).

24 bit instructions:
5b sorta works, 3R is pushing it, 2R is OK.

32 bits instructions:
5b is a sane default;
6b works, but is pushing it (18 out of 32 bits for 3R).

40 bit instructions:
May as well go for 6 or 7 bits.

Granted, 40b (with 7b regs) vs 32b (with 5b regs):
Adds 8 bits to instruction;
Loses 6 bits to registers;
Net gain of 2 bits.

So, the bigger register fields would mostly eat the difference.

Multiples of 8 bits allow for conventional byte-oriented memory.

Non-power-of-2 sizes assume freely aligned memory.
Otherwise, almost invariably, some space will be wasted.
If the unit is larger than 256 bits, then this is likely worse than had
free-form bytes been used (a bit-aligned pattern that repeats after 128
or 256 bits is at least workable).

256 bits would give 7x 36-bit ops, or 6x 40 bit ops.

So, for 8192 bits:
16b: 512 ops (16-bit aligned)
18b: 454 ops (dense packed), 448 ops (256b bundle)
24b: 341 ops (byte-aligned), 320 ops (256b bundle)
32b: 256 ops (32-bit aligned)
36b: 227 ops (dense packing), 224 ops (256b bundle)
40b: 204 ops (byte-aligned), 192 ops (256b bundle)

Of these, 18 or 36 bit would be best if one assumes a 256-bit bundle
packing.

And, 24, 32, or 40 makes sense if using byte-aligned memory.

And, 16 or 32 allows for a 16 or 32 bit aligned memory, which saves some
LUTs if compared with byte-aligned.

However, the relative cost/complexity difference between "fetch N 16-bit
words", "fetch N 32-bit words", and "fetch N bytes", isn't that drastic.

One drawback of forsaking byte-aligned memory is that this would have a
(significant) adverse impact on the effectiveness of typical LZ77
compressors.

For byte-aligned encodings, defining an immediate field as part of the
baseline instruction encoding seems like an interesting possibility.

This could allow some of the benefits of a variable-length immediate
field, without the drawbacks of some other approaches (such as encoding
the presence of an immediate field via a special register or addressing
mode).

Eg: Encoding a "Load Immediate" instruction as something like "LD.W
@PC+, Rn" or similar kinda sucks...

Nearly every instruction category ends up needing immediate values, so
it may make sense to make them a baseline feature (much like the
existence of addressing modes in traditional Reg/Mem ISA's).

Current thinking would be to still stick with a Load/Store design with
(Reg,Disp) and (Reg,Index) addressing, as this is sufficient for most
things.

Similarly, a Reg/Mem ISA is not ideal for a pipelined implementation.

I guess an open question is if an ISA can be designed to make it fairly
easy to implement a superscalar decoder without needing to use a "WEX
bit" or other similar approach (or how expensive it would be to apply
the WEX approach to an ISA with free-form byte aligned instructions).

A combination would be to allow for a WEX encoding which allows bundling
of byte-aligned instructions, but only in certain combinations, eg:
24+24, 24+24+24; 32+32, 32+32+32; 40+40.

Likely sticking with a 3-wide pipeline and 6R+3W register file.

....

Meanwhile, starts wondering if there is any "good compromise" between a
hardware page-walker and software managed TLB.

Like, for however long I have been poking at it (I suspect upwards of a
year by this point), fully achieving stability in the face of TLB miss
handler interrupts remains elusive (there are timing related bugs
crapping all over this that I can't seem to find or fix).

Poking at the L1 D$ to try to fix some other cache coherency issues
apparently made this situation worse (stuff is now a lot more
crash-prone). Though, the associated changes (while working on the epoch
flushing for the L1 caches), did at least point out some other timing
bugs (1).

*1: For example, there was a bug where the cache would proceed execution
if "any" store response arrived, rather than necessarily the "correct"
store response, and could (somehow) get in a state where load-responses
were arriving back at the L1 before their associated store responses,
and the cache was moving forward based on the arrival of responses to
previous stores (rather than the one matching the request sent by the L1).

Fixing this bug, ironically, caused the L1 to change timing slightly,
and seems to (also) have started causing the extra crashes related to
the TLB miss handlers.

Great issue is that the arrival of the interrupt (at the EX pipeline)
isn't necessarily in lock-step with the interrupt being generated at the
TLB, and the pipeline needs to be able to advance for the "interrupt
dispatch" to take place (so, it tries to grab the state from the oldest
valid instruction still in the pipeline, which is hopefully the one that
resulted in the interrupt, or that the side-effects are no-op if the
same instruction is executed twice).

Though, it could be possible to keep the pipeline stalled until the
interrupt dispatch is "ready to go", and then try to more precisely
flush the pipeline.

I guess, would be ideal if some of this were "less crap"...

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Friday, February 11, 2022 at 1:49:10 PM UTC-6, Stephen Fuld wrote:
>> On 2/11/2022 10:35 AM, MitchAlsup wrote:
>>> On Friday, February 11, 2022 at 11:18:43 AM UTC-6, Stephen Fuld wrote:
>>>> On 2/11/2022 1:12 AM, Thomas Koenig wrote:
>>>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>>>> On Wednesday, February 9, 2022 at 3:53:18 PM UTC-6, Quadibloc wrote:
>>>>>>> On Wednesday, February 9, 2022 at 1:58:55 PM UTC-7, gg...@yahoo.com wrote:
>>>>>>>> And with 64 you can crush the
>>>>>>>> competition on FPU code which eats registers for breakfast. As a former
>>>>>>>> game console programmer I can tell you 32 is not enough.
>>>>>> <
>>>>>>> This is useful to know. Despite Mitch noting that decoding more than 32 registers
>>>>>>> is a problem, then, I guess I'll be putting the 128-register bank feature back into
>>>>>>> my attempts at a high-performance design.
>>>>>> <
>>>>>> It is NOT the decoding of registers that is the problem.
>>>>>> It is that the register specifiers eat too many bits in the instruction.
>>>>>
>>>>> ... which is one reason I stared thinking about 40-bit instruction
>>>>> bundles :-)
>>> <
>>>> Mitch has stated that his ideal instruction size is 36 bits, so how
>>>> about this?
>>> <
>>> Just to bring everyone up to speed::
>>> <
>>> 36-bits allows one to encode everything My 66000 encodes in 32-bits
>>> but also to provide room for sized arithmetic {8,16,32,64}×{int,fp},
>>> saturation, overflow, wrap,...along with CARRY {int, fp}, plus
>> Did you prematurely send this, or did you intend it to end there?
>>
>> Also, it seems you want to use the extra bits for more op-codes or op
>> code modifiers and not to support 64 GPRs. Is that right? Or some mix
>> of both?
> <
> I intended to imply other things can use those extra bits, but that I don't
> need to influence those thinking about what they might be used for.
> <
> I am not a fan of 64-registers,
> you gain so little 3%-ish

The average C code gains 3%, averages lie.

The inner loops of game code can get 10%. And 90% of the compute time is in
10% of the code. So a overall 9% speedup, while a simple code profile
without benchmarking says 3%.

A new processor needs an edge, and this is an edge.

The RISC approach of cutting features until there is nothing left, leaves
no reason to buy.

Transistors are cheap, throw the kitchen sink at an optional variable width
opcode extension, kind of like ARM did with Cortex-Thumb2.

> you lose in subroutines
> you lose in context switch
> you lose in memory footprint
> you eat up 3-4 your instruction bits--which is all you added in the first place !
> That it is quite likely you make negative forward progress.
> <
> A decided choice.
>> --
>> - Stephen Fuld
>> (e-mail address disguised to prevent spam)
>

No amount of genius can overcome a preoccupation with detail.

devel / comp.arch / Re: Encoding 20 and 40 bit instructions in 128 bits

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton