Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Invest in physics -- own a piece of Dirac!


devel / comp.arch / Canonical recoding

SubjectAuthor
* Canonical recodingThomas Koenig
`* Re: Canonical recodingMitchAlsup
 +* Re: Canonical recodingBGB
 |`* Re: Canonical recodingMitchAlsup
 | `* Re: Canonical recodingBGB
 |  `* Re: Canonical recodingMitchAlsup
 |   `* Re: Canonical recodingBGB
 |    `* Re: Canonical recodingMitchAlsup
 |     `- Re: Canonical recodingBGB
 `* Re: Canonical recodingThomas Koenig
  `* Re: Canonical recodingMitchAlsup
   `- Re: Canonical recodingThomas Koenig

1
Canonical recoding

<t2j9r8$is8$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24607&group=comp.arch#24607

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Canonical recoding
Date: Wed, 6 Apr 2022 05:52:40 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t2j9r8$is8$1@newsreader4.netcologne.de>
Injection-Date: Wed, 6 Apr 2022 05:52:40 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:ec41:0:7285:c2ff:fe6c:992d";
logging-data="19336"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 6 Apr 2022 05:52 UTC

This may be old news to most of you, but it is at least new relative
to the second edition of Koren. He gives a serial algorithm for the
determination of the minimal signed digit representation of a number
(for Booth-style multipliers).

There's a better one given by "H. Prodinger, On binary
representations of integers with digits -1, 0, 1 , INTEGERS 0 (2000)",
https://www.emis.de/journals/INTEGERS/papers/a8/a8.Abstract.html .
Taking the Mathematica code from https://oeis.org/A184615, it is

xh = BitShiftRight[x, 1];
x3 = x + xh;
c = BitXor[xh, x3];
np = BitAnd[x3, c];
nm = BitAnd[xh, c];
Return[{np, nm}]];

so in hardware, it would be one (free) shift, an addition followed
by an xor and two ands for the positive and negative part (which
could of course run in parallel).

Now, I'm ready to be told that a) everybody in the field has known
this for 20 years, and b) nobody uses Booth-style multipliers in
serious work anyway any more ;-)

Re: Canonical recoding

<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24614&group=comp.arch#24614

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5be3:0:b0:441:2af0:6ea2 with SMTP id k3-20020ad45be3000000b004412af06ea2mr8420315qvc.116.1649264724951;
Wed, 06 Apr 2022 10:05:24 -0700 (PDT)
X-Received: by 2002:a4a:b343:0:b0:324:512e:e340 with SMTP id
n3-20020a4ab343000000b00324512ee340mr3113061ooo.59.1649264724735; Wed, 06 Apr
2022 10:05:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 6 Apr 2022 10:05:24 -0700 (PDT)
In-Reply-To: <t2j9r8$is8$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:94fd:3abc:fc90:f5e8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:94fd:3abc:fc90:f5e8
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
Subject: Re: Canonical recoding
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 06 Apr 2022 17:05:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 10
 by: MitchAlsup - Wed, 6 Apr 2022 17:05 UTC

On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
>
> Now, I'm ready to be told that a) everybody in the field has known
> this for 20 years, and b) nobody uses Booth-style multipliers in
> serious work anyway any more ;-)
<
All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
tree. 3-bits-in 5-bits-out skip-2 recoding.
<
(*) or higher recodings.

Re: Canonical recoding

<t2kn99$hel$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24617&group=comp.arch#24617

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Canonical recoding
Date: Wed, 6 Apr 2022 13:48:02 -0500
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <t2kn99$hel$1@dont-email.me>
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 6 Apr 2022 18:48:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3d6eaf504b53c4fc92707ad2a00146ed";
logging-data="17877"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19mEQkwbjiSkBC2j9wHg8Vu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:DD24dcBL3UDfDdxKfRxjYIAwLmU=
In-Reply-To: <ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 6 Apr 2022 18:48 UTC

On 4/6/2022 12:05 PM, MitchAlsup wrote:
> On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
>>
>> Now, I'm ready to be told that a) everybody in the field has known
>> this for 20 years, and b) nobody uses Booth-style multipliers in
>> serious work anyway any more ;-)
> <
> All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
> tree. 3-bits-in 5-bits-out skip-2 recoding.
> <
> (*) or higher recodings.

Apart from the observation that some people use plain shift-add multipliers.

Did recently (as of yesterday) go and add a shift-add multiplier to my
BJX2 core, mostly for sake of being able to (in RISC-V mode) support the
'M' extension, nevermind if it is *significantly* slower than the normal
32-bit multiplier (67 cycles vs 3 cycles).

But, the same basic logic can be used to implement MUL, DIV, and MOD, so
good enough...

Eg:
https://pastebin.com/RNZVEgWz

Some fiddling was mostly to battle with timing and similar...
Granted, this approach is possibly counter-intuitive, but alas...

Not sure if this is the "bast" approach to pull this off, this is just
what I was able to come up with for this.

Cost is reasonable, and it does at least seem to give the expected
results in unit tests.

In this case:
The RV64 'MULW' instruction would still map to the 32-bit multiplier.
A few of the 'new' instructions have been added to BJX2 under the 'MULQ'
extension (I added some instructions for Integer Divide).

Originally I had assumed MULQ would support a "moderately fast" 64-bit
multiplier, but alas, this is not happening at the moment (within
reasonably cost and timing constraints).

So, this will not likely be a good alternative to "doing it in
software". Should be more competitive WRT integer divide though (doing
shift-subtract in software isn't exactly fast either...).

....

Re: Canonical recoding

<7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24618&group=comp.arch#24618

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1707:b0:699:a16a:5ceb with SMTP id az7-20020a05620a170700b00699a16a5cebmr6833799qkb.534.1649271180365;
Wed, 06 Apr 2022 11:53:00 -0700 (PDT)
X-Received: by 2002:a05:6808:113:b0:2ec:b7db:df66 with SMTP id
b19-20020a056808011300b002ecb7dbdf66mr4454666oie.108.1649271180002; Wed, 06
Apr 2022 11:53:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 6 Apr 2022 11:52:59 -0700 (PDT)
In-Reply-To: <t2kn99$hel$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:94fd:3abc:fc90:f5e8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:94fd:3abc:fc90:f5e8
References: <t2j9r8$is8$1@newsreader4.netcologne.de> <ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kn99$hel$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
Subject: Re: Canonical recoding
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 06 Apr 2022 18:53:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 60
 by: MitchAlsup - Wed, 6 Apr 2022 18:52 UTC

On Wednesday, April 6, 2022 at 1:48:12 PM UTC-5, BGB wrote:
> On 4/6/2022 12:05 PM, MitchAlsup wrote:
> > On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
> >>
> >> Now, I'm ready to be told that a) everybody in the field has known
> >> this for 20 years, and b) nobody uses Booth-style multipliers in
> >> serious work anyway any more ;-)
> > <
> > All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
> > tree. 3-bits-in 5-bits-out skip-2 recoding.
> > <
> > (*) or higher recodings.
> Apart from the observation that some people use plain shift-add multipliers.
>
>
> Did recently (as of yesterday) go and add a shift-add multiplier to my
> BJX2 core, mostly for sake of being able to (in RISC-V mode) support the
> 'M' extension, nevermind if it is *significantly* slower than the normal
> 32-bit multiplier (67 cycles vs 3 cycles).
<
If you Booth recoded it, the cycle count would go down to 35 cycles.
>
> But, the same basic logic can be used to implement MUL, DIV, and MOD, so
> good enough...
>
> Eg:
> https://pastebin.com/RNZVEgWz
>
> Some fiddling was mostly to battle with timing and similar...
> Granted, this approach is possibly counter-intuitive, but alas...
>
> Not sure if this is the "bast" approach to pull this off, this is just
> what I was able to come up with for this.
>
> Cost is reasonable, and it does at least seem to give the expected
> results in unit tests.
>
>
>
> In this case:
> The RV64 'MULW' instruction would still map to the 32-bit multiplier.
> A few of the 'new' instructions have been added to BJX2 under the 'MULQ'
> extension (I added some instructions for Integer Divide).
>
>
> Originally I had assumed MULQ would support a "moderately fast" 64-bit
> multiplier, but alas, this is not happening at the moment (within
> reasonably cost and timing constraints).
>
> So, this will not likely be a good alternative to "doing it in
> software". Should be more competitive WRT integer divide though (doing
> shift-subtract in software isn't exactly fast either...).
>
> ...

Re: Canonical recoding

<t2ksp5$3c1$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24619&group=comp.arch#24619

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Canonical recoding
Date: Wed, 6 Apr 2022 15:21:51 -0500
Organization: A noiseless patient Spider
Lines: 95
Message-ID: <t2ksp5$3c1$1@dont-email.me>
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kn99$hel$1@dont-email.me>
<7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 6 Apr 2022 20:21:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3d6eaf504b53c4fc92707ad2a00146ed";
logging-data="3457"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/GHBnDI1v/1eww3eI6REsv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:ll8GYLaRxIrZctYCOXhKf1Iqv4Q=
In-Reply-To: <7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 6 Apr 2022 20:21 UTC

On 4/6/2022 1:52 PM, MitchAlsup wrote:
> On Wednesday, April 6, 2022 at 1:48:12 PM UTC-5, BGB wrote:
>> On 4/6/2022 12:05 PM, MitchAlsup wrote:
>>> On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
>>>>
>>>> Now, I'm ready to be told that a) everybody in the field has known
>>>> this for 20 years, and b) nobody uses Booth-style multipliers in
>>>> serious work anyway any more ;-)
>>> <
>>> All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
>>> tree. 3-bits-in 5-bits-out skip-2 recoding.
>>> <
>>> (*) or higher recodings.
>> Apart from the observation that some people use plain shift-add multipliers.
>>
>>
>> Did recently (as of yesterday) go and add a shift-add multiplier to my
>> BJX2 core, mostly for sake of being able to (in RISC-V mode) support the
>> 'M' extension, nevermind if it is *significantly* slower than the normal
>> 32-bit multiplier (67 cycles vs 3 cycles).
> <
> If you Booth recoded it, the cycle count would go down to 35 cycles.

It is possible, though in this case everything here is being done as-if
it were a 64-bit operation.

It is possible I could have also added a mechanism to check for leading
or large runs of zeros and skip over these bits, but this would add
logic cost.

Eg:
Are the next 8-bits all zeroes (and count>8)?
If yes, do an 8-bit shift (no update), count decreases by 8.
Else, do 1-bit logic.
Count decreases by 1.

I guess it is also partly a question of:
Will this logic be used enough to make this worthwhile?
Or, is this some backwater case best kept "as cheap as possible"?...

....

Well, and other questions like, "Will the existence of a hardware
integer divider have a visible effect on the Dhrystone score?", ...

Well, and other options, like whether it would be better to indicate the
existence of a DIV instruction as a compiler option, or leave it as a
runtime feature (say, an internal function pointer which either points
at the software-divide loop, or points to a function which invokes the
'DIV' instruction ?... ), or intermediate (compiler option, but it
mostly just sets a flag which effects the inline ASM, which then uses
the 'DIV' instruction).

....

Had also added the encodings for the 'M' extension to the RV64 dual-ISA
mode decoder.

>>
>> But, the same basic logic can be used to implement MUL, DIV, and MOD, so
>> good enough...
>>
>> Eg:
>> https://pastebin.com/RNZVEgWz
>>
>> Some fiddling was mostly to battle with timing and similar...
>> Granted, this approach is possibly counter-intuitive, but alas...
>>
>> Not sure if this is the "bast" approach to pull this off, this is just
>> what I was able to come up with for this.
>>
>> Cost is reasonable, and it does at least seem to give the expected
>> results in unit tests.
>>
>>
>>
>> In this case:
>> The RV64 'MULW' instruction would still map to the 32-bit multiplier.
>> A few of the 'new' instructions have been added to BJX2 under the 'MULQ'
>> extension (I added some instructions for Integer Divide).
>>
>>
>> Originally I had assumed MULQ would support a "moderately fast" 64-bit
>> multiplier, but alas, this is not happening at the moment (within
>> reasonably cost and timing constraints).
>>
>> So, this will not likely be a good alternative to "doing it in
>> software". Should be more competitive WRT integer divide though (doing
>> shift-subtract in software isn't exactly fast either...).
>>
>> ...

Re: Canonical recoding

<c8705d94-6157-475c-a7ad-81c72ee73595n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24620&group=comp.arch#24620

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5889:0:b0:2e1:afa2:65a9 with SMTP id t9-20020ac85889000000b002e1afa265a9mr8855222qta.268.1649277534207;
Wed, 06 Apr 2022 13:38:54 -0700 (PDT)
X-Received: by 2002:a9d:6c58:0:b0:5cd:8ccb:c128 with SMTP id
g24-20020a9d6c58000000b005cd8ccbc128mr3798370otq.254.1649277533877; Wed, 06
Apr 2022 13:38:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 6 Apr 2022 13:38:53 -0700 (PDT)
In-Reply-To: <t2ksp5$3c1$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1c5a:131a:c7bf:75c6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1c5a:131a:c7bf:75c6
References: <t2j9r8$is8$1@newsreader4.netcologne.de> <ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kn99$hel$1@dont-email.me> <7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
<t2ksp5$3c1$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c8705d94-6157-475c-a7ad-81c72ee73595n@googlegroups.com>
Subject: Re: Canonical recoding
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 06 Apr 2022 20:38:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 117
 by: MitchAlsup - Wed, 6 Apr 2022 20:38 UTC

On Wednesday, April 6, 2022 at 3:22:00 PM UTC-5, BGB wrote:
> On 4/6/2022 1:52 PM, MitchAlsup wrote:
> > On Wednesday, April 6, 2022 at 1:48:12 PM UTC-5, BGB wrote:
> >> On 4/6/2022 12:05 PM, MitchAlsup wrote:
> >>> On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
> >>>>
> >>>> Now, I'm ready to be told that a) everybody in the field has known
> >>>> this for 20 years, and b) nobody uses Booth-style multipliers in
> >>>> serious work anyway any more ;-)
> >>> <
> >>> All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
> >>> tree. 3-bits-in 5-bits-out skip-2 recoding.
> >>> <
> >>> (*) or higher recodings.
> >> Apart from the observation that some people use plain shift-add multipliers.
> >>
> >>
> >> Did recently (as of yesterday) go and add a shift-add multiplier to my
> >> BJX2 core, mostly for sake of being able to (in RISC-V mode) support the
> >> 'M' extension, nevermind if it is *significantly* slower than the normal
> >> 32-bit multiplier (67 cycles vs 3 cycles).
> > <
> > If you Booth recoded it, the cycle count would go down to 35 cycles.
> It is possible, though in this case everything here is being done as-if
> it were a 64-bit operation.
<
Booth causes the central adder to do one of the following {-b,0,+b} then
skip 2-bits. So, in effect, needing to add 3×b is equivalent to adding 4×b-b
over 2 iterations.
>
> It is possible I could have also added a mechanism to check for leading
> or large runs of zeros and skip over these bits, but this would add
> logic cost.
>
> Eg:
> Are the next 8-bits all zeroes (and count>8)?
> If yes, do an 8-bit shift (no update), count decreases by 8.
> Else, do 1-bit logic.
> Count decreases by 1.
>
> I guess it is also partly a question of:
> Will this logic be used enough to make this worthwhile?
> Or, is this some backwater case best kept "as cheap as possible"?...
>
> ...
>
>
> Well, and other questions like, "Will the existence of a hardware
> integer divider have a visible effect on the Dhrystone score?", ...
<
I, personally, do not even look at Dhrystone (or Whetstone) anymore.
>
> Well, and other options, like whether it would be better to indicate the
> existence of a DIV instruction as a compiler option, or leave it as a
> runtime feature (say, an internal function pointer which either points
> at the software-divide loop, or points to a function which invokes the
> 'DIV' instruction ?... ), or intermediate (compiler option, but it
> mostly just sets a flag which effects the inline ASM, which then uses
> the 'DIV' instruction).
>
Hint: If you can "take an exception" (Unimplemented inst) in 10 cycles
You can software implement fast enough not to clue the compiler in.
> ...
>
>
> Had also added the encodings for the 'M' extension to the RV64 dual-ISA
> mode decoder.
> >>
> >> But, the same basic logic can be used to implement MUL, DIV, and MOD, so
> >> good enough...
> >>
> >> Eg:
> >> https://pastebin.com/RNZVEgWz
> >>
> >> Some fiddling was mostly to battle with timing and similar...
> >> Granted, this approach is possibly counter-intuitive, but alas...
> >>
> >> Not sure if this is the "bast" approach to pull this off, this is just
> >> what I was able to come up with for this.
> >>
> >> Cost is reasonable, and it does at least seem to give the expected
> >> results in unit tests.
> >>
> >>
> >>
> >> In this case:
> >> The RV64 'MULW' instruction would still map to the 32-bit multiplier.
> >> A few of the 'new' instructions have been added to BJX2 under the 'MULQ'
> >> extension (I added some instructions for Integer Divide).
> >>
> >>
> >> Originally I had assumed MULQ would support a "moderately fast" 64-bit
> >> multiplier, but alas, this is not happening at the moment (within
> >> reasonably cost and timing constraints).
> >>
> >> So, this will not likely be a good alternative to "doing it in
> >> software". Should be more competitive WRT integer divide though (doing
> >> shift-subtract in software isn't exactly fast either...).
> >>
> >> ...

Re: Canonical recoding

<t2kveu$odp$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24622&group=comp.arch#24622

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Canonical recoding
Date: Wed, 6 Apr 2022 21:07:42 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t2kveu$odp$1@newsreader4.netcologne.de>
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 6 Apr 2022 21:07:42 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:ec41:0:7285:c2ff:fe6c:992d";
logging-data="25017"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 6 Apr 2022 21:07 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
>>
>> Now, I'm ready to be told that a) everybody in the field has known
>> this for 20 years, and b) nobody uses Booth-style multipliers in
>> serious work anyway any more ;-)
><
> All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
> tree. 3-bits-in 5-bits-out skip-2 recoding.

So, do you use the algorithm by Prodinger, or something else?

Re: Canonical recoding

<e11985c3-305a-4ac3-afa1-28d5ee8f23f2n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24623&group=comp.arch#24623

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ae9:ed96:0:b0:67e:c89e:480a with SMTP id c144-20020ae9ed96000000b0067ec89e480amr7155516qkg.274.1649283393021;
Wed, 06 Apr 2022 15:16:33 -0700 (PDT)
X-Received: by 2002:a05:6870:9604:b0:de:a876:fbba with SMTP id
d4-20020a056870960400b000dea876fbbamr4719741oaq.239.1649283392900; Wed, 06
Apr 2022 15:16:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 6 Apr 2022 15:16:32 -0700 (PDT)
In-Reply-To: <t2kveu$odp$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1c5a:131a:c7bf:75c6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1c5a:131a:c7bf:75c6
References: <t2j9r8$is8$1@newsreader4.netcologne.de> <ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kveu$odp$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e11985c3-305a-4ac3-afa1-28d5ee8f23f2n@googlegroups.com>
Subject: Re: Canonical recoding
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 06 Apr 2022 22:16:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 20
 by: MitchAlsup - Wed, 6 Apr 2022 22:16 UTC

On Wednesday, April 6, 2022 at 4:07:45 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
> >>
> >> Now, I'm ready to be told that a) everybody in the field has known
> >> this for 20 years, and b) nobody uses Booth-style multipliers in
> >> serious work anyway any more ;-)
> ><
> > All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
> > tree. 3-bits-in 5-bits-out skip-2 recoding.
> So, do you use the algorithm by Prodinger, or something else?
<
I don't know the algorithm had a name !? I have been aware of Booth recoding
since about 1972 and have been using it since 1983. The 3-bits-in, 5-bits-out,
skip-2 is simply how one does Booth recoding when you get to play with individual
transistors.

Re: Canonical recoding

<t2l6sg$btd$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24625&group=comp.arch#24625

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Canonical recoding
Date: Wed, 6 Apr 2022 18:14:17 -0500
Organization: A noiseless patient Spider
Lines: 199
Message-ID: <t2l6sg$btd$1@dont-email.me>
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kn99$hel$1@dont-email.me>
<7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
<t2ksp5$3c1$1@dont-email.me>
<c8705d94-6157-475c-a7ad-81c72ee73595n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 6 Apr 2022 23:14:24 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9ca3105ac9f4738936a7bb96d2fdafb0";
logging-data="12205"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/AtNSKxac9+6qFFpkNfkJL"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:AH4KSJUWLI0aYvO+PkDn1CwcxAc=
In-Reply-To: <c8705d94-6157-475c-a7ad-81c72ee73595n@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 6 Apr 2022 23:14 UTC

On 4/6/2022 3:38 PM, MitchAlsup wrote:
> On Wednesday, April 6, 2022 at 3:22:00 PM UTC-5, BGB wrote:
>> On 4/6/2022 1:52 PM, MitchAlsup wrote:
>>> On Wednesday, April 6, 2022 at 1:48:12 PM UTC-5, BGB wrote:
>>>> On 4/6/2022 12:05 PM, MitchAlsup wrote:
>>>>> On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
>>>>>>
>>>>>> Now, I'm ready to be told that a) everybody in the field has known
>>>>>> this for 20 years, and b) nobody uses Booth-style multipliers in
>>>>>> serious work anyway any more ;-)
>>>>> <
>>>>> All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
>>>>> tree. 3-bits-in 5-bits-out skip-2 recoding.
>>>>> <
>>>>> (*) or higher recodings.
>>>> Apart from the observation that some people use plain shift-add multipliers.
>>>>
>>>>
>>>> Did recently (as of yesterday) go and add a shift-add multiplier to my
>>>> BJX2 core, mostly for sake of being able to (in RISC-V mode) support the
>>>> 'M' extension, nevermind if it is *significantly* slower than the normal
>>>> 32-bit multiplier (67 cycles vs 3 cycles).
>>> <
>>> If you Booth recoded it, the cycle count would go down to 35 cycles.
>> It is possible, though in this case everything here is being done as-if
>> it were a 64-bit operation.
> <
> Booth causes the central adder to do one of the following {-b,0,+b} then
> skip 2-bits. So, in effect, needing to add 3×b is equivalent to adding 4×b-b
> over 2 iterations.

OK.

As noted, I am using a more naive shift-add here, probably good enough,
and does both MUL and DIV, ...

Something like Booth does not seem like a simple change, and don't
necessarily want to completely rethink the design for this part.

For 64-bit multiply, would also need to be less than around 20 cycles to
beat out the runtime function (which does 64-bit multiply piecewise via
the 32-bit widening multiplier, as an ASM blob).

Trying to do 64-bit multiply via DSP's and adders runs into issues with
timing and resource cost. So, while I can make something faster than the
software version, the costs get pretty steep. The FPGA has plenty of
DSPs, but it is mostly trying to get all the results added together
within a relatively modest number of clock cycles and without wasting an
unreasonable number of LUTs or similar while doing so.

But, something which "technically does the job" and "isn't too
expensive" is "probably OK" (nevermind if it takes 67 cycles to do so...).

>>
>> It is possible I could have also added a mechanism to check for leading
>> or large runs of zeros and skip over these bits, but this would add
>> logic cost.
>>
>> Eg:
>> Are the next 8-bits all zeroes (and count>8)?
>> If yes, do an 8-bit shift (no update), count decreases by 8.
>> Else, do 1-bit logic.
>> Count decreases by 1.
>>
>> I guess it is also partly a question of:
>> Will this logic be used enough to make this worthwhile?
>> Or, is this some backwater case best kept "as cheap as possible"?...
>>
>> ...
>>
>>
>> Well, and other questions like, "Will the existence of a hardware
>> integer divider have a visible effect on the Dhrystone score?", ...
> <
> I, personally, do not even look at Dhrystone (or Whetstone) anymore.

Possibly.

It doesn't really match realistic code all that well, but seems to be a
pretty standard benchmark for this sort of thing.

Most other code doesn't really use integer divide enough to divide
performance to make all that big of a difference (or, where it does, it
can be special-cased...).

This 'DIV' instruction will probably not, in any case, be faster than
using a "lookup table of reciprocals", so it would most likely make
sense to treat it as a fall back case for "divisor falls outside the
range of the reciprocal table" (this is the case where the runtime
library falls back to the shift-subtract loop).

Though, on x86-64, the C compilers typically defaulted to always using
the DIV instruction, despite it typically typically being slower than
what one could pull off in software on these machines.

>>
>> Well, and other options, like whether it would be better to indicate the
>> existence of a DIV instruction as a compiler option, or leave it as a
>> runtime feature (say, an internal function pointer which either points
>> at the software-divide loop, or points to a function which invokes the
>> 'DIV' instruction ?... ), or intermediate (compiler option, but it
>> mostly just sets a flag which effects the inline ASM, which then uses
>> the 'DIV' instruction).
>>
> Hint: If you can "take an exception" (Unimplemented inst) in 10 cycles
> You can software implement fast enough not to clue the compiler in.

At present, with C based ISR handling, typical interrupt times seem to
be more in the area of kilocycles...

A lot of the cost goes into to overheads with things like C
prolog/epilog sequences and similar (the C compiler using ISR entry
points written in C, and generates boilerplate prologs/epilogs which
save and restore all of the GPRs and a chunk of the CRs as well; some
amount of these being saved/restored again by the normal C function
prologs/epilogs).

So, for example, TLB Miss handling is ~ 1-2 kilocycle (*1), and a page
evict/reload for page swap is around 20-50 kilocycle (very rough
estimate, *2).

*1: With a delta of around 800..1000 cycles between using
nested-page-tables and a B-Tree. An ASM page-table walker is a bit
faster than a C page table walker.

I have yet to write an ASM B-Tree walker, which is at least
theoretically possible if it turns out to be too much of an issue.

Did also need to switch the ISR handlers over to a RAM backed stack,
mostly because after adding all this stuff, the ISR's were no longer
able to keep within the limits of the 8K SRAM (kept overflowing the
stack). Seemingly need more around 32K of stack space for this stuff.

*2: Average case is a little better with Zero-skip and QWORD LZ
compression than with a "naive" implementation (writing out one
uncompressed page and then reading in another uncompressed page via the
SPI SDcard interface being around 1 million clock cycles).

With no Z-skip or LZ, with every swap being ~ 1M cycles, the performance
impact of the page swaps is very obvious (though, luckily, zero-skipping
pages seems to be pretty effective in this case).

Though, even despite this, the LZ encoder needs to be "reasonably fast"
in order to not make this part "even slower".

>> ...
>>
>>
>> Had also added the encodings for the 'M' extension to the RV64 dual-ISA
>> mode decoder.
>>>>
>>>> But, the same basic logic can be used to implement MUL, DIV, and MOD, so
>>>> good enough...
>>>>
>>>> Eg:
>>>> https://pastebin.com/RNZVEgWz
>>>>
>>>> Some fiddling was mostly to battle with timing and similar...
>>>> Granted, this approach is possibly counter-intuitive, but alas...
>>>>
>>>> Not sure if this is the "bast" approach to pull this off, this is just
>>>> what I was able to come up with for this.
>>>>
>>>> Cost is reasonable, and it does at least seem to give the expected
>>>> results in unit tests.
>>>>
>>>>
>>>>
>>>> In this case:
>>>> The RV64 'MULW' instruction would still map to the 32-bit multiplier.
>>>> A few of the 'new' instructions have been added to BJX2 under the 'MULQ'
>>>> extension (I added some instructions for Integer Divide).
>>>>
>>>>
>>>> Originally I had assumed MULQ would support a "moderately fast" 64-bit
>>>> multiplier, but alas, this is not happening at the moment (within
>>>> reasonably cost and timing constraints).
>>>>
>>>> So, this will not likely be a good alternative to "doing it in
>>>> software". Should be more competitive WRT integer divide though (doing
>>>> shift-subtract in software isn't exactly fast either...).
>>>>
>>>> ...

Re: Canonical recoding

<25dd7aab-a2f9-4990-bb5a-f0e1c4b1db95n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24626&group=comp.arch#24626

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:21d4:b0:67d:6a35:5dff with SMTP id h20-20020a05620a21d400b0067d6a355dffmr7755638qka.747.1649294382590;
Wed, 06 Apr 2022 18:19:42 -0700 (PDT)
X-Received: by 2002:a05:6820:1508:b0:320:f886:9e01 with SMTP id
ay8-20020a056820150800b00320f8869e01mr3701561oob.26.1649294382363; Wed, 06
Apr 2022 18:19:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 6 Apr 2022 18:19:42 -0700 (PDT)
In-Reply-To: <t2l6sg$btd$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1c5a:131a:c7bf:75c6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1c5a:131a:c7bf:75c6
References: <t2j9r8$is8$1@newsreader4.netcologne.de> <ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kn99$hel$1@dont-email.me> <7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
<t2ksp5$3c1$1@dont-email.me> <c8705d94-6157-475c-a7ad-81c72ee73595n@googlegroups.com>
<t2l6sg$btd$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25dd7aab-a2f9-4990-bb5a-f0e1c4b1db95n@googlegroups.com>
Subject: Re: Canonical recoding
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 07 Apr 2022 01:19:42 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 58
 by: MitchAlsup - Thu, 7 Apr 2022 01:19 UTC

On Wednesday, April 6, 2022 at 6:14:28 PM UTC-5, BGB wrote:
> On 4/6/2022 3:38 PM, MitchAlsup wrote:

>
> Something like Booth does not seem like a simple change, and don't
> necessarily want to completely rethink the design for this part.
>
Booth is basically a MUX in front of your adder choosing between
{~b with carry in, 0, and b}. The cycle before you examine the 3-bits
of a to decide which b input gets added. The 3-bit examiner is 5 gates
(not 5-gates per bit, 5 total gates). This has been std since about the
time of DG Nova.
>

> > I, personally, do not even look at Dhrystone (or Whetstone) anymore.
> Possibly.
>
> It doesn't really match realistic code all that well, but seems to be a
> pretty standard benchmark for this sort of thing.
<
It is std only when you cannot run bigger benchmarks.
>

> > Hint: If you can "take an exception" (Unimplemented inst) in 10 cycles
> > You can software implement fast enough not to clue the compiler in.
> At present, with C based ISR handling, typical interrupt times seem to
> be more in the area of kilocycles...
>
This is a problem (understatement).
>
> A lot of the cost goes into to overheads with things like C
> prolog/epilog sequences and similar (the C compiler using ISR entry
<
I do this in HW (swap thread-state) so when you arrive at handler,
there is no prologue and epilog when you leave.
<
> points written in C, and generates boilerplate prologs/epilogs which
> save and restore all of the GPRs and a chunk of the CRs as well; some
> amount of these being saved/restored again by the normal C function
> prologs/epilogs).
<
When you bite off on doing this in HW you start being able to save/restore
a ½ cache line per cycle.
>
>
> So, for example, TLB Miss handling is ~ 1-2 kilocycle (*1), and a page
<
1 level table walk averages 17 cycles with reasonable TW accelerators.
Nested page table walks take ~40-ish cycles.
<
> evict/reload for page swap is around 20-50 kilocycle (very rough
> estimate, *2).
<
Page on disk should be in the 5 millisecond range.
Page in memory with wrong permissions should be kilocycle range.
>

Re: Canonical recoding

<t2lmdl$skt$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24627&group=comp.arch#24627

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Canonical recoding
Date: Wed, 6 Apr 2022 22:39:25 -0500
Organization: A noiseless patient Spider
Lines: 224
Message-ID: <t2lmdl$skt$1@dont-email.me>
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kn99$hel$1@dont-email.me>
<7755a0c3-db84-4404-9f5a-95735ea72f4fn@googlegroups.com>
<t2ksp5$3c1$1@dont-email.me>
<c8705d94-6157-475c-a7ad-81c72ee73595n@googlegroups.com>
<t2l6sg$btd$1@dont-email.me>
<25dd7aab-a2f9-4990-bb5a-f0e1c4b1db95n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Apr 2022 03:39:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9ca3105ac9f4738936a7bb96d2fdafb0";
logging-data="29341"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5n+HEiteXM/2HIM9bzb+m"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:YoFXXkLLTaMYUmk2PsAt5+KFpEI=
In-Reply-To: <25dd7aab-a2f9-4990-bb5a-f0e1c4b1db95n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 7 Apr 2022 03:39 UTC

On 4/6/2022 8:19 PM, MitchAlsup wrote:
> On Wednesday, April 6, 2022 at 6:14:28 PM UTC-5, BGB wrote:
>> On 4/6/2022 3:38 PM, MitchAlsup wrote:
>
>>
>> Something like Booth does not seem like a simple change, and don't
>> necessarily want to completely rethink the design for this part.
>>
> Booth is basically a MUX in front of your adder choosing between
> {~b with carry in, 0, and b}. The cycle before you examine the 3-bits
> of a to decide which b input gets added. The 3-bit examiner is 5 gates
> (not 5-gates per bit, 5 total gates). This has been std since about the
> time of DG Nova.

OK.

May need to look more into this.

>>
>
>>> I, personally, do not even look at Dhrystone (or Whetstone) anymore.
>> Possibly.
>>
>> It doesn't really match realistic code all that well, but seems to be a
>> pretty standard benchmark for this sort of thing.
> <
> It is std only when you cannot run bigger benchmarks.

Like CoreMark?...

>>
>
>>> Hint: If you can "take an exception" (Unimplemented inst) in 10 cycles
>>> You can software implement fast enough not to clue the compiler in.
>> At present, with C based ISR handling, typical interrupt times seem to
>> be more in the area of kilocycles...
>>
> This is a problem (understatement).

Probably...

I can also note that there is a reason I am using a 256x 4-way TLB...

Eg, drop TLB size to 32x 4-way and suddenly a good chunk of the CPU time
is going into TLB misses, and much smaller and everything basically
grinds to a halt...

Some of this is why I had originally had some reservations about
software managed TLB, though in practice they seem to work OK enough.

>>
>> A lot of the cost goes into to overheads with things like C
>> prolog/epilog sequences and similar (the C compiler using ISR entry
> <
> I do this in HW (swap thread-state) so when you arrive at handler,
> there is no prologue and epilog when you leave.
> <
>> points written in C, and generates boilerplate prologs/epilogs which
>> save and restore all of the GPRs and a chunk of the CRs as well; some
>> amount of these being saved/restored again by the normal C function
>> prologs/epilogs).
> <
> When you bite off on doing this in HW you start being able to save/restore
> a ½ cache line per cycle.

Dunno.

The ancestor ISA (SH-4) had used a design where it basically banked out
half of the register space whenever an interrupt occurred.

Say:
R0..R7: Get banked-swapped (with S8..S15) on ISR
R8..R14: Left as-is
R15: Swaps with SGR

In an earlier form of the ISA design for BJX2:
R0..R7: get banked-swapped S0..S7
R8..R14: Left as-is
R15: swaps with SSP
R16..R23 get banked-swapped S8..S15
R24..R31: Left as-is

The "Shadow Registers" were mostly dropped from the ISA, mostly for cost
saving reasons.

This did make ISR entry more convoluted, as it basically has to save off
a register to memory before it can have any registers free to save-off
anything else, and ISR exit needs to restore the register state to the
exact state it was on ISR entry.

The start and end of the ISR handling process basically turn into a sort
of juggling game with the registers.

>>
>>
>> So, for example, TLB Miss handling is ~ 1-2 kilocycle (*1), and a page
> <
> 1 level table walk averages 17 cycles with reasonable TW accelerators.
> Nested page table walks take ~40-ish cycles.
> <

This is assuming:
One does the page walk in hardware;
One is using nested page-tables, and not something like a B-Tree.

So, say, here one can spend around 1 kilocycle on "general overheads",
with the actual page-table walking part being "almost negligible".

Or, around 2 kilocycle, with roughly half the cycles going into walking
a 2-level B-Tree. Roughly 2 binary searches, each of which takes ~ 10
steps (or, around 20 binary-search steps in total).

Hybrid tables have a smaller relative cost increase, but a higher
relative memory overhead due to ASLR (and the bottom layer being
sparsely populated).

Well, or do the TLB Miss ISR in ASM or something...

But, this would create complexity since it might not be a simple TLB
miss, but may require dealing with the pagefile or similar, which will
have required all of the GPR state to have been saved.

Well, unless one goes the route of not doing a full GPR save/restore
until we determine that this is not a simple case "fetch something from
page table and put it in TLB".

>> evict/reload for page swap is around 20-50 kilocycle (very rough
>> estimate, *2).
> <
> Page on disk should be in the 5 millisecond range.
> Page in memory with wrong permissions should be kilocycle range.

It is as low as it is because the vast majority of the pages which hit
the page file are Zero-Skipped.

As for SDcard, it is currently clocked ~ 12 MHz (was originally 5 MHz),
and SPI, so theoretical bandwidth is 1.5 MB/s (practical is a little
lower, so say assume 1.0 MB/s).

Transferring a 16K page takes ~ 16ms, or ~ 800 kilocycles (relative to
50 MHz).
Transferring two 16K pages takes ~ 32ms, or ~ 1.6 megacycles.

Z-Skip eliminates pretty much all of the IO cost in the cases it applies
to, so cost is mostly detecting that the page is zero, and other
overheads related to pagefile management: LRU stuff, updating the
page-tables, ...

And, say, both pages not Z-Skipped, but were each LZ Compressed to 1K.
This reduces IO time to 2ms, or ~ 100 kilocycle.

However, given LZ encoding is also expensive, the LZ compressor needs to
be "as fast as possible" to effectively leverage the IO bandwidth savings.

My attempts at "fast" RP2 and LZ4 encoders still weren't really fast
enough to offer a competitive advantage here, but my newer experiment
(using 64-bit QWORDs as the base unit) did a little better here (what it
lacks in decent compression it leverages in being able to be faster than
the IO cost it displaces).

For "loading stuff", RP2 and LZ4 are better options though, given they
have better compression and are fast to decode.

I had another past design (FeLZ32), which was similar in concept, just
building everything around 32-bit DWORDs.

One drawback for both my current design (and to a lesser extent for
FeLZ32), was that their compression will take a pretty big hit for
things like text-files and other free-form byte-aligned data.

For the current scheme, things like ASCII text files and similar are
basically non-compressible...

Though, I guess it is roughly a similar situation on a PC trying to use
an LZ compressor as an accelerator to make HDD IO faster. But, a modern
PC does an OK job at being able to make an LZ4 encoder fast.

Though, one could argue that, probably in an ideal situation, one
wouldn't be LZ compressing memory pages or similar inside of an
interrupt handler (and consequently missing untold number of timer IRQs
in the process).

But, doing this differently would likely require a different strategy,
like doing pagefile stuff offline, and then rescheduling to a different
process or similar whenever a hard-fault occurs.

....

An earlier idea for the TLB miss ISR was that it would effectively
context-switch to a kernel-mode handler to deal with any "Page Fault"
situations, in which case, they would be handled in supervisor mode
rather than inside the TLB-Miss ISR.

However, the current approach was "technically easier" than having a
"context switch into supervisor mode; then eventually context switch
back into the usermode program" strategy.

Some of this may need to be redesigned eventually...

Re: Canonical recoding

<t2lvhq$cav$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=24628&group=comp.arch#24628

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Canonical recoding
Date: Thu, 7 Apr 2022 06:15:23 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t2lvhq$cav$1@newsreader4.netcologne.de>
References: <t2j9r8$is8$1@newsreader4.netcologne.de>
<ca1e5711-1e14-4937-9238-1ede1dfa74f8n@googlegroups.com>
<t2kveu$odp$1@newsreader4.netcologne.de>
<e11985c3-305a-4ac3-afa1-28d5ee8f23f2n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Apr 2022 06:15:23 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:ec41:0:7285:c2ff:fe6c:992d";
logging-data="12639"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Thu, 7 Apr 2022 06:15 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, April 6, 2022 at 4:07:45 PM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> > On Wednesday, April 6, 2022 at 12:52:43 AM UTC-5, Thomas Koenig wrote:
>> >>
>> >> Now, I'm ready to be told that a) everybody in the field has known
>> >> this for 20 years, and b) nobody uses Booth-style multipliers in
>> >> serious work anyway any more ;-)
>> ><
>> > All* HW multipliers use Booth-recoding--it gets rid of ½ the layers in the
>> > tree. 3-bits-in 5-bits-out skip-2 recoding.
>> So, do you use the algorithm by Prodinger, or something else?
><
> I don't know the algorithm had a name !? I have been aware of Booth recoding
> since about 1972 and have been using it since 1983. The 3-bits-in, 5-bits-out,
> skip-2 is simply how one does Booth recoding when you get to play with individual
> transistors.

This is a variant which calculates the non-adjacent form, where
digits are separated by at least one zero. It minimizes the
number of non-zero digits, so it is better than standard Booth
encoding. Koren gives a serial algorithm for the calculation of
the non-ajacent form, not the one I quoted above.

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor