Message-ID:

I do not find in orthodox Christianity one redeeming feature. -- Thomas Jefferson

devel / comp.lang.forth / Re: SHA512 implementation in Forth (debugging)

Re: SHA512 implementation in Forth (debugging)

<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=16735&group=comp.lang.forth#16735

X-Received: by 2002:a37:a855:: with SMTP id r82mr858163qke.645.1644409428966;
Wed, 09 Feb 2022 04:23:48 -0800 (PST)
X-Received: by 2002:a05:620a:2481:: with SMTP id i1mr867492qkn.159.1644409428834;
Wed, 09 Feb 2022 04:23:48 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 9 Feb 2022 04:23:48 -0800 (PST)
In-Reply-To: <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=24.214.84.54; posting-account=tLyDfwoAAAALxbjbzLPd3Molo3hRLGFY
NNTP-Posting-Host: 24.214.84.54
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: km3611.2...@gmail.com (Krishna Myneni)
Injection-Date: Wed, 09 Feb 2022 12:23:48 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 58

by: Krishna Myneni - Wed, 9 Feb 2022 12:23 UTC

On Wednesday, February 9, 2022 at 2:52:33 AM UTC-6, Marcel Hendrix wrote:
> On Wednesday, February 9, 2022 at 5:59:45 AM UTC+1, km361...@gmail.com wrote:
> > On Tuesday, February 8, 2022 at 9:06:03 PM UTC-6, Marcel Hendrix wrote:
> > > On Wednesday, February 9, 2022 at 3:23:05 AM UTC+1, Marcel Hendrix wrote:
> > > [..]
> > > > Maybe I'll try to optimize it some more. The above is for the do-loop version.
> > > FORTH> S" abc" SHA512_Data .SHA512
> > > DDAF35A1 93617ABA CC417349 AE204131 12E6FA4E 89A97EA2
> > > 0A9EEEE6 4B55D39A 2192992A 274FC1A8 36BA3C23 A3FEEBBD
> > > 454D4423 643CE80E 2A9AC94F A54CA49F ok
> > > FORTH> SHAspeed many
> > > Processing 40 Mbytes ... 0.149 seconds elapsed.
> > > Processing 40 Mbytes ... 0.149 seconds elapsed.
> > > Processing 40 Mbytes ... 0.149 seconds elapsed.
> > > Processing 40 Mbytes ... 0.149 seconds elapsed.
> > > Processing 40 Mbytes ... 0.149 seconds elapsed.
> > > Processing 40 Mbytes ... 0.149 seconds elapsed. ok
> > > FORTH> 1000 40 149 */ . 268 ok
> > >
> > > -marcel
> > The word SLIDE_WITH_ADD in the Loop version is actually a shift register
> > operation on a...h, with 'h' being shifted out on the right, and T1 + T2 being
> > shifted in on the left. Additonally, one of the elements, 'd', was summed
> > with T1 before being shifted to the position of 'e' during the shift operation.
> >
> > Your original unrolled code did not reproduce this type of operation. There
> > were additional bugs also.
> Sorry if I was unclear. I consulted your repository, specifically
> the earliest modification that produced correct results.
> That version had a DO-LOOP, (no SLIDE_WITH_ADD, though) and
> I wanted to see how that worked out, so kept it in place.
>
> The version that I originally posted did not have the line-by-line
> debugging code from the C version (too bulky for a post), but
> with that in place I quickly got it working (at least the examples).
>
> T1 and T2 are removed from the present code as they can be kept
> on the stack (that is actually the 35ms speed up in my last result).
> Given the unexpectedly high influence of T1 and T2, looking at
> the fetch and store patterns might be beneficial.
>
> -marcel

The SLIDE_WITH_ADD is just an inlining word for code in SHA_Transform
which was already present in SHA_Transform in the first working version.
SLIDE_WITH_ADD was used in subsequent versions to make the code
more readable, but did not change the efficiency of the computation.
In contrast, the inlining word COMPUTE_T1 was a slightly more efficient
version of the code than the earlier version.

Good idea to keep T1 and T2 on the stack. There might be a more efficient
way to implement the shift register. Congrats on your throughput. It will be
useful if we could benchmark iforth running the same code on the same
system along with the other Forth systems. From Anton's earlier post, it
seems like all that was needed to do that was a way to obtain a time
elapsed with millisecond or higher resolution.

--
Krishna

Re: SHA512 implementation in Forth (debugging)

<3468ddef-44ac-447b-8c23-b5e583829457n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16736&group=comp.lang.forth#16736

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a05:6214:20ee:: with SMTP id 14mr2149556qvk.38.1644426559234;
Wed, 09 Feb 2022 09:09:19 -0800 (PST)
X-Received: by 2002:a05:6214:411a:: with SMTP id kc26mr2144011qvb.59.1644426559034;
Wed, 09 Feb 2022 09:09:19 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 9 Feb 2022 09:09:18 -0800 (PST)
In-Reply-To: <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:1c05:2f14:600:84f7:4520:8114:f24f;
posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 2001:1c05:2f14:600:84f7:4520:8114:f24f
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3468ddef-44ac-447b-8c23-b5e583829457n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Wed, 09 Feb 2022 17:09:19 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 54

by: Marcel Hendrix - Wed, 9 Feb 2022 17:09 UTC

On Wednesday, February 9, 2022 at 1:23:50 PM UTC+1, km361...@gmail.com wrote:
[..]
> > > The word SLIDE_WITH_ADD in the Loop version is actually a shift register
[..]
> The SLIDE_WITH_ADD is just an inlining word for code in SHA_Transform
> which was already present in SHA_Transform in the first working version.

Sorry, I didn't see it there (I copied the full do-loop SHA_Transform).

> It will be
> useful if we could benchmark iforth running the same code on the same
> system along with the other Forth systems.

That can be done by reporting clock ticks instead of wall-clock ticks,
but then the question becomes AMD or Intel, let alone other architectures,
so I generally avoid that (plus that I am always converting these horribly
long numbers to seconds in my head anyway).

> From Anton's earlier post, it
> seems like all that was needed to do that was a way to obtain a time
> elapsed with millisecond or higher resolution.

This is fundamentally a non-portable word, so what is the correct
response?

FORTH> WORDS: MS
..MSECS TICKS>MS .MS MS?
MS ?MS ~MS ticks/ms
ok

And then LOCATE <word> or help <word>
FORTH> help MS?
MS? IFORTH
( -- u )
Fetches the elapsed time in milliseconds since the execution of TIMER-RESET
TIMER-PRESET and others. Updates diff0 .
See also: (.T0) .MS n.ELAPSED diff0
ok
FORTH> help ?MS
?MS IFORTH
( -- time )
time is a time in milliseconds derived from the value of the system
clock. The only property of time is the fact that it represents a time
in milliseconds, there is no 'starting point'. Note that when the system
clock is halted, which is only possible on some processor models, the
time returned by ?MS will not advance anymore. Note also that processes
are free to load the system timer with a new value.

?MS avoids the use of additional non-portable words.

I think I can now use the WSL to also get these fancy portable
1,161,347,179 cycles, 3,410,826,133 instructions # 2.94 insn per cycle
reports, let me try...

-marcel

Re: SHA512 implementation in Forth (debugging)

<ae70e9d0-0548-4dbd-8a9f-f764b966a5c5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16739&group=comp.lang.forth#16739

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a37:654c:: with SMTP id z73mr1850796qkb.631.1644431393081;
Wed, 09 Feb 2022 10:29:53 -0800 (PST)
X-Received: by 2002:a05:622a:514:: with SMTP id l20mr2308273qtx.86.1644431392921;
Wed, 09 Feb 2022 10:29:52 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 9 Feb 2022 10:29:52 -0800 (PST)
In-Reply-To: <3468ddef-44ac-447b-8c23-b5e583829457n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2605:a601:a80d:5900:856d:f14b:848e:318e;
posting-account=tLyDfwoAAAALxbjbzLPd3Molo3hRLGFY
NNTP-Posting-Host: 2605:a601:a80d:5900:856d:f14b:848e:318e
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <3468ddef-44ac-447b-8c23-b5e583829457n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ae70e9d0-0548-4dbd-8a9f-f764b966a5c5n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: km3611.2...@gmail.com (Krishna Myneni)
Injection-Date: Wed, 09 Feb 2022 18:29:53 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 67

by: Krishna Myneni - Wed, 9 Feb 2022 18:29 UTC

On Wednesday, February 9, 2022 at 11:09:20 AM UTC-6, Marcel Hendrix wrote:
> On Wednesday, February 9, 2022 at 1:23:50 PM UTC+1, km361...@gmail.com wrote:
> [..]
> > > > The word SLIDE_WITH_ADD in the Loop version is actually a shift register
> [..]
> > The SLIDE_WITH_ADD is just an inlining word for code in SHA_Transform
> > which was already present in SHA_Transform in the first working version.
> Sorry, I didn't see it there (I copied the full do-loop SHA_Transform).
> > It will be
> > useful if we could benchmark iforth running the same code on the same
> > system along with the other Forth systems.
> That can be done by reporting clock ticks instead of wall-clock ticks,
> but then the question becomes AMD or Intel, let alone other architectures,
> so I generally avoid that (plus that I am always converting these horribly
> long numbers to seconds in my head anyway).
> > From Anton's earlier post, it
> > seems like all that was needed to do that was a way to obtain a time
> > elapsed with millisecond or higher resolution.
> This is fundamentally a non-portable word, so what is the correct
> response?
>
> FORTH> WORDS: MS
> .MSECS TICKS>MS .MS MS?
> MS ?MS ~MS ticks/ms
> ok
>
> And then LOCATE <word> or help <word>
> FORTH> help MS?
> MS? IFORTH
> ( -- u )
> Fetches the elapsed time in milliseconds since the execution of TIMER-RESET
> TIMER-PRESET and others. Updates diff0 .
> See also: (.T0) .MS n.ELAPSED diff0
> ok
> FORTH> help ?MS
> ?MS IFORTH
> ( -- time )
> time is a time in milliseconds derived from the value of the system
> clock. The only property of time is the fact that it represents a time
> in milliseconds, there is no 'starting point'. Note that when the system
> clock is halted, which is only possible on some processor models, the
> time returned by ?MS will not advance anymore. Note also that processes
> are free to load the system timer with a new value.
>
> ?MS avoids the use of additional non-portable words.
>
Sounds like ?MS will work as a drop-in replacement for MS@.

The time origin (starting point) is not relevant to the use of MS@ for timing
the execution of Forth code. It is important that the returned time has
a resolution of 1 ms, and that subsequent calls can be differenced to
give an elapsed time which is accurate to +/-1 ms. The accuracy may
depend on the length of the interval between successive calls,
particularly if the system is flooded with hardware interrupts and
interrupt handlers suppress the system clock interrupt. I have only
observed this under deliberate tests to flood a system. For millisecond
accuracy over timing tests of seconds to minutes, I have not found
a problem, at least when there is a stratum 2 time server maintaining
the system time.

--
Krishna

> I think I can now use the WSL to also get these fancy portable
> 1,161,347,179 cycles, 3,410,826,133 instructions # 2.94 insn per cycle
> reports, let me try...
>

Re: SHA512 implementation in Forth (debugging)

<2159e583-38cb-44bf-9805-33ee51b69541n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16740&group=comp.lang.forth#16740

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:ac8:7547:: with SMTP id b7mr2331554qtr.464.1644431406193;
Wed, 09 Feb 2022 10:30:06 -0800 (PST)
X-Received: by 2002:a05:6214:2406:: with SMTP id fv6mr2422074qvb.25.1644431406071;
Wed, 09 Feb 2022 10:30:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 9 Feb 2022 10:30:05 -0800 (PST)
In-Reply-To: <3468ddef-44ac-447b-8c23-b5e583829457n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2605:a601:a80d:5900:856d:f14b:848e:318e;
posting-account=tLyDfwoAAAALxbjbzLPd3Molo3hRLGFY
NNTP-Posting-Host: 2605:a601:a80d:5900:856d:f14b:848e:318e
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <3468ddef-44ac-447b-8c23-b5e583829457n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2159e583-38cb-44bf-9805-33ee51b69541n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: km3611.2...@gmail.com (Krishna Myneni)
Injection-Date: Wed, 09 Feb 2022 18:30:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 67

by: Krishna Myneni - Wed, 9 Feb 2022 18:30 UTC

--
Krishna

> I think I can now use the WSL to also get these fancy portable
> 1,161,347,179 cycles, 3,410,826,133 instructions # 2.94 insn per cycle
> reports, let me try...
>

Re: SHA512 implementation in Forth (debugging)

<su1u03$1bf$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16742&group=comp.lang.forth#16742

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Wed, 9 Feb 2022 20:41:05 -0600
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <su1u03$1bf$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 10 Feb 2022 02:41:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8fa825599e67965ee1f63d1fbeaaafca";
logging-data="1391"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+w+6sn80EOqID3RXc/yhcF"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:u2VH2q8BvTgPs8vdQuZujAXIqoY=
In-Reply-To: <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
Content-Language: en-US

by: Krishna Myneni - Thu, 10 Feb 2022 02:41 UTC

On 2/9/22 06:23, Krishna Myneni wrote:
....
> The SLIDE_WITH_ADD is just an inlining word for code in SHA_Transform
> which was already present in SHA_Transform in the first working version.
> SLIDE_WITH_ADD was used in subsequent versions to make the code
> more readable, but did not change the efficiency of the computation.
> In contrast, the inlining word COMPUTE_T1 was a slightly more efficient
> version of the code than the earlier version.
>
> Good idea to keep T1 and T2 on the stack. There might be a more efficient
> way to implement the shift register. ...

I have implemented the shift register idea, now using a contiguous block
of 8 CELLS to hold the a--h values. Now a shift by 1 cell can be
implemented with a single MOVE instead of performing g TO h f TO g ... etc.

The new inline word replacing SLIDE_WITH_ADD is SHIFT_WITH_ADD. It gives
a significant speedup in kforth64 (see below). The version of sha512.4th
has been bumped up to 0.02.

--
Krishna

---
shaspeed

Processing 40 Mbytes ... 8625 ms elapsed
ok
40000000e 8.625e f/ f.
4.63768e+06 ok \ Hashing rate of 4.6 MB/s; previous was 4.1 MB/s
---

Re: SHA512 implementation in Forth (debugging)

<su23hg$i8t$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16743&group=comp.lang.forth#16743

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Wed, 9 Feb 2022 22:15:42 -0600
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <su23hg$i8t$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 10 Feb 2022 04:15:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8fa825599e67965ee1f63d1fbeaaafca";
logging-data="18717"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9007AJ0P4iAgw3WFe8DYI"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:swjszs6w5ACFFT31oS/ie3+Aigk=
In-Reply-To: <su1u03$1bf$1@dont-email.me>
Content-Language: en-US

by: Krishna Myneni - Thu, 10 Feb 2022 04:15 UTC

On 2/9/22 20:41, Krishna Myneni wrote:
> On 2/9/22 06:23, Krishna Myneni wrote:
> ...

>> Good idea to keep T1 and T2 on the stack. There might be a more efficient
>> way to implement the shift register. ...
>
> I have implemented the shift register idea, now using a contiguous block
> of 8 CELLS to hold the a--h values. Now a shift by 1 cell can be
> implemented with a single MOVE instead of performing g TO h f TO g ...
> etc.
>
> The new inline word replacing SLIDE_WITH_ADD is SHIFT_WITH_ADD. It gives
> a significant speedup in kforth64 (see below). The version of sha512.4th
> has been bumped up to 0.02.
>
> --
> Krishna
>
>
> ---
> shaspeed
>
> Processing 40 Mbytes ... 8625 ms elapsed
> ok
> 40000000e 8.625e f/ f.
> 4.63768e+06 ok \ Hashing rate of 4.6 MB/s; previous was 4.1 MB/s
> ---
>
>

Also keep T1 and T2 on the stack.

---
shaspeed

Processing 40 Mbytes ... 8496 ms elapsed
ok
4e7 8.496e f/ f.
4.7081e+06 ok
---

Re: SHA512 implementation in Forth (debugging)

<sus9kf$luv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16860&group=comp.lang.forth#16860

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Sat, 19 Feb 2022 20:39:10 -0600
Organization: A noiseless patient Spider
Lines: 194
Message-ID: <sus9kf$luv$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 20 Feb 2022 02:39:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="856f04429f34067035ef85e8c15dd82b";
logging-data="22495"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0vdYpL9OgX4ovWHmdjc3O"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:rA6/RjebzXpb4Nb2rTat5WQx/bw=
In-Reply-To: <su23hg$i8t$1@dont-email.me>
Content-Language: en-US

by: Krishna Myneni - Sun, 20 Feb 2022 02:39 UTC

On 2/9/22 22:15, Krishna Myneni wrote:
> On 2/9/22 20:41, Krishna Myneni wrote:
>> On 2/9/22 06:23, Krishna Myneni wrote:
>> ...
>
>>> Good idea to keep T1 and T2 on the stack. There might be a more
>>> efficient
>>> way to implement the shift register. ...
>>
>> I have implemented the shift register idea, now using a contiguous
>> block of 8 CELLS to hold the a--h values. Now a shift by 1 cell can be
>> implemented with a single MOVE instead of performing g TO h f TO g
>> ... etc.
>>
>> The new inline word replacing SLIDE_WITH_ADD is SHIFT_WITH_ADD. It
>> gives a significant speedup in kforth64 (see below). The version of
>> sha512.4th has been bumped up to 0.02.
>>
>> --
>> Krishna
>>
>>
>> ---
>> shaspeed
>>
>> Processing 40 Mbytes ...   8625 ms elapsed
>>   ok
>> 40000000e 8.625e f/ f.
>> 4.63768e+06 ok \ Hashing rate of 4.6 MB/s; previous was 4.1 MB/s
>> ---
>>
>>
>
> Also keep T1 and T2 on the stack.
>
> ---
> shaspeed
>
> Processing 40 Mbytes ...   8496 ms elapsed
> ok
> 4e7 8.496e f/ f.
> 4.7081e+06 ok
> ---

While the 64-bit assembler for kForth-64 is still in progress, this does
not impede us from examining efficiency improvements possible in the
present implementation of SHA512. I have put in machine code
replacements for some of the words to see how much improvement will
result. The new sha512.4th may be found here:

https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512.4th

I measure roughly 5.6x increase in speed, now at 26 MB/s. I did not try
to optimize the assembly code -- it's a direct coding. However, when
compared with native code compilers, e.g. VFX, there is still a factor
of about 10 to be obtained with optimization. The present version is
only slightly slower on my system than gforth-fast, which clocks in at
about 28 MB/s.

--
Krishna

Speed Test with kforth64 v 0.2.4
---
s" abc" SHA512_Data type
DDAF35A193617ABACC417349AE20413112E6FA4E89A97EA20A9EEEE64B55D39A2192992A274FC1A836BA3C23A3FEEBBD454D4423643CE80E2A9AC94FA54CA49F
ok
..s
<empty>
ok
SHAspeed

Processing 40 Mbytes ... 1532 ms elapsed
ok
40e6 1.536e f/ f.
2.60417e+07 ok
---

Speed Test with gforth-fast 0.7.9_20220120
---
\ The machine code placement and calling features of the program
\ are not used by Gforth. Only the Forth source is executed.

include sha512-test.fs
sha512-test.fs:11:43: warning: redefined table
table.fs:37:3: warning: original location ok
s" abc" SHA512_Data type
DDAF35A193617ABACC417349AE20413112E6FA4E89A97EA20A9EEEE64B55D39A2192992A274FC1A836BA3C23A3FEEBBD454D4423643CE80E2A9AC94FA54CA49F
ok
ok
SHAspeed
Processing 40 Mbytes ... 1415 ms elapsed
ok
40e6 1.415e f/ f. 28268551.2367491 ok
---

Listing of sha512-test.4th (for kforth64)
---
\ Test file hashes for the Forth SHA512 implementation.
\

include ans-words
include modules
include syscalls
include mc \ Remove this include to benchmark pure Forth code.
include strings
include files
include utils
include dump
include slurp-file
include sha512.4th

also sha512

variable rfileid
create buf[] SHA512_BLOCK_LENGTH ALLOT

\ Read n bytes from input file, store at addr array
: bytes@ ( adr u -- ) rfileid @ READ-FILE 2DROP ;
: block@ buf[] SHA512_BLOCK_LENGTH bytes@ ;

: File_SHA512 ( caddr u -- )
R/O BIN OPEN-FILE SWAP rfileid !
ABORT" Invalid input file."
SHA512_Init \ Valid file, init transform
rfileid @ FILE-SIZE DROP ( ud ) \ Get bytesize of input file
CR ." Bytesize: " 2DUP UD.
SHA512_BLOCK_LENGTH UM/MOD ( rembytes nblocks ) \ Compute
nblocks & rembytes
0 ?DO
block@
buf[] SHA512_BLOCK_LENGTH SHA512_Update
LOOP \ Do n full blocks
buf[] swap 2dup bytes@
SHA512_Update
SHA512_End
CR TYPE CR \ Show SHA512 hash for file
rfileid @ CLOSE-FILE DROP
; ---

Listing of sha512-test.fs (for Gforth)
---
\ Test file hashes for the Forth SHA512 implementation.
\

include kforth-compat.fs
\ include strings.fs
\ include utils.fs
include modules.fs

: table ( v1 v2 ... vn n <name> -- | create a table of singles )
create dup cells allot? over 1- cells + swap
0 ?do dup >r ! r> 1 cells - loop drop ;

include sha512.fs \ exact same file as sha512.4th from kForth-64 repo

also sha512

variable rfileid
create buf[] SHA512_BLOCK_LENGTH ALLOT

\ Read n bytes from input file, store at addr array
: bytes@ ( adr u -- ) rfileid @ READ-FILE 2DROP ;
: block@ buf[] SHA512_BLOCK_LENGTH bytes@ ;

Compatibility file for Gforth, kforth-compat.fs, and an
ANS-Forth-compatible modules.fs may be found at this link:

https://github.com/mynenik/kForth-64/tree/master/forth-src/compat

Re: SHA512 implementation in Forth (debugging)

<sv1dh5$2k7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16868&group=comp.lang.forth#16868

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Mon, 21 Feb 2022 19:16:19 -0600
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <sv1dh5$2k7$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me>
<sus9kf$luv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Feb 2022 01:16:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5fc2b950d5a21d03a21ef3db9aeee379";
logging-data="2695"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/7Zo08LSpY8UGJx6A3VYgd"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:O47yBGMX/S+dIVVh0t2R0hN0ehY=
In-Reply-To: <sus9kf$luv$1@dont-email.me>
Content-Language: en-US

by: Krishna Myneni - Tue, 22 Feb 2022 01:16 UTC

On 2/19/22 20:39, Krishna Myneni wrote:
....
> While the 64-bit assembler for kForth-64 is still in progress, this does
> not impede us from examining efficiency improvements possible in the
> present implementation of SHA512. I have put in machine code
> replacements for some of the words to see how much improvement will
> result. The new sha512.4th may be found here:
>
> https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512.4th
>
> I measure roughly 5.6x increase in speed, now at 26 MB/s. I did not try
> to optimize the assembly code -- it's a direct coding.
>
> ...

I reverted sha512.4th to pure Forth source, and created a separate
version, sha512-x86_64.4th, which is a hybrid of Forth source and
machine code calls used from within the Forth environment. I also
optimized the machine code a bit (v0.04). The current version gives a
hashing rate of about 37 MB/s (using SHASPEED on my system). A link to
the new hybrid file is given below.

--
Krishna

https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512-x86_64.4th

Re: SHA512 implementation in Forth (debugging)

<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16873&group=comp.lang.forth#16873

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a5d:6205:0:b0:1e4:b3fd:9ba8 with SMTP id y5-20020a5d6205000000b001e4b3fd9ba8mr20796142wru.426.1645567752697;
Tue, 22 Feb 2022 14:09:12 -0800 (PST)
X-Received: by 2002:ac8:5dc8:0:b0:2de:61eb:331f with SMTP id
e8-20020ac85dc8000000b002de61eb331fmr3557030qtx.2.1645567751663; Tue, 22 Feb
2022 14:09:11 -0800 (PST)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!pasdenom.info!usenet-fr.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Tue, 22 Feb 2022 14:09:11 -0800 (PST)
In-Reply-To: <sv1dh5$2k7$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=79.31.153.83; posting-account=ryzhhAoAAAAIqf1uqmG9E4uP1Bagd-k2
NNTP-Posting-Host: 79.31.153.83
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <su1u03$1bf$1@dont-email.me>
<su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: peter.m....@gmail.com (P Falth)
Injection-Date: Tue, 22 Feb 2022 22:09:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: P Falth - Tue, 22 Feb 2022 22:09 UTC

On Tuesday, 22 February 2022 at 02:16:23 UTC+1, Krishna Myneni wrote:
> On 2/19/22 20:39, Krishna Myneni wrote:
> ...
> > While the 64-bit assembler for kForth-64 is still in progress, this does
> > not impede us from examining efficiency improvements possible in the
> > present implementation of SHA512. I have put in machine code
> > replacements for some of the words to see how much improvement will
> > result. The new sha512.4th may be found here:
> >
> > https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512.4th
> >
> > I measure roughly 5.6x increase in speed, now at 26 MB/s. I did not try
> > to optimize the assembly code -- it's a direct coding.
> >
> > ...
>
> I reverted sha512.4th to pure Forth source, and created a separate
> version, sha512-x86_64.4th, which is a hybrid of Forth source and
> machine code calls used from within the Forth environment. I also
> optimized the machine code a bit (v0.04). The current version gives a
> hashing rate of about 37 MB/s (using SHASPEED on my system). A link to
> the new hybrid file is given below.
>
> --
> Krishna
>
>
> https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512-x86_64.4th

I have also been testing out my NTF64/LXF64 on your sha512 implementation.
Specifically I have been testing the code-generator I am developing.
I concentrated on the complete sha512_transform function.
My findings running on a XEON E5-4657L v2 CPU are

Forth, Original, With asm functions
kForth, 21.8 sec, 2.7 sec
LXF64, 17.2 sec, 2.4 sec

so quite similar improvements.
Looking at the shift_with_add function I see it uses MOVE to to speed up.
On LXF64 in this case that translates to using CMOVE> as it is overwriting its buffer
In LXF64 this compiles to a rep movsb with the direction flag set. It is copying down
in memory, a slow operation. Instead I tested with:

m: shift_with_add ( T1 T2 -- ) a b c d >r >r >r >r over + &a ! r> &b ! r> &c ! r> &d ! r> + e f g >r >r >r &e ! r> &f ! r> &g ! r> &h ! ;

This looks ugly with all its >r operations, but the code-generator translates this to

comp shift_with_add
START:
mov rax , QWORD PTR [0x4078F0]
mov rcx , QWORD PTR [0x4078F8]
mov rdx , QWORD PTR [0x407900]
mov rdi , QWORD PTR [0x407908]
add rbx , QWORD PTR [rbp+0]
mov QWORD PTR [0x4078F0], rbx
mov QWORD PTR [0x4078F8], rax
mov QWORD PTR [0x407900], rcx
mov QWORD PTR [0x407908], rdx
add rdi , QWORD PTR [rbp+0]
mov rbx , QWORD PTR [0x407910]
mov rax , QWORD PTR [0x407918]
mov rcx , QWORD PTR [0x407920]
mov QWORD PTR [0x407910], rdi
mov QWORD PTR [0x407918], rbx
mov QWORD PTR [0x407920], rax
mov QWORD PTR [0x407928], rcx
mov rbx , QWORD PTR [rbp+8]
lea rbp, [rbp+16]
ret
(rbx is top of stack and rbp stackpointer)

The result is that it is now running the 40MB shaspeed in 350 ms!

This was a very good exercise. It took me one week to get the results correct.
I searched for problems in my code-generator but in the end it was the assembler I
used that miss-compiled. The code generator produces assembler text that is sent
to an external assembler, in this case Keystone-engine. I switched to another FCML-LIB
but also that had problem. FASM produces correct result. I later found the problems in
Keystone and could recompile a correct library (but there could of course be other problems)

BR
Peter Fälth

Re: SHA512 implementation in Forth (debugging)

<sv42kq$g83$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16875&group=comp.lang.forth#16875

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Tue, 22 Feb 2022 19:28:56 -0600
Organization: A noiseless patient Spider
Lines: 98
Message-ID: <sv42kq$g83$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me>
<sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 23 Feb 2022 01:28:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a3311eb887a5ead665e77bc39d46f539";
logging-data="16643"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19IHqPLUJXRD1xRE9otrs8p"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:Ccj4+UsSONr7qDeJccAnSbH9zvk=
In-Reply-To: <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
Content-Language: en-US

by: Krishna Myneni - Wed, 23 Feb 2022 01:28 UTC

On 2/22/22 16:09, P Falth wrote:
> On Tuesday, 22 February 2022 at 02:16:23 UTC+1, Krishna Myneni wrote:
>> On 2/19/22 20:39, Krishna Myneni wrote:
>> ...
>>> While the 64-bit assembler for kForth-64 is still in progress, this does
>>> not impede us from examining efficiency improvements possible in the
>>> present implementation of SHA512. I have put in machine code
>>> replacements for some of the words to see how much improvement will
>>> result. The new sha512.4th may be found here:
>>>
>>> https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512.4th
>>>
>>> I measure roughly 5.6x increase in speed, now at 26 MB/s. I did not try
>>> to optimize the assembly code -- it's a direct coding.
>>>
>>> ...
>>
>> I reverted sha512.4th to pure Forth source, and created a separate
>> version, sha512-x86_64.4th, which is a hybrid of Forth source and
>> machine code calls used from within the Forth environment. I also
>> optimized the machine code a bit (v0.04). The current version gives a
>> hashing rate of about 37 MB/s (using SHASPEED on my system). A link to
>> the new hybrid file is given below.
>>
>> --
>> Krishna
>>
>>
>> https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512-x86_64.4th
>
> I have also been testing out my NTF64/LXF64 on your sha512 implementation.
> Specifically I have been testing the code-generator I am developing.
> I concentrated on the complete sha512_transform function.
> My findings running on a XEON E5-4657L v2 CPU are
>
> Forth, Original, With asm functions
> kForth, 21.8 sec, 2.7 sec
> LXF64, 17.2 sec, 2.4 sec
>
> so quite similar improvements.
> Looking at the shift_with_add function I see it uses MOVE to to speed up.
> On LXF64 in this case that translates to using CMOVE> as it is overwriting its buffer
> In LXF64 this compiles to a rep movsb with the direction flag set. It is copying down
> in memory, a slow operation. Instead I tested with:
>
> m: shift_with_add ( T1 T2 -- ) a b c d >r >r >r >r over + &a ! r> &b ! r> &c ! r> &d ! r> + e f g >r >r >r &e ! r> &f ! r> &g ! r> &h ! ;
>
> This looks ugly with all its >r operations, but the code-generator translates this to
>
> comp shift_with_add
> START:
> mov rax , QWORD PTR [0x4078F0]
> mov rcx , QWORD PTR [0x4078F8]
> mov rdx , QWORD PTR [0x407900]
> mov rdi , QWORD PTR [0x407908]
> add rbx , QWORD PTR [rbp+0]
> mov QWORD PTR [0x4078F0], rbx
> mov QWORD PTR [0x4078F8], rax
> mov QWORD PTR [0x407900], rcx
> mov QWORD PTR [0x407908], rdx
> add rdi , QWORD PTR [rbp+0]
> mov rbx , QWORD PTR [0x407910]
> mov rax , QWORD PTR [0x407918]
> mov rcx , QWORD PTR [0x407920]
> mov QWORD PTR [0x407910], rdi
> mov QWORD PTR [0x407918], rbx
> mov QWORD PTR [0x407920], rax
> mov QWORD PTR [0x407928], rcx
> mov rbx , QWORD PTR [rbp+8]
> lea rbp, [rbp+16]
> ret
> (rbx is top of stack and rbp stackpointer)
>
> The result is that it is now running the 40MB shaspeed in 350 ms!
>
> This was a very good exercise. It took me one week to get the results correct.
> I searched for problems in my code-generator but in the end it was the assembler I
> used that miss-compiled. The code generator produces assembler text that is sent
> to an external assembler, in this case Keystone-engine. I switched to another FCML-LIB
> but also that had problem. FASM produces correct result. I later found the problems in
> Keystone and could recompile a correct library (but there could of course be other problems)
>

Excellent. Thank you for reporting your results. I was puzzling over
where the bottleneck might be in the code. Although I had improved the
execution speed of the Forth-only source from about 5 MB/s to nearly 40
MB/s by replacing parts of SHA512_TRANSFORM with calls to machine code,
I did not think SHIFT_WITH_ADD would be such a bottleneck. If possible
it might be best to keep the shift register, values a--h, in actual
registers, as Anton had previously suggested, but I'm not sure there are
enough registers available to do it conveniently. I'll revise my x86_64
version presently to see how it fares with a machine code call to
SHIFT_WITH_ADD.

--
Krishna

Re: SHA512 implementation in Forth (debugging)

<sv48sg$k9u$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16876&group=comp.lang.forth#16876

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Tue, 22 Feb 2022 21:15:26 -0600
Organization: A noiseless patient Spider
Lines: 92
Message-ID: <sv48sg$k9u$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me>
<sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 23 Feb 2022 03:15:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a3311eb887a5ead665e77bc39d46f539";
logging-data="20798"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/oewMLcivMF7IzAxa1oKNd"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:pEaTYwdJwuOfyl6oeH3mQ/2GrgE=
In-Reply-To: <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
Content-Language: en-US

by: Krishna Myneni - Wed, 23 Feb 2022 03:15 UTC

On 2/22/22 16:09, P Falth wrote:
> On Tuesday, 22 February 2022 at 02:16:23 UTC+1, Krishna Myneni wrote:
>> On 2/19/22 20:39, Krishna Myneni wrote:
....
>>
>> https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512-x86_64.4th
>
....
> My findings running on a XEON E5-4657L v2 CPU are
>
> Forth, Original, With asm functions
> kForth, 21.8 sec, 2.7 sec
> LXF64, 17.2 sec, 2.4 sec
>
> so quite similar improvements.
> Looking at the shift_with_add function I see it uses MOVE to to speed up.
> On LXF64 in this case that translates to using CMOVE> as it is overwriting its buffer
> In LXF64 this compiles to a rep movsb with the direction flag set. It is copying down
> in memory, a slow operation. Instead I tested with:
>
> m: shift_with_add ( T1 T2 -- ) a b c d >r >r >r >r over + &a ! r> &b ! r> &c ! r> &d ! r> + e f g >r >r >r &e ! r> &f ! r> &g ! r> &h ! ;
>
> This looks ugly with all its >r operations, but the code-generator translates this to
>
> comp shift_with_add
> START:
> mov rax , QWORD PTR [0x4078F0]
> mov rcx , QWORD PTR [0x4078F8]
> mov rdx , QWORD PTR [0x407900]
> mov rdi , QWORD PTR [0x407908]
> add rbx , QWORD PTR [rbp+0]
> mov QWORD PTR [0x4078F0], rbx
> mov QWORD PTR [0x4078F8], rax
> mov QWORD PTR [0x407900], rcx
> mov QWORD PTR [0x407908], rdx
> add rdi , QWORD PTR [rbp+0]
> mov rbx , QWORD PTR [0x407910]
> mov rax , QWORD PTR [0x407918]
> mov rcx , QWORD PTR [0x407920]
> mov QWORD PTR [0x407910], rdi
> mov QWORD PTR [0x407918], rbx
> mov QWORD PTR [0x407920], rax
> mov QWORD PTR [0x407928], rcx
> mov rbx , QWORD PTR [rbp+8]
> lea rbp, [rbp+16]
> ret
> (rbx is top of stack and rbp stackpointer)
>
> The result is that it is now running the 40MB shaspeed in 350 ms!
>

I used the following equivalent machine code macro for SHIFT_WITH_ADD in
sha512-x86_64.4th

---
\ in: r11 = ashiftreg, rbx points to T2, Forth stack ( T1 T2 ... )
\ out: shift register is modified per algorithm, no change to Forth stack
\ uses: rax, rcx, rdx, rdi, r8
: shift_with_add
49 8b 03 \ 0 [r11] rax mov, \ rax = a
49 8b 4b 08 \ 8 [r11] rcx mov, \ rcx = b
49 8b 53 10 \ 16 [r11] rdx mov, \ rdx = c
49 8b 7b 18 \ 24 [r11] rdi mov, \ rdi = d
4c 8b 43 08 \ 8 [rbx] r8 mov,
4c 03 03 \ [rbx] r8 add, \ r8 = T1 + T2
4d 89 03 \ r8 0 [r11] mov, \ new a = T1 + T2
49 89 43 08 \ rax 8 [r11] mov, \ new b = old a
49 89 4b 10 \ rcx 16 [r11] mov, \ new c = old b
49 89 53 18 \ rdx 24 [r11] mov, \ new d = old c
49 8b 43 20 \ 32 [r11] rax mov, \ rax = e
49 8b 4b 28 \ 40 [r11] rcx mov, \ rcx = f
49 8b 53 30 \ 48 [r11] rdx mov, \ rdx = g
48 03 7b 08 \ 8 [rbx] rdi add, \ rdi = old d + T1
49 89 7b 20 \ rdi 32 [r11] mov, \ new e = old d + T1
49 89 43 28 \ rax 40 [r11] mov, \ new f = old e
49 89 4b 30 \ rcx 48 [r11] mov, \ new g = old f
49 89 53 38 \ rdx 56 [r11] mov, \ new h = old g
; ---

The revised version of the hybrid Forth+machine code program, version
0.05, increases the hashing rate from 37 MB/s to 63 MB/s with kforth64
on my system -- SHASPEED reports an elapsed time for 40 MB of 639 ms. A
nice improvement!

--
Krishna

Re: SHA512 implementation in Forth (debugging)

<232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16877&group=comp.lang.forth#16877

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a05:600c:8a9:b0:380:da47:a911 with SMTP id l41-20020a05600c08a900b00380da47a911mr5219467wmp.102.1645604157352;
Wed, 23 Feb 2022 00:15:57 -0800 (PST)
X-Received: by 2002:a05:622a:11d2:b0:2d6:8a01:66ef with SMTP id
n18-20020a05622a11d200b002d68a0166efmr25222417qtk.38.1645604156894; Wed, 23
Feb 2022 00:15:56 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 23 Feb 2022 00:15:56 -0800 (PST)
In-Reply-To: <sv48sg$k9u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:1c05:2f14:600:4147:66bb:1bd8:d0e3;
posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 2001:1c05:2f14:600:4147:66bb:1bd8:d0e3
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <su1u03$1bf$1@dont-email.me>
<su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com> <sv48sg$k9u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Wed, 23 Feb 2022 08:15:57 +0000
Content-Type: text/plain; charset="UTF-8"

by: Marcel Hendrix - Wed, 23 Feb 2022 08:15 UTC

On Wednesday, February 23, 2022 at 4:15:31 AM UTC+1, Krishna Myneni wrote:
> On 2/22/22 16:09, P Falth wrote:
> > On Tuesday, 22 February 2022 at 02:16:23 UTC+1, Krishna Myneni wrote:
> >> On 2/19/22 20:39, Krishna Myneni wrote:
[..]
> > Looking at the shift_with_add function I see it uses MOVE to to speed up.
> > On LXF64 in this case that translates to using CMOVE> as it is overwriting its buffer
> > In LXF64 this compiles to a rep movsb with the direction flag set. It is copying down
> > in memory, a slow operation. Instead I tested with:
[..]
> >
> > The result is that it is now running the 40MB shaspeed in 350 ms!
> >
> I used the following equivalent machine code macro for SHIFT_WITH_ADD in
> sha512-x86_64.4th
> The revised version of the hybrid Forth+machine code program, version
> 0.05, increases the hashing rate from 37 MB/s to 63 MB/s with kforth64
> on my system -- SHASPEED reports an elapsed time for 40 MB of 639 ms. A
> nice improvement!

For iForth64 Hanno Schwalm put a lot of work in CMOVE and all of
its variants. The problem is that OS calls are unbeatable for large
size transfers, but these calls are expensive to set up and clean up
afterwards. We have found that for a certain upper bound, CODE
words of the REP MOV type are useful, but for very small size
(I think it was below 32 bytes) one better uses !+, !- etc. The
problem for the compiler is to figure out the transfer size
at compile time. Shaspeed is a good example of this.

However, optimizing CMOVE with !+ only decreases the runspeed
by 10ms or so (currently at 113 ms and really no ideas what to
improve next).

-marcel

Re: SHA512 implementation in Forth (debugging)

<2022Feb23.111521@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16878&group=comp.lang.forth#16878

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Wed, 23 Feb 2022 10:15:21 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 79
Message-ID: <2022Feb23.111521@mips.complang.tuwien.ac.at>
References: <ste547$ukb$1@dont-email.me> <5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com> <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me> <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com> <sv48sg$k9u$1@dont-email.me> <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="01dc13702e2109ac829114b6e87a1b89";
logging-data="6158"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19iZKWB0ARBmApCLVYdCVi7"
Cancel-Lock: sha1:3BXK1vaDs2gxf/I3/myUyK7x3jU=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Wed, 23 Feb 2022 10:15 UTC

Marcel Hendrix <mhx@iae.nl> writes:
>For iForth64 Hanno Schwalm put a lot of work in CMOVE and all of
>its variants. The problem is that OS calls are unbeatable for large
>size transfers,

I was puzzled by that statement, because I don't know a system call
that helps here, at least in the general case; in very special cases
you can mmap the same physical memory to another virtual address
range, but I doubt that you mean that.

So my guess is that you mean library calls, and basically you can do
all the things in your user-level code that the library can do in its
user-level code. It may be that the library author has spent more
time finding out about the twisty ways to get more performance,
however.

I have ventured in that area myself
<https://github.com/AntonErtl/move>, with results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>
<2019Dec12.085706@mips.complang.tuwien.ac.at>
<2019Dec14.190943@mips.complang.tuwien.ac.at>

I did not optimize for large blocks (approaching the size of the L1
cache and bigger), though. AMD64 CPUs have "non-temporal" move
instructions (e.g., VMOVNTDQ, VMOVNTDQA), which tell the CPU that the
memory operand is not going to be accessed soon and does not need to
be kept in the cache; this avoids trampling over the old cache
contents, and the read-for-ownership that is needed for stores on some
(all?) microarchitectures. These instructions are weakly ordered and
need a slow fencing instruction for multicore consistency.

Depending on the way the MOVE is used, it may be a good idea to use
non-temporal move instructions already for smaller blocks. E.g., if
you know that you will not use the from block of a 2KB MOVE soon, it
may be worthwhile to use non-temporal loads to avoid the overhead of
reloading the cache lines evicted by ordinary loads. Or if you know
that the from block of a MOVE will not be used soon, the same is true
for stores. However, we probably don't want to introduce MOVE-NT>T
MOVE-T>NT MOVE-NT>NT in addition to the ordinary MOVE (I thing
temporal is a good default for block sizes < 1/2 L1).

> We have found that for a certain upper bound, CODE
>words of the REP MOV type are useful,

REP MOVSB has been optimized on some microarchitectures, but is slow
on some others. Even on those where it has been optimized, code
consisting of simpler instructions tends to beat it on many use cases.
E.g., looking at the Skylake results in
<2017Sep23.174313@mips.complang.tuwien.ac.at>, REP MOVSB beats my
avxmemcpy only if both the block size is 16KB and the aligned move is
benchmarked. My guess is that REP MOVSB on Skylake uses non-temporal
memory accesses plus a fence, which causes a slowdown for all smaller
blocks in these benchmarks, but the temporal accesses in avxmemcpy
lead to slowdowns due to cache conflicts with 16KB blocks (16KB from
block, 16KB to block, and likely a virtual-memory mapping that
introduces a conflict).

You need to use wide-memory instructions (SSE, AVX, AVX512) for
optimal performance with simpler instructions, though.

>really no ideas what to
>improve next).

Keep all the a-h values in registers all the time. With lxf64 this
may be possible to achieve with the locals-to-stack-to-locals
technique I used for your (incorrect) implementation of SHA512
<2022Jan30.232519@mips.complang.tuwien.ac.at> (plus some way to work
around the lack of inlining, if that lack still exists). For iForth
you would need to put locals in registers, or rewrite the code to use
only the stacks.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2021: https://euro.theforth.net/2021

Re: SHA512 implementation in Forth (debugging)

<sv5dor$2j1$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16879&group=comp.lang.forth#16879

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Wed, 23 Feb 2022 07:44:57 -0600
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <sv5dor$2j1$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me>
<sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
<sv48sg$k9u$1@dont-email.me>
<232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 23 Feb 2022 13:44:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a3311eb887a5ead665e77bc39d46f539";
logging-data="2657"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+6oPCufhRtaGK+02bR86YQ"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:m1WuEdZl7ZgyoYNB/vESyuK0MsM=
In-Reply-To: <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>
Content-Language: en-US

by: Krishna Myneni - Wed, 23 Feb 2022 13:44 UTC

On 2/23/22 02:15, Marcel Hendrix wrote:
> On Wednesday, February 23, 2022 at 4:15:31 AM UTC+1, Krishna Myneni wrote:
>> On 2/22/22 16:09, P Falth wrote:
>>> On Tuesday, 22 February 2022 at 02:16:23 UTC+1, Krishna Myneni wrote:
>>>> On 2/19/22 20:39, Krishna Myneni wrote:
> [..]
>>> Looking at the shift_with_add function I see it uses MOVE to to speed up.
>>> On LXF64 in this case that translates to using CMOVE> as it is overwriting its buffer
>>> In LXF64 this compiles to a rep movsb with the direction flag set. It is copying down
>>> in memory, a slow operation. Instead I tested with:
> [..]
>>>
>>> The result is that it is now running the 40MB shaspeed in 350 ms!
>>>
>> I used the following equivalent machine code macro for SHIFT_WITH_ADD in
>> sha512-x86_64.4th
>> The revised version of the hybrid Forth+machine code program, version
>> 0.05, increases the hashing rate from 37 MB/s to 63 MB/s with kforth64
>> on my system -- SHASPEED reports an elapsed time for 40 MB of 639 ms. A
>> nice improvement!
>
> For iForth64 Hanno Schwalm put a lot of work in CMOVE and all of
> its variants. The problem is that OS calls are unbeatable for large
> size transfers, but these calls are expensive to set up and clean up
> afterwards. We have found that for a certain upper bound, CODE
> words of the REP MOV type are useful, but for very small size
> (I think it was below 32 bytes) one better uses !+, !- etc. The
> problem for the compiler is to figure out the transfer size
> at compile time. Shaspeed is a good example of this.
>
> However, optimizing CMOVE with !+ only decreases the runspeed
> by 10ms or so (currently at 113 ms and really no ideas what to
> improve next).
>

The utility sha512sum on Linux appears to have a hashing rate in excess
of 500 MB/s on my system, so it may be worthwhile to examine its source
code.

To clarify, my intent with the Forth/machine code hybrid program is
simply to determine how much efficiency can be gained from a
straightforward Forth/assembly code mixed source within a non-optimizing
indirect threaded code Forth compiler. It is not intended to be an
argument against the need to use an optimizing Forth compiler with pure
Forth source. The goal of an optimizing compiler is worthwhile, although
it is not always necessary.

--
Krishna

Re: SHA512 implementation in Forth (debugging)

<f97cd5a2-bd8f-4bfd-b255-8db23a328f7dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16880&group=comp.lang.forth#16880

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a5d:588a:0:b0:1e8:b478:e74f with SMTP id n10-20020a5d588a000000b001e8b478e74fmr106619wrf.210.1645629458145;
Wed, 23 Feb 2022 07:17:38 -0800 (PST)
X-Received: by 2002:a05:6214:29cc:b0:42d:f63c:f3f4 with SMTP id
gh12-20020a05621429cc00b0042df63cf3f4mr124407qvb.87.1645629457537; Wed, 23
Feb 2022 07:17:37 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 23 Feb 2022 07:17:37 -0800 (PST)
In-Reply-To: <sus9kf$luv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:3f7a:20d0:cd33:7faa:b180:9fd6;
posting-account=V5nGoQoAAAC_P2U0qnxm2kC0s1jNJXJa
NNTP-Posting-Host: 2600:1700:3f7a:20d0:cd33:7faa:b180:9fd6
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <su1u03$1bf$1@dont-email.me>
<su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f97cd5a2-bd8f-4bfd-b255-8db23a328f7dn@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: sdwjac...@gmail.com (S Jack)
Injection-Date: Wed, 23 Feb 2022 15:17:38 +0000
Content-Type: text/plain; charset="UTF-8"

by: S Jack - Wed, 23 Feb 2022 15:17 UTC

On Saturday, February 19, 2022 at 8:39:14 PM UTC-6, Krishna Myneni wrote:

> s" abc" SHA512_Data type
> >DDAF35A193617ABACC417349AE20413112E6FA4E89A97EA20A9EEEE64B55D39A2192992A274FC1A836BA3C23A3FEEBBD454D4423643CE80E2A9AC94FA54CA49F

That does agree with document, but I'm getting:
:) echo abc|sha512sum
4f285d0c0cc77286d8731798b7aae2639e28270d4166f40d769cbbdca5230714d848483d364e2f39fe6cb9083c15229b39a33615ebc6d57605f7c43f6906739d -
Any guess as to why?
--
me

Re: SHA512 implementation in Forth (debugging)

<63d6b3c9-9210-4bcc-b597-c5a6c8cdfb70n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16881&group=comp.lang.forth#16881

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:adf:f041:0:b0:1ed:bdb2:ea8c with SMTP id t1-20020adff041000000b001edbdb2ea8cmr120838wro.99.1645630166761;
Wed, 23 Feb 2022 07:29:26 -0800 (PST)
X-Received: by 2002:ac8:5853:0:b0:2d6:8a16:753c with SMTP id
h19-20020ac85853000000b002d68a16753cmr188407qth.401.1645630166120; Wed, 23
Feb 2022 07:29:26 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 23 Feb 2022 07:29:25 -0800 (PST)
In-Reply-To: <f97cd5a2-bd8f-4bfd-b255-8db23a328f7dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:3f7a:20d0:cd33:7faa:b180:9fd6;
posting-account=V5nGoQoAAAC_P2U0qnxm2kC0s1jNJXJa
NNTP-Posting-Host: 2600:1700:3f7a:20d0:cd33:7faa:b180:9fd6
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me> <2022Feb6.195134@mips.complang.tuwien.ac.at>
<stpgjo$mut$1@dont-email.me> <2022Feb6.234434@mips.complang.tuwien.ac.at>
<stpksk$l7b$1@dont-email.me> <2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com> <60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com> <0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <su1u03$1bf$1@dont-email.me>
<su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <f97cd5a2-bd8f-4bfd-b255-8db23a328f7dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <63d6b3c9-9210-4bcc-b597-c5a6c8cdfb70n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: sdwjac...@gmail.com (S Jack)
Injection-Date: Wed, 23 Feb 2022 15:29:26 +0000
Content-Type: text/plain; charset="UTF-8"

by: S Jack - Wed, 23 Feb 2022 15:29 UTC

On Wednesday, February 23, 2022 at 9:17:40 AM UTC-6, S Jack wrote:
> On Saturday, February 19, 2022 at 8:39:14 PM UTC-6, Krishna Myneni wrote:
>
> > s" abc" SHA512_Data type
> > >DDAF35A193617ABACC417349AE20413112E6FA4E89A97EA20A9EEEE64B55D39A2192992A274FC1A836BA3C23A3FEEBBD454D4423643CE80E2A9AC94FA54CA49F
> That does agree with document, but I'm getting:
> :) echo abc|sha512sum
> 4f285d0c0cc77286d8731798b7aae2639e28270d4166f40d769cbbdca5230714d848483d364e2f39fe6cb9083c15229b39a33615ebc6d57605f7c43f6906739d -
> Any guess as to why?
> --
> me

ok found it.
:) echo -n abc|sha512sum
ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f -
--
me

Re: SHA512 implementation in Forth (debugging)

<25f6cbc7-c60a-4317-b6c3-14e2aadb5007n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16883&group=comp.lang.forth#16883

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:adf:fa82:0:b0:1e6:34fe:9bf with SMTP id h2-20020adffa82000000b001e634fe09bfmr1084524wrr.43.1645652235874;
Wed, 23 Feb 2022 13:37:15 -0800 (PST)
X-Received: by 2002:ac8:5f0c:0:b0:2de:2dc9:24e5 with SMTP id
x12-20020ac85f0c000000b002de2dc924e5mr1749010qta.535.1645652235258; Wed, 23
Feb 2022 13:37:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 23 Feb 2022 13:37:15 -0800 (PST)
In-Reply-To: <2022Feb23.111521@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=84.30.53.30; posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 84.30.53.30
References: <ste547$ukb$1@dont-email.me> <5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com> <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me>
<sv1dh5$2k7$1@dont-email.me> <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
<sv48sg$k9u$1@dont-email.me> <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>
<2022Feb23.111521@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25f6cbc7-c60a-4317-b6c3-14e2aadb5007n@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Wed, 23 Feb 2022 21:37:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 19

by: Marcel Hendrix - Wed, 23 Feb 2022 21:37 UTC

On Wednesday, February 23, 2022 at 11:17:56 AM UTC+1, Anton Ertl wrote:
> Marcel Hendrix <m...@iae.nl> writes:
[..]
> >really no ideas what to
> >improve next).
> Keep all the a-h values in registers all the time. [..] For iForth
> you would need to put locals in registers, or rewrite the code to use
> only the stacks.

Thank you for cleaning up my terminology.

In your opinion, would be it be effective to store/load locals in the
SSE/AVX registers (I only use the FPU in iForth)? There is, or was,
overhead in switching between floating-point modes but I notice
that the Intel C/Fortran compilers can mix e.g. SSE with FPU instructions.
I don't know if this is because of wishing to use the 80-bit extended
floats, or if the FPU can actually run in parallel with wide-memory
instructions.

-marcel

Re: SHA512 implementation in Forth (debugging)

<2022Feb23.230415@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16884&group=comp.lang.forth#16884

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Wed, 23 Feb 2022 22:04:15 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 27
Message-ID: <2022Feb23.230415@mips.complang.tuwien.ac.at>
References: <ste547$ukb$1@dont-email.me> <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com> <su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me> <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com> <sv48sg$k9u$1@dont-email.me> <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com> <2022Feb23.111521@mips.complang.tuwien.ac.at> <25f6cbc7-c60a-4317-b6c3-14e2aadb5007n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="01dc13702e2109ac829114b6e87a1b89";
logging-data="28351"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/94S+nV5xBRBdQrqobgCx3"
Cancel-Lock: sha1:4VbV4rOgItVuquUD20dE4+Eyhf0=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Wed, 23 Feb 2022 22:04 UTC

Marcel Hendrix <mhx@iae.nl> writes:
>In your opinion, would be it be effective to store/load locals in the
>SSE/AVX registers (I only use the FPU in iForth)?

I would put cells in the GPRs (rax-r15); should be enough for a-h plus
a few temporary values, plus the stack pointers. I tend to avoid
moving data between GPRs and XMM/YMM registers, so I don't know if
that is fast or slow. I think for stuff like SHA512 where (I think)
the data is not used as addresses, you could run the whole computation
in XMM/YMM/ZMM registers. But I would design my Forth system to use
GPRs and instructions that work on that, so that's more a CODE word
option for me.

> the Intel C/Fortran compilers can mix e.g. SSE with FPU instructions.
>I don't know if this is because of wishing to use the 80-bit extended
>floats, or if the FPU can actually run in parallel with wide-memory
>instructions.

Running the FPU in parallel with the load/store unit is not a problem
for recent (or not so recent) CPU cores.

Re: SHA512 implementation in Forth (debugging)

<7483ff55-5f38-40a1-9165-e59dff7a6e6dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16885&group=comp.lang.forth#16885

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a05:600c:240b:b0:380:f424:f2be with SMTP id 11-20020a05600c240b00b00380f424f2bemr4123022wmp.16.1645657824860;
Wed, 23 Feb 2022 15:10:24 -0800 (PST)
X-Received: by 2002:a05:6214:21c4:b0:42c:3068:e1cf with SMTP id
d4-20020a05621421c400b0042c3068e1cfmr52380qvh.59.1645657822413; Wed, 23 Feb
2022 15:10:22 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Wed, 23 Feb 2022 15:10:22 -0800 (PST)
In-Reply-To: <2022Feb23.230415@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=84.30.53.30; posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 84.30.53.30
References: <ste547$ukb$1@dont-email.me> <2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me>
<sv1dh5$2k7$1@dont-email.me> <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
<sv48sg$k9u$1@dont-email.me> <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com>
<2022Feb23.111521@mips.complang.tuwien.ac.at> <25f6cbc7-c60a-4317-b6c3-14e2aadb5007n@googlegroups.com>
<2022Feb23.230415@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7483ff55-5f38-40a1-9165-e59dff7a6e6dn@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Wed, 23 Feb 2022 23:10:24 +0000
Content-Type: text/plain; charset="UTF-8"

by: Marcel Hendrix - Wed, 23 Feb 2022 23:10 UTC

On Wednesday, February 23, 2022 at 11:20:10 PM UTC+1, Anton Ertl wrote:
> Marcel Hendrix <m...@iae.nl> writes:
> >In your opinion, would be it be effective to store/load locals in the
> >SSE/AVX registers (I only use the FPU in iForth)?
> I would put cells in the GPRs (rax-r15); should be enough for a-h plus
> a few temporary values, plus the stack pointers.

This is part of the main sha512 loop:
[...]
e f g Ch +
e sigma1_512u + h + ( T1 )
a sigma0_512u a b c Maj + ( T2 )
[ HERE H. ]
g TO h f TO g e TO f
OVER ( T1 ) d + TO e
c TO d b TO c a TO b
( T1 T2 ) + TO a
LOOP
[...]
The HERE H. is to find out the address idis needs:

FORTH> $013480C1 idis
$013480C1 mov r9, $01342540 qword-offset
$013480C8 mov $01342560 qword-offset, r9
$013480CF mov r9, $01342520 qword-offset
$013480D6 mov $01342540 qword-offset, r9
$013480DD mov r9, $01342500 qword-offset
$013480E4 mov $01342520 qword-offset, r9
\ apparently T1 is in rax
$013480EB mov r9, rax
$013480EE add r9, $013424E0 qword-offset
$013480F5 mov $01342500 qword-offset, r9
$013480FC mov r9, $013424C0 qword-offset
$01348103 mov $013424E0 qword-offset, r9
$0134810A mov r9, $013424A0 qword-offset
$01348111 mov $013424C0 qword-offset, r9
$01348118 mov r9, $01342480 qword-offset
$0134811F mov $013424A0 qword-offset, r9
\ looks like T2 is known as the sum of rdi and rdx ( Maj + )
$01348126 lea rdi, [rdi rdx*1] qword
$0134812A lea rcx, [rax rdi*1] qword
$0134812E mov $01342480 qword-offset, rcx
[...]
There is quite an overuse of r9, but isn't AMD doing register
renaming behind our backs by now?

LOOP is still as bad as when you last criticized it
so I won't show it.

-marcel

Re: SHA512 implementation in Forth (debugging)

<2022Feb24.092412@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16886&group=comp.lang.forth#16886

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Thu, 24 Feb 2022 08:24:12 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 72
Message-ID: <2022Feb24.092412@mips.complang.tuwien.ac.at>
References: <ste547$ukb$1@dont-email.me> <su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me> <1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com> <sv48sg$k9u$1@dont-email.me> <232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com> <2022Feb23.111521@mips.complang.tuwien.ac.at> <25f6cbc7-c60a-4317-b6c3-14e2aadb5007n@googlegroups.com> <2022Feb23.230415@mips.complang.tuwien.ac.at> <7483ff55-5f38-40a1-9165-e59dff7a6e6dn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="352c93ce62a785721801cba54fe768c5";
logging-data="31270"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eP87lvR0Af2bB+A9Fx6PB"
Cancel-Lock: sha1:jw8ywMW+OcY8fz3OqvhLy9OXIWM=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Thu, 24 Feb 2022 08:24 UTC

Marcel Hendrix <mhx@iae.nl> writes:
>On Wednesday, February 23, 2022 at 11:20:10 PM UTC+1, Anton Ertl wrote:
>> Marcel Hendrix <m...@iae.nl> writes:
>> >In your opinion, would be it be effective to store/load locals in the
>> >SSE/AVX registers (I only use the FPU in iForth)?
>> I would put cells in the GPRs (rax-r15); should be enough for a-h plus
>> a few temporary values, plus the stack pointers.
>
>This is part of the main sha512 loop:
>[...]
> e f g Ch +
> e sigma1_512u + h + ( T1 )
> a sigma0_512u a b c Maj + ( T2 )
>[ HERE H. ]
> g TO h f TO g e TO f
> OVER ( T1 ) d + TO e
> c TO d b TO c a TO b
> ( T1 T2 ) + TO a
> LOOP
>[...]
>The HERE H. is to find out the address idis needs:
>
>FORTH> $013480C1 idis
>$013480C1 mov r9, $01342540 qword-offset
>$013480C8 mov $01342560 qword-offset, r9
>$013480CF mov r9, $01342520 qword-offset
>$013480D6 mov $01342540 qword-offset, r9
>$013480DD mov r9, $01342500 qword-offset
>$013480E4 mov $01342520 qword-offset, r9
>\ apparently T1 is in rax
>$013480EB mov r9, rax
>$013480EE add r9, $013424E0 qword-offset
>$013480F5 mov $01342500 qword-offset, r9
>$013480FC mov r9, $013424C0 qword-offset
>$01348103 mov $013424E0 qword-offset, r9
>$0134810A mov r9, $013424A0 qword-offset
>$01348111 mov $013424C0 qword-offset, r9
>$01348118 mov r9, $01342480 qword-offset
>$0134811F mov $013424A0 qword-offset, r9
>\ looks like T2 is known as the sum of rdi and rdx ( Maj + )
>$01348126 lea rdi, [rdi rdx*1] qword
>$0134812A lea rcx, [rax rdi*1] qword
>$0134812E mov $01342480 qword-offset, rcx
>[...]
>There is quite an overuse of r9, but isn't AMD doing register
>renaming behind our backs by now?

Yes, AMD has been using hardware register renaming since the K5 (and
Intel since the Pentium Pro), so I would not expect an improvement if
you used different registers. What I mean is getting rid of all these
stores by keeping a-h in registers.

If you keep them in memory, you can use the AVX instructions to reduce
the 8 64-bit loads to 2 256-bit loads, and the 8 stores to 2 256-bit
stores and 2 64-bit store (or you merge the new results while the data
in in YMM registers; but I think that would require so many
instructions that it would be a loss). I would arrange the data such
that the 256-bit stores are 256-bit-aligned.

>LOOP is still as bad as when you last criticized it
>so I won't show it.

The loop body takes more than the 5-6 cycles minimum latency caused by
keeping the loop counter in memory, so I don't think it causes a
significant slowdown here.

Re: SHA512 implementation in Forth (debugging)

<svdb6f$k0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16894&group=comp.lang.forth#16894

copy link Newsgroups: comp.lang.forth

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: SHA512 implementation in Forth (debugging)
Date: Sat, 26 Feb 2022 07:50:05 -0600
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <svdb6f$k0$1@dont-email.me>
References: <ste547$ukb$1@dont-email.me> <stfml5$fhg$1@dont-email.me>
<stk05g$j70$1@dont-email.me> <stnjbb$3c7$1@dont-email.me>
<2022Feb6.195134@mips.complang.tuwien.ac.at> <stpgjo$mut$1@dont-email.me>
<2022Feb6.234434@mips.complang.tuwien.ac.at> <stpksk$l7b$1@dont-email.me>
<2022Feb8.125849@mips.complang.tuwien.ac.at>
<7a1f10f7-f9ba-4215-bcf5-4386af849da8n@googlegroups.com>
<60781ffe-65a0-40d1-948b-df766219e542n@googlegroups.com>
<5afc366f-d2b3-43d2-a8dc-13e3fb826654n@googlegroups.com>
<0bf525a7-0db0-4533-8839-20ca9454309fn@googlegroups.com>
<2996d1bc-6f53-42ba-b630-53365632e3b0n@googlegroups.com>
<su1u03$1bf$1@dont-email.me> <su23hg$i8t$1@dont-email.me>
<sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com>
<sv48sg$k9u$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 26 Feb 2022 13:50:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8b25c84d762d3df9e6ab47934029ef51";
logging-data="640"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++lF1tCPYXyc0vr8kiRO0V"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:B+7x4sOkE3Wx7zl31SfI/6wNSLg=
In-Reply-To: <sv48sg$k9u$1@dont-email.me>
Content-Language: en-US

by: Krishna Myneni - Sat, 26 Feb 2022 13:50 UTC

On 2/22/22 21:15, Krishna Myneni wrote:
> On 2/22/22 16:09, P Falth wrote:
....
>> The result is that it is now running the 40MB shaspeed in 350 ms!
>>
>
....
> The revised version of the hybrid Forth+machine code program, version
> 0.05, increases the hashing rate from 37 MB/s to 63 MB/s with kforth64
> on my system -- SHASPEED reports an elapsed time for 40 MB of 639 ms. A
> nice improvement!
>

Further improvements include rounds looping within the assembler code,
as well as the accumulation of the a--h constants. My revised
SHA512_TRANSFORM is

: SHA512_Transform ( addr -- )
>r
sreg_acc shiftreg 8 CELLS MOVE
shiftreg K512[] W512[] r> roundsA
shiftreg K512[] W512[] roundsB
shiftreg sreg_acc accumulate ;

The SHASPEED test (hashing 40 MB) takes 196 ms, giving a hashing rate in
excess of 200 MB/s with kforth64 on my system.

The updated file, sha512-x86_64.4th, may be found at

https://github.com/mynenik/kForth-64/blob/master/forth-src/sha512-x86_64.4th

--
Krishna

Re: SHA512 implementation in Forth (debugging)

<e06e759d-fb3d-4beb-8438-916114ec71dcn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16898&group=comp.lang.forth#16898

copy link Newsgroups: comp.lang.forth

X-Received: by 2002:a37:644:0:b0:60d:eace:79c1 with SMTP id 65-20020a370644000000b0060deace79c1mr7978297qkg.744.1645916467205;
Sat, 26 Feb 2022 15:01:07 -0800 (PST)
X-Received: by 2002:a37:841:0:b0:478:9e37:96fb with SMTP id
62-20020a370841000000b004789e3796fbmr8044203qki.110.1645916467025; Sat, 26
Feb 2022 15:01:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sat, 26 Feb 2022 15:01:06 -0800 (PST)
In-Reply-To: <2022Feb24.092412@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:1c05:2f14:600:e516:7527:61c2:fca5;
posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 2001:1c05:2f14:600:e516:7527:61c2:fca5
References: <ste547$ukb$1@dont-email.me> <su1u03$1bf$1@dont-email.me>
<su23hg$i8t$1@dont-email.me> <sus9kf$luv$1@dont-email.me> <sv1dh5$2k7$1@dont-email.me>
<1cd581c1-f91f-4fb4-aa71-d4f5915acc41n@googlegroups.com> <sv48sg$k9u$1@dont-email.me>
<232bc19e-f4c0-4aa1-9f4c-67815d3e42c0n@googlegroups.com> <2022Feb23.111521@mips.complang.tuwien.ac.at>
<25f6cbc7-c60a-4317-b6c3-14e2aadb5007n@googlegroups.com> <2022Feb23.230415@mips.complang.tuwien.ac.at>
<7483ff55-5f38-40a1-9165-e59dff7a6e6dn@googlegroups.com> <2022Feb24.092412@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e06e759d-fb3d-4beb-8438-916114ec71dcn@googlegroups.com>
Subject: Re: SHA512 implementation in Forth (debugging)
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Sat, 26 Feb 2022 23:01:07 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 140

by: Marcel Hendrix - Sat, 26 Feb 2022 23:01 UTC

On Thursday, February 24, 2022 at 9:40:56 AM UTC+1, Anton Ertl wrote:
> Marcel Hendrix <m...@iae.nl> writes:
> >On Wednesday, February 23, 2022 at 11:20:10 PM UTC+1, Anton Ertl wrote:
> >> Marcel Hendrix <m...@iae.nl> writes:
> >> >In your opinion, would be it be effective to store/load locals in the
> >> >SSE/AVX registers (I only use the FPU in iForth)?
> >> I would put cells in the GPRs (rax-r15); should be enough for a-h plus
> >> a few temporary values, plus the stack pointers.
> >
> >This is part of the main sha512 loop:
> >[...]
> > e f g Ch +
> > e sigma1_512u + h + ( T1 )
> > a sigma0_512u a b c Maj + ( T2 )
> >[ HERE H. ]
> > g TO h f TO g e TO f
> > OVER ( T1 ) d + TO e
> > c TO d b TO c a TO b
> > ( T1 T2 ) + TO a
> > LOOP
> >[...]
> >The HERE H. is to find out the address idis needs:
> >
> >FORTH> $013480C1 idis
> >$013480C1 mov r9, $01342540 qword-offset
> >$013480C8 mov $01342560 qword-offset, r9
> >$013480CF mov r9, $01342520 qword-offset
> >$013480D6 mov $01342540 qword-offset, r9
> >$013480DD mov r9, $01342500 qword-offset
> >$013480E4 mov $01342520 qword-offset, r9
> >\ apparently T1 is in rax
> >$013480EB mov r9, rax
> >$013480EE add r9, $013424E0 qword-offset
> >$013480F5 mov $01342500 qword-offset, r9
> >$013480FC mov r9, $013424C0 qword-offset
> >$01348103 mov $013424E0 qword-offset, r9
> >$0134810A mov r9, $013424A0 qword-offset
> >$01348111 mov $013424C0 qword-offset, r9
> >$01348118 mov r9, $01342480 qword-offset
> >$0134811F mov $013424A0 qword-offset, r9
> >\ looks like T2 is known as the sum of rdi and rdx ( Maj + )
> >$01348126 lea rdi, [rdi rdx*1] qword
> >$0134812A lea rcx, [rax rdi*1] qword
> >$0134812E mov $01342480 qword-offset, rcx
> >[...]
> >There is quite an overuse of r9, but isn't AMD doing register
> >renaming behind our backs by now?
> Yes, AMD has been using hardware register renaming since the K5 (and
> Intel since the Pentium Pro), so I would not expect an improvement if
> you used different registers. What I mean is getting rid of all these
> stores by keeping a-h in registers.
>
> If you keep them in memory, you can use the AVX instructions to reduce
> the 8 64-bit loads to 2 256-bit loads, and the 8 stores to 2 256-bit
> stores and 2 64-bit store (or you merge the new results while the data
> in in YMM registers; but I think that would require so many
> instructions that it would be a loss). I would arrange the data such
> that the 256-bit stores are 256-bit-aligned.
> >LOOP is still as bad as when you last criticized it
> >so I won't show it.
> The loop body takes more than the 5-6 cycles minimum latency caused by
> keeping the loop counter in memory, so I don't think it causes a
> significant slowdown here.

I hoped for an easier win here, but it is not so simple. I added 8 register
vars to iForth64 (lower 64-bit half of xmm0 ... xmm7). The code now
looks like the below. Unfortunately our disassembler is not good enough
yet for this job, so I can't show everything. This is somewhere around
the #80 #16 DO ... location in SHA512_transform.
....
$013A8856 mov rdi, [rbp 0 +] qword
$013A885A add rdx, [rdi*8 $013A2CA0 +] qword
$013A8862 lea rbx, [rbx #32 +] qword
$013A8866 push rbx
$013A8867 mov rbx, rdx
$013A886A push rbx
$013A886B movq rbx, xmm4
$013A8870 push rbx
$013A8871 movq rbx, xmm5
$013A8876 push rbx
$013A8877 movq rbx, xmm6
$013A887C pop rdi
$013A887D pop rax
$013A887E and rdi, rax
$013A8881 not rax
$013A8884 and rax, rbx
$013A8887 xor rdi, rax
$013A888A pop rbx
$013A888B lea rbx, [rbx rdi*1] qword
$013A888F push rbx
$013A8890 movq rbx, xmm4
$013A8895 mov rdi, rbx
$013A8898 ror rdi, #14 b#
$013A889C mov rax, rbx
$013A889F ror rax, #18 b#
$013A88A3 xor rdi, rax
$013A88A6 ror rbx, #41 b#
$013A88AA xor rdi, rbx
....
etc.

It is actually quite a bit slower than when using ordinary locals.
FORTH> S" abc" SHA512_Data .SHA512
DDAF35A1 93617ABA CC417349 AE204131 12E6FA4E 89A97EA2
0A9EEEE6 4B55D39A 2192992A 274FC1A8 36BA3C23 A3FEEBBD
454D4423 643CE80E 2A9AC94F A54CA49F ok
FORTH> SHAspeed
Processing 40 Mbytes ... 0.142 seconds elapsed. ok

With locals this is 0.113 ms.

A simple benchmark with the new regvars shows that the SSE
instructions are about 50% slower than instructions between GPRs.
Maybe I should use both halves of the XMM register, per your
suggestion.

: MAKE-REGVARS ( -- )
'i' 'a' DO I 'a' - S" FROM-REG " I CHAR-APPEND EVALUATE
I 'a' - S" TO-REG TO_" I CHAR-APPEND EVALUATE
LOOP ;

MAKE-REGVARS

: tt1 CR ." \ tt1 " TIMER-RESET
0 TO_a 1 TO_b 2 TO_c 3 TO_d 4 TO_e 5 TO_f 6 TO_g 7 TO_h
0 #1000000000
0 DO a + b + c + d + e + f + g + h + LOOP .ELAPSED SPACE . ;

0 VALUE t0 0 VALUE t1 0 VALUE t2 0 VALUE t3
0 VALUE t4 0 VALUE t5 0 VALUE t6 0 VALUE t7

: tt2 CR ." \ tt2 " TIMER-RESET
0 TO t0 1 TO t1 2 TO t2 3 TO t3 4 TO t4 5 TO t5 6 TO t6 7 TO t7
0 #1000000000
0 DO t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + LOOP .ELAPSED SPACE . ;

FORTH> tt1 tt2 ( disappointing ... )
\ tt1 2.485 seconds elapsed. 28000000000
\ tt2 1.730 seconds elapsed. 28000000000 ok

-marcel

Subject	Author
SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	P Falth
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Anton Ertl
Re: SHA512 implementation in Forth (debugging)	Marcel Hendrix
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni
Re: SHA512 implementation in Forth (debugging)	S Jack
Re: SHA512 implementation in Forth (debugging)	S Jack
Re: SHA512 implementation in Forth (debugging)	Krishna Myneni