Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"In short, _N is Richardian if, and only if, _N is not Richardian."


computers / comp.arch / Re: Proposal for Single instructions for string library functions on My 66000

Re: Proposal for Single instructions for string library functions on My 66000

<sasg75$cj9$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=18029&group=comp.arch#18029

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!XpBc3qS8ZZIa50C3RMoQkQ.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Proposal for Single instructions for string library functions on
My 66000
Date: Tue, 22 Jun 2021 13:06:14 +0200
Organization: Aioe.org NNTP Server
Lines: 156
Message-ID: <sasg75$cj9$1@gioia.aioe.org>
References: <sar8dp$d9$1@dont-email.me>
NNTP-Posting-Host: XpBc3qS8ZZIa50C3RMoQkQ.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Tue, 22 Jun 2021 11:06 UTC

Stephen Fuld wrote:
> I was content to assume the use of VVM for most/all of the C string
> library functions on Mitches MY 66000.  The capabilities, especially the
> ones “underneath” the ISA level to stream values from/to memory and
> do multiple byte compares in a single cycle, make a huge improvement in
> the performance of these functions.
>
> However, once Mitch introduced the MM (Memory Move) instruction, which
> makes a single instruction out of what would otherwise be a short VVM
> sequence of instructions, that made me try to think about the issues
> involved in adding single instructions to implement (perhaps some of)
> the other string functions.  This is what I have so far.
>
> A few caveats.
>
> First, IANAHG, and this certainly not a detailed proposal.  I don’t
> know enough to make one.  It is just some thoughts on the issue.  Many
> of these ideas and statements may be wrong, and of course, I welcome
> corrections.
>
> Second, I looked a little but could not find information giving the
> frequency of use of each of the functions, so any ideas there are just
> my, probably poor, intuition. Again, I welcome more information.
>
> Third, I realize that Mitch added the MM instruction to speed up
> structure moves, and this means higher use of the MM instruction than
> strictly as a library function replacement.  But this provides the
> infrastructure making the implementation of similar instructions easier.
> (i.e. arbitrary number of TLB misses per instruction,
> interruptable/resumable instructions, etc.)  And several of the proposed
> instructions have other uses too.
>
> So, is it worth considering?
>
> The big pro to this is performance.  While the VVM solution eliminates
> the cost of fetching, decoding and executing the multiple instructions
> in the loop on other than the first pass, there is still more cost to
> executing the several instructions in the loop once (on the first pass)
> than doing so for a single instruction.    However this is usually a one
> time cost, and may not by huge when talking about a longer running
> function.  But if the function is, in fact, long running, then it it
> more likely to encounter an interrupt, in which case the “first
> pass” cost is encountered again.  Similarly, a single instruction
> takes less space in the instruction stream and the I-cache than a
> sequence of several instructions. (Note that this assumes in-lining the
> function. If it is called, there is even more overhead.)  And functions
> like strcat require two VVM loops to complete, although upon an
> interrupt, only one of them must incur the extra cost.  Again, I
> didn’t think this was huge, but the fact that Mitch thought it was
> worthwhile enough to eliminate made me reconsider.
>
> For some functions, there are opportunities for further substantial
> performance gains.  (see below)
>
> The big cons are the cost of implementing any new instruction.  First,
> there is the cost of the gates to do it, although for many of the
> functions, they are already there, so it adds only the logic to invoke
> them.
>
> Second is the cost of additional op-codes.  While there are twenty some
> functions, I don’t propose anywhere close to that many op codes.  I
> suspect that some of the functions, especially the ones that require an
> additional potential character substitution per character (e.g. the
> localization functions) aren’t good candidates for single instruction
> implementation.  I also probably wouldn’t do the errno lookup
> function, as presumably it is infrequently called and never in the
> critical path.
>
> And many of the functions are essentially small modifications of others,
> e.g. strcmp and strncmp.  These two can be handled using a single op
> code but using the Carry meta instruction to indicate “use the n
> version” and specify the register that contains n.  Applying this
> logic to other similar cases reduces the number of op codes further.
> Even where you don’t need the additional register, you could use the
> presence of the Carry indicator bit to modify the instruction, e.g.
> strchr and strrchr.  There are several choices for how to implement
> this.  I don’t pretend to know the best one.
>
> Lastly, there are a number of functions, mostly the “nested loop”
> ones that would gain substantial benefit from being able to use an
> instruction that implements another string function as a “building
> block” to speed them up, even without a dedicated op code.  See below
> for an example.
>
> Combining all of these, I think you could get down to a single digit
> number of new op codes for most of the desired functionality.
>
> The “nested loop” functions are the ones, such as strpbrk that
> require you to code a nested loop, the outer loop going over the first
> string, the inner loop going over the second string.  The code that is
> in the outer loop, but not in the inner loop is just loading the next
> byte and checking it for a value of zero.  This will work, but there is
> a performance issue.  VVM loops can’t be nested.  So, assuming you use
> VVM for the inner loop,  the outer loop will cost relatively a lot, as
> it can’t use the streaming and multi byte compare capabilities of VVM.
> But, if you have the single instruction that searches for a character
> match in a string (strchr), you can use this single instruction (plus
> perhaps the Carry modifier), as the inner loop, thus enabling you to use
> VVM for the outer loop.  So while you still have essentially a nested
> loop (the strchr instruction is essentially a loop), you have
> substantially sped up the operation.
>
> One last thing.  While thinking about this, I had another idea regarding
> some of these nested loop functions.  I am guessing that for a
> non-trivial percentage of these, the string containing the “to be
> looked for” characters is, in fact, a compile time constant.  An
> example of this is searching text for any “white space” character.
> For these cases, perhaps we can use the features built into the My
> 66000, together with some things that hardware does better than software
> to do better than even the method I outlined in the previous paragraph.
>
> The idea is as follows.  Let’s use strspn as an example. If the
> compiler sees that the second string is a compile time constant, it
> builds a 256 bit map, with a one bit set for the value in each character
> in the string.  It then emits an instruction giving the addresses of the
> first and this newly constructed second string as the two source
> operands. The result operand will be used to hold the count.  When the
> instruction is encountered, the hardware loads the 256 bit (32 byte) bit
> map into one of the available buffers.  It also loads the starting bytes
> of the first string into another streaming buffer.  It then starts going
> through the first string, using the value of each byte as an index into
> the second string (this ability to have a 256 bit index into a 32 byte
> string is the thing that the hardware can do better (faster) than
> software.  The hardware proceeds through the first string, looking up
> each byte and looking for a one bit, incrementing a counter for each
> byte.  The presence of the carry flag could be used to reverse the sense
> of the test, giving strcspn.  The hope is that it can do one byte per
> cycle, or perhaps one byte per two cycles.  In any event, it certainly
> is much faster than a nested loop as it becomes a single loop.    I
> don’t know if the advantages of this are worth the extra
> implementation cost, but I wanted to mention it.
>
> Let me conclude by re-emphasizing that this whole idea (single
> instructions for string functions) might not make sense, or might not be
> worthwhile, or it might be the wrong way to implement the functionality,
> etc.  But I want to present it to get reactions and potential improvements.

This idea tends to break down when all text is utf8, i.e. you can still
handle all the 7-bit US ASCII chars this way but you have to exit out of
the inner loop each time you get to an extended unicode wide char.

I try to write such code that I can simply ignore all the utf8 issues,
but that isn't always possible.

I.e. for my next world count implementation I've already figured out
that I can count characters by skipping (setting the char count
increment to zero) all intermediate utf8 byte values, but counting words
the same way only works if all utf8 sequences that end with a given
final byte value are members of the same set: Either all space/word
separators, or all in-word chars.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

SubjectRepliesAuthor
o Proposal for Single instructions for string library functions on My

By: Stephen Fuld on Mon, 21 Jun 2021

83Stephen Fuld
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor