Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

Less is more or less more -- Y_Plentyn on #LinuxGER


programming / comp.lang.asm.x86 / Re: Working Boyer-Moore search

SubjectAuthor
* Working Boyer-Moore searchRobert Prins
`* Re: Working Boyer-Moore searchTerje Mathisen
 +- Re: Working Boyer-Moore searchRobert Prins
 `* Re: Working Boyer-Moore searchChristian_Hanné
  `* Re: Working Boyer-Moore searchRobert Prins
   `* Re: Working Boyer-Moore searchBernhard Schornak
    `* Re: Working Boyer-Moore searchRobert Prins
     `- Re: Working Boyer-Moore searchRobert Prins

1
Subject: Working Boyer-Moore search
From: Robert Prins
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Sat, 31 Oct 2020 20:01 UTC
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rob...@nospicedham.prino.org (Robert Prins)
Newsgroups: comp.lang.asm.x86
Subject: Working Boyer-Moore search
Date: Sat, 31 Oct 2020 20:01:07 +0000
Organization: A noiseless patient Spider
Lines: 18
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <rnk8t3$e1p$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="5b6fba51c6005ceaeb892de70017a230";
logging-data="20374"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FhoEHIFjXEna3pmt/apTgac8qNUKxcmc="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.3.2
Cancel-Lock: sha1:LBuu4avMaR/YKN/upsog26zBo8I=
View all headers
Hi all,

There's one on the masm32 site, I found a Pascal version @ http://www.algorytm.org/przetwarzanie-tekstu/algorytm-bm-boyer-moorea/bm-d.html but neither of them seems to work correctly, the masm32 version is unable to find one particular string, the Pascal version only finds 9 occurrences of one particular (other) string.

Does anyone have the mixed source/assembler output of a working 32-bit version written in C?

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html



Subject: Re: Working Boyer-Moore search
From: Terje Mathisen
Newsgroups: comp.lang.asm.x86
Organization: Aioe.org NNTP Server
Date: Sun, 1 Nov 2020 14:35 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: terje.ma...@nospicedham.tmsw.no (Terje Mathisen)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Sun, 1 Nov 2020 15:35:40 +0100
Organization: Aioe.org NNTP Server
Lines: 280
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <rnmh3s$17p1$1@gioia.aioe.org>
References: <rnk8t3$e1p$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="1a043ffbe3bfd2797c655366f66ad5ee";
logging-data="31198"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1864cVCby9ZebpdBMNFb3fuHAaURoc6HwA="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.4
Cancel-Lock: sha1:AFb1wXaykKTasor5lwLRhs0E/ww=
View all headers
Robert Prins wrote:
Hi all,

There's one on the masm32 site, I found a Pascal version @ http://www.algorytm.org/przetwarzanie-tekstu/algorytm-bm-boyer-moorea/bm-d.html but neither of them seems to work correctly, the masm32 version is unable to find one particular string, the Pascal version only finds 9 occurrences of one particular (other) string.

Does anyone have the mixed source/assembler output of a working 32-bit version written in C?

Robert
Here's some random code I wrote in 1998, its main calim to fame is that it should allow you to search in a case-independent way inside unicode strings, while still using very little setup/&table memory.

It works by generating two copies of the search string, forcing one to UPPERCASE_ONLY and the other to lowercase_only, then when verifying a match is is OK if at least one of the forms fits at any given position.

This means that it works OK even if the function convert to/from upper/lower case is complicated & slow.

Terje


#include <stddef.h> // wchar_t definition
#include <windows.h> // To get the monocasing functions!

#define _memuprw(buf, len) CharUpperBuffW(buf, len)
#define _memlwrw(buf, len) CharLowerBuffW(buf, len)
#define _hashw(h) (h & 255) // _very_ simplistic hash from Unicode to [0..255]

#define _memupra(buf, len) CharUpperBuffA((char *) buf, len)
#define _memlwra(buf, len) CharLowerBuffA((char *) buf, len)

typedef unsigned char achar_t;

class wboyer_moore {
private:
wchar_t *lPattern;
wchar_t *uPattern;
wchar_t *lPatEnd;
wchar_t *uPatEnd;
unsigned plen;
bool ignorecase;
unsigned char skiptable[256];

bool init(wchar_t *pattern, unsigned len, bool icase);
public:
wboyer_moore(wchar_t *pattern, unsigned len, bool icase);
~wboyer_moore(void);
wchar_t *search(wchar_t *buf, unsigned buflen);
};

bool wboyer_moore::init(wchar_t *pattern, unsigned len, bool icase)
{
plen = len;

ignorecase = icase;
lPattern = uPattern = NULL;
if (len == 0) return true;

lPattern = new wchar_t [len];
if (!lPattern) return false;

memcpy(lPattern, pattern, len * sizeof(wchar_t));
uPattern = lPattern;
if (icase) {
uPattern = new wchar_t [len];
if (!uPattern) return false;

memcpy(lPattern, pattern, len * sizeof(wchar_t));
_memlwrw(lPattern, len);
_memuprw(uPattern, len);
}
if (len > 1) {
unsigned l, skip, p;

lPatEnd = lPattern + len - 1;
uPatEnd = uPattern + len - 1;

l = min(len, 255);
memset(skiptable, l, sizeof(skiptable));

for (skip = l - 1, p = plen - len; skip--; p++) {
skiptable[_hashw(lPattern[p])] = skiptable[_hashw(uPattern[p])] = skip;
}
}
return true;
}

wboyer_moore::wboyer_moore(wchar_t *pattern, unsigned len, bool icase)
{
init(pattern, len, icase);
}

wboyer_moore::~wboyer_moore(void)
{
if (lPattern) delete lPattern;
if (ignorecase && uPattern) delete uPattern;
}

wchar_t *wboyer_moore::search(wchar_t *buf, unsigned buflen)
{
if (buflen < plen) return NULL; // Buffer too shart, cannot match

wchar_t *bufend = buf + buflen;

if (plen <= 1) {
if (plen == 0) return buf; // An empty search pattern will always match

wchar_t l = lPattern[0];
if (ignorecase) {
wchar_t u = uPattern[0];

do {
wchar_t w;

if (((w = *buf) == l) || (w == u)) return buf;
} while (++buf < bufend);
return NULL;
}
// case-sensitive search for a single char, could use _wmemchr()!
do {
if (*buf == l) return buf;
} while (++buf < bufend);
return NULL;
}

// The search pattern is at least two characters long, so use the skip table:

wchar_t *b = buf + plen - 1; // Starting position for end of pattern
do { // Repeat until the end of the search buffer
unsigned skip;
do { // repeat until we find a possible matching endpoint
skip = skiptable[_hashw(*b)];
b += skip; // Guaranteed safe skip length
if (b >= bufend) return NULL;
} while (skip);

// We have a possible match on the endpoint, check the full pattern:

wchar_t *pl = lPatEnd, *pu = uPatEnd, *pb = b, w;
for (unsigned l = plen; ( (w = *pb) == *pl) || (w == *pu); pb--, pl--, pu--) {
if (--l == 0) return pb;
}
} while (++b < bufend); // Increment end pointer and try again
return NULL;
}

class aboyer_moore {
private:
achar_t *lPattern;
achar_t *uPattern;
achar_t *lPatEnd;
achar_t *uPatEnd;
unsigned plen;
bool ignorecase;
unsigned char skiptable[256];

bool init(achar_t *pattern, unsigned len, bool icase);
public:
aboyer_moore(achar_t *pattern, unsigned len, bool icase);
~aboyer_moore(void);
achar_t *search(achar_t *buf, unsigned buflen);
};

bool aboyer_moore::init(achar_t *pattern, unsigned len, bool icase)
{
plen = len;

ignorecase = icase;
lPattern = uPattern = NULL;
if (len == 0) return true;

lPattern = new achar_t [len];
if (!lPattern) return false;

memcpy(lPattern, pattern, len * sizeof(achar_t));
uPattern = lPattern;
if (icase) {
uPattern = new achar_t [len];
if (!uPattern) return false;

memcpy(lPattern, pattern, len * sizeof(achar_t));
_memlwra(lPattern, len);
_memupra(uPattern, len);
}
if (len > 1) {
unsigned l, skip, p;

lPatEnd = lPattern + len - 2;
uPatEnd = uPattern + len - 2;

l = min(len, 255);
memset(skiptable, l, sizeof(skiptable));

for (skip = l - 1, p = plen - len; skip--; p++) {
skiptable[lPattern[p]] = skiptable[uPattern[p]] = skip;
}
}
return true;
}

aboyer_moore::aboyer_moore(achar_t *pattern, unsigned len, bool icase)
{
init(pattern, len, icase);
}

aboyer_moore::~aboyer_moore(void)
{
if (lPattern) delete lPattern;
if (ignorecase && uPattern) delete uPattern;
}

achar_t *aboyer_moore::search(achar_t *buf, unsigned buflen)
{
if (buflen < plen) return NULL; // Buffer too shart, cannot match

achar_t *bufend = buf + buflen;

if (plen <= 1) {
if (plen == 0) return buf; // An empty search pattern will allways match

achar_t l = lPattern[0];
if (ignorecase) {
achar_t u = uPattern[0];

do {
achar_t a;

if (((a = *buf) == l) || (a == u)) return buf;
} while (++buf < bufend);
return NULL;
}
// case-sensitive search for a single char, could use _memchr()!
do {
if (*buf == l) return buf;
} while (++buf < bufend);
return NULL;
}

// The search pattern is at least two characters long, so use the skip table:

achar_t *b = buf + plen - 1; // Starting position for end of pattern
do { // Repeat until the end of the search buffer
unsigned skip;
do { // repeat until we find a matching endpoint
skip = skiptable[*b];
b += skip; // Guaranteed safe skip length
if (b >= bufend) return NULL;
} while (skip);

// We have a match on the endpoint, check the rest of the pattern:

achar_t *pl = lPatEnd, *pu = uPatEnd, *pb = b - 1, a;
for (unsigned l = plen; ( (a = *pb) == *pl) || (a == *pu); pb--, pl--, pu--) {
if (--l == 0) return pb;
}
} while (++b < bufend); // Increment end pointer and try again
return NULL;
}

int main(int, char **)
{
return 0;
}


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"



Subject: Re: Working Boyer-Moore search
From: Robert Prins
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Mon, 2 Nov 2020 01:08 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rob...@nospicedham.prino.org (Robert Prins)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Mon, 2 Nov 2020 01:08:14 +0000
Organization: A noiseless patient Spider
Lines: 55
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <rnnf8u$cp5$1@dont-email.me>
References: <rnk8t3$e1p$1@dont-email.me> <rnmh3s$17p1$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="f3e933e45498ac71fefe1b38d774a3f8";
logging-data="15104"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LwNz7LD89wdhoKg8hlRA66xfZ8X9pz8A="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.3.2
Cancel-Lock: sha1:zdL1L1l0jDA6v9j9G1m54jcyiJU=
View all headers
On 2020-11-01 14:35, Terje Mathisen wrote:
Robert Prins wrote:
Hi all,

There's one on the masm32 site, I found a Pascal version @ http://www.algorytm.org/przetwarzanie-tekstu/algorytm-bm-boyer-moorea/bm-d.html but neither of them seems to work correctly, the masm32 version is unable to find one particular string, the Pascal version only finds 9 occurrences of one particular (other) string.

Does anyone have the mixed source/assembler output of a working 32-bit version written in C?

Robert
Here's some random code I wrote in 1998, its main calim to fame is that it should allow you to search in a case-independent way inside unicode strings, while still using very little setup/&table memory.

It works by generating two copies of the search string, forcing one to UPPERCASE_ONLY and the other to lowercase_only, then when verifying a match is is OK if at least one of the forms fits at any given position.

This means that it works OK even if the function convert to/from upper/lower case is complicated & slow.

Hi Terje,

<very big snip>

That's way over my head.

I found a version in the "strutils.pp" unit in the FreePascal source archive, and after a bit of fiddling, it compiles with Virtual Pascal, and unlike the two versions mentioned above it works.

As for the code?

Setting up the bad-character table in all three is the same, the good suffix generation code is different, and I've not compared the tables generated, spending (a bit) more time on doing something about the generated code, which is, as usual with VP, less than optimal, and although my code is better, it's probably far from optimal.

As for the why of rolling my own BM, when "sed" can do what needs to be done, adding RTF tags to files, way faster and easier (and without the silly 255 restriction on string-lengths of VP), I just like my programs to be free of external dependencies. ;)

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html



Subject: Re: Working Boyer-Moore search
From: Christian_Hanné
Newsgroups: comp.lang.asm.x86
Organization: Aioe.org NNTP Server
Date: Sat, 7 Nov 2020 08:56 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!3.eu.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!news.swapon.de!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: the.ha...@nospicedham.gmail.com (Christian_Hanné)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Sat, 7 Nov 2020 09:56:33 +0100
Organization: Aioe.org NNTP Server
Lines: 2
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <ro5nfv$1nou$1@gioia.aioe.org>
References: <rnk8t3$e1p$1@dont-email.me> <rnmh3s$17p1$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="939e10cac9a49dbedeb97b4dc0adc709";
logging-data="9912"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Q275Ia5+esSrvhGPgtt/6qSRpoS3MH/0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.4.1
Cancel-Lock: sha1:IOk2v6nMGM4EG0Bhjy1JEH0gSak=
View all headers
Better in asm, that would be multiple times faster.



Subject: Re: Working Boyer-Moore search
From: Robert Prins
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Sun, 8 Nov 2020 00:16 UTC
References: 1 2 3
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rob...@nospicedham.prino.org (Robert Prins)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Sun, 8 Nov 2020 00:16:37 +0000
Organization: A noiseless patient Spider
Lines: 21
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <ro76gg$o1h$1@dont-email.me>
References: <rnk8t3$e1p$1@dont-email.me> <rnmh3s$17p1$1@gioia.aioe.org>
<ro5nfv$1nou$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="939e10cac9a49dbedeb97b4dc0adc709";
logging-data="28116"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kay0QYrzEpfZexnvhzyI16YNsKCsqWGw="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.4.0
Cancel-Lock: sha1:CCkDGcLfVgfl9HHTpiYAmjVJGzI=
View all headers
On 2020-11-07 08:56, Christian Hanné wrote:
Better in asm, that would be multiple times faster.

I've converted the FreePascal version into assembler, and although it's likely a (fair) bit slower than the version on masm32, it works. At some stage I might go back to it and see if I can squeeze a few more nanoseconds out of it, but when it comes back with 360 (all occurrences of "Oostende" in a 1.7Mb) file almost at the moment I release the Enter key, it's probably fast enough, and I expect that the RTF-ification, replacing the string to be replaced, the current year, with "{\cf1\highlight2 2020}" in a 157Kb file will take rather a bit more time, I have to think of a smart way of not moving that amount of data (or probably half of it) 92 times. Then again, moving a total of 15Mb is probably peanuts...


Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html



Subject: Re: Working Boyer-Moore search
From: Bernhard Schornak
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Sun, 8 Nov 2020 12:06 UTC
References: 1 2 3 4
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: schor...@nospicedham.web.de (Bernhard Schornak)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Sun, 8 Nov 2020 13:06:19 +0100
Organization: A noiseless patient Spider
Lines: 17
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <ro8n1s$1d8$2@dont-email.me>
References: <rnk8t3$e1p$1@dont-email.me> <rnmh3s$17p1$1@gioia.aioe.org>
<ro5nfv$1nou$1@gioia.aioe.org> <ro76gg$o1h$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="ff05b29321c4436e7aa4d208580ef491";
logging-data="6360"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19NjZR58aEDOQX3aPPFQ6QFxrSBQ3uGayY="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.1
Cancel-Lock: sha1:88/z+KWNW1gecv8XFPQpPNjIUdU=
View all headers
Robert Prins wrote:


(...), I have to think of a smart way of not moving that amount of data (or probably half of it) 92 times. Then again, moving a total of 15Mb
is probably peanuts...


If you allocate a second block, where you copy already
compared text from the source file, only (with already
replaced patterns!), it is a simple one pass action.


Greetings from Augsburg

Bernhard Schornak



Subject: Re: Working Boyer-Moore search
From: Robert Prins
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Sun, 8 Nov 2020 18:50 UTC
References: 1 2 3 4 5
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rob...@nospicedham.prino.org (Robert Prins)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Sun, 8 Nov 2020 18:50:18 +0000
Organization: A noiseless patient Spider
Lines: 31
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <ro97ok$ru3$1@dont-email.me>
References: <rnk8t3$e1p$1@dont-email.me> <rnmh3s$17p1$1@gioia.aioe.org>
<ro5nfv$1nou$1@gioia.aioe.org> <ro76gg$o1h$1@dont-email.me>
<ro8n1s$1d8$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="ff05b29321c4436e7aa4d208580ef491";
logging-data="437"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Y5R3+NIpWO6zL0Xtym+JA+dj7MsyfmqQ="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.4.1
Cancel-Lock: sha1:9hALDEYnJR6mw/aH2ytvqCx2a6I=
View all headers
On 2020-11-08 12:06, Bernhard Schornak wrote:
Robert Prins wrote:
(...), I have to think of a smart way of not moving that amount of data (or probably half of it) 92 times. Then again, moving a total of 15Mb
is probably peanuts...
 >
If you allocate a second block, where you copy already
compared text from the source file, only (with already
replaced patterns!), it is a simple one pass action.

Yes, that's the best way, but I might settle for a compromise and scan each line before it's added to the output buffer, and go for a part copy, add replacement, copy more, add another replacement, followed by a final copy. The added replacement just consists of 22 characters, so a single YMM VMOVDQU.

As for the data, there are currently 92 occurrences of 2020, and 86 of them are the only one on a line (in a files with 767 lines). Next year, 2021, will add 2+8 lines, 2022 2+8+50 and the years after that 2+8 again until 2029 or so.

The number of replacements (at the end of the calendar year) is right now maxed out at 94, but as the data is spread over a multitude of tables sorted in various orders, there's no easier way of adding the highlighting, other than of course using a program like "sed".

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html



Subject: Re: Working Boyer-Moore search
From: Robert Prins
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Tue, 10 Nov 2020 21:40 UTC
References: 1 2 3 4 5 6
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: rob...@nospicedham.prino.org (Robert Prins)
Newsgroups: comp.lang.asm.x86
Subject: Re: Working Boyer-Moore search
Date: Tue, 10 Nov 2020 21:40:56 +0000
Organization: A noiseless patient Spider
Lines: 115
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <roeqgb$ns9$1@dont-email.me>
References: <rnk8t3$e1p$1@dont-email.me> <rnmh3s$17p1$1@gioia.aioe.org>
<ro5nfv$1nou$1@gioia.aioe.org> <ro76gg$o1h$1@dont-email.me>
<ro8n1s$1d8$2@dont-email.me> <ro97ok$ru3$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="001654300e8e38438253cea77f188b6c";
logging-data="13958"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lZwtXC85Y64h1rpJknNSRjPjFoopZpro="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.4.2
Cancel-Lock: sha1:GardIm7lf3rnEVrIWAgZHrGJRoU=
View all headers
On 2020-11-08 18:50, Robert Prins wrote:
On 2020-11-08 12:06, Bernhard Schornak wrote:
Robert Prins wrote:
(...), I have to think of a smart way of not moving that amount of data (or probably half of it) 92 times. Then again, moving a total of 15Mb
is probably peanuts...
 >
If you allocate a second block, where you copy already
compared text from the source file, only (with already
replaced patterns!), it is a simple one pass action.

Yes, that's the best way, but I might settle for a compromise and scan each line before it's added to the output buffer, and go for a part copy, add replacement, copy more, add another replacement, followed by a final copy. The added replacement just consists of 22 characters, so a single YMM VMOVDQU.

As for the data, there are currently 92 occurrences of 2020, and 86 of them are the only one on a line (in a files with 767 lines). Next year, 2021, will add 2+8 lines, 2022 2+8+50 and the years after that 2+8 again until 2029 or so.

The number of replacements (at the end of the calendar year) is right now maxed out at 94, but as the data is spread over a multitude of tables sorted in various orders, there's no easier way of adding the highlighting, other than of course using a program like "sed".

In the end I decided not to use BM to find the "202x" strings, and here's why:

The first two tables both contain a single occurrence, both on columns 3..6, so I just write out the replacement with RTF highlighting string (including the additional columns 1..2, "| ", and then use a pointer to virtually move the start of the line to column 7 (with a reduced by 6 bytes) length, and write that one out.

The next five pages contain a table in (up to) four columns, and the "202x" can only be in specific columns, so I check those columns and if there's a "202x" I write out the replacement string followed by the rest of the column, if not I just write out the whole column, and given that it's only 51 characters wide, it's just four "vmovdqu" instructions in both cases.

The next four pages each contain 2x6 tables (Jan-Jun/Jul-Dec) and the code is nearly the same, just taking into account that the columns are a bit narrower, and while I'm writing this, I realize that edx is not used. Using it would allow me to merge routines:

Note that ebx contain the address line to be changed, esi the "202x"

@s0:
   add     ecx, 5

@s1:
   cmp     esi, [ebx + ecx - 4]
   jne     @s2

   vmovdqu [eax], ymm7

   vmovdqu ymm0, [ebx + ecx]
   vmovdqu ymm1, [ebx + ecx + 32]
   vmovdqu [eax + 24], ymm0
   vmovdqu [eax + 24 + 32], ymm1
   add     eax, 69
   jmp     @s3

@s2:
   vmovdqu ymm0, [ebx + ecx - 6]
   vmovdqu ymm1, [ebx + ecx + 26]
   vmovdqu [eax], ymm0
   vmovdqu [eax + 32], ymm1
   add     eax, 51

@s3:
   add     ecx, 46
   cmp     byte ptr [ebx], cl
   ret


@t0:
   add     ecx, 5

@t1:
   cmp     esi, [ebx + ecx - 4]
   jne     @t2

   vmovdqu [eax], ymm7

   vmovdqu ymm0, [ebx + ecx]
   vmovdqu ymm1, [ebx + ecx + 32]
   vmovdqu [eax + 24], ymm0
   vmovdqu [eax + 24 + 32], ymm1
   add     eax, 60
   jmp     @t3

@t2:
   vmovdqu ymm0, [ebx + ecx - 6]
   vmovdqu ymm1, [ebx + ecx + 26]
   vmovdqu [eax], ymm0
   vmovdqu [eax + 32], ymm1
   add     eax, 42

@t3:
   add     ecx, 37
   cmp     byte ptr [ebx], cl
   ret

by changing the last to "add"s in either by a "lea"s.

I'm also pretty sure that my code will be (quite) a bit faster than BM searching each line for "202x".

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/indez.html
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html



1
rocksolid light 0.7.2
clearneti2ptor