Message-ID:

The herd instinct among economists makes sheep look like independent thinkers.

devel / comp.lang.c / checking back chars when scaning - best way ?

checking back chars when scaning - best way ?

<6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=21482&group=comp.lang.c#21482

X-Received: by 2002:a37:9ed7:0:b0:69e:a6bf:cc37 with SMTP id h206-20020a379ed7000000b0069ea6bfcc37mr13503872qke.744.1650987325921;
Tue, 26 Apr 2022 08:35:25 -0700 (PDT)
X-Received: by 2002:a05:622a:6114:b0:2f0:ffc8:53f8 with SMTP id
hg20-20020a05622a611400b002f0ffc853f8mr15660098qtb.681.1650987325723; Tue, 26
Apr 2022 08:35:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 08:35:25 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.134; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.134
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
Subject: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 15:35:25 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 15

by: fir - Tue, 26 Apr 2022 15:35 UTC

assume you got routine that takes long string/literal char* txt
as an input

i need to scan for some patterns in it like "..." in this exist
convenient is to check

for(char* c = text; c<txt+end; c++)
if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found

the problem is if it dont gets over a string and reads
two chars before its begining

what is the best and simplest way to do it correctly?

tnx

Re: checking back chars when scaning - best way ?

<t498dv$a3h0$1@paganini.bofh.team>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21483&group=comp.lang.c#21483

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!paganini.bofh.team!not-for-mail
From: inva...@invalid.net (Jack Lemmon)
Newsgroups: comp.lang.c
Subject: Re: checking back chars when scaning - best way ?
Date: Tue, 26 Apr 2022 17:58:17 +0100
Organization: To protect and to server
Message-ID: <t498dv$a3h0$1@paganini.bofh.team>
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain;
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 26 Apr 2022 16:59:43 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="331296"; posting-host="xWoSobfAAAMkUmDG6ndrsw.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team";
X-Notice: Filtered by postfilter v. 0.9.1
Content-Language: en-US

by: Jack Lemmon - Tue, 26 Apr 2022 16:58 UTC

On 26/04/2022 16:35, fir wrote:
> assume you got routine that takes long string/literal char* txt
> as an input
>
> i need to scan for some patterns in it like "..." in this exist
> convenient is to check
>
> for(char* c = text; c<txt+end; c++)
> if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
>
> the problem is if it dont gets over a string and reads
> two chars before its begining
>
> what is the best and simplest way to do it correctly?
>
> tnx
>
Try using strstr function:

char *strstr(const char *s1, const char *s2);

<https://www.ibm.com/docs/en/i/7.4?topic=functions-strstr-locate-substring>

Re: checking back chars when scaning - best way ?

<t498eb$ek5$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21484&group=comp.lang.c#21484

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: richard....@gmail.com (Richard Harnden)
Newsgroups: comp.lang.c
Subject: Re: checking back chars when scaning - best way ?
Date: Tue, 26 Apr 2022 17:59:54 +0100
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <t498eb$ek5$1@dont-email.me>
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
Reply-To: nospam.harnden@gmail.com
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 26 Apr 2022 16:59:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d4b30662cb5b91fd8adc101ca3c22a94";
logging-data="14981"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/UtSuPduXIVCqJUHI1djWUEQvyjUSz0O4="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.8.1
Cancel-Lock: sha1:UXUDWt1IAe4kEJ+XBc/gY5aMC74=
In-Reply-To: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>

by: Richard Harnden - Tue, 26 Apr 2022 16:59 UTC

What is wrong with: strstr(text, "...") ?

Re: checking back chars when scaning - best way ?

<f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21485&group=comp.lang.c#21485

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:13c8:b0:2f3:5421:d64d with SMTP id p8-20020a05622a13c800b002f35421d64dmr16212865qtk.43.1650993238930;
Tue, 26 Apr 2022 10:13:58 -0700 (PDT)
X-Received: by 2002:a05:620a:424e:b0:67e:4c1b:baef with SMTP id
w14-20020a05620a424e00b0067e4c1bbaefmr14112281qko.778.1650993238717; Tue, 26
Apr 2022 10:13:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 10:13:58 -0700 (PDT)
In-Reply-To: <t498eb$ek5$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.226; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.226
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com> <t498eb$ek5$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 17:13:58 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 92

by: fir - Tue, 26 Apr 2022 17:13 UTC

wtorek, 26 kwietnia 2022 o 19:00:08 UTC+2 Richard Harnden napisał(a):
> On 26/04/2022 16:35, fir wrote:
> > assume you got routine that takes long string/literal char* txt
> > as an input
> >
> > i need to scan for some patterns in it like "..." in this exist
> > convenient is to check
> >
> > for(char* c = text; c<txt+end; c++)
> > if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
> >
> > the problem is if it dont gets over a string and reads
> > two chars before its begining
> >
> > what is the best and simplest way to do it correctly?
> >
> > tnx
> >
> What is wrong with: strstr(text, "...") ?

i need genarally this loop to be preserved
for(char* c = text; c<txt+end; c++)
{
//acces *c here
}

becouse i do more things here in that loop besides this kind of scanning is more general pattaern which i encounter from time to time

if you curious for example this is a pattern of what i call splitter function which is very core function
in various text based programing i do (this splitter breaks input text onto say logical lines ready for
use for the rest of code)..for various things you could have various splitters but the basic one is like

chunks SplitOnLogLines4Furia(chunk text )
{

int current_chunk_number = 0;

static chunk* ram4chunks = NULL;
ram4chunks = (chunk*) realloc(ram4chunks, (current_chunk_number+1 )* sizeof(chunk) );
ram4chunks[current_chunk_number].beg = text.beg;

int parenthesis_count =0;
int inside_commentary =0;

for(char* c = text.beg; c <= text.end; c++ )
{
if(*c== '\"') quote_mark_count++;
if(*c== '/' && *(c-1)== '/') inside_commentary = 1;

int do_break;
do_break = 0;

if(!inside_commentary)
if(*c==0x0a ||*c==',')
if(quote_mark_count%2==0)
do_break=1;

if(inside_commentary)
if(*c==0x0a)
do_break=1;

if(do_break)
{
//do break
inside_commentary = 0;
quote_mark_count =0;

ram4chunks[current_chunk_number].end = c-1;

if(c+1<=text.end) // open next
{
current_chunk_number ++;
ram4chunks = (chunk*) realloc(ram4chunks, (current_chunk_number+1) * sizeof(chunk) );
ram4chunks[current_chunk_number].beg = c + 1;
}
}
}
ram4chunks[current_chunk_number].end = text.end; //finish last
//send output
chunks splited = {ram4chunks, ram4chunks + current_chunk_number} ;
return splited;

}

Re: checking back chars when scaning - best way ?

<a8227975-a087-4280-89c9-4d072c95ce53n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21487&group=comp.lang.c#21487

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:3d3:b0:2e2:1294:5817 with SMTP id k19-20020a05622a03d300b002e212945817mr16233578qtx.638.1650993496087;
Tue, 26 Apr 2022 10:18:16 -0700 (PDT)
X-Received: by 2002:a37:b605:0:b0:69e:6d6f:aea7 with SMTP id
g5-20020a37b605000000b0069e6d6faea7mr13645237qkf.655.1650993495937; Tue, 26
Apr 2022 10:18:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 10:18:15 -0700 (PDT)
In-Reply-To: <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.226; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.226
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
<t498eb$ek5$1@dont-email.me> <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a8227975-a087-4280-89c9-4d072c95ce53n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 17:18:16 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 93

by: fir - Tue, 26 Apr 2022 17:18 UTC

wtorek, 26 kwietnia 2022 o 19:14:06 UTC+2 fir napisał(a):
> wtorek, 26 kwietnia 2022 o 19:00:08 UTC+2 Richard Harnden napisał(a):
> > On 26/04/2022 16:35, fir wrote:
> > > assume you got routine that takes long string/literal char* txt
> > > as an input
> > >
> > > i need to scan for some patterns in it like "..." in this exist
> > > convenient is to check
> > >
> > > for(char* c = text; c<txt+end; c++)
> > > if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
> > >
> > > the problem is if it dont gets over a string and reads
> > > two chars before its begining
> > >
> > > what is the best and simplest way to do it correctly?
> > >
> > > tnx
> > >
> > What is wrong with: strstr(text, "...") ?
> i need genarally this loop to be preserved
> for(char* c = text; c<txt+end; c++)
> {
> //acces *c here
> }
>
> becouse i do more things here in that loop besides this kind of scanning is more general pattaern which i encounter from time to time
>
> if you curious for example this is a pattern of what i call splitter function which is very core function
> in various text based programing i do (this splitter breaks input text onto say logical lines ready for
> use for the rest of code)..for various things you could have various splitters but the basic one is like
>
> chunks SplitOnLogLines4Furia(chunk text )
> {
>
> int current_chunk_number = 0;
>
> static chunk* ram4chunks = NULL;
> ram4chunks = (chunk*) realloc(ram4chunks, (current_chunk_number+1 )* sizeof(chunk) );
> ram4chunks[current_chunk_number].beg = text.beg;
>
> int parenthesis_count =0;

shoud be int quote_mark_count = 0;

> int inside_commentary =0;
>
> for(char* c = text.beg; c <= text.end; c++ )
> {
> if(*c== '\"') quote_mark_count++;
> if(*c== '/' && *(c-1)== '/') inside_commentary = 1;
>
>
> int do_break;
> do_break = 0;
>
> if(!inside_commentary)
> if(*c==0x0a ||*c==',')
> if(quote_mark_count%2==0)
> do_break=1;
>
> if(inside_commentary)
> if(*c==0x0a)
> do_break=1;
>
> if(do_break)
> {
> //do break
> inside_commentary = 0;
> quote_mark_count =0;
>
> ram4chunks[current_chunk_number].end = c-1;
>
> if(c+1<=text.end) // open next
> {
> current_chunk_number ++;
> ram4chunks = (chunk*) realloc(ram4chunks, (current_chunk_number+1) * sizeof(chunk) );
> ram4chunks[current_chunk_number].beg = c + 1;
> }
> }
> }
> ram4chunks[current_chunk_number].end = text.end; //finish last
> //send output
> chunks splited = {ram4chunks, ram4chunks + current_chunk_number} ;
> return splited;
>
> }

Re: checking back chars when scaning - best way ?

<909c4af1-1634-4d64-a8f5-57367cc67773n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21489&group=comp.lang.c#21489

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:d87:b0:67b:311c:ecbd with SMTP id q7-20020a05620a0d8700b0067b311cecbdmr14040720qkl.146.1650994335941;
Tue, 26 Apr 2022 10:32:15 -0700 (PDT)
X-Received: by 2002:a37:a716:0:b0:69f:7e9b:9762 with SMTP id
q22-20020a37a716000000b0069f7e9b9762mr2086961qke.33.1650994335785; Tue, 26
Apr 2022 10:32:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 10:32:15 -0700 (PDT)
In-Reply-To: <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.93; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.93
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
<t498eb$ek5$1@dont-email.me> <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <909c4af1-1634-4d64-a8f5-57367cc67773n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 17:32:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 21

by: fir - Tue, 26 Apr 2022 17:32 UTC

wtorek, 26 kwietnia 2022 o 19:14:06 UTC+2 fir napisał(a):
>
> for(char* c = text.beg; c <= text.end; c++ )
> {
> if(*c== '\"') quote_mark_count++;
> if(*c== '/' && *(c-1)== '/') inside_commentary = 1;
>

probably suficient way here is

if(c-text.beg>=1) if(*c== '/' && *(c-1)== '/') inside_commentary = 1;

though this is probably runtime not optimal (and i may put megabyte long texts thru that too)

changing the loop header is strongly not welcome to me becouse i like the
clarity and generality of this loop (scan)

in this situation i would prefer the runtime slower form for my typical use, unles someone maybe see something better?

Re: checking back chars when scaning - best way ?

<25f25e31-e854-4935-adf2-88809ee02753n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21490&group=comp.lang.c#21490

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:74b:b0:69b:db1d:f91e with SMTP id i11-20020a05620a074b00b0069bdb1df91emr13925716qki.286.1650994686746;
Tue, 26 Apr 2022 10:38:06 -0700 (PDT)
X-Received: by 2002:ac8:7f95:0:b0:2f3:479d:1c1d with SMTP id
z21-20020ac87f95000000b002f3479d1c1dmr16300260qtj.345.1650994686562; Tue, 26
Apr 2022 10:38:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 10:38:06 -0700 (PDT)
In-Reply-To: <909c4af1-1634-4d64-a8f5-57367cc67773n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.93; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.93
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
<t498eb$ek5$1@dont-email.me> <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
<909c4af1-1634-4d64-a8f5-57367cc67773n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25f25e31-e854-4935-adf2-88809ee02753n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 17:38:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 30

by: fir - Tue, 26 Apr 2022 17:38 UTC

wtorek, 26 kwietnia 2022 o 19:32:23 UTC+2 fir napisał(a):
> wtorek, 26 kwietnia 2022 o 19:14:06 UTC+2 fir napisał(a):
> >
> > for(char* c = text.beg; c <= text.end; c++ )
> > {
> > if(*c== '\"') quote_mark_count++;
> > if(*c== '/' && *(c-1)== '/') inside_commentary = 1;
> >
> probably suficient way here is
>
> if(c-text.beg>=1) if(*c== '/' && *(c-1)== '/') inside_commentary = 1;
>
> though this is probably runtime not optimal (and i may put megabyte long texts thru that too)
>
> changing the loop header is strongly not welcome to me becouse i like the
> clarity and generality of this loop (scan)
>
> in this situation i would prefer the runtime slower form for my typical use, unles someone maybe see something better?
well that probably is better?

if(*c== '/') if(c-text.beg>=1) if( *(c-1)== '/') inside_commentary = 1;

looks unnatural but spares many ifs

Re: checking back chars when scaning - best way ?

<memcmp-20220426185913@ram.dialup.fu-berlin.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21491&group=comp.lang.c#21491

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.c
Subject: Re: checking back chars when scaning - best way ?
Date: 26 Apr 2022 17:59:49 GMT
Organization: Stefan Ram
Lines: 14
Expires: 1 Apr 2023 11:59:58 GMT
Message-ID: <memcmp-20220426185913@ram.dialup.fu-berlin.de>
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com> <t498eb$ek5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de f4o9UJ4JjSt+yeVajLYvIQrfdC8TmlWZdvGtDXIvD7PFHF
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Tue, 26 Apr 2022 17:59 UTC

Richard Harnden <richard.nospam@gmail.com> writes:
>On 26/04/2022 16:35, fir wrote:
....
>>for(char* c = text; c<txt+end; c++)
>>if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
....
>What is wrong with: strstr(text, "...") ?

This has a different meaning.

WRT the OP code: c[ -2 ] only is permissible when c >= text + 2.
Under this assumption, one might also use !memcmp( c-2, "...", 3 ).

Re: checking back chars when scaning - best way ?

<89de3679-cf88-4747-842c-bf1664b525a8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21493&group=comp.lang.c#21493

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:21c1:b0:450:5583:6595 with SMTP id d1-20020a05621421c100b0045055836595mr16784519qvh.130.1651001746148;
Tue, 26 Apr 2022 12:35:46 -0700 (PDT)
X-Received: by 2002:a05:620a:2416:b0:69f:47fa:595e with SMTP id
d22-20020a05620a241600b0069f47fa595emr7812348qkn.229.1651001745956; Tue, 26
Apr 2022 12:35:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 12:35:45 -0700 (PDT)
In-Reply-To: <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.104; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.104
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
<t498eb$ek5$1@dont-email.me> <f9962476-e7ca-491f-ad09-12faf0b63692n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <89de3679-cf88-4747-842c-bf1664b525a8n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 19:35:46 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 18

by: fir - Tue, 26 Apr 2022 19:35 UTC

btw bartc was once asking on speed of compiler processing sources

i checked now how many sole such splitter takes (though i not do much care
for cerefull testing and observations)

100 iterations of splitting 1MB of some .c code (which has 50k lines before splitting and
the splitting result had 70k entries (i splitted on 0xa and ";" ) took 1.7 second
which is like 100 MB o 1.7 second which is 58 MB/s and 2.9 M lines per second (hovever my
tests was not much carefull)
(tested on 10 year old pc)
i could eventually say that 58 MB/s not seem to be fast, but this is not optimised
this splitter obviously calls realloc on each break this is it calls 7M reallocks in the process
(which are calls becouse minority of them do reall reallocking)
i got somewhere version that spares reallocks but not use it last times

whait a second i will preallock 100k entries in front and delete reallock
0.34 second 294 MB/s 14.7 M lines/s - this is more reasonable especially as it
may probably be tried to oplimise yet slightly

Re: checking back chars when scaning - best way ?

<t49kri$mbv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21494&group=comp.lang.c#21494

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: andreyta...@hotmail.com (Andrey Tarasevich)
Newsgroups: comp.lang.c
Subject: Re: checking back chars when scaning - best way ?
Date: Tue, 26 Apr 2022 13:31:45 -0700
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <t49kri$mbv$1@dont-email.me>
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 26 Apr 2022 20:31:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7bce47085e3f486107b30c31d4b3894b";
logging-data="22911"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PSXvR9XOm5CqOdng7wKlQ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.1
Cancel-Lock: sha1:0k+Hd2MTzSJRy8AXYbJQL8Ib8l0=
In-Reply-To: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
Content-Language: en-US

by: Andrey Tarasevich - Tue, 26 Apr 2022 20:31 UTC

On 4/26/2022 8:35 AM, fir wrote:
> assume you got routine that takes long string/literal char* txt
> as an input
>
> i need to scan for some patterns in it like "..." in this exist
> convenient is to check
>
> for(char* c = text; c<txt+end; c++)
> if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
>
> the problem is if it dont gets over a string and reads
> two chars before its begining
>
> what is the best and simplest way to do it correctly?

Um... Start searching from `text + 2`, obviously. And generally from
`pattern_length - 1` position. The pattern cannot possibly occur before
that position, no point in searching there.

--
Best regards,
Andrey Tarasevich

Re: checking back chars when scaning - best way ?

<6ff910ed-527e-4697-9bb1-26a350c2cf38n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21495&group=comp.lang.c#21495

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ac8:5789:0:b0:2f3:63d9:62e4 with SMTP id v9-20020ac85789000000b002f363d962e4mr11014803qta.382.1651006267796;
Tue, 26 Apr 2022 13:51:07 -0700 (PDT)
X-Received: by 2002:ad4:594c:0:b0:449:95d6:d715 with SMTP id
eo12-20020ad4594c000000b0044995d6d715mr17590405qvb.115.1651006267581; Tue, 26
Apr 2022 13:51:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 13:51:07 -0700 (PDT)
In-Reply-To: <t49kri$mbv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.249; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.249
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com> <t49kri$mbv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6ff910ed-527e-4697-9bb1-26a350c2cf38n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Tue, 26 Apr 2022 20:51:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 22

by: fir - Tue, 26 Apr 2022 20:51 UTC

wtorek, 26 kwietnia 2022 o 22:32:01 UTC+2 Andrey Tarasevich napisał(a):
> On 4/26/2022 8:35 AM, fir wrote:
> > assume you got routine that takes long string/literal char* txt
> > as an input
> >
> > i need to scan for some patterns in it like "..." in this exist
> > convenient is to check
> >
> > for(char* c = text; c<txt+end; c++)
> > if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
> >
> > the problem is if it dont gets over a string and reads
> > two chars before its begining
> >
> > what is the best and simplest way to do it correctly?
> Um... Start searching from `text + 2`, obviously. And generally from
> `pattern_length - 1` position. The pattern cannot possibly occur before
> that position, no point in searching there.
>
it seems most people do this way but in respect to what i was sayin this seem to me to be kinda bad pattern

Re: checking back chars when scaning - best way ?

<t49qa1$3dt$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21497&group=comp.lang.c#21497

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: andreyta...@hotmail.com (Andrey Tarasevich)
Newsgroups: comp.lang.c
Subject: Re: checking back chars when scaning - best way ?
Date: Tue, 26 Apr 2022 15:04:46 -0700
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <t49qa1$3dt$1@dont-email.me>
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
<t49kri$mbv$1@dont-email.me>
<6ff910ed-527e-4697-9bb1-26a350c2cf38n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Apr 2022 22:04:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5cd20dbea4cac0b649a20141e4aae2d2";
logging-data="3517"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18OlO5Zp2JmueaLRGEVyUUI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.1
Cancel-Lock: sha1:NLRUwPa2/BFgbx4UQs4Jhidfolc=
In-Reply-To: <6ff910ed-527e-4697-9bb1-26a350c2cf38n@googlegroups.com>
Content-Language: en-US

by: Andrey Tarasevich - Tue, 26 Apr 2022 22:04 UTC

On 4/26/2022 1:51 PM, fir wrote:
> wtorek, 26 kwietnia 2022 o 22:32:01 UTC+2 Andrey Tarasevich napisał(a):
>> On 4/26/2022 8:35 AM, fir wrote:
>>> assume you got routine that takes long string/literal char* txt
>>> as an input
>>>
>>> i need to scan for some patterns in it like "..." in this exist
>>> convenient is to check
>>>
>>> for(char* c = text; c<txt+end; c++)
>>> if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
>>>
>>> the problem is if it dont gets over a string and reads
>>> two chars before its begining
>>>
>>> what is the best and simplest way to do it correctly?
>> Um... Start searching from `text + 2`, obviously. And generally from
>> `pattern_length - 1` position. The pattern cannot possibly occur before
>> that position, no point in searching there.
>>
> it seems most people do this way but in respect to what i was sayin this seem to me to be kinda bad pattern

Quite the opposite! This is actually a very elegant and useful pattern,
which I'd personally call "strive to prologue/epilogue the special case"
or something more catchy.

Pretty often when people write cyclic code they run into situations when
the cycle has to watch for a one-off special case, which occurs only
once. Keeping this special processing inside the cycle's body usually
requires checking for it (e.g. with an `if`) on every iteration. Doing
it that way reeks to high heaven: spending an effort on _every_
iteration to catch something that happens _only_ _once_ is a bad pattern.

In many cases, if you put your mind to it, you might discover a way to
pull that one-off special processing into cycle's prologue or epilogue
code, thus relieving the actual cycle from the burden of watching for
that special case. It is a nice, elegant and very useful refactoring
pattern that not only improves efficiency, but also improves readability
of the code.

One example that comes to mind is processing edges of a polygon
represented as an array of vertices. One edge - the one between the last
and the first vertex - is an obvious special case, which can be handled
very nicely by this pattern.

What you have above is a more trivial variation of that pattern as well.

P.S. On a related note: sometimes I see people who seem to be
consciously doing the opposite: for no apparent reason they push cycle's
prologue and/or epilogue _into_ the cycle's body. Something along the
lines of

for (unsigned i = 0; i < n; ++i)
{
if (i == 0)
{
// Initialize some cycle-related stuff
}

// The actual "work"

if (i == n - 1)
{
// Cleanup some cycle-related stuff
}
}

It is weird, but I encounter it often enough to conclude that those
people are "learning" this from someone/someplace and then using this in
a cargo-cult fashion.

--
Best regards,
Andrey Tarasevich

Re: checking back chars when scaning - best way ?

<a0382ff0-9b3c-4590-8feb-c2db5adc8a07n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=21500&group=comp.lang.c#21500

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:3189:b0:69f:421e:ba00 with SMTP id bi9-20020a05620a318900b0069f421eba00mr10160111qkb.485.1651042314130;
Tue, 26 Apr 2022 23:51:54 -0700 (PDT)
X-Received: by 2002:a05:620a:372a:b0:69f:61e6:f981 with SMTP id
de42-20020a05620a372a00b0069f61e6f981mr6878974qkb.767.1651042313916; Tue, 26
Apr 2022 23:51:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 26 Apr 2022 23:51:53 -0700 (PDT)
In-Reply-To: <t49qa1$3dt$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.92; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.92
References: <6857b639-cde5-422d-bbf3-c9b7500d8ab7n@googlegroups.com>
<t49kri$mbv$1@dont-email.me> <6ff910ed-527e-4697-9bb1-26a350c2cf38n@googlegroups.com>
<t49qa1$3dt$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a0382ff0-9b3c-4590-8feb-c2db5adc8a07n@googlegroups.com>
Subject: Re: checking back chars when scaning - best way ?
From: profesor...@gmail.com (fir)
Injection-Date: Wed, 27 Apr 2022 06:51:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 92

by: fir - Wed, 27 Apr 2022 06:51 UTC

środa, 27 kwietnia 2022 o 00:05:02 UTC+2 Andrey Tarasevich napisał(a):
> On 4/26/2022 1:51 PM, fir wrote:
> > wtorek, 26 kwietnia 2022 o 22:32:01 UTC+2 Andrey Tarasevich napisał(a):
> >> On 4/26/2022 8:35 AM, fir wrote:
> >>> assume you got routine that takes long string/literal char* txt
> >>> as an input
> >>>
> >>> i need to scan for some patterns in it like "..." in this exist
> >>> convenient is to check
> >>>
> >>> for(char* c = text; c<txt+end; c++)
> >>> if(*c=='.' && *(c-1)=='.' *(c-2)=='.') ...found
> >>>
> >>> the problem is if it dont gets over a string and reads
> >>> two chars before its begining
> >>>
> >>> what is the best and simplest way to do it correctly?
> >> Um... Start searching from `text + 2`, obviously. And generally from
> >> `pattern_length - 1` position. The pattern cannot possibly occur before
> >> that position, no point in searching there.
> >>
> > it seems most people do this way but in respect to what i was sayin this seem to me to be kinda bad pattern
> Quite the opposite! This is actually a very elegant and useful pattern,
> which I'd personally call "strive to prologue/epilogue the special case"
> or something more catchy.
>
> Pretty often when people write cyclic code they run into situations when
> the cycle has to watch for a one-off special case, which occurs only
> once. Keeping this special processing inside the cycle's body usually
> requires checking for it (e.g. with an `if`) on every iteration. Doing
> it that way reeks to high heaven: spending an effort on _every_
> iteration to catch something that happens _only_ _once_ is a bad pattern.
>
> In many cases, if you put your mind to it, you might discover a way to
> pull that one-off special processing into cycle's prologue or epilogue
> code, thus relieving the actual cycle from the burden of watching for
> that special case. It is a nice, elegant and very useful refactoring
> pattern that not only improves efficiency, but also improves readability
> of the code.
>
> One example that comes to mind is processing edges of a polygon
> represented as an array of vertices. One edge - the one between the last
> and the first vertex - is an obvious special case, which can be handled
> very nicely by this pattern.
>
> What you have above is a more trivial variation of that pattern as well.
>
> P.S. On a related note: sometimes I see people who seem to be
> consciously doing the opposite: for no apparent reason they push cycle's
> prologue and/or epilogue _into_ the cycle's body. Something along the
> lines of
>
> for (unsigned i = 0; i < n; ++i)
> {
> if (i == 0)
> {
> // Initialize some cycle-related stuff
> }
>
> // The actual "work"
>
> if (i == n - 1)
> {
> // Cleanup some cycle-related stuff
> }
> }
>
> It is weird, but I encounter it often enough to conclude that those
> people are "learning" this from someone/someplace and then using this in
> a cargo-cult fashion.

ifs are costly, so from optimisation/eficiency point of view probably,
but from 'nativity' point of view its probably worse imo as it complicates
a loop, ..as i pointet in such scann pass you may need to check for nomber
of various ;subwords etc .. what i was asking in title was to establish maybe
some better pattern of doing such things

Subject	Author
checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	Jack Lemmon
Re: checking back chars when scaning - best way ?	Richard Harnden
Re: checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	Stefan Ram
Re: checking back chars when scaning - best way ?	Andrey Tarasevich
Re: checking back chars when scaning - best way ?	fir
Re: checking back chars when scaning - best way ?	Andrey Tarasevich
Re: checking back chars when scaning - best way ?	fir