Message-ID:

Klingon phaser attack from front!!!!! 100% Damage to life support!!!!

devel / comp.lang.c / Get line number for a FILE *

Get line number for a FILE *

<3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=27384&group=comp.lang.c#27384

X-Received: by 2002:a05:622a:104d:b0:403:adff:5bb4 with SMTP id f13-20020a05622a104d00b00403adff5bb4mr26241qte.13.1691665844890;
Thu, 10 Aug 2023 04:10:44 -0700 (PDT)
X-Received: by 2002:a17:90b:350c:b0:26b:b59:a115 with SMTP id
ls12-20020a17090b350c00b0026b0b59a115mr335301pjb.3.1691665844577; Thu, 10 Aug
2023 04:10:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 04:10:43 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e56b:26d4:2c8f:8f29;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e56b:26d4:2c8f:8f29
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
Subject: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 10 Aug 2023 11:10:44 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 35

by: Malcolm McLean - Thu, 10 Aug 2023 11:10 UTC

Just adding some better error reporting to my XML parser.

I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.

It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.

Here's the function I wrote.

int flineno(FILE *fp)
{ long pos;
int err;
int ch;
int answer = 1;

pos = ftell(fp);
if (pos < 0)
return 0;
err = fseek(fp, 0, SEEK_SET);
if (err)
return 0;
while (ftell(fp) < pos)
{
ch = fgetc(fp);
if (ch == '\n')
answer++;
if (ch == EOF)
return 0;
}
if (ch == '\n')
answer--;

return answer;
}

Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.

Re: Get line number for a FILE *

<6df8398b-c01d-4225-bc5e-e7679fa3786dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27389&group=comp.lang.c#27389

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:1a06:b0:40f:d6f0:7681 with SMTP id f6-20020a05622a1a0600b0040fd6f07681mr29118qtb.3.1691668687406;
Thu, 10 Aug 2023 04:58:07 -0700 (PDT)
X-Received: by 2002:a05:6a00:1ace:b0:687:3110:7faa with SMTP id
f14-20020a056a001ace00b0068731107faamr884555pfv.5.1691668686826; Thu, 10 Aug
2023 04:58:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 04:58:06 -0700 (PDT)
In-Reply-To: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.28; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.28
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6df8398b-c01d-4225-bc5e-e7679fa3786dn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 11:58:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Thu, 10 Aug 2023 11:58 UTC

czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
> Just adding some better error reporting to my XML parser.
>
> I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
>
> It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.
>
> Here's the function I wrote.
>
> int flineno(FILE *fp)
> {
> long pos;
> int err;
> int ch;
> int answer = 1;
>
> pos = ftell(fp);
> if (pos < 0)
> return 0;
> err = fseek(fp, 0, SEEK_SET);
> if (err)
> return 0;
> while (ftell(fp) < pos)
> {
> ch = fgetc(fp);
> if (ch == '\n')
> answer++;
> if (ch == EOF)
> return 0;
> }
> if (ch == '\n')
> answer--;
>
> return answer;
> }
>
> Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.

i recently switched my monitor to more width and more resolution and then i disliked
to write such vertical codes - and yours is specially vertical here
(i used to keey things to 50-60 char width i guess now im rather go twice as that

it seem to depend on monitor someone use but recently i find the width style much better
as
1) you dont need so much optical readability of code you already written
2) and you need a code comacity becouse files you work are much more compact

so what you write strikes me as ty stylistycal point..its terrible waste
something like that would be better to make 2-4 vertical lines of it
c fault is afir one candt put the "int" ceclaration in some places and it is not visibel
below loop (its c fault as loop scope usually works on context of a function it is in not has its own only
besides bad names imo fp should be file and answer should be lines
besides it looks something weird logically imo - isnt it better just to coout "\n" and then say add 1

Re: Get line number for a FILE *

<585a7a0b-c701-4e81-aa5b-36291bb8b9dbn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27391&group=comp.lang.c#27391

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ad4:4ba8:0:b0:63f:8040:2152 with SMTP id i8-20020ad44ba8000000b0063f80402152mr25393qvw.6.1691669018962;
Thu, 10 Aug 2023 05:03:38 -0700 (PDT)
X-Received: by 2002:a17:90a:8902:b0:262:ffa8:f49d with SMTP id
u2-20020a17090a890200b00262ffa8f49dmr539605pjn.9.1691669018405; Thu, 10 Aug
2023 05:03:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 05:03:37 -0700 (PDT)
In-Reply-To: <6df8398b-c01d-4225-bc5e-e7679fa3786dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.28; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.28
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com> <6df8398b-c01d-4225-bc5e-e7679fa3786dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <585a7a0b-c701-4e81-aa5b-36291bb8b9dbn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 12:03:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3862

by: fir - Thu, 10 Aug 2023 12:03 UTC

czwartek, 10 sierpnia 2023 o 13:58:15 UTC+2 fir napisał(a):
> czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
> > Just adding some better error reporting to my XML parser.
> >
> > I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
> >
> > It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.
> >
> > Here's the function I wrote.
> >
> > int flineno(FILE *fp)
> > {
> > long pos;
> > int err;
> > int ch;
> > int answer = 1;
> >
> > pos = ftell(fp);
> > if (pos < 0)
> > return 0;
> > err = fseek(fp, 0, SEEK_SET);
> > if (err)
> > return 0;
> > while (ftell(fp) < pos)
> > {
> > ch = fgetc(fp);
> > if (ch == '\n')
> > answer++;
> > if (ch == EOF)
> > return 0;
> > }
> > if (ch == '\n')
> > answer--;
> >
> > return answer;
> > }
> >
> > Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.
> i recently switched my monitor to more width and more resolution and then i disliked
> to write such vertical codes - and yours is specially vertical here
> (i used to keey things to 50-60 char width i guess now im rather go twice as that
>
> it seem to depend on monitor someone use but recently i find the width style much better
> as
> 1) you dont need so much optical readability of code you already written
> 2) and you need a code comacity becouse files you work are much more compact
>
> so what you write strikes me as ty stylistycal point..its terrible waste
> something like that would be better to make 2-4 vertical lines of it
> c fault is afir one candt put the "int" ceclaration in some places and it is not visibel
> below loop (its c fault as loop scope usually works on context of a function it is in not has its own only
> besides bad names imo fp should be file and answer should be lines
> besides it looks something weird logically imo - isnt it better just to coout "\n" and then say add 1
i mean if the logical core of this function is "for(int i=0; i<txt_len; i++) if(txt[i]=='\n') lines++"
then all that other things is unnecessary uglines

Re: Get line number for a FILE *

<15105fbe-4227-425c-a78a-4fef4b1a2f86n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27393&group=comp.lang.c#27393

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ae9:de87:0:b0:76d:331a:22cb with SMTP id s129-20020ae9de87000000b0076d331a22cbmr13839qkf.5.1691669424107;
Thu, 10 Aug 2023 05:10:24 -0700 (PDT)
X-Received: by 2002:a17:90b:716:b0:263:3437:a0b0 with SMTP id
s22-20020a17090b071600b002633437a0b0mr554177pjz.3.1691669423728; Thu, 10 Aug
2023 05:10:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 05:10:23 -0700 (PDT)
In-Reply-To: <585a7a0b-c701-4e81-aa5b-36291bb8b9dbn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e56b:26d4:2c8f:8f29;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e56b:26d4:2c8f:8f29
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<6df8398b-c01d-4225-bc5e-e7679fa3786dn@googlegroups.com> <585a7a0b-c701-4e81-aa5b-36291bb8b9dbn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <15105fbe-4227-425c-a78a-4fef4b1a2f86n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 10 Aug 2023 12:10:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4484

by: Malcolm McLean - Thu, 10 Aug 2023 12:10 UTC

On Thursday, 10 August 2023 at 13:03:51 UTC+1, fir wrote:
> czwartek, 10 sierpnia 2023 o 13:58:15 UTC+2 fir napisał(a):
> > czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
> > > Just adding some better error reporting to my XML parser.
> > >
> > > I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
> > >
> > > It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.
> > >
> > > Here's the function I wrote.
> > >
> > > int flineno(FILE *fp)
> > > {
> > > long pos;
> > > int err;
> > > int ch;
> > > int answer = 1;
> > >
> > > pos = ftell(fp);
> > > if (pos < 0)
> > > return 0;
> > > err = fseek(fp, 0, SEEK_SET);
> > > if (err)
> > > return 0;
> > > while (ftell(fp) < pos)
> > > {
> > > ch = fgetc(fp);
> > > if (ch == '\n')
> > > answer++;
> > > if (ch == EOF)
> > > return 0;
> > > }
> > > if (ch == '\n')
> > > answer--;
> > >
> > > return answer;
> > > }
> > >
> > > Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.
> > i recently switched my monitor to more width and more resolution and then i disliked
> > to write such vertical codes - and yours is specially vertical here
> > (i used to keey things to 50-60 char width i guess now im rather go twice as that
> >
> > it seem to depend on monitor someone use but recently i find the width style much better
> > as
> > 1) you dont need so much optical readability of code you already written
> > 2) and you need a code comacity becouse files you work are much more compact
> >
> > so what you write strikes me as ty stylistycal point..its terrible waste
> > something like that would be better to make 2-4 vertical lines of it
> > c fault is afir one candt put the "int" ceclaration in some places and it is not visibel
> > below loop (its c fault as loop scope usually works on context of a function it is in not has its own only
> > besides bad names imo fp should be file and answer should be lines
> > besides it looks something weird logically imo - isnt it better just to coout "\n" and then say add 1
> i mean if the logical core of this function is "for(int i=0; i<txt_len; i++) if(txt[i]=='\n') lines++"
> then all that other things is unnecessary uglines
>
ftell() usually but not always returns the position of the file pointer in terms of calls to fgetc().
And whilst it is unlikely, you could get a IO error with the device. Then the stream might be
unseekable.
I agree that these are irritating little details. Fundamentally the logic is as you say.

Re: Get line number for a FILE *

<ub2kn9$c6ds$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27395&group=comp.lang.c#27395

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED.cpc109573-know16-2-0-cust636.17-2.cable.virginm.net!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Thu, 10 Aug 2023 13:20:58 +0100
Organization: A noiseless patient Spider
Message-ID: <ub2kn9$c6ds$2@dont-email.me>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 10 Aug 2023 12:20:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cpc109573-know16-2-0-cust636.17-2.cable.virginm.net:94.175.38.125";
logging-data="399804"; mail-complaints-to="abuse@eternal-september.org"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
In-Reply-To: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>

by: Bart - Thu, 10 Aug 2023 12:20 UTC

On 10/08/2023 12:10, Malcolm McLean wrote:
> Just adding some better error reporting to my XML parser.
>
> I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
>
> It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.
>
> Here's the function I wrote.
>
> int flineno(FILE *fp)
> {
> long pos;
> int err;
> int ch;
> int answer = 1;
>
> pos = ftell(fp);
> if (pos < 0)
> return 0;
> err = fseek(fp, 0, SEEK_SET);
> if (err)
> return 0;
> while (ftell(fp) < pos)
> {
> ch = fgetc(fp);
> if (ch == '\n')
> answer++;
> if (ch == EOF)
> return 0;
> }
> if (ch == '\n')
> answer--;
>
> return answer;
> }
>
> Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.

I use something similar (with parsing source code), but I load text
entirely into memory, and traverse it as a string, identifying and
returning tokens.

Each token has associated with it, its offset within the source string.

When an error occurs, it is linked to a specific token (maybe the last
one read, maybe the first of a particular syntactic construct; this is
not an exact science).

Searching backwards from the token's offset, determines the line-number,
and offset with the line.

Re: Get line number for a FILE *

<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27396&group=comp.lang.c#27396

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:9a3:b0:640:1599:1f8a with SMTP id du3-20020a05621409a300b0064015991f8amr17119qvb.1.1691671224812;
Thu, 10 Aug 2023 05:40:24 -0700 (PDT)
X-Received: by 2002:a17:903:1c1:b0:1b8:3c5e:2289 with SMTP id
e1-20020a17090301c100b001b83c5e2289mr809380plh.2.1691671224440; Thu, 10 Aug
2023 05:40:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 05:40:23 -0700 (PDT)
In-Reply-To: <ub2kn9$c6ds$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e56b:26d4:2c8f:8f29;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e56b:26d4:2c8f:8f29
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com> <ub2kn9$c6ds$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 10 Aug 2023 12:40:24 +0000
Content-Type: text/plain; charset="UTF-8"

by: Malcolm McLean - Thu, 10 Aug 2023 12:40 UTC

On Thursday, 10 August 2023 at 13:21:12 UTC+1, Bart wrote:
> On 10/08/2023 12:10, Malcolm McLean wrote:
> > Just adding some better error reporting to my XML parser.
> >
> > I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
> >
> > It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.
> >
> > Here's the function I wrote.
> >
> > int flineno(FILE *fp)
> > {
> > long pos;
> > int err;
> > int ch;
> > int answer = 1;
> >
> > pos = ftell(fp);
> > if (pos < 0)
> > return 0;
> > err = fseek(fp, 0, SEEK_SET);
> > if (err)
> > return 0;
> > while (ftell(fp) < pos)
> > {
> > ch = fgetc(fp);
> > if (ch == '\n')
> > answer++;
> > if (ch == EOF)
> > return 0;
> > }
> > if (ch == '\n')
> > answer--;
> >
> > return answer;
> > }
> >
> > Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.
> I use something similar (with parsing source code), but I load text
> entirely into memory, and traverse it as a string, identifying and
> returning tokens.
>
> Each token has associated with it, its offset within the source string.
>
> When an error occurs, it is linked to a specific token (maybe the last
> one read, maybe the first of a particular syntactic construct; this is
> not an exact science).
>
> Searching backwards from the token's offset, determines the line-number,
> and offset with the line.
>
There's a strong case for doing that. If a file is read in as a string, then any IO
erros can be dealt with in one place. Also, it's easy to integrate the parser into
main logic. Just use the Baby X resouce compiler to embed the file as a
string and you're ready to go.

However it's traditional to parse input from a stream, probably dating back to
the days when memory was expensive. That's the approach I use in the xml
parser.

Re: Get line number for a FILE *

<3b3428ca-cfdb-46ac-916d-da83a3f09012n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27397&group=comp.lang.c#27397

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:11d1:b0:40f:df11:8c07 with SMTP id n17-20020a05622a11d100b0040fdf118c07mr34122qtk.1.1691671717764;
Thu, 10 Aug 2023 05:48:37 -0700 (PDT)
X-Received: by 2002:a37:8645:0:b0:768:3ae0:c178 with SMTP id
i66-20020a378645000000b007683ae0c178mr26553qkd.1.1691671717493; Thu, 10 Aug
2023 05:48:37 -0700 (PDT)
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 05:48:37 -0700 (PDT)
In-Reply-To: <15105fbe-4227-425c-a78a-4fef4b1a2f86n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.26; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.26
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<6df8398b-c01d-4225-bc5e-e7679fa3786dn@googlegroups.com> <585a7a0b-c701-4e81-aa5b-36291bb8b9dbn@googlegroups.com>
<15105fbe-4227-425c-a78a-4fef4b1a2f86n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3b3428ca-cfdb-46ac-916d-da83a3f09012n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 12:48:37 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Thu, 10 Aug 2023 12:48 UTC

czwartek, 10 sierpnia 2023 o 14:10:31 UTC+2 Malcolm McLean napisał(a):
> On Thursday, 10 August 2023 at 13:03:51 UTC+1, fir wrote:
> > czwartek, 10 sierpnia 2023 o 13:58:15 UTC+2 fir napisał(a):
> > > czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
> > > > Just adding some better error reporting to my XML parser.
> > > >
> > > > I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
> > > >
> > > > It's slow of course. But you will call the function when in error procesing mode. So that shouldn't matter.
> > > >
> > > > Here's the function I wrote.
> > > >
> > > > int flineno(FILE *fp)
> > > > {
> > > > long pos;
> > > > int err;
> > > > int ch;
> > > > int answer = 1;
> > > >
> > > > pos = ftell(fp);
> > > > if (pos < 0)
> > > > return 0;
> > > > err = fseek(fp, 0, SEEK_SET);
> > > > if (err)
> > > > return 0;
> > > > while (ftell(fp) < pos)
> > > > {
> > > > ch = fgetc(fp);
> > > > if (ch == '\n')
> > > > answer++;
> > > > if (ch == EOF)
> > > > return 0;
> > > > }
> > > > if (ch == '\n')
> > > > answer--;
> > > >
> > > > return answer;
> > > > }
> > > >
> > > > Note the slightly tricky detail. If the last character read was a newline, that probably meant that you read a line of text, parsed it, and found an error in that line. So the function needs to report the previous rather than the current line.
> > > i recently switched my monitor to more width and more resolution and then i disliked
> > > to write such vertical codes - and yours is specially vertical here
> > > (i used to keey things to 50-60 char width i guess now im rather go twice as that
> > >
> > > it seem to depend on monitor someone use but recently i find the width style much better
> > > as
> > > 1) you dont need so much optical readability of code you already written
> > > 2) and you need a code comacity becouse files you work are much more compact
> > >
> > > so what you write strikes me as ty stylistycal point..its terrible waste
> > > something like that would be better to make 2-4 vertical lines of it
> > > c fault is afir one candt put the "int" ceclaration in some places and it is not visibel
> > > below loop (its c fault as loop scope usually works on context of a function it is in not has its own only
> > > besides bad names imo fp should be file and answer should be lines
> > > besides it looks something weird logically imo - isnt it better just to coout "\n" and then say add 1
> > i mean if the logical core of this function is "for(int i=0; i<txt_len; i++) if(txt[i]=='\n') lines++"
> > then all that other things is unnecessary uglines
> >
> ftell() usually but not always returns the position of the file pointer in terms of calls to fgetc().
> And whilst it is unlikely, you could get a IO error with the device. Then the stream might be
> unseekable.
> I agree that these are irritating little details. Fundamentally the logic is as you say.

if you do thet scanig whole file by this fgetc based on this hidden position in file
its imo rather bad design (its potentially square compelxity, though this depends on use case
as as you sey whan using it in in sole error comunicate one cand did it.. also those fgetc all
are probably cached in ram so one could meybe use that fgetc and fseek as an interface for ram-jumping and maybe sometimes its even good decision not to duplicate its ram mirror..
but i would rether need to know that... and hella know what it happens, windows is
probably able to delete cached area if is low on ram or something though i dont know
if it doeas it on opened files )..
it probably depends on how someone "loves" the api it uses.. i typically not love
this stdio.h api so much (though i dint mean it cant be loved if better known ;c
i got no idea)

overally with time i think i learned probably not be so paranoid and say it is bad (as it uses fgatc, fseeks and hidden pointer, yet potential killing square compelxity ) becouse it depends on use case
i would say assuming everything is okay everything is okay here, but it depends on how much someone want to chesk or study given topic and assumptions

im not sure if studting details of this stdio api is most interesting thing for me
(thats whan i answered mainly on code style which is more interesting to me by now)

(overally i wouldnt write thing like that as i use my own "sickle" (load into ram and then split)
way but you may write such way if you want, though i would suspect most people would say
it is bad and you will probably better keep track on current line number on the run or something like that)

Re: Get line number for a FILE *

<898c467a-0ea7-4b5d-94ca-0d4a97c786f2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27398&group=comp.lang.c#27398

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:22e6:b0:765:a9f8:959b with SMTP id p6-20020a05620a22e600b00765a9f8959bmr26578qki.13.1691672589023;
Thu, 10 Aug 2023 06:03:09 -0700 (PDT)
X-Received: by 2002:a17:903:22cd:b0:1b8:ecd:cb7f with SMTP id
y13-20020a17090322cd00b001b80ecdcb7fmr844687plg.9.1691672588457; Thu, 10 Aug
2023 06:03:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 06:03:07 -0700 (PDT)
In-Reply-To: <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.26; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.26
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me> <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <898c467a-0ea7-4b5d-94ca-0d4a97c786f2n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 13:03:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Thu, 10 Aug 2023 13:03 UTC

czwartek, 10 sierpnia 2023 o 14:40:42 UTC+2 Malcolm McLean napisał(a):
> On Thursday, 10 August 2023 at 13:21:12 UTC+1, Bart wrote:
> > On 10/08/2023 12:10, Malcolm McLean wrote:
> > > Just adding some better error reporting to my XML parser.
> string and you're ready to go.
>
> However it's traditional to parse input from a stream, probably dating back to
> the days when memory was expensive. That's the approach I use in the xml
> parser.

those were 60-ties or 70-ties where you got 16KB ram now warious working structures i
guess are bigger... there are maybe two problems with that 1) stylistical dependency
of this stdio api (in windows for exampel you could use couple of altartantive ways to read
a file imo ) especially it is not much great 2) potential swuare complexity (i know ypu
sayd use in error routine which makes its not square but as a general "danger", becouse it some function is present there is temptation to use it in some other cases)

i seen recently this second problem (in thread on container for reducing space square complexity) - square komplexity is deadly killing 0 it kils modern ps as on low numbers of 1000
elements (as 1000x1000xC gives miliseconds when C is nanoseconds) its so damn
killing its just a disaster and its absolutely no way to go

Re: Get line number for a FILE *

<5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27400&group=comp.lang.c#27400

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:6181:b0:767:f1e6:85ff with SMTP id or1-20020a05620a618100b00767f1e685ffmr28735qkn.2.1691676334416;
Thu, 10 Aug 2023 07:05:34 -0700 (PDT)
X-Received: by 2002:a05:6a00:399b:b0:668:85ba:7164 with SMTP id
fi27-20020a056a00399b00b0066885ba7164mr1094910pfb.0.1691676334063; Thu, 10
Aug 2023 07:05:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 07:05:33 -0700 (PDT)
In-Reply-To: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.130; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.130
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 14:05:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1469

by: fir - Thu, 10 Aug 2023 14:05 UTC

czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):

> while (ftell(fp) < pos)
especially this - i wouldnt write that...hela knows what it doeas it probably does not much
then return and if but even if so it doesnt to do like that - use while and return on EOF ..

Re: Get line number for a FILE *

<e7a5dd2f-1931-40ee-a657-eaf4085f33d8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27401&group=comp.lang.c#27401

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a37:93c4:0:b0:76a:93b1:52a7 with SMTP id v187-20020a3793c4000000b0076a93b152a7mr26707qkd.3.1691676831113;
Thu, 10 Aug 2023 07:13:51 -0700 (PDT)
X-Received: by 2002:a17:903:32c3:b0:1bc:7c69:925c with SMTP id
i3-20020a17090332c300b001bc7c69925cmr928505plr.10.1691676830824; Thu, 10 Aug
2023 07:13:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 07:13:50 -0700 (PDT)
In-Reply-To: <5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.130; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.130
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com> <5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e7a5dd2f-1931-40ee-a657-eaf4085f33d8n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 14:13:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Thu, 10 Aug 2023 14:13 UTC

czwartek, 10 sierpnia 2023 o 16:05:43 UTC+2 fir napisał(a):
> czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
>
> > while (ftell(fp) < pos)
> especially this - i wouldnt write that...hela knows what it doeas it probably does not much
> then return and if but even if so it doesnt to do like that - use while and return on EOF ..

basic advice do time measurment - write a function that returns nanoseconds from its previous call..name it delta_ns() or tic() or something like that

int main() {tic(); foo(); float f = tic(); printf("\n time of foo %1.f nanoseconds", t); return 'ok';}

this code will give times and you may see if such while(ftell(file)) has an impact over a while(1)
i probably could eventually test if here but im not sure if its worth

Re: Get line number for a FILE *

<bD6BM.359407$xMqa.294344@fx12.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27408&group=comp.lang.c#27408

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Get line number for a FILE *
Newsgroups: comp.lang.c
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
Lines: 22
Message-ID: <bD6BM.359407$xMqa.294344@fx12.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 10 Aug 2023 14:37:59 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 10 Aug 2023 14:37:59 GMT
X-Received-Bytes: 1418

by: Scott Lurndal - Thu, 10 Aug 2023 14:37 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>Just adding some better error reporting to my XML parser.
>
>I started out by keeping track of the line number as I went. But of course that's very fiddly. There's a much better way of doing it, which is to rewind the file and count newlines.
>

How so?

line = fgets(buffer, sizeof(buffer), stdin);
if (line != NULL) {
line_number++;
/* parse line. If the text exceeds sizeof(buffer) before
a new line is encountered, fread the remaining into an additional
buffer and concatenate buffers */
} else {
/* all done, go home */
}

>It's slow of course.

It's completely unnecessary and a waste of processing cycles.

Re: Get line number for a FILE *

<87ttt7dkrt.fsf@bsb.me.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27410&group=comp.lang.c#27410

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Thu, 10 Aug 2023 15:39:18 +0100
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <87ttt7dkrt.fsf@bsb.me.uk>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="1e9312cf9931c060dc623a19d6d4488d";
logging-data="435270"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+PGM6OJ7QnGcXpghgZaNX/r6R4nhaD8eE="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:pgTvNd98Est2gjTSUBVr6fuFhaU=
sha1:Qbww3P9bhDNPnbU4m7nFvHSXsbw=
X-BSB-Auth: 1.2e2e6d8e51f4390b1846.20230810153918BST.87ttt7dkrt.fsf@bsb.me.uk

by: Ben Bacarisse - Thu, 10 Aug 2023 14:39 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> Just adding some better error reporting to my XML parser.
>
> I started out by keeping track of the line number as I went. But of
> course that's very fiddly.

Why is it very fiddly?

> There's a much better way of doing it,
> which is to rewind the file and count newlines.
>
> It's slow of course. But you will call the function when in error
> procesing mode. So that shouldn't matter.
>
> Here's the function I wrote.
>
> int flineno(FILE *fp)
> {
> long pos;
> int err;
> int ch;
> int answer = 1;
>
> pos = ftell(fp);
> if (pos < 0)
> return 0;
> err = fseek(fp, 0, SEEK_SET);
> if (err)
> return 0;
> while (ftell(fp) < pos)
> {
> ch = fgetc(fp);
> if (ch == '\n')
> answer++;
> if (ch == EOF)
> return 0;
> }
> if (ch == '\n')

ch can be indeterminate at this point.

The fix gives you the option of a bit of a hack: if you initialise ch to
'\n' this test will report an error on line zero (for an empty file)
which you might want.

> answer--;
>
> return answer;
> }

But overall, I would not do this. It won't work when the input is not
seekable and that's just too useful a situation on Linux systems. I
don't think that matters so much to you, but I avoid tools that can't be
part of a pipeline.

The other consideration, for me, is that every lexer I've written ends
up needing state (the file name, the current token and so on) so
counting the lines is not fiddly at all. It's just another small bit of
state to maintain in the reader.

> Note the slightly tricky detail. If the last character read was a
> newline, that probably meant that you read a line of text, parsed it,
> and found an error in that line. So the function needs to report the
> previous rather than the current line.

If I were to do this, I'd consider also counting the characters since
the last newline to give a line offset as well. Mind you UTF-8 inputs
with lots of combining characters makes this hard to do correctly. In
practice, though, I don't think there are common.

--
Ben.

Re: Get line number for a FILE *

<52a69bb8-05ea-46df-ae76-1bc1b6486cbbn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27412&group=comp.lang.c#27412

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a37:8645:0:b0:768:3ae0:c178 with SMTP id i66-20020a378645000000b007683ae0c178mr35012qkd.1.1691680298455;
Thu, 10 Aug 2023 08:11:38 -0700 (PDT)
X-Received: by 2002:a17:902:e841:b0:1b5:147f:d8d1 with SMTP id
t1-20020a170902e84100b001b5147fd8d1mr1018971plg.3.1691680298026; Thu, 10 Aug
2023 08:11:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 08:11:37 -0700 (PDT)
In-Reply-To: <e7a5dd2f-1931-40ee-a657-eaf4085f33d8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.132; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.132
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com> <e7a5dd2f-1931-40ee-a657-eaf4085f33d8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <52a69bb8-05ea-46df-ae76-1bc1b6486cbbn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 15:11:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Thu, 10 Aug 2023 15:11 UTC

czwartek, 10 sierpnia 2023 o 16:13:59 UTC+2 fir napisał(a):
> czwartek, 10 sierpnia 2023 o 16:05:43 UTC+2 fir napisał(a):
> > czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
> >
> > > while (ftell(fp) < pos)
> > especially this - i wouldnt write that...hela knows what it doeas it probably does not much
> > then return and if but even if so it doesnt to do like that - use while and return on EOF ..
> basic advice do time measurment - write a function that returns nanoseconds from its previous call..name it delta_ns() or tic() or something like that
>
> int main() {tic(); foo(); float f = tic(); printf("\n time of foo %1.f nanoseconds", t); return 'ok';}
>
> this code will give times and you may see if such while(ftell(file)) has an impact over a while(1)
> i probably could eventually test if here but im not sure if its worth

lol id showed to be worse than i thought (but at kleast worth of test)

tested
Les Trois Mousquetaires / The Three Musketeers
Bilingual Edition
Translated by William Robson

in my txt file 4 107 378 bytes

whisch your code and my says has 39 052 lines

your code with ftell in a loop takes 21 670 ms (when i set pos to end)
your code with while (pos--) takes 319 ms

my sickle code on splitter takes stable 13 though it does more work

int flineno(FILE *fp)
{ long pos;int err;int ch;int answer = 1;
fseek(fp, 0, SEEK_END); pos = ftell(fp); if (pos < 0) return 0;
err = fseek(fp, 0, SEEK_SET);if (err)return 0;
while (ftell(fp) < pos)
{ ch = fgetc(fp);
if (ch == '\n')answer++;
if (ch == EOF)return 0;
}

if (ch == '\n')answer--;return answer;
}

int flineno2(FILE *fp)
{ long pos;int err;int ch;int answer = 1;
fseek(fp, 0, SEEK_END); pos = ftell(fp); if (pos < 0) return 0;
err = fseek(fp, 0, SEEK_SET);if (err)return 0;

while (pos--)
{ ch = fgetc(fp);
if (ch == '\n')answer++;
if (ch == EOF) return answer;
}

if (ch == '\n')answer--;return answer;
}

int main()
{ // TestSpaceDirectory();

if(0)
{ TakeDeltaTimeNS(1);
chunk d = LoadChunk("les trois.txt");
chunks lines = SplitAfterCharacter(d, 0x0a);
int n = ChunksLength(lines);
float t = TakeDeltaTimeNS(1);

printf("\n newlines: %d %1.f ms", n, t/1000/1000);

}
else
{

FILE * fp = fopen("les trois.txt", "rt");
if(!fp) printf("error");

TakeDeltaTimeNS(1);
int n = flineno2(fp);
float t = TakeDeltaTimeNS(1);

printf("\n newlines: %d %1.f ms", n, t/1000/1000);
}

// TestDynamicArray();

return 'OK';
}

Re: Get line number for a FILE *

<9c3ed6ab-5232-4820-94c1-3985bb91e451n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27414&group=comp.lang.c#27414

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ad4:55cf:0:b0:635:e010:970e with SMTP id bt15-20020ad455cf000000b00635e010970emr39470qvb.13.1691681180029;
Thu, 10 Aug 2023 08:26:20 -0700 (PDT)
X-Received: by 2002:a17:902:f68c:b0:1bb:cd10:823f with SMTP id
l12-20020a170902f68c00b001bbcd10823fmr980162plg.5.1691681179640; Thu, 10 Aug
2023 08:26:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!glou.org!news.glou.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 08:26:18 -0700 (PDT)
In-Reply-To: <52a69bb8-05ea-46df-ae76-1bc1b6486cbbn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.132; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.132
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com> <e7a5dd2f-1931-40ee-a657-eaf4085f33d8n@googlegroups.com>
<52a69bb8-05ea-46df-ae76-1bc1b6486cbbn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9c3ed6ab-5232-4820-94c1-3985bb91e451n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Thu, 10 Aug 2023 15:26:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Thu, 10 Aug 2023 15:26 UTC

czwartek, 10 sierpnia 2023 o 17:11:46 UTC+2 fir napisał(a):
> czwartek, 10 sierpnia 2023 o 16:13:59 UTC+2 fir napisał(a):
> > czwartek, 10 sierpnia 2023 o 16:05:43 UTC+2 fir napisał(a):
> > > czwartek, 10 sierpnia 2023 o 13:10:52 UTC+2 Malcolm McLean napisał(a):
> > >
> > > > while (ftell(fp) < pos)
> > > especially this - i wouldnt write that...hela knows what it doeas it probably does not much
> > > then return and if but even if so it doesnt to do like that - use while and return on EOF ..
> > basic advice do time measurment - write a function that returns nanoseconds from its previous call..name it delta_ns() or tic() or something like that
> >
> > int main() {tic(); foo(); float f = tic(); printf("\n time of foo %1.f nanoseconds", t); return 'ok';}
> >
> > this code will give times and you may see if such while(ftell(file)) has an impact over a while(1)
> > i probably could eventually test if here but im not sure if its worth
> lol id showed to be worse than i thought (but at kleast worth of test)
>
> tested
> Les Trois Mousquetaires / The Three Musketeers
> Bilingual Edition
> Translated by William Robson
>
> in my txt file 4 107 378 bytes
>
> whisch your code and my says has 39 052 lines
>
> your code with ftell in a loop takes 21 670 ms (when i set pos to end)
> your code with while (pos--) takes 319 ms
>
> my sickle code on splitter takes stable 13 though it does more work
> int flineno(FILE *fp)
> {
> long pos;int err;int ch;int answer = 1;
> fseek(fp, 0, SEEK_END); pos = ftell(fp); if (pos < 0) return 0;
> err = fseek(fp, 0, SEEK_SET);if (err)return 0;
> while (ftell(fp) < pos)
> {
> ch = fgetc(fp);
> if (ch == '\n')answer++;
> if (ch == EOF)return 0;
> }
>
> if (ch == '\n')answer--;return answer;
> }
> int flineno2(FILE *fp)
> {
> long pos;int err;int ch;int answer = 1;
> fseek(fp, 0, SEEK_END); pos = ftell(fp); if (pos < 0) return 0;
> err = fseek(fp, 0, SEEK_SET);if (err)return 0;
> while (pos--)
> {
> ch = fgetc(fp);
> if (ch == '\n')answer++;
> if (ch == EOF) return answer;
> }
>
> if (ch == '\n')answer--;return answer;
> }
>
> int main()
> {
> // TestSpaceDirectory();
>
> if(0)
> {
> TakeDeltaTimeNS(1);
> chunk d = LoadChunk("les trois.txt");
> chunks lines = SplitAfterCharacter(d, 0x0a);
> int n = ChunksLength(lines);
> float t = TakeDeltaTimeNS(1);
>
> printf("\n newlines: %d %1.f ms", n, t/1000/1000);
>
> }
> else
> {
>
> FILE * fp = fopen("les trois.txt", "rt");
> if(!fp) printf("error");
>
> TakeDeltaTimeNS(1);
> int n = flineno2(fp);
> float t = TakeDeltaTimeNS(1);
>
> printf("\n newlines: %d %1.f ms", n, t/1000/1000);
> }
>
> // TestDynamicArray();
>
> return 'OK';
> }

if you want thsi test file https://fastupload.io/uMmGB2N7bsZx7mf/file
with test source and my dll - which i use coz it has this timer and sickle
routines inside (sickle (thsi is my routines for chunk based text processing)
may also be included as .c file but i use it sometimes as dll too)

Re: Get line number for a FILE *

<20230810081407.945@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27418&group=comp.lang.c#27418

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Thu, 10 Aug 2023 15:38:47 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <20230810081407.945@kylheku.com>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
Injection-Date: Thu, 10 Aug 2023 15:38:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c8ceb3884c73c57508aac2a066fd132c";
logging-data="453175"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18QNV6Y+DOCbMYEMAPsX6tc69Ea1SqfNGU="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:8thfXGbIDNHArS+xt6NEq1yAs8w=

by: Kaz Kylheku - Thu, 10 Aug 2023 15:38 UTC

On 2023-08-10, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> Just adding some better error reporting to my XML parser.
>
> I started out by keeping track of the line number as I went. But of
> course that's very fiddly.

It's only slightly fiddly if you have to recognize tokens that can
contain embedded newlines.

In a language in which a newline only occurs in a comment, or
whitespace, you can have a couple of dedicated tokenizing rules
which increment the line number.

You attach the current line number to all token objects and then
the parser can just refer to them.

If you allow tokens with newlines like

"multi
line
literal"

by a single token match that gobbles newlines, you then need
some slight fiddling, like walking the lexeme of that object after
it is extracted and bumping the line number for each newline
that occurs.

You can also have your lexer go through a stream abstraction which
tracks the line number (and possibly column position: popular nowadays).

> There's a much better way of doing it,
> which is to rewind the file and count newlines.

Doesn't work for non-seekable streams, like pipes and sockets.

I suspect that next to no language processors in the wild
get their error-reporting line numbers this way.

Also, it wouldn't be accurate because the current line number in the
underlying input file is not always the best line number for the error
diagnostic. Sometimes a diagnostic benefits from two line numbers.

Here is a trivial example: Lisp parser reporting unbalanced
parens, while reading multi-line input (in this case from a string):

1> (read "(foo\n(bar\n(baz" *stderr*)
string:3: syntax error
string:3: unterminated expression
string:1: while parsing expression starting on this line
** syntax error: read: string: errors encountered

We cannot use your approach to generate the aditional helpful diagnostic
about the starting line, unless at the start of every top-level
expression, we invoke the fseek-based logic to record the line number,
even when there is no error.

You're assuming it's okay for it to be "expensive" to get the
line number, because only the parsing ever needs it and only
when there is a syntax error.

Line numbers can also be usefully retained into later processing stages,
long after parsing is done.

There are errors that can be caught only after a second pass through the
parse tree or abstract syntax tree. E.g. simple example: expanding
macros. A structural macro expander works with the abstract syntax, by
walking the tree. The original stream (if any) from which the code came
is long gone.

1> (defmacro needs-arg (x))
needs-arg
2> (compile-toplevel '(needs-arg))
** expr-2:1: needs-arg: missing arguments for params (x)
2> (compile-toplevel '(list
1
2
(needs-arg)))
** expr-2:4: needs-arg: missing arguments for params (x)

Yet, we can report the error to a line number. Which means that
all expressions have line information attached to them, just in
case there may be some error found in them in a later code walk.

Re: Get line number for a FILE *

<20230810083924.456@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27419&group=comp.lang.c#27419

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Thu, 10 Aug 2023 15:40:26 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <20230810083924.456@kylheku.com>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me>
<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
Injection-Date: Thu, 10 Aug 2023 15:40:26 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="c8ceb3884c73c57508aac2a066fd132c";
logging-data="453175"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ua57Ax80K6PDlQj41aO9lWOCNvmfCHlk="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:VTFW8GH8QnISeTPQ/SemOiFOpr8=

by: Kaz Kylheku - Thu, 10 Aug 2023 15:40 UTC

On 2023-08-10, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> On Thursday, 10 August 2023 at 13:21:12 UTC+1, Bart wrote:
>> Searching backwards from the token's offset, determines the line-number,
>> and offset with the line.
>>
> There's a strong case for doing that. If a file is read in as a string, then any IO
> erros can be dealt with in one place.

Only if you have lexical, syntax, type, semantic and all other error
checking that needs a line number all in one place, in one pass.

Re: Get line number for a FILE *

<53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27456&group=comp.lang.c#27456

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a37:2c07:0:b0:76c:69ac:a0f0 with SMTP id s7-20020a372c07000000b0076c69aca0f0mr6217qkh.4.1691722318098;
Thu, 10 Aug 2023 19:51:58 -0700 (PDT)
X-Received: by 2002:a05:6a00:1704:b0:687:41a1:640e with SMTP id
h4-20020a056a00170400b0068741a1640emr231020pfc.6.1691722317524; Thu, 10 Aug
2023 19:51:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 10 Aug 2023 19:51:56 -0700 (PDT)
In-Reply-To: <20230810083924.456@kylheku.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:868:d0a4:2df0:ec16;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:868:d0a4:2df0:ec16
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me> <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Fri, 11 Aug 2023 02:51:58 +0000
Content-Type: text/plain; charset="UTF-8"

by: Malcolm McLean - Fri, 11 Aug 2023 02:51 UTC

On Thursday, 10 August 2023 at 16:40:40 UTC+1, Kaz Kylheku wrote:
> On 2023-08-10, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > On Thursday, 10 August 2023 at 13:21:12 UTC+1, Bart wrote:
> >> Searching backwards from the token's offset, determines the line-number,
> >> and offset with the line.
> >>
> > There's a strong case for doing that. If a file is read in as a string, then any IO
> > erros can be dealt with in one place.
> Only if you have lexical, syntax, type, semantic and all other error
> checking that needs a line number all in one place, in one pass.
>
The "lexer" is fgetc(). You can tell because the code makes large numbers of ungetc()
calls. I'm rewriting it any way. Maybe I should do a complete rewrite to have a
proper lexer with a "gettoken()" and "match()" function. Then, as you say, all calls
to fgetc() will be in the same place, and it becomes easy to count lines and columns.

The main problem with a lexer for XML is that the grammar specifies "anything except
a special characer" for the data between nodes. So either you have to have a special
mode, so the lexer isn't really a lexer any more, or the tokens have to be single characters
anyway.

IO errors are of a different nature to parse errors. By far the most likely IO error is
failure to open the file because the wrong name was provided. But any call to
fgetc() can return EOF. If you load all the data as a string in a first pass, then
you pick up any IO errors, and further ones are not possible.

Re: Get line number for a FILE *

<RGzeoTxZngQiZgoTG@bongo-ra.co>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27457&group=comp.lang.c#27457

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Fri, 11 Aug 2023 05:38:33 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <RGzeoTxZngQiZgoTG@bongo-ra.co>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com> <20230810081407.945@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 11 Aug 2023 05:38:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3a8c36ec1363fec05f45e9485a21b02f";
logging-data="791953"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/e3Ka4SP6zO+37DIpAlX69"
Cancel-Lock: sha1:UclFZZPMZ8pz7gXmH4V75pTQZm0=
X-Organisation: Weyland-Yutani
In-Reply-To: <20230810081407.945@kylheku.com>
X-Server-Commands: nowebcancel

by: Spiros Bousbouras - Fri, 11 Aug 2023 05:38 UTC

On Thu, 10 Aug 2023 15:38:47 -0000 (UTC)
Kaz Kylheku <864-117-4973@kylheku.com> wrote:
> On 2023-08-10, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> > Just adding some better error reporting to my XML parser.
> >
> > I started out by keeping track of the line number as I went. But of
> > course that's very fiddly.
>
> It's only slightly fiddly if you have to recognize tokens that can
> contain embedded newlines.

I load the whole file into memory and , as I do so , I also recognise ends of
lines and create an array where line_numbers[i] = j means that the line
number i starts at byte j .So I can get from a byte number to a line number
by a binary search. All further parsing is done after I have the whole file
in memory.

> In a language in which a newline only occurs in a comment, or
> whitespace, you can have a couple of dedicated tokenizing rules
> which increment the line number.
>
> You attach the current line number to all token objects and then
> the parser can just refer to them.
>
> If you allow tokens with newlines like
>
> "multi
> line
> literal"
>
> by a single token match that gobbles newlines, you then need
> some slight fiddling, like walking the lexeme of that object after
> it is extracted and bumping the line number for each newline
> that occurs.

[...]

Re: Get line number for a FILE *

<878rahesel.fsf@bsb.me.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27465&group=comp.lang.c#27465

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Fri, 11 Aug 2023 12:21:22 +0100
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <878rahesel.fsf@bsb.me.uk>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me>
<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com>
<53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="ba900a44ace1132ec994257293eba59f";
logging-data="881135"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tfpSIRVtGYJZM1qsNGisvgA6TW+HHXRw="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:XdyFVQgroFGaePdvjDBF3r/RSIE=
sha1:oA6Ufg4Gf/AquvvrBv1RbAwxOcE=
X-BSB-Auth: 1.312c1ff1cee90b9713ef.20230811122122BST.878rahesel.fsf@bsb.me.uk

by: Ben Bacarisse - Fri, 11 Aug 2023 11:21 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> The main problem with a lexer for XML is that the grammar specifies
> "anything except a special characer" for the data between nodes. So
> either you have to have a special mode, so the lexer isn't really a
> lexer any more, or the tokens have to be single characters anyway.

No. It's almost universal to have such tokens. A C string is (to a
first approximation) anything except a '"'. A C comment is anything up
to a '*' followed by a '/'. In the most common kind of lexer, what you
think of as "modes" are states in a finite-state machine recogniser for
a regular language. For example, in C, when it sees a letter or '_' the
lexer enters the ID "mode" and accepts characters until the first
non-alphanumeric. It's just a state machine.

--
Ben.

Re: Get line number for a FILE *

<f1b06f9e-2c8d-4d00-8295-72cc40b8fbddn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27469&group=comp.lang.c#27469

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:249:b0:40c:84bb:1b09 with SMTP id c9-20020a05622a024900b0040c84bb1b09mr44772qtx.0.1691761776677;
Fri, 11 Aug 2023 06:49:36 -0700 (PDT)
X-Received: by 2002:a05:6a00:cc6:b0:687:a657:e11d with SMTP id
b6-20020a056a000cc600b00687a657e11dmr660372pfv.1.1691761776358; Fri, 11 Aug
2023 06:49:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!glou.org!news.glou.org!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 11 Aug 2023 06:49:35 -0700 (PDT)
In-Reply-To: <878rahesel.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.85; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.85
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me> <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com> <53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f1b06f9e-2c8d-4d00-8295-72cc40b8fbddn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Fri, 11 Aug 2023 13:49:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Fri, 11 Aug 2023 13:49 UTC

piątek, 11 sierpnia 2023 o 13:21:38 UTC+2 Ben Bacarisse napisał(a):
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> > The main problem with a lexer for XML is that the grammar specifies
> > "anything except a special characer" for the data between nodes. So
> > either you have to have a special mode, so the lexer isn't really a
> > lexer any more, or the tokens have to be single characters anyway.
> No. It's almost universal to have such tokens. A C string is (to a
> first approximation) anything except a '"'. A C comment is anything up
> to a '*' followed by a '/'. In the most common kind of lexer, what you
> think of as "modes" are states in a finite-state machine recogniser for
> a regular language. For example, in C, when it sees a letter or '_' the
> lexer enters the ID "mode" and accepts characters until the first
> non-alphanumeric. It's just a state machine.
>
interesting i didnt know c works that way (i man to be open on unknown characters and if so they are legal in c - this is kind of this opennes
i probably suffice when i say C is better language than i
most probably probbaly could invent ), hovever what is the proof
c is like that? (and standard does not convince me as imo
standard is totally not c, better tradition of early c which is imo much more
important)

Re: Get line number for a FILE *

<87y1ihcv3c.fsf@bsb.me.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27491&group=comp.lang.c#27491

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Fri, 11 Aug 2023 19:06:15 +0100
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <87y1ihcv3c.fsf@bsb.me.uk>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me>
<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com>
<53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
<f1b06f9e-2c8d-4d00-8295-72cc40b8fbddn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: dont-email.me; posting-host="ba900a44ace1132ec994257293eba59f";
logging-data="997292"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/1QVcel4+A6KYDcYaRXM6FY23H/sxIaVY="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:UM4yq2ya7NMkBVz3A0l6lKFYcRU=
sha1:BivT4XOW31sK/UMjpybpWEKvsXc=
X-BSB-Auth: 1.60b06b61c8cc91ab7a57.20230811190615BST.87y1ihcv3c.fsf@bsb.me.uk

by: Ben Bacarisse - Fri, 11 Aug 2023 18:06 UTC

fir <profesor.fir@gmail.com> writes:

> piątek, 11 sierpnia 2023 o 13:21:38 UTC+2 Ben Bacarisse napisał(a):
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>> > The main problem with a lexer for XML is that the grammar specifies
>> > "anything except a special characer" for the data between nodes. So
>> > either you have to have a special mode, so the lexer isn't really a
>> > lexer any more, or the tokens have to be single characters anyway.
>>
>> No. It's almost universal to have such tokens. A C string is (to a
>> first approximation) anything except a '"'. A C comment is anything up
>> to a '*' followed by a '/'. In the most common kind of lexer, what you
>> think of as "modes" are states in a finite-state machine recogniser for
>> a regular language. For example, in C, when it sees a letter or '_' the
>> lexer enters the ID "mode" and accepts characters until the first
>> non-alphanumeric. It's just a state machine.
>>
> interesting i didnt know c works that way (i man to be open on unknown
> characters and if so they are legal in c - this is kind of this
> opennes i probably suffice when i say C is better language than i most
> probably probbaly could invent ), hovever what is the proof c is like
> that? (and standard does not convince me as imo standard is totally
> not c, better tradition of early c which is imo much more important)

There's a question about C in there but I don't really know what you are
asking. I think the answer is no, C does not work like that. When I
said "to a first approximation" I meant that I've omitted lots of
details and I think details are what you are asking about.

--
Ben.

Re: Get line number for a FILE *

<bd3f35b4-fd47-451a-8d78-c9ab21075ba5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27504&group=comp.lang.c#27504

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:199d:b0:40f:17f1:541 with SMTP id u29-20020a05622a199d00b0040f17f10541mr51274qtc.13.1691820623983;
Fri, 11 Aug 2023 23:10:23 -0700 (PDT)
X-Received: by 2002:a17:90b:914:b0:26b:20f2:c0e7 with SMTP id
bo20-20020a17090b091400b0026b20f2c0e7mr866097pjb.0.1691820623702; Fri, 11 Aug
2023 23:10:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 11 Aug 2023 23:10:23 -0700 (PDT)
In-Reply-To: <878rahesel.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:868:d0a4:2df0:ec16;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:868:d0a4:2df0:ec16
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me> <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com> <53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bd3f35b4-fd47-451a-8d78-c9ab21075ba5n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 12 Aug 2023 06:10:23 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2969

by: Malcolm McLean - Sat, 12 Aug 2023 06:10 UTC

On Friday, 11 August 2023 at 12:21:38 UTC+1, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> > The main problem with a lexer for XML is that the grammar specifies
> > "anything except a special characer" for the data between nodes. So
> > either you have to have a special mode, so the lexer isn't really a
> > lexer any more, or the tokens have to be single characters anyway.
> No. It's almost universal to have such tokens. A C string is (to a
> first approximation) anything except a '"'. A C comment is anything up
> to a '*' followed by a '/'. In the most common kind of lexer, what you
> think of as "modes" are states in a finite-state machine recogniser for
> a regular language. For example, in C, when it sees a letter or '_' the
> lexer enters the ID "mode" and accepts characters until the first
> non-alphanumeric. It's just a state machine.
>
The difference is that in C, or for that matter, Minibasic, when you hit an
alpha (or underscore), you known that you just have to read off all
the subsequent alnums to get the token. Free form text can only appear
inside quotes or comments. So if you hit a quote you gobble the string,
and if you hit a comment you just pass through until you hit a "close
comment" sequence (or in Mininbasic, just skip the REM line)

Wiht XML it's not like that. Free form text is delineated by tags. But you've
got to know whether you are in a tag or not to read it. So if you are to have
a FREEFORMTEXT token, the lexer needs to know whether it has just parsed
an element opening tag or not.

Re: Get line number for a FILE *

<20230812001343.595@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27505&group=comp.lang.c#27505

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Sat, 12 Aug 2023 07:26:11 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <20230812001343.595@kylheku.com>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me>
<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com>
<53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
<bd3f35b4-fd47-451a-8d78-c9ab21075ba5n@googlegroups.com>
Injection-Date: Sat, 12 Aug 2023 07:26:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3e40dc164c77a4336b6eaa78fd02fe1b";
logging-data="1325171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+T8liPG11XuSn0F/xELC35M7fyizU3W9A="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:sDQzfU8zZkrrIZFBIXm7G+q0/dA=

by: Kaz Kylheku - Sat, 12 Aug 2023 07:26 UTC

On 2023-08-12, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> On Friday, 11 August 2023 at 12:21:38 UTC+1, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>> > The main problem with a lexer for XML is that the grammar specifies
>> > "anything except a special characer" for the data between nodes. So
>> > either you have to have a special mode, so the lexer isn't really a
>> > lexer any more, or the tokens have to be single characters anyway.
>> No. It's almost universal to have such tokens. A C string is (to a
>> first approximation) anything except a '"'. A C comment is anything up
>> to a '*' followed by a '/'. In the most common kind of lexer, what you
>> think of as "modes" are states in a finite-state machine recogniser for
>> a regular language. For example, in C, when it sees a letter or '_' the
>> lexer enters the ID "mode" and accepts characters until the first
>> non-alphanumeric. It's just a state machine.
>>
> The difference is that in C, or for that matter, Minibasic, when you hit an
> alpha (or underscore), you known that you just have to read off all
> the subsequent alnums to get the token. Free form text can only appear
> inside quotes or comments. So if you hit a quote you gobble the string,
> and if you hit a comment you just pass through until you hit a "close
> comment" sequence (or in Mininbasic, just skip the REM line)
>
> Wiht XML it's not like that. Free form text is delineated by tags. But you've
> got to know whether you are in a tag or not to read it. So if you are to have
> a FREEFORMTEXT token, the lexer needs to know whether it has just parsed
> an element opening tag or not.

XML isn't HTML, but the things you're getting at are mostly common.

A sequence of characters that is not special can be treated as a token;
it's a repetition of an inverted regex character class.

A decade ago I made a tiny project called hc (HTML Cleaner). It parses
HTML and removes all tags that are not permitted, or certain disallowed
attributes of tags that are permitted.

The use case is this: allowing HTML e-mails into a Web-based mailing
list archive, with the HTML intact (but cleaned).

https://www.kylheku.com/cgit/hc/tree/hc.l

You can see that it's straight Lex rules.

The lexer has two exclusive states in addition to the initial one: ELM
and ATT.

The rule for matching a ream of text is one or more notspecial
characters:

{notspecial}+ { return tok_text; }

where that is defined as a negated character class; not any
of these characters.

notspecial [^"'<>/=& \t\n\r\v\t]

This stops at whitespace; which is returned as a different token.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Get line number for a FILE *

<c4bf357e-66c6-4e2a-aa39-dd67bcd632ccn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27508&group=comp.lang.c#27508

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ad4:5501:0:b0:63c:f853:c8a with SMTP id pz1-20020ad45501000000b0063cf8530c8amr57108qvb.6.1691835241085;
Sat, 12 Aug 2023 03:14:01 -0700 (PDT)
X-Received: by 2002:aa7:88c8:0:b0:687:4a62:f49 with SMTP id
k8-20020aa788c8000000b006874a620f49mr1895054pff.5.1691835240724; Sat, 12 Aug
2023 03:14:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!glou.org!news.glou.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 12 Aug 2023 03:14:00 -0700 (PDT)
In-Reply-To: <20230812001343.595@kylheku.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:868:d0a4:2df0:ec16;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:868:d0a4:2df0:ec16
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me> <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com> <53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk> <bd3f35b4-fd47-451a-8d78-c9ab21075ba5n@googlegroups.com>
<20230812001343.595@kylheku.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c4bf357e-66c6-4e2a-aa39-dd67bcd632ccn@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 12 Aug 2023 10:14:01 +0000
Content-Type: text/plain; charset="UTF-8"

by: Malcolm McLean - Sat, 12 Aug 2023 10:14 UTC

On Saturday, 12 August 2023 at 08:26:29 UTC+1, Kaz Kylheku wrote:
> On 2023-08-12, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > On Friday, 11 August 2023 at 12:21:38 UTC+1, Ben Bacarisse wrote:
> >> Malcolm McLean <malcolm.ar...@gmail.com> writes:
> >>
> >> > The main problem with a lexer for XML is that the grammar specifies
> >> > "anything except a special characer" for the data between nodes. So
> >> > either you have to have a special mode, so the lexer isn't really a
> >> > lexer any more, or the tokens have to be single characters anyway.
> >> No. It's almost universal to have such tokens. A C string is (to a
> >> first approximation) anything except a '"'. A C comment is anything up
> >> to a '*' followed by a '/'. In the most common kind of lexer, what you
> >> think of as "modes" are states in a finite-state machine recogniser for
> >> a regular language. For example, in C, when it sees a letter or '_' the
> >> lexer enters the ID "mode" and accepts characters until the first
> >> non-alphanumeric. It's just a state machine.
> >>
> > The difference is that in C, or for that matter, Minibasic, when you hit an
> > alpha (or underscore), you known that you just have to read off all
> > the subsequent alnums to get the token. Free form text can only appear
> > inside quotes or comments. So if you hit a quote you gobble the string,
> > and if you hit a comment you just pass through until you hit a "close
> > comment" sequence (or in Mininbasic, just skip the REM line)
> >
> > Wiht XML it's not like that. Free form text is delineated by tags. But you've
> > got to know whether you are in a tag or not to read it. So if you are to have
> > a FREEFORMTEXT token, the lexer needs to know whether it has just parsed
> > an element opening tag or not.
> XML isn't HTML, but the things you're getting at are mostly common.
>
> A sequence of characters that is not special can be treated as a token;
> it's a repetition of an inverted regex character class.
>
> A decade ago I made a tiny project called hc (HTML Cleaner). It parses
> HTML and removes all tags that are not permitted, or certain disallowed
> attributes of tags that are permitted.
>
> The use case is this: allowing HTML e-mails into a Web-based mailing
> list archive, with the HTML intact (but cleaned).
>
> https://www.kylheku.com/cgit/hc/tree/hc.l
>
> You can see that it's straight Lex rules.
>
> The lexer has two exclusive states in addition to the initial one: ELM
> and ATT.
>
> The rule for matching a ream of text is one or more notspecial
> characters:
>
> {notspecial}+ { return tok_text; }
>
>
> where that is defined as a negated character class; not any
> of these characters.
>
> notspecial [^"'<>/=& \t\n\r\v\t]
>
> This stops at whitespace; which is returned as a different token.
>
Your lexer hardcodes the HTML element and attribute names. Which is of
course the core purpose of a lexer. To convert the keywords to single value
tokens to make the grammar easier to write. But XML doesn't have any keywords,
just a few special tokens.
So I wrote the "vanilla" XML parser without a lexer. But whilst it is OK for Baby X
resource script files, which are very simple XML, it won't stand up to XML
in the wild. Also, it only reports success or fail. The benefit of a lexer is that
you can easily store the number of the line where you encoutner a parse error,
which is more user-friendly.

I'm having a go at an XML parser mark 2, with a formal lexer. But the problem is
this this legal XML
<Text>Text</Text>
<Text attr="Text">More Text</Text>

Now you could say that "<{identifer pattern}>" matches the token "open tag", with
the value "Text". But that won't work for the second line. You could say that "<"
matches the token "start tag definfition", but then "Text" has to match the token
"element name", and you will match it in the freeform text. You can of course get the
attribute value "Text" out relatively easily because it is quoted string. The token is
"quoted string" with the value "Text".

If you've got. a single character = 1 token lexer you can do it all with grammar rules,
however.

Re: Get line number for a FILE *

<355159e3-fb4f-4d20-89fc-19a1831acab5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27513&group=comp.lang.c#27513

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:5ce:b0:40f:fc40:87d8 with SMTP id d14-20020a05622a05ce00b0040ffc4087d8mr64544qtb.6.1691852512553;
Sat, 12 Aug 2023 08:01:52 -0700 (PDT)
X-Received: by 2002:a05:6a00:2488:b0:67d:41a8:3e19 with SMTP id
c8-20020a056a00248800b0067d41a83e19mr2101982pfv.3.1691852512272; Sat, 12 Aug
2023 08:01:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 12 Aug 2023 08:01:51 -0700 (PDT)
In-Reply-To: <878rahesel.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:a13e:e92d:ba00:c2b1;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:a13e:e92d:ba00:c2b1
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me> <7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com> <53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <355159e3-fb4f-4d20-89fc-19a1831acab5n@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 12 Aug 2023 15:01:52 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3446

by: Malcolm McLean - Sat, 12 Aug 2023 15:01 UTC

On Friday, 11 August 2023 at 12:21:38 UTC+1, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> > The main problem with a lexer for XML is that the grammar specifies
> > "anything except a special characer" for the data between nodes. So
> > either you have to have a special mode, so the lexer isn't really a
> > lexer any more, or the tokens have to be single characters anyway.
> No. It's almost universal to have such tokens. A C string is (to a
> first approximation) anything except a '"'. A C comment is anything up
> to a '*' followed by a '/'. In the most common kind of lexer, what you
> think of as "modes" are states in a finite-state machine recogniser for
> a regular language. For example, in C, when it sees a letter or '_' the
> lexer enters the ID "mode" and accepts characters until the first
> non-alphanumeric. It's just a state machine.
>
I've more or less rewritten the XML parser. The old ad hoc logic has been abandoned
and the parser written entirely from scratch, on top of a single character lexer.
You might say it's not a lexer at all, but the advantages are huge. Obviously keeping
track of the line number if no longer a problem. It's much easier to add informative
parse errors. And it can be trivially toggled between file or string input, or UTF-16 (you'd
have to have a little UTF-16 to UTF-8 converter in the get character function) The
structure of the program is far better. It's much more maintainable.

The only big disadvantge of. a single character lexer is that the tokens "<?" to introduce
the metadata tag, and "<!--" to introduce a comment can't be encoded as single tokens.
So when you hit a "<" you have to match it and parse from the character following it,
which isn't as nice as parsing from the "<".

Quoted strings and identifiers are all recognised using grammar rules, rather than in the
lexer, which would be more conventional.

Sadly the new XML parser is no going to make it into v1.2 because it's too big. a change
too close to release.

Subject	Author
Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Bart
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Scott Lurndal
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Spiros Bousbouras