novaBBS - comp.lang.c - Re: Get line number for a FILE *

Re: Get line number for a FILE *

<237957a3-7a60-49a4-8376-df10fba035een@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=27519&group=comp.lang.c#27519

X-Received: by 2002:a05:620a:3c93:b0:76c:81dc:afee with SMTP id tp19-20020a05620a3c9300b0076c81dcafeemr59937qkn.12.1691866406706;
Sat, 12 Aug 2023 11:53:26 -0700 (PDT)
X-Received: by 2002:a05:6a00:39a9:b0:668:7512:7c49 with SMTP id
fi41-20020a056a0039a900b0066875127c49mr2257212pfb.5.1691866406202; Sat, 12
Aug 2023 11:53:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 12 Aug 2023 11:53:25 -0700 (PDT)
In-Reply-To: <52a69bb8-05ea-46df-ae76-1bc1b6486cbbn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.247; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.247
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<5cb86760-ffe4-48ad-9ecf-f3a2612eabb8n@googlegroups.com> <e7a5dd2f-1931-40ee-a657-eaf4085f33d8n@googlegroups.com>
<52a69bb8-05ea-46df-ae76-1bc1b6486cbbn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <237957a3-7a60-49a4-8376-df10fba035een@googlegroups.com>
Subject: Re: Get line number for a FILE *
From: profesor...@gmail.com (fir)
Injection-Date: Sat, 12 Aug 2023 18:53:26 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2186

by: fir - Sat, 12 Aug 2023 18:53 UTC

czwartek, 10 sierpnia 2023 o 17:11:46 UTC+2 fir napisał(a):
> in my txt file 4 107 378 bytes
>
> whisch your code and my says has 39 052 lines
>
> your code with ftell in a loop takes 21 670 ms (when i set pos to end)
> your code with while (pos--) takes 319 ms
>
> my sickle code on splitter takes stable 13 though it does more work

byw those fgetc shows overally unusable if reading 4MB uising them takes 300 ms compared to
fread ehich probably takes 3 ms or so (my splitter not only freads but uses two ifs on byte (prepass to count newline reallock and then pass to store chunks in dynamic array)... such time as 0.3 seconds slowdown is imo genrally not acceptable for anything - i personally would suggest to optimise antything that takes more than 0.3 milisecond or something liek that

Re: Get line number for a FILE *

<87h6p3d8iy.fsf@bsb.me.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27525&group=comp.lang.c#27525

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Sun, 13 Aug 2023 02:40:37 +0100
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <87h6p3d8iy.fsf@bsb.me.uk>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me>
<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com>
<53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
<355159e3-fb4f-4d20-89fc-19a1831acab5n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="15be0507bd94ab5f777620aae7a0c046";
logging-data="1655245"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18yY1i7Juc0Ijw5s9xqjrSP4OQtvSWYreE="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:MMjBNpvY+NLEBiDRqR6c5RJCIyo=
sha1:WDtKVsBmK6WsRen1NU8UG1nMm8E=
X-BSB-Auth: 1.aa71915ef4907d5e5563.20230813024037BST.87h6p3d8iy.fsf@bsb.me.uk

by: Ben Bacarisse - Sun, 13 Aug 2023 01:40 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Friday, 11 August 2023 at 12:21:38 UTC+1, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>> > The main problem with a lexer for XML is that the grammar specifies
>> > "anything except a special characer" for the data between nodes. So
>> > either you have to have a special mode, so the lexer isn't really a
>> > lexer any more, or the tokens have to be single characters anyway.
>> No. It's almost universal to have such tokens. A C string is (to a
>> first approximation) anything except a '"'. A C comment is anything up
>> to a '*' followed by a '/'. In the most common kind of lexer, what you
>> think of as "modes" are states in a finite-state machine recogniser for
>> a regular language. For example, in C, when it sees a letter or '_' the
>> lexer enters the ID "mode" and accepts characters until the first
>> non-alphanumeric. It's just a state machine.
>>
> I've more or less rewritten the XML parser. The old ad hoc logic has
> been abandoned and the parser written entirely from scratch, on top of
> a single character lexer. You might say it's not a lexer at all,

A lexer recognises tokens. There is nothing to stop these tokens being
single characters. That just pushes more work up into the parser.
Sometimes that is good, sometimes it's a mistake to pass-up the option
of doing the work in a more structured way.

That, however, was not my point but I don't see any point in restating
it.

> but
> the advantages are huge. Obviously keeping track of the line number if
> no longer a problem.

Nor is it with a lexer that recognises bigger tokens that one
character. I can't see why you consider that a problem unless the
tokens are single characters.

> It's much easier to add informative parse
> errors. And it can be trivially toggled between file or string input,
> or UTF-16 (you'd have to have a little UTF-16 to UTF-8 converter in
> the get character function) The structure of the program is far
> better. It's much more maintainable.

Again, I can't see why. I've written parsers with highly structured
tokens and with single character tokens and I don't recall these
advantages attaching only to the latter.

> The only big disadvantge of. a single character lexer is that the
> tokens "<?" to introduce the metadata tag, and "<!--" to introduce a
> comment can't be encoded as single tokens. So when you hit a "<" you
> have to match it and parse from the character following it, which
> isn't as nice as parsing from the "<".

Yes. In general, the simpler the tokens, the more look-ahead your
parser will need.

--
Ben.

Re: Get line number for a FILE *

<20230812194320.470@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27527&group=comp.lang.c#27527

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Get line number for a FILE *
Date: Sun, 13 Aug 2023 02:54:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 118
Message-ID: <20230812194320.470@kylheku.com>
References: <3f16229e-7043-4b52-a3bb-8b1428838a15n@googlegroups.com>
<ub2kn9$c6ds$2@dont-email.me>
<7a7790f8-0913-4d42-9376-73e885cd45afn@googlegroups.com>
<20230810083924.456@kylheku.com>
<53695cd3-146d-4288-9d84-f38e34bbf3cfn@googlegroups.com>
<878rahesel.fsf@bsb.me.uk>
<bd3f35b4-fd47-451a-8d78-c9ab21075ba5n@googlegroups.com>
<20230812001343.595@kylheku.com>
<c4bf357e-66c6-4e2a-aa39-dd67bcd632ccn@googlegroups.com>
Injection-Date: Sun, 13 Aug 2023 02:54:52 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="162b5c51be233759ac9aee76427ac06c";
logging-data="1791363"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/FXy9OwHq3ZYFHN1qrwgXQpInbjJ0tt/I="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:hlP8FKWdYJQQXND+F9miN1BBGvs=

by: Kaz Kylheku - Sun, 13 Aug 2023 02:54 UTC

On 2023-08-12, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> On Saturday, 12 August 2023 at 08:26:29 UTC+1, Kaz Kylheku wrote:
>> On 2023-08-12, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
>> > On Friday, 11 August 2023 at 12:21:38 UTC+1, Ben Bacarisse wrote:
>> >> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>> >>
>> >> > The main problem with a lexer for XML is that the grammar specifies
>> >> > "anything except a special characer" for the data between nodes. So
>> >> > either you have to have a special mode, so the lexer isn't really a
>> >> > lexer any more, or the tokens have to be single characters anyway.
>> >> No. It's almost universal to have such tokens. A C string is (to a
>> >> first approximation) anything except a '"'. A C comment is anything up
>> >> to a '*' followed by a '/'. In the most common kind of lexer, what you
>> >> think of as "modes" are states in a finite-state machine recogniser for
>> >> a regular language. For example, in C, when it sees a letter or '_' the
>> >> lexer enters the ID "mode" and accepts characters until the first
>> >> non-alphanumeric. It's just a state machine.
>> >>
>> > The difference is that in C, or for that matter, Minibasic, when you hit an
>> > alpha (or underscore), you known that you just have to read off all
>> > the subsequent alnums to get the token. Free form text can only appear
>> > inside quotes or comments. So if you hit a quote you gobble the string,
>> > and if you hit a comment you just pass through until you hit a "close
>> > comment" sequence (or in Mininbasic, just skip the REM line)
>> >
>> > Wiht XML it's not like that. Free form text is delineated by tags. But you've
>> > got to know whether you are in a tag or not to read it. So if you are to have
>> > a FREEFORMTEXT token, the lexer needs to know whether it has just parsed
>> > an element opening tag or not.
>> XML isn't HTML, but the things you're getting at are mostly common.
>>
>> A sequence of characters that is not special can be treated as a token;
>> it's a repetition of an inverted regex character class.
>>
>> A decade ago I made a tiny project called hc (HTML Cleaner). It parses
>> HTML and removes all tags that are not permitted, or certain disallowed
>> attributes of tags that are permitted.
>>
>> The use case is this: allowing HTML e-mails into a Web-based mailing
>> list archive, with the HTML intact (but cleaned).
>>
>> https://www.kylheku.com/cgit/hc/tree/hc.l
>>
>> You can see that it's straight Lex rules.
>>
>> The lexer has two exclusive states in addition to the initial one: ELM
>> and ATT.
>>
>> The rule for matching a ream of text is one or more notspecial
>> characters:
>>
>> {notspecial}+ { return tok_text; }
>>
>>
>> where that is defined as a negated character class; not any
>> of these characters.
>>
>> notspecial [^"'<>/=& \t\n\r\v\t]
>>
>> This stops at whitespace; which is returned as a different token.
>>
> Your lexer hardcodes the HTML element and attribute names.

Yes it does; but it also has a catch all token rule for other elements
and attributes:

<ELM>{elname} { BEGIN(ATT); return tok_el_unknown; }

The numerous hard-coded elements and attributes are there because this
is designed for a very specific task: which is passing through only
allowed elements, and for each one, passing through only attributes
allowed for it.

Someone reusing this for a different purpose might want to throw away
those hard-coded tokens and just deal with a single token enumeration
for all elements.

> Which is of
> course the core purpose of a lexer. To convert the keywords to single value
> tokens to make the grammar easier to write.

In this program, there is in fact no grammar; it will not validate
that tags are matching and nesting. Just when it sees <foo bar="xyzzy">,
it will remove that entire tag if it is not whitelisted; and if it
happens to be whitelisted, but not allowed to hae a bar attribute, it
will pass it through as <foo>.

> just a few special tokens.
> So I wrote the "vanilla" XML parser without a lexer. But whilst it is OK for Baby X
> resource script files, which are very simple XML, it won't stand up to XML
> in the wild. Also, it only reports success or fail. The benefit of a lexer is that
> you can easily store the number of the line where you encoutner a parse error,
> which is more user-friendly.
>
> I'm having a go at an XML parser mark 2, with a formal lexer. But the problem is
> this this legal XML
><Text>Text</Text>
><Text attr="Text">More Text</Text>
>
> Now you could say that "<{identifer pattern}>" matches the token "open tag", with
> the value "Text". But that won't work for the second line. You could say that "<"
> matches the token "start tag definfition", but then "Text" has to match the token
> "element name", and you will match it in the freeform text. You can of course get the
> attribute value "Text" out relatively easily because it is quoted string. The token is
> "quoted string" with the value "Text".

You need a rule that matches an opening tag, with optional attributes.

The parser can have a stack of states which informs it which tag is
currently open; when it sees a closing tag, it has to match.

I don't think it's much different from those Wirthian languages where a
PROCEDURE Foo must end with END Foo.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

No amount of genius can overcome a preoccupation with detail.

devel / comp.lang.c / Re: Get line number for a FILE *

Subject	Author
Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Bart
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Malcolm McLean
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	fir
Re: Get line number for a FILE *	Scott Lurndal
Re: Get line number for a FILE *	Ben Bacarisse
Re: Get line number for a FILE *	Kaz Kylheku
Re: Get line number for a FILE *	Spiros Bousbouras