Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

To program is to be.


devel / comp.compilers / An ingenious scanning problem!

SubjectAuthor
* An ingenious scanning problem!Roger L Costello
`- RE: An ingenious scanning problem!Christopher F Clark

1
An ingenious scanning problem!

<22-05-053@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=408&group=comp.compilers#408

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: An ingenious scanning problem!
Date: Sat, 28 May 2022 19:03:09 +0000
Organization: Compilers Central
Lines: 69
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-05-053@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="42510"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 28 May 2022 15:36:40 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-US
 by: Roger L Costello - Sat, 28 May 2022 19:03 UTC

Hi Folks,

Page 115-116 of the Flex User's Manual [1] has an ingenious problem posed by
Jan Kort on 05 September 1998.

Here's a description of the problem:

The input contains this: TEST1\n

That is, the input contains the string TEST1 followed by a newline.

The Flex scanner contains a rule that matches on the input:

TEST1\n { action }

Here's a description of its action:

Of the text that was matched, I only
want the first 5 characters (i.e., TEST1).

That action is expressed in Flex this way:

TEST1\n { yyless(5); }

In the scanner there are also two newline rules.

One newline rule matches on a newline at the beginning of a line (i.e., empty
line):

^\n { action }

The other newline rule matches on a newline at the end of a line:

\n { action }

Question: After the action for TEST1\n is executed, which newline rule is
executed?

The answer depends on the meaning of yyless(). Here are two possible meanings
for yyless():

(a) The scanner is backing up in the input stream.
(b) The scanner is pushing new characters onto the input stream.

If the meaning of yyless() is (a), then the second newline rule will be
executed.

If the meaning of yyless() is (b), then the first newline rule will be
executed.

So which does Flex do?

.............................

Answer: (b)

Flex treats the newline from TEST1\n as a newline at the beginning of a line
(i.e., empty line). Flex executes this rule:

^\n { action }

Such an ingenious problem!

Comments?

/Roger

[1] https://epaperpress.com/lexandyacc/download/flex.pdf
[This is a good example of the reasons you don't want your lexer to be too clever. -John]

RE: An ingenious scanning problem!

<22-06-001@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=422&group=comp.compilers#422

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: RE: An ingenious scanning problem!
Date: Wed, 1 Jun 2022 13:34:22 +0300
Organization: Compilers Central
Lines: 91
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-001@comp.compilers>
References: <22-05-053@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="17443"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 01 Jun 2022 13:15:31 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Wed, 1 Jun 2022 10:34 UTC

I will skip most of Roger's posting prior to the question:

> Question: After the action for TEST1\n is executed, which newline rule is
> executed?
>
> The answer depends on the meaning of yyless(). Here are two possible meanings
> for yyless():
>
> (a) The scanner is backing up in the input stream.
> (b) The scanner is pushing new characters onto the input stream.

I find the Flex choice of "b" to be semantically incorrect and a bug.
because you cannot get functionality "a" if you do that. However, I
suspect that the bug is relatively intractable. It may be quite hard
to back up the input stream and recreate the proper lexer state,
especially since I suspect that newlines are not part of the lexer
state but an additional "widget" on the side.

Note, that an alternate solution is to have what PCRE calls lookahead
zero-width assertions. I think the original LEX had them for this
case.

That is, you don't use yyless for this case, but instead you write a rule like:

TEST1 / \n

where the "/" is an operator in PCRE it would look something like this

TEST1 (?=\n)

In either case, the \n is "peeked" at but not consumed as part of the
token and not otherwise processed at all. So, you don't have to
"backup" the stream as you never moved it forward over the \n. It
stays in the lookahead part of the input stream.

The other alternative is to not use yyless and simply deal with the \n
as part of the actions for the TEST1\n rule, making the rule do the
actions of both TEST1 and \n. Now in this case where it is just one
character and that character is also a token, that probably works ok.
However, if \n# (for example) is a special token, you have gotten
yourself into a mess as you probably need to make TEST1\n# a rule
also, and you can see where this is headed.

But this all goes to my point about not making the lexer complicated.
It is probably a flaw in the language design if you need lookahead
past the end of the token to determine what to tokenize it as. And,
note several designs by famous people in the compiler field have made
that mistake. Two examples come immediately to mind.

Yacc itself needs identifier: as a special token to deal with the fact
that semicolons are optional at the end of a rule. It makes the
language LR(2) otherwise. (In our (Compiler Resources) version of
Yacc++, we disallowed rules that don't end with semicolons.) Note, I
suspect most versions don't allow comments in between the identifier
and the colon (they probably don't even allow whitespace).

In Pascal 1..2 need lookahead to recognize that it is an int a ..
operator and another int rather than 1.2 a floating point ("real")
number. Better would have been to require spaces (as in 1 .. 2) in
that case.

Note, I even regret the way our version of Yacc++ handles code blocks
in braces. It is the same kind of quagmire. I cannot easily use
braces in other places because of it. I am looking at having to
develop an even more complicated scheme to make that work, making the
corner case even more convoluted.

But when I rail against hand-written recursive descent parsers and not
using tools and living within the limits of what can express simply,
this is the kind of errors I am referring to. Without the guardrails
of a tool telling you that your language design has gone off into the
reeds, you will make those kind of errors and your language will be
harder to evolve and have corner cases that you don't even know that
are there. Enough preaching. I will get off my soapbox.

Kind regards,
Chris

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
[Lex and flex both have trailing context. The manual says that if it
can tell at compile time that the text matched by trailing context is
fixed length, it's cheap, otherwise it's very expensive. It also has
REJECT in an action which tells it to pretend that the pattern didn't
match and try something else. That's also very expsnsive, and
potentially very confusing. -John]

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor