Message-ID:

Did you know that for the price of a 280-Z you can buy two Z-80's? -- P. J. Plauger

devel / comp.compilers / Re: Interesting paper on regex NFA matching

Re: Interesting paper on regex NFA matching

<24-01-008@comp.compilers>

https://www.novabbs.com/devel/article-flat.php?id=882&group=comp.compilers#882

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: Re: Interesting paper on regex NFA matching
Date: Mon, 29 Jan 2024 10:39:58 +0200
Organization: Compilers Central
Sender: johnl%iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <24-01-008@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="76233"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, NFA
Posted-Date: 29 Jan 2024 12:11:14 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: Christopher F Clark - Mon, 29 Jan 2024 08:39 UTC

Although, our esteemed moderator already corrected this error:
> [Patterns with alternatives and trailing contexts do have to back up
> and the flex manual warns it's quite expensive. See the
> PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]

I still want to write an extended reply to this, where
Kaz Kylheku wrote:

> I think that case can be handled by technology like Lex trailing
> contexts, which doesn't require backtracking.
>
>
> {digits}/. {
> // ... set up semantic value ..
> return INTEGER;
> }
>
>
> {digits}.{digits} {
> // ... set up semantic value ..
> return FLOAT;
> }
>
>
> Another possibility is to recongize {digits}. as one lexeme and call
> unput('.'). That's a minor form of backtracking, that doesn't require us
> to do anything funny to the regex implementation.

You may not recognize it as backtracking, but it is reading two characters
past the end of the integer token to disambiguate it. And while one character
of lookahead is generally "free" two characters requires more extensive
buffering and mechanisms to reset the state.

Thus, what LEX and flex are implementing is not strictly a pure
regular expression matcher, but one with a slight extension to what
can be recognized. While it is an extremely useful and appropriate
convention for the application area, it isn't strictly within the
bounds. The engine which runs such machines needs to have the hooks to
implement that extension.

One can prove this by using the hierarchy of language families where
regular expressions are a subset of LL(1) grammars which are a subset
of LR(1) grammars. If you try to write an LL(1) grammar for that, you
will find yourself unable to construct a proper FIRST set which disambiguates
the two rules Similarly, if one is attempting to make an LR grammar for it
uaing an algorithm that builds first an LR(0) machine, you will see that it
has a shift/reduce conflict on the "." character and one which cannot be
disambiguated by one character of lookahead.

Thus, what is implemented in LEX is essentially "longest, lexically first"
matching. Where the lexer saves the last state where a token was accepted
and continues attempting to match, and on failure backs up to the last
recognized match (and then applies the rule which is lexically first that
caused that match).

The only difference between this and what PEG (and PCRE) mechanisms
do is that they implement "lexical first, longest" match. That reorders the
set of matches that are considered. As the paper you dismissed notes,
that can be compared to depth-first, versus breadth-first search of the
tree of potential matches. Both strategies have their advantages and
areas where they are the proper choice and places where they are the
wrong choice.

If one finds oneself writing "trailing context" or "unput" in one's lexical
specification, one is using lookahead. And sometimes in lexers one
uses lookahead without even noticing it, as LEX doesn't emit warnings
when it does so. (We had to disable such warnings in our implementation
to get it to approximate LEX behavior.)

-------

One final note is that language classes are determined by writing a
grammar which accepts (not lexes or parses) the given language.
That is one can write a grammar that doesn't disambiguate the cases
(i.e. combines them into one case), but then one hasn't correctly
tokenized it. I don't know of an automata theory (or compiler) book
that mentions this distinction, much less highlights its implications,
so I will forgive not knowing it unless one makes an argument that
depends upon it, which you did.

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: Interesting paper on regex NFA matching

<24-01-009@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=883&group=comp.compilers#883

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!news.hispagatos.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 433-929-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: Interesting paper on regex NFA matching
Date: Mon, 29 Jan 2024 19:28:02 -0000
Organization: Compilers Central
Sender: johnl%iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <24-01-009@comp.compilers>
References: <24-01-008@comp.compilers>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="21096"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 29 Jan 2024 15:57:01 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: Kaz Kylheku - Mon, 29 Jan 2024 19:28 UTC

On 2024-01-29, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> Although, our esteemed moderator already corrected this error:
>> [Patterns with alternatives and trailing contexts do have to back up
>> and the flex manual warns it's quite expensive. See the
>> PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
>
> I still want to write an extended reply to this, where
> Kaz Kylheku wrote:
>
>> I think that case can be handled by technology like Lex trailing
>> contexts, which doesn't require backtracking.
>>
>>
> > {digits}/. {
>> // ... set up semantic value ..
>> return INTEGER;
>> }
>>
>>
>> {digits}.{digits} {
>> // ... set up semantic value ..
>> return FLOAT;
>> }
>>
>>
>> Another possibility is to recongize {digits}. as one lexeme and call
>> unput('.'). That's a minor form of backtracking, that doesn't require us
>> to do anything funny to the regex implementation.
>
> You may not recognize it as backtracking, but it is reading two characters
> past the end of the integer token to disambiguate it.

But some backtracking is required for recognizing the longest matching
prefixes of the input, even if we us nothing but basic regex features!

If we consider the trivial pattern (a*b)+:

Given the input "aaabaaabaaac" the prefix which constitutes a token
is the "aaabaaab".

In order to extract that token, the machine will have to consume
symbols all the way through the "aaac" which is excluded from the
lexeme. Only at the "c" character does it hit the situation that no
transition is available, and so the matching terminates. Then it has to
return to the most recent position in the input where the automaton had
been in an acceptance state.

It is not the same as the backtracking of a graph/tree search algorithm,
which returns to prior points to try alternatives, due to doing a
depth-first search.

At the point where we back up in the input, we are not trying
different branches of a depth-first-search. We have completed
a depth-first-search!

One input retrace at the conclusion of a succesful search is not
qualitatively the same situation as multiple, cascaded retraces
integrated into the search algorithm.

> Thus, what is implemented in LEX is essentially "longest, lexically first"
> matching. Where the lexer saves the last state where a token was accepted
> and continues attempting to match, and on failure backs up to the last
> recognized match (and then applies the rule which is lexically first that
> caused that match).

Yes; that is required in any regex matching function that searches
for a prefix, or infix.

The fundamental automaton's job is to recognize strings: does a given
input string belong to the language (set of strings) described by that
automaton/regex.

In lexical analysis, we solve the problem of, what is the longest prefix
of the input which belongs to that set of strings, the obvious way to do
which is to match as far as we can and then return to the most recent
good position.

> The only difference between this and what PEG (and PCRE) mechanisms
> do is that they implement "lexical first, longest" match.

I think when you say "lexical first", you might be referring to the
rules themselves: rules that are lexically earlier in the grammar
"win" over later rules.

(In Lex, that still matters. If two rules match tokens of exactly
the same length, the lexically earlier rule gets the token.)

This "lexical first, longest" is why the order of R1|R2|..|RN matters
in some so-called regex implementations.

Unfortunately, as far as regexes go, that is broken.

It may still be a language describing a regular set, or at least
in some cases, but the regex language for doing that requires the
disjunction operator to be commutative.

This is just historically so, but with justification.

The people who invented regular expressions decades ago, as far back as
the 1960's and earlier, and wrote papers about them decided on a
commutative disjunction, for good reasons.

A commutative disjunction is necessary in order to have a direct
relationship between sets and regular expressions.

If R1 describes a set of strings S1, and R2 describes a set of strings
S2, then R1|R2 describes S1 ∪ S2.

Because ∪ commutes, we want | to commute, since it means the same thing.

If | doesn't commute, then it doesn't correspond to the ∪ operator. The
relationship to sets has been severed or muddled.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject	Author
Re: Interesting paper on regex NFA matching	Christopher F Clark
Re: Interesting paper on regex NFA matching	Kaz Kylheku