Message-ID:

Machines that have broken down will work perfectly when the repairman arrives.

devel / comp.compilers / Re: Has lexing and parsing theory advanced since the 1970's?

Lately I have been learning to use Flex & Bison. As I understand it, Flex &
Bison (and other parser generators) are built on solid theory. As a result,
those parser generators were created quickly and cheaply using structured,
well-known techniques. Contrast with parsers developed prior to the theory:
The earliest parser back in the 1950s used utterly ad hoc techniques to
analyze the syntax of the source code of programs they were parsing. During
the 1960s, the field got a lot of academic attention and by the early 1970s,
parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many
others put parsing techniques solidly on their theoretical feet.

One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was
to separate lexing from parsing. Lexing built upon regular expressions, which
built upon Finite Automata (FA) theory and Nondeterministic Finite Automata
(NFA) theory. FA and NFA were brilliantly melded together with the famous
Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context
Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated.
Neat!

That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
1970�s? If yes, are there parser generators available today which are based on
those advances in lexing/parsing theory? Or does Flex & Bison still represent
the state-of-the-art in terms of the underlying theory it uses?

[Having been there in the 1970s I can report that the linguists and the computer
scientists were dimly aware of each other but didn't work together and used
different terms, e.g., linguists say phrase structure grammars where we would
say CFG.
Flex is a rewrite of lex which was a Bell Labs summer project to make
it easier to turn regular expressions into DFAs (not NFAs) by Eric
Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
AT&T software license. Bison was orignally a GPL rewrite of yacc but it
has since added reentrant parsers and GLR parsing for ambiguous grammars.
They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
and the data structures were too big. -John]

Roger L Costello <costello@mitre.org> writes:
>That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
>1970s? If yes, are there parser generators available today which are based on
>those advances in lexing/parsing theory? Or does Flex & Bison still represent
>the state-of-the-art in terms of the underlying theory it uses?

It seems to me that after the success of BNF (context-free grammars)
with Algol 60 people wanted to go further, and Algol 68 used a Van
Wijngaarden grammar (two-level grammars). But this formalism has not
been used for describing and implementing later languages, and even
Algol 60 put more in the grammar (in particular, the type difference
between flags and numbers) than later languages. Most languages use
just BNF and describe the rest of the language in prose, a few use a
formal semantics, but I guess that's not what the question was about.
So on the language definition side, there have been no advances in
grammar formalism, because none are needed.

On the implementation side, there have been many attribute grammar
generators (e.g., Ox), but few uses. In the back end various more or
less formal techniques have been used for instruction selection; e.g.,
lcc uses the tree-parser generator lburg.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

On 2021-09-14, Roger L Costello <costello@mitre.org> wrote:
> Lately I have been learning to use Flex & Bison. As I understand it, Flex &
> Bison (and other parser generators) are built on solid theory. As a result,
> those parser generators were created quickly and cheaply using structured,
> well-known techniques. Contrast with parsers developed prior to the theory:
> The earliest parser back in the 1950s used utterly ad hoc techniques to
> analyze the syntax of the source code of programs they were parsing. During
> the 1960s, the field got a lot of academic attention and by the early 1970s,
> parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many
> others put parsing techniques solidly on their theoretical feet.
>
> One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was
> to separate lexing from parsing.

This is for one particular pragmatic reason. The LALR(1) parsing
technique uses one token of lookahead to make key parsing decisions.

If you combine lexing with parsing, then some of your original lookahead
symbols may turn into sequences of symbols, which share the same initial
element.

E.g. supose you have a "++" token and a "+=" token, which appear as a
lookahead symbol which determines which way to reduce.

If we don't separate lexing and parsing, such that we have a
character-level grammar, then we no longer have a "++" and "+="
token. Each of these is two grammar symbols.

And so in that same reduce situation, we no longer have two different
lookahead tokens for "++" and "+=". The lookahead symbol is just "+".

Fixing this requires techniques like multiple symbols of lookahead:
LR(k). (Not that that is new or anything.)

There are certain efficiencies that can be had in lexing, as such.
Dedicated lexical analysis can recognize tokens "blindingly" fast,
using buffering techniques integrated with the recognizer.

> Lexing built upon regular expressions, which
> built upon Finite Automata (FA) theory and Nondeterministic Finite Automata
> (NFA) theory. FA and NFA were brilliantly melded together with the famous
> Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context
> Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated.
> Neat!

However, LR parsing is built on regular expressions again, with
some sophistication thrown in to handle nesting.

You know from ad hoc programming with regexes that you can handle some
forms of nesting if you invoke a regex more than once on a string.

For instance balanced parentheses can be recognized via regex-driven
algorithm if we us a regex to recognize all instances of an opening
parenthesis followed by non-parenthesis characters [^()] followed by a
closing parenthesis, and then reduce this by removing it from the input,
and iterate on it.

That sort of gives us an intuition behind LR parsing. At the core of a
LR parser is a state machine, and if we ignore the push down aspects of
it: how its actions maintain a stack, it's just a finite state machine
that can be described by a regex. The terminating states of the regex
basically rewrite something in the stack of symbols, or push something
into it, and then choose some state to jump to to iterate.

> That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
> 1970�s? If yes, are there parser generators available today which are based on
> those advances in lexing/parsing theory? Or does Flex & Bison still represent
> the state-of-the-art in terms of the underlying theory it uses?

Firstly, Flex and Bison have enough power in them to handle
state-of-the-art *languages*.

Parsing theory has long outstripped the practical realm.

Loosel speaking, if you design a computer language whose syntax is so
complicated with difficult ambiguities that it benefits from being
parsed by some technique only available from a 2020 dated paper, it's
probably an awful language from a syntactic point of view, which
ill serves its users in that regard.

(Look at C++, by the way; it's famous for having a large, awful grammar.
Yet, the GNU C++ implementation is a hand-maintained recursive-descent
parser, that's about a megabyte and a half of C++ code all in one file.
So for all the complexity, C++ can be handled using approaches that take
advantage of approximately zero advances in parsing since 1965.)

Now Bison is actively developed. There are avenues of improvement in
dimensions other than parsing theory. Firstly, Bison does have some
theoretical muscle in it in the way of handling GLR grammars.

There are other pragmatic improvements in it like support for push
parsers, and re-entrant parsers.

It has support for multiple programming languages. Its C++ support goes
as far as supporting AST nodes that are objects which can have
destructors.

Fairly recently, support was added to Bison for multiple start symbols
in the same grammar, so you can parse just a subset of a language.

Another thing I believe I saw recently on the mailing list recently is
that contributions were merged to Bison which allow it to generate
counterexamples for ambiguities, so that it's easier to debug bad
grammars. Previous implementations of Yacc-like parsers just give you
diagnostics like "state 123: reduce-reduce conflict" whose root cause
can be hard to unravel.

> [Having been there in the 1970s I can report that the linguists and the computer
> scientists were dimly aware of each other but didn't work together and used
> different terms, e.g., linguists say phrase structure grammars where we would
> say CFG.
> Flex is a rewrite of lex which was a Bell Labs summer project to make
> it easier to turn regular expressions into DFAs (not NFAs) by Eric
> Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
> AT&T software license. Bison was orignally a GPL rewrite of yacc but it
> has since added reentrant parsers and GLR parsing for ambiguous grammars.
> They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
> and the data structures were too big. -John]

Bison was originally written by the same person, Robert Corbett, who
went on to write Berkeley Yacc.

https://en.wikipedia.org/wiki/Berkeley_Yacc

Berkeley Yacc is far less actively maintained than Bison and has fewer
features. It does support reentrant parsing in a way that is
compatible with around Bison 2.5.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[No coincidence that it was the same guy, since bison was a fork of
byacc.
The lookahead example of ++ vs += isn't very good since a LALR
parser only needs to look ahead when it reduces but in general you're
right, it'd be a problem. A much worse problem in combined parsers is
getting rid of noise like comments and white space. In a separate lexer
you can easily discard them between tokens, in a combined lex/parse you
have to include them everywhere they can occur. -John]

On Thursday, September 16, 2021 at 7:56:25 PM UTC+3, Roger L Costello wrote:
>
> That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
> 1970’s? If yes, are there parser generators available today which are based on
> those advances in lexing/parsing theory? Or does Flex & Bison still represent
> the state-of-the-art in terms of the underlying theory it uses?

Hello,

The routines that recognize tokens are still called scanners and those
that parse the input are still called parsers. That explained, this long
list may give a clue of what an answer to your last question shall be:
https://en.wikipedia.org/wiki/Comparison_of_parser_generators

Yet, I haven't used most of them. So, I'll give you an example with
Syntaxis, a tool I've coded that isn't included in the above list.

If we try to parse this erroneous Fortran line with the command 'fcheck'
(binary available at https://github.com/drikosev/Fortran) we see that
the expected tokens in the error message contain both a '=' and a name:

program ? ; end

Note that the command 'fcheck' uses a deterministic parser (built by
Syntaxis) and the expected tokens in an error message are pre-computed.

To my knowledge, the ability of a parser to shift simultaneously two
distinct terminals in one transition isn't an advancement in theory but
I guess several tools mentioned in the Wikipedia list above possibly
provide similar or better goodies (ie Chris Clark described an ANTLR4
feature that advances theory directly).

Regards,
Ev. Drikos

Subject	Author
Has lexing and parsing theory advanced since the 1970's?	Roger L Costello
Re: Has lexing and parsing theory advanced since the 1970's?	Anton Ertl
Re: Has lexing and parsing theory advanced since the 1970's?	Kaz Kylheku
Re: Has lexing and parsing theory advanced since the 1970's?	Ev Drikos