Message-ID:

An optimist believes we live in the best world possible; a pessimist fears this is true.

devel / comp.compilers / Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-023@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=443&group=comp.compilers#443

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Thu, 9 Jun 2022 14:52:59 +0000
Organization: Compilers Central
Lines: 35
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="68394"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, question, design
Posted-Date: 09 Jun 2022 12:50:47 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Accept-Language: en-US
Content-Language: en-US

by: Roger L Costello - Thu, 9 Jun 2022 14:52 UTC

Hi Folks,

Page 84-85 of the dragon book [1] says:

There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing.

1. Simpler design is perhaps the most important consideration. The separation
of lexical analysis from syntax analysis often allows us to simplify one or
the other of these phases. For example, a parser embodying the conventions for
comments and white space is significantly more complex than one that can
assume comments and white space have already been removed by a lexical
analyzer. It we are designing a new language, separating the lexical and
syntactic conventions can lead to a cleaner overall language design.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to
construct a specialized and potentially more efficient processor for the task.
A large amount of time is spent reading the source program and partitioning it
into tokens. Specialized buffering techniques for reading input characters and
processing tokens can significantly speed up the performance of a compiler.

3. Compiler portability is enhanced. Input alphabet peculiarities and other
device-specific anomalies can be restricted to the lexical analyzer. The
representation of special or non-standard symbols, such as the up-arrow in
Pascal, can be isolated in the lexical analyzer.

Specialized tools have been designed to help automate the construction of
lexical analyzers and parsers when they are separated.
----------------------------
Those seem like compelling reasons for separating the lexical analysis from
parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?

/Roger

[1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.

Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-027@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=445&group=comp.compilers#445

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: alexfrun...@gmail.com (Alexei A. Frounze)
Newsgroups: comp.compilers
Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Thu, 9 Jun 2022 18:07:46 -0700 (PDT)
Organization: Compilers Central
Lines: 30
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-027@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="49582"; mail-complaints-to="abuse@iecc.com"
Keywords: design, performance, comment
Posted-Date: 09 Jun 2022 21:15:54 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-023@comp.compilers>

by: Alexei A. Frounze - Fri, 10 Jun 2022 01:07 UTC

On Thursday, June 9, 2022 at 9:50:50 AM UTC-7, Roger L Costello wrote:
> 2. ...
> A large amount of time is spent reading the source program and partitioning it
> into tokens. Specialized buffering techniques for reading input characters and
> processing tokens can significantly speed up the performance of a compiler.

In any decent compiler this amount of time is relatively small compared to optimizations.
Unless we're talking about JIT, which is a different kind of compiling.

> 3. Compiler portability is enhanced. Input alphabet peculiarities and other
> device-specific anomalies can be restricted to the lexical analyzer. The
> representation of special or non-standard symbols, such as the up-arrow in
> Pascal, can be isolated in the lexical analyzer.

ASCII is supported out of the box nowadays. Unicode is available and simply
parsing and storing Unicode code points is straightforward (processing
Unicode text, especially displaying, is problematic, but that should hardly
affect a programming language or its compiler much).

> Those seem like compelling reasons for separating the lexical analysis from
> parsing,
....
> [1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.

The book is old and doesn't quite reflect the current state of things, IMO.

Alex
[My impression is that the lexer can often take significant time since it has to look
at every character in the input, but the parser is fast unless you're doing something
strange like very ambiguous Earley parsing. -John]

Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-030@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=447&group=comp.compilers#447

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Thu, 9 Jun 2022 23:01:04 -0700 (PDT)
Organization: Compilers Central
Lines: 29
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-030@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="88723"; mail-complaints-to="abuse@iecc.com"
Keywords: design, history
Posted-Date: 10 Jun 2022 11:41:35 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-023@comp.compilers>

by: gah4 - Fri, 10 Jun 2022 06:01 UTC

On Thursday, June 9, 2022 at 9:50:50 AM UTC-7, Roger L Costello wrote:

> Page 84-85 of the dragon book [1] says:

> There are several reasons for separating the analysis phase of compiling into
> lexical analysis and parsing.

OK, no-one else has said it yet, so I will.

With small memories and slow computers 70 or 60 years ago, the time
and space were important.

Then we had microcomputers, where we had to relearn everything learned
on mainframes years before, so that was 50 or 40 years ago.

But by 30 or 20 years ago, memories were big and computers fast.
While programs needing compiling have also gotten larger, I am not
convinced that either speed or size is much of a problem now.

The separation of tasks, modular programming, often makes sense,
so it is likely still reasonable to keep them separate. But I don't
believe because of space or size.

(Well, maybe running compilers on an Arduino Nano still have
a size or space limit. But most don't do that.)

There is still the old saying:

"premature optimization is the root of all evil"

Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-032@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=448&group=comp.compilers#448

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Fri, 10 Jun 2022 12:26:38 +0200
Organization: Compilers Central
Lines: 28
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-032@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="89359"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, design, comment
Posted-Date: 10 Jun 2022 11:43:47 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-023@comp.compilers>

by: Hans-Peter Diettrich - Fri, 10 Jun 2022 10:26 UTC

On 6/9/22 4:52 PM, Roger L Costello wrote:

> Those seem like compelling reasons for separating the lexical analysis from
> parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?

I cannot answer the ANTLR question but want to point out why IMO
(traditional) languages based on tokens should have a tokenizer in the
first place.

We encountered a problem with scannerless parsers and traditional
programming languages in Meta§. A tokenizer can use a couple of
whitespace or control characters or tokens as token separators. Without
such a rule it's very hard to flag all places in a grammar where
whitespace is *required* as a token separator.

As an example: in an "else if" sequence whitespace is required between
"else" and "if" while in other context e.g. "else{" whitespace is not
required. This problem may be solved by a regex, but why should a
grammar be inflated by adding a token termination clause to *each* keyword?

Fortran is another example where keyword recognition requires
backtracking or similar means due to ignorance of spaces e.g. in the
well known "DO10I = 1," snippet.

DoDi
[A separate lexer also makes it a lot easier to skip comments. For Fortran
you had to do a prepass to see whether a statement was an assignment or something
else but after that you could tokenize without backtracking. -John]

The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-037@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=452&group=comp.compilers#452

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Sat, 11 Jun 2022 23:45:26 +0300
Organization: Compilers Central
Lines: 48
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-037@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="10937"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, design
Posted-Date: 11 Jun 2022 19:26:13 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: Christopher F Clark - Sat, 11 Jun 2022 20:45 UTC

Since, Terence borrowed that idea from our version of Yacc++, I feel
qualified to answer. (And we in turn borrowed plenty of ideas from
ANTLR, so it's all fair.)

First, they aren't really merged. The Scannerless parser people often
merge the idea, but ANTLR still has them separate and generates a
lexer and a parser as two separate entities and there are slight
differences between them (as there should be). And, you can actually
do them as separate files (and compile them separately) if you want.

All that happens is that you have the two parts using roughly the same
notation (and if you don't need the lexer specific features, it is
essentially a subset of the parsing language, and vice-versa). So,
you learn that notation and you know it for both parts.

Moreover, by combining the two parts in one file, you know that the
parts "go together" and you have less problems with mismatches,
especially not the kind where you update one but then have an "old"
version of the other which doesn't quite match. It also allows you to
introduce "tokens" (especially "keywords") in the parser. (Note you
lose this if you compile the parts separately.)

A slightly more advanced version of that allows you to have tokens
that are only matched in certain contexts. ANTLR has some of this
implemented, so if you have a place in your parser where you want to
match ">" but not worry about ">>", you can use the literal token in
the grammar and it will override the longest match rule, provided it
doesn't introduce a conflict. (You also lose this feature with
separate compilation.)

So, note that by keeping the implementations separate (they are really
two phases), you have kept item 1. Your parser never sees whitepace
and comments, unless you want it to. You can still do 2 with the
implementation. I don't know whether the ANTLR generated lexer does
so (or not). Since ANTLR is Unicode based, 3 is not an issue.

All of this is why we merged the two files into one in Yacc++ and
ANLTR started doing the same thing. We also merged them because our
original target was going to be building a syntax directed editor
version of Emacs and these ideas made sense in that regard.

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-040@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=454&group=comp.compilers#454

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.compilers
Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Sat, 11 Jun 2022 18:15:06 -0400
Organization: A noiseless patient Spider
Lines: 24
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-040@comp.compilers>
References: <22-06-023@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="13702"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 11 Jun 2022 19:38:42 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: George Neuner - Sat, 11 Jun 2022 22:15 UTC

On Thu, 9 Jun 2022 14:52:59 +0000, Roger L Costello
<costello@mitre.org> wrote:

>[elided quote] seem like compelling reasons for separating the lexical analysis
>from parsing, so why does ANTLR not do so; i.e., why does ANTLR combine them?
>
>/Roger
>
>[1] Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman.

Technically ANTLR does /not/ combine them ... it generates separate
lexer and parser.

What ANTLR /does/ do is allow specifying both the parser and lexer in
a single input file. It also can recognize textual constants within
the parser specification and generate lexer rules for them.

Note that Yacc and Bison also recognize textual constants in the parser
grammar and generate a token id for the (seperately specified) lexer
to return. ANTLR goes further only because it can: ANTLR can create a
lexer whereas Yacc and Bison cannot.

George

Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?

<22-06-042@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=455&group=comp.compilers#455

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.compilers
Subject: Re: The dragon book says separating lexical analysis and parsing is beneficial, so why doesn't ANTLR separate them?
Date: Sun, 12 Jun 2022 14:10:43 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 39
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-042@comp.compilers>
References: <22-06-023@comp.compilers> <22-06-040@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="31532"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 12 Jun 2022 14:25:47 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: Anton Ertl - Sun, 12 Jun 2022 14:10 UTC

George Neuner <gneuner2@comcast.net> writes:
>Note that Yacc and Bison also recognize textual constants in the parser
>grammar and generate a token id for the (seperately specified) lexer
>to return.

If you have a rule in yacc/bison:

E: T '+' T

the "token id" for '+' is the ASCII-code of +. Bison generates token
ids only for tokens defined with %token. So if you instead write

E: T PLUS T

you have to define

%token PLUS

and the value of PLUS is communicated to the scanner through the
..tab.h file.

Also note that for the last version of yacc that I have seen
documentation for, if you have a rule

S: L ":=" E

there is no token for ":=", but instead what you get is equivalent to

S: L ':' '=' E

Bison is more capable, you can, e.g., define

%token BECOMES ":="

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Subject	Author
The dragon book says separating lexical analysis and parsing is beneficial, so w	Roger L Costello
Re: The dragon book says separating lexical analysis and parsing is beneficial,	Alexei A. Frounze
Re: The dragon book says separating lexical analysis and parsing is beneficial,	gah4
Re: The dragon book says separating lexical analysis and parsing is beneficial,	Hans-Peter Diettrich
The dragon book says separating lexical analysis and parsing is beneficial, so w	Christopher F Clark
Re: The dragon book says separating lexical analysis and parsing is beneficial,	George Neuner
Re: The dragon book says separating lexical analysis and parsing is beneficial,	Anton Ertl