Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"No matter where you go, there you are..." -- Buckaroo Banzai


devel / comp.compilers / Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?

SubjectAuthor
* Why don't compiler writers adhere to the dragon book recommendation of one lexerRoger L Costello
+* Re: Why don't compiler writers adhere to the dragon book recommendation of one lgah4
|+- Re: Why don't compiler writers adhere to the dragon book recommendation of one lKaz Kylheku
|`- Re: Why don't compiler writers adhere to the dragon book recommendation of one lHans-Peter Diettrich
`- One lexer rule for keywords and identifiersChristopher F Clark

1
Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?

<22-06-075@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=479&group=comp.compilers#479

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?
Date: Sat, 25 Jun 2022 12:58:25 +0000
Organization: Compilers Central
Lines: 44
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-075@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="78285"; mail-complaints-to="abuse@iecc.com"
Keywords: history, practice, comment
Posted-Date: 25 Jun 2022 12:41:29 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Thread-Topic: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?
Thread-Index: AdiIkmeY/cQZAc/NR0C+NL7PLIp4Bw==
 by: Roger L Costello - Sat, 25 Jun 2022 12:58 UTC

Hi Folks,

Page 101-102 of the dragon book recommends having one lexer rule for both
keyword and identifiers (symtab = symbol table):

[a-zA-Z][a-zA-Z0-9]* { Action: check symtab for the lexeme.
If the symtab says the lexeme
is a keyword, return the keyword.
Else, if the symtab says the lexeme
is an existing identifier, return ID
and a pointer to the symtab entry.
Else, add the identifier to the symtab
and return ID and a pointer to the
new symtab entry. }

The book says the advantages of this approach are:

1. Easy to add new keywords. Simply initialize the symtab with the new
keywords. No changes to the lexer rules.

2. Smaller transition diagram. Instead of several hundred states, there are
less than a hundred states in the transition diagram.

Those seem like pretty compelling reasons for having one lexer rule for both
keywords and identifiers.

So why don't compiler (lexer) writers follow that recommendation?

Consider:

Page 86-90 of the book "flex & bison" shows the lexer for MySQL. The lexer has
a rule for every keyword.

Pag 228 of "Introduction to Compiling Techniques" show a lexer for VSL (Very
Simple Language). Again, the lexer has a rule for every keyword.

/Roger
[I think the answer to a lot of these questions contains the phrase "64K PDP-11." In
the 1970s a thousand state DFA took a lot of memory. Today it's barely a rounding error.
In flex&bison the main reason I did it the way I did is that it's a pedagogical example
but it also meant the symbol table didn't need special cases for keywords. There is also
the PL/I situation where keywords are only keywords in certain contexts so you can say
IF IF = GOTO; THEN ELSE = BEGIN; ELSE END = IF;
-John]

Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?

<22-06-079@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=482&group=comp.compilers#482

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?
Date: Sat, 25 Jun 2022 13:01:40 -0700 (PDT)
Organization: Compilers Central
Lines: 35
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-079@comp.compilers>
References: <22-06-075@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="49113"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, history
Posted-Date: 25 Jun 2022 20:47:47 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: gah4 - Sat, 25 Jun 2022 20:01 UTC

On Saturday, June 25, 2022 at 9:41:32 AM UTC-7, Roger L Costello wrote:

(snip)

> Page 101-102 of the dragon book recommends having one lexer rule for both
> keyword and identifiers (symtab = symbol table):

(and our moderator says)
> [I think the answer to a lot of these questions contains the phrase "64K PDP-11." ...]

The lost art of small memory compilers.

Before the PDP-11 there were big computers like the IBM 704, where Fortran
originated, and when core was $1/bit or more. (The 704 with core was an
upgrade from the 701, using CRTs for main memory.) Big ones had 32K
words, but I think the Fortran compiler ran in 8K or 16K words.

After big computers got bigger, then we had minicomputers like the PDP-11
and Data General Nova and Eclipse, and some others, with 32K or so bytes.

And when minicomputers got bigger, everything happened again with
microcomputers, and again compilers had to fit. I do remember swapping
floppy disks for passes of a Fortran and Pascal compiler for MS-DOS 2.0.

Some years ago, the Hercules group was trying to get gcc running
on an emulated IBM S/370 running MVS, with an 8M region.
(Out of the 16M byte address space, MVS takes up about half.)
But you can't run gcc in 8M bytes.

When I remember S/370 and OS/VS2, the usual region was 300K,
which we thought was big.

And now, we can barely run a system with 4G main memory,
such as the Macbook Air that I am writing this on.

One lexer rule for keywords and identifiers

<22-06-080@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=483&group=comp.compilers#483

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: One lexer rule for keywords and identifiers
Date: Sun, 26 Jun 2022 00:09:46 +0300
Organization: Compilers Central
Lines: 63
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-080@comp.compilers>
References: <22-06-075@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="49447"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 25 Jun 2022 20:48:13 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Sat, 25 Jun 2022 21:09 UTC

Two parts to my answer.

First, if they are using the Yacc++ my partner and I wrote, they are
following that rule. Why? Because we built it into the product.

You simply declare keywords like this:

keyword if, then, else, begin, end, for, while, do, defun, let;
substring keyword subroutine, function, procedure; // these match any
unique substrings.
sensitive keyword EoF; // only accepts that exact spelling (not eof or EOF)
keyword amp "&amp;" // keywords that are not strictly alphanumeric
inline keyword ge ".ge.", le ".le.", eq ".eq."; // don't put these in the
symbol table but instead make rules for them...
....

It also incorporates the C-typedef hack if you want that. It also deals
with the PL/I issue of context-sensitive keywords, where the word sometimes
is and sometimes is not a keyword. . All of this is built-in.

It does that by using the idea from Frank Deremer of separating the scanner
from the screener. The keyword processing is done in the screener (using
the symbol table), and because it uses the symbol table, the parser can add
(or remove) entries from being keywords--although that's not the technique
used for PL/I-style context-sensitive ones but is the right approach for
C-style typedefs.

So, I would say the Dragon book is correct. Not because it makes the
tables smaller, but because it says cognitive space. You don't want to be
reinventing the wheel for all those variations. You just want to declare
what you mean and have the tool do the "right" thing.

-----

Now, the reverse side of the coin. ANTLR's lexer may be separate from the
parser, but doesn't have the concept of a screener. You cannot get between
the lexer and the parser to do "magic". Thus, you have to use explicit
keyword tokens and if you want case insensitive ones, use regular
expressions (character classes) to specify them, same thing for substring
keywords (and it doesn't warn you if your substrings are ambiguous).

So, while I normally laud Terence for his work on ANTLR. He gets far more
right than wrong. This is a case where he got it wrong. (BTW, JavaCC
which borrows heavily from ANTLRs implementation, suffers the same way).

Now, being fair, ANTLR4 does have a unique feature in that regard in that
if you write a keyword or any token (in quotes) in your grammar, it is
processed only when that rule is being recognized. That's part of how
ANTLR4 is more PEG-like. That is a different solution to the PL/I keywords
as identifiers issue than what we support. And as far as I can tell, it
even overrides the longest-match problem. Thus, if you have a place in
your grammar where you want only ">" to be recognized and not ">>", simply
put ">" in your parser at that location.

--
******************************************************************************

Chris Clark email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?

<22-06-082@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=484&group=comp.compilers#484

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 480-992-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?
Date: Sun, 26 Jun 2022 06:17:49 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 26
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-082@comp.compilers>
References: <22-06-075@comp.compilers> <22-06-079@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="46227"; mail-complaints-to="abuse@iecc.com"
Keywords: storage, history, comment
Posted-Date: 26 Jun 2022 13:44:13 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Sun, 26 Jun 2022 06:17 UTC

On 2022-06-25, gah4 <gah4@u.washington.edu> wrote:
> Some years ago, the Hercules group was trying to get gcc running
> on an emulated IBM S/370 running MVS, with an 8M region.
> (Out of the 16M byte address space, MVS takes up about half.)
> But you can't run gcc in 8M bytes.
>
> When I remember S/370 and OS/VS2, the usual region was 300K,
> which we thought was big.
>
> And now, we can barely run a system with 4G main memory,
> such as the Macbook Air that I am writing this on.

My TXR Lisp hits a peak memory footprint of around 17 megabytes during
the build of its compiler and standard library (on a 32 bit GNU/Linux
system). If the CONFIG_SMALL_MEM build time option is used, it can get
down to 11. That's a total footprint including all the code and data
areas, such as mappings for the C library (which is a pig nowadays), as
reported by common tools like top. (CONFIG_SMALL_MEM just adjusts the
sizes of some static arrays used by the garbage collector and sets
certain thresholds differently.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[I think we should stop here before I start telling you what we did
on a 4K PDP-8. -John]

Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?

<22-06-084@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=485&group=comp.compilers#485

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: Why don't compiler writers adhere to the dragon book recommendation of one lexer rule for keywords and identifiers?
Date: Sun, 26 Jun 2022 10:20:25 +0200
Organization: Compilers Central
Lines: 10
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-084@comp.compilers>
References: <22-06-075@comp.compilers> <22-06-079@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="84561"; mail-complaints-to="abuse@iecc.com"
Keywords: storage
Posted-Date: 26 Jun 2022 17:35:20 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-079@comp.compilers>
Content-Language: de-DE
X-ZEDAT-Hint: R
 by: Hans-Peter Diettrich - Sun, 26 Jun 2022 08:20 UTC

On 6/25/22 10:01 PM, gah4 wrote:

> And now, we can barely run a system with 4G main memory,
> such as the Macbook Air that I am writing this on.

Squeeze your system and run WinXP in a 1G VM ;-)

DoDi

P.S.: Of course Linux were a better choice ;-)

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor