Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Heisenberg might have been here.


devel / comp.compilers / Re: Learning only one lexer made me blind to its hidden assumptions

SubjectAuthor
* Learning only one lexer made me blind to its hidden assumptionsRoger L Costello
+* Re: Learning only one lexer made me blind to its hidden assumptionsluser droog
|`- Re: Learning only one lexer made me blind to its hidden assumptionsJuan Miguel Vilar Torres
+- Re: Learning only one lexer made me blind to its hidden assumptionsEv. Drikos
+* Re: Learning only one lexer made me blind to its hidden assumptionsantispam
|`* Re: Learning only one lexer made me blind to its hidden assumptionsGeorge Neuner
| `- Re: Learning only one lexer made me blind to its hidden assumptionsantispam
`- Re: Learning only one lexer made me blind to its hidden assumptionsKaz Kylheku

1
Learning only one lexer made me blind to its hidden assumptions

<22-07-006@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=493&group=comp.compilers#493

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: Learning only one lexer made me blind to its hidden assumptions
Date: Thu, 7 Jul 2022 17:49:44 +0000
Organization: Compilers Central
Lines: 55
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-006@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="18181"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, question, comment
Posted-Date: 11 Jul 2022 20:26:04 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-US
 by: Roger L Costello - Thu, 7 Jul 2022 17:49 UTC

Hi Folks,

For months I have been immersed in learning and using Flex. Great fun indeed.

But recently I have been reading a book, Crafting a Compiler with C, and
reading its chapter on lexers. The chapter describes two lexer-generators:
ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
assumptions in Lex/Flex. Without learning ScanGen I would have continued to
think that the way things are done in Lex/Flex way is the only way.

Below I have documented some of the differences between Lex/Flex and ScanGen.

Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct'
regex. Flex has rules for picking the correct one: longest match wins, regex
listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex
and then, if needed, you create "Except" clauses. E.g., the token is an
Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls
it 'catenation' not 'concatenation'

Difference:
- Flex regexes use | for specifying alteration in regexes
- ScanGen uses ',' to specify alternation

Difference:
- With Flex, tossing out characters (e.g., toss out the quotes surrounding a
string) may involve writing C code to reprocess the token
- ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No
token reprocessing needed

Difference:
Flex regexes use ^ for specifying 'not', e.g., [^ab] means any char except a
and b
ScanGen regexes uses 'Not', e.g., Not(Quote)

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use
of character classes decreases (quite significantly) the size of the
transition table

Difference:
- Flex regexes use the ? meta-symbol
- ScanGen doesn't have that. Instead, it has 'Epsilon'

Difference:
- ScanGen has something called a Major number and a Minor number for each
token
- Flex doesn't have that concept
[For the same reason, I don't think it's a good idea to learn only one programming langage. -John]

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-007@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=494&group=comp.compilers#494

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: luser.dr...@gmail.com (luser droog)
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Tue, 12 Jul 2022 19:49:31 -0700 (PDT)
Organization: Compilers Central
Lines: 33
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-007@comp.compilers>
References: <22-07-006@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="82964"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 12 Jul 2022 23:25:36 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-07-006@comp.compilers>
 by: luser droog - Wed, 13 Jul 2022 02:49 UTC

On Monday, July 11, 2022 at 7:26:08 PM UTC-5, Roger L Costello wrote:
> Hi Folks,
>
> For months I have been immersed in learning and using Flex. Great fun indeed.
>
> But recently I have been reading a book, Crafting a Compiler with C, and
> reading its chapter on lexers. The chapter describes two lexer-generators:
> ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
> assumptions in Lex/Flex. Without learning ScanGen I would have continued to
> think that the way things are done in Lex/Flex way is the only way.
>
> Below I have documented some of the differences between Lex/Flex and ScanGen.
[snip]
> Difference:
> - Flex regexes use juxtaposition for specifying concatenation.
> - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls
> it 'catenation' not 'concatenation'

I think this difference in word choice has possibly some etymological significance.
Both word come from "catenary" which is the shape a rope or cord makes when
you drape it over some spokes or frames or hooks or whatever. So, to *catenate*
is to hoist the string or rope up onto some hooks or poles so it makes that
dangling *garland* kind of curve. So, it's focused on the *rope* as an entity.

*Concatenate* adds the prefix "con" meaning "with". I interpret this as embellishing
the rope with beads or light bulbs or something. So now we're stringing up
a bunch of beads *together*, focusing on the hanging objects.

The original APL book uses "catenate" in a way that I think is consistent with
my interpretation here. But I could also be wrong. I have not actually researched
this beyond having run into it a few times and attempted to come up with a
plausible reason.
[Lots of people agree with that etymology. Where do you think the Unix "cat" command came from? -John]

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-008@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=495&group=comp.compilers#495

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: jvi...@uji.es (Juan Miguel Vilar Torres)
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Wed, 13 Jul 2022 01:46:17 -0700 (PDT)
Organization: Compilers Central
Lines: 29
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-008@comp.compilers>
References: <22-07-006@comp.compilers> <22-07-007@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="78102"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, history, comment
Posted-Date: 13 Jul 2022 11:22:43 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-07-007@comp.compilers>
 by: Juan Miguel Vilar To - Wed, 13 Jul 2022 08:46 UTC

El miércoles, 13 de julio de 2022 a las 5:25:39 UTC+2, luser droog
escribió:
> I think this difference in word choice has possibly some etymological
significance.
> Both word come from "catenary" which is the shape a rope or cord makes when
> you drape it over some spokes or frames or hooks or whatever. So, to
*catenate*
> is to hoist the string or rope up onto some hooks or poles so it makes that
> dangling *garland* kind of curve. So, it's focused on the *rope* as an
entity.
>
> *Concatenate* adds the prefix "con" meaning "with". I interpret this as
embellishing
> the rope with beads or light bulbs or something. So now we're stringing up
> a bunch of beads *together*, focusing on the hanging objects.
>
> [Lots of people agree with that etymology. Where do you think the Unix "cat"
command came from? -John]

Not wanting be pedantic, but according to Merrian Webster, catenate comes from
"Latin catenatus, past participle of catenare, from catena"; concatenare from
"Middle English, from Late Latin concatenatus, past participle of concatenare
to link together, from Latin com- + catena chain "; and catenary from "New
Latin catenaria, from Latin, feminine of catenarius of a chain, from catena".
So, the three words derive ultimately from catena (chain in Latin) and
catenary is not in the root of the others. Moreover, Merrian Webster dates in
circa 1623 the first known use of catenate, in 1598 the first known use of
concatenate, and in 1788 the first use of catenary.
[Thanks. I think we're done with this thread. -John]

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-009@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=496&group=comp.compilers#496

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: driko...@gmail.com (Ev. Drikos)
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Wed, 13 Jul 2022 14:58:50 +0300
Organization: Aioe.org NNTP Server
Lines: 41
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-009@comp.compilers>
References: <22-07-006@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="78718"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, history
Posted-Date: 13 Jul 2022 11:23:44 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-US
 by: Ev. Drikos - Wed, 13 Jul 2022 11:58 UTC

On 07/07/2022 20:49, Roger L Costello wrote:
> ...

> Difference:
> - Flex allows overlapping regexes. It is up to Flex to use the 'correct'
> regex. Flex has rules for picking the correct one: longest match wins, regex
> listed first wins.
> - ScanGen does not allow overlapping regexes. Instead, you create one regex
> and then, if needed, you create "Except" clauses. E.g., the token is an
> Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'
>
> ...

As you can imagine there are many such options. A DFA builder may have
options a) to behave as Flex b) to treat only some tokens as reserved,
others as non reserved and c) to allow you examine shorter matches.

Who knows what else there is out there! (I don't claim to be an expert)

> Difference:
> - Flex deals with individual characters
> - ScanGen lumps characters into character classes and deals with classes. Use
> of character classes decreases (quite significantly) the size of the
> transition table
>

FYI, there is also a related controversial issue that may fire flames!

Bison also doesn't support character classes and this could be a reason
that scannerless parsing sounds weird to several people. Of course one
may use Bison down to the character level, but with many more states.

Also, if the grammar allows two consecutive identifiers, a lookahead
operator is likely necessary. (admittedly, a better alternative to
scannerless parsing may be different start states as supported by Flex).

When I played in the past with a scannerless GRL parser for SQL I hadn't
seen dramatic runtime slow downs with a few single/multi line commands.
Yet, I wouldn't try (or suggest) such an approach for XML processing.

> ...

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-010@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=497&group=comp.compilers#497

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: antis...@math.uni.wroc.pl
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Wed, 13 Jul 2022 19:52:45 -0000 (UTC)
Organization: Aioe.org NNTP Server
Lines: 23
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-010@comp.compilers>
References: <22-07-006@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="30201"; mail-complaints-to="abuse@iecc.com"
Keywords: flex
Posted-Date: 13 Jul 2022 15:59:03 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: antis...@math.uni.wroc.pl - Wed, 13 Jul 2022 19:52 UTC

Roger L Costello <costello@mitre.org> wrote:
> Below I have documented some of the differences between Lex/Flex and ScanGen.
<snip>

> Difference:
> - Flex deals with individual characters
> - ScanGen lumps characters into character classes and deals with classes. Use
> of character classes decreases (quite significantly) the size of the
> transition table

Hmm, from flex manual:

: -Ce, --ecs
: construct equivalence classes
: : -Cm, --meta-ecs
: construct meta-equivalence classes

If you want smaller tables use options above and flex DFA will
work on character classes.

--
Waldek Hebisch

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-017@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=501&group=comp.compilers#501

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Thu, 14 Jul 2022 16:46:36 -0400
Organization: A noiseless patient Spider
Lines: 22
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-017@comp.compilers>
References: <22-07-006@comp.compilers> <22-07-010@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="49458"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 14 Jul 2022 22:18:15 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: George Neuner - Thu, 14 Jul 2022 20:46 UTC

On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
wrote:

>Hmm, from flex manual:
>
>: -Ce, --ecs
>: construct equivalence classes
>:
>: -Cm, --meta-ecs
>: construct meta-equivalence classes
>
>If you want smaller tables use options above and flex DFA will
>work on character classes.

But note that Flex /may/ run considerably slower if you make heavy use
of equivalence classes. IIRC, that results in (moral equivalent of)
NFA rather than DFA.

George
[On modern computers it's hard to imagine a scanner so big that the space
savings from those two options are worth it. 64K PDP-11 and all that. -John]

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-022@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=504&group=comp.compilers#504

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 480-992-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Fri, 15 Jul 2022 14:16:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 137
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-022@comp.compilers>
References: <22-07-006@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="72062"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 15 Jul 2022 12:26:45 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Fri, 15 Jul 2022 14:16 UTC

On 2022-07-07, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> For months I have been immersed in learning and using Flex. Great fun indeed.
>
> But recently I have been reading a book, Crafting a Compiler with C, and
> reading its chapter on lexers. The chapter describes two lexer-generators:
> ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
> assumptions in Lex/Flex. Without learning ScanGen I would have continued to
> think that the way things are done in Lex/Flex way is the only way.

It seems hard to find information about ScanGen. I found someone's
slides claim that it's something written by a Gary Sevitsky in 1979,
and then worked on over the next couple of years by Robert Gray
and Charles Fischer.

By "learning ScanGen" do you mean that you have an implementation
and have created some scanners?

> Below I have documented some of the differences between Lex/Flex and ScanGen.
>
> Difference:
> - Flex allows overlapping regexes. It is up to Flex to use the 'correct'
> regex. Flex has rules for picking the correct one: longest match wins, regex
> listed first wins.
> - ScanGen does not allow overlapping regexes. Instead, you create one regex
> and then, if needed, you create "Except" clauses. E.g., the token is an
> Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'
>
> Difference:
> - Flex regexes use juxtaposition for specifying concatenation.
> - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls

That's verbose. Juxtaposition is used for catenation so that you can
simply use "foo" if you want to match the text "foo".

It behooves pattern matching notations to look like the problem space;
i.e. that expression which just match single terms rather than sets
just look like the terms they are matching.

> it 'catenation' not 'concatenation'

Careful speaker and writers of the English language avoid the
"concatenate" pleonasm.

"catena" is Latin for "chain" (e.g. "catenary curve: the curve formed by
a chain supported horizontally at its endpoints, under gravity"). To
catenate literally means to chain; the "con" semantics is implicit,
since chaining is always "together".

The Unix command is called "cat" not "concat"; the function is "strcat"
not "strconcat".

If you could think of a way to chain things apart, that could be
grounds for a functional prefix. For instance, chaining a pair of
agrressive dogs to opposite walls of a yard so they cannot get at each
other could plausibly be called "discatenation". :)

> Difference:
> - Flex regexes use | for specifying alteration in regexes
> - ScanGen uses ',' to specify alternation

That's just gratuitously weird. Commas are widely used for separating
groups in sequences: domain names, object member access,
date.month.year, integer.fraction, filename.suffix, ...

The | character is used in BNF.

I can see that in 1979, stringent character set limitations would still
have to be taken into account when developing something; there would
still be reasons to stick to a six bit character set.

> Difference:
> - With Flex, tossing out characters (e.g., toss out the quotes surrounding a
> string) may involve writing C code to reprocess the token
> - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No
> token reprocessing needed

That C code is just:

char *interior = yytext + 1;
yytext[yyleng - 1] = 0;

You could write a library a function like this, which you can reuse in
future lex jobs:

/* chop "front" characters from front, "back" characters
from back, of yytext, and return pointer to chopped
lexeme. */

char *yytrim(int front, int back)
{
assert (front >= 0);
assert (back >= 0);
assert (front <= yyleng);
assert (back <= yyleng);

{
char *f = yytext + front;
char *b = yytext + yyleng - back;
if (b < f)
b = f;
*b = 0;
return f;
}
}

If the string literal syntax supports the escaping of quotes, and other
features, stripping the quotes becomes insufficient.

There are plenty of situations in which it is handy to strip something
from either end of a lexeme, though; the above function would be useful
in the "-ll" lex library.

> Difference:
> - Flex deals with individual characters
> - ScanGen lumps characters into character classes and deals with classes. Use
> of character classes decreases (quite significantly) the size of the
> transition table

Obviously, Lex has explicit character classes, like [cxb]: the class
containing 'c', 'x' and 'b'.

Even without explicit classes in the syntax, the DFA subset constrution
implicitly identifies character classes. A regex like (c|x|b) should not
generate more states than [cxb].

Subset construction will squeeze out things like, say (adcdx|abcdy),
which should produce the same DFA as a[bd]cd[xy].

A state in a DFA can have many transitions leading out of it, triggered
by different input symbols: that constitutes a character class.
This class is essentially discovered by the algorithm.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Re: Learning only one lexer made me blind to its hidden assumptions

<22-07-027@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=507&group=comp.compilers#507

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: antis...@math.uni.wroc.pl
Newsgroups: comp.compilers
Subject: Re: Learning only one lexer made me blind to its hidden assumptions
Date: Fri, 15 Jul 2022 20:14:10 -0000 (UTC)
Organization: Aioe.org NNTP Server
Lines: 46
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-07-027@comp.compilers>
References: <22-07-006@comp.compilers> <22-07-010@comp.compilers> <22-07-017@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="93272"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 15 Jul 2022 16:36:32 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: antis...@math.uni.wroc.pl - Fri, 15 Jul 2022 20:14 UTC

George Neuner <gneuner2@comcast.net> wrote:
> On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
> wrote:
>
> >Hmm, from flex manual:
> >
> >: -Ce, --ecs
> >: construct equivalence classes
> >:
> >: -Cm, --meta-ecs
> >: construct meta-equivalence classes
> >
> >If you want smaller tables use options above and flex DFA will
> >work on character classes.
>
> But note that Flex /may/ run considerably slower if you make heavy use
> of equivalence classes. IIRC, that results in (moral equivalent of)
> NFA rather than DFA.

IIUC equivalence classes are reasonably cheap: single extra
array access per input character. Meta-equivalence are more
expensive.

If I wanted ultimate speed I probably would use hand-written
scanner, they tend to be significantly faster than anything
flex can generate. But in most cases flex scanner speed is
adequate.

> [On modern computers it's hard to imagine a scanner so big that the space
> savings from those two options are worth it. 64K PDP-11 and all that. -John]

Well, L1 cache on many processors is just 16K. L2 caches are
bigger, but it is not hard to imagince scanner which wastes
a lot of time due to L2 misses. If smaller tables can fit
in L2, than there may be net speed gain. Even if there is
speed loss on some machines smaller tables are likely to
lead to more predictable/uniform performance.

Also, if scanner if fast enough with smaller tables, then
using bigger tables is just waste. Of course it does not
make sense to spent a lot of effort eliminating small waste,
but with flex effort is just an extra command line switch
to flex.

--
Waldek Hebisch

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor