Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

There are bugs and then there are bugs. And then there are bugs. -- Karl Lehenbauer


devel / comp.compilers / Re: State-of-the-art algorithms for lexical analysis?

SubjectAuthor
* State-of-the-art algorithms for lexical analysis?Roger L Costello
`* Re: State-of-the-art algorithms for lexical analysis?gah4
 +* Re: State-of-the-art algorithms for lexical analysis?Hans-Peter Diettrich
 |`* State-of-the-art algorithms for lexical analysis?Christopher F Clark
 | `* Re: State-of-the-art algorithms for lexical analysis?Hans-Peter Diettrich
 |  `* Re: State-of-the-art algorithms for lexical analysis?Christopher F Clark
 |   `* Re: State-of-the-art algorithms for lexical analysis?Hans-Peter Diettrich
 |    `* Re: counted strings, was State-of-the-art algorithms for lexical analysis?gah4
 |     `* Re: counted characters in stringsRobin Vowels
 |      +- Re: counted characters in stringsMartin Ward
 |      `- Re: counted characters in stringsDennis Boone
 +- Re: State-of-the-art algorithms for lexical analysis?Kaz Kylheku
 `- References for PSL ?Christopher F Clark

1
State-of-the-art algorithms for lexical analysis?

<22-06-006@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=427&group=comp.compilers#427

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: State-of-the-art algorithms for lexical analysis?
Date: Sun, 5 Jun 2022 20:53:47 +0000
Organization: Compilers Central
Lines: 21
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-006@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="16989"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, question, comment
Posted-Date: 05 Jun 2022 17:08:08 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Accept-Language: en-US
Content-Language: en-US
 by: Roger L Costello - Sun, 5 Jun 2022 20:53 UTC

Hi Folks,

Is there a list of algorithms used in lexical analysis?

Are regular expressions still the best way to specify tokens?

Is creating a Finite Automata for regular expressions the state-of-the-art?

What is the state-of-the-art algorithm for generating a Finite Automata?

What is the state-of-the-art algorithm for finding holes in the set of regex
patterns?

What are the state-of-the-art algorithms for lexical analysis?

If you were to build a lexer-generator tool today, in 2022, what
state-of-the-art algorithms would you use?

/Roger
[I doubt it. Yes. If you mean a DFA, yes. Same as it was 40 years ago. ...
-John]

Re: State-of-the-art algorithms for lexical analysis?

<22-06-007@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=428&group=comp.compilers#428

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Sun, 5 Jun 2022 16:05:38 -0700 (PDT)
Organization: Compilers Central
Lines: 35
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-007@comp.compilers>
References: <22-06-006@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="46525"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 05 Jun 2022 21:11:12 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-006@comp.compilers>
 by: gah4 - Sun, 5 Jun 2022 23:05 UTC

On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:

(snip)

> Are regular expressions still the best way to specify tokens?

Some years ago, I used to work with a company that sold hardware
search processors to a certain three letter agency that we are not
supposed to mention, but everyone knows.

It has a completely different PSL, Pattern Specification Language,
much more powerful than the usual regular expression.

Both the standard and extended regular expression are nice, in that we
get used to using them, especially with grep, and without thinking too
much about them.

I suspect, though, that if they hadn't previously been defined, we
might come up with something different today.

Among others, PSL has the ability to define approximate matches,
such as a word with one or more misspellings, that is insertions,
deletions, or substitutions. Usual RE don't have that ability.

There are also PSL expressions for ranges of numbers.
You can often do that with very complicated RE, considering
all of the possibilities. PSL automatically processes those
possibilities. (Some can expand to complicated code.)

I suspect that in many cases the usual RE is not optimal for
lexical analysis, other than being well known.

But as noted, DFA are likely the best way to do them.

Though that could change with changes in computer hardware.

Re: State-of-the-art algorithms for lexical analysis?

<22-06-008@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=429&group=comp.compilers#429

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Mon, 6 Jun 2022 08:59:47 +0200
Organization: Compilers Central
Lines: 54
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-008@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="35996"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 06 Jun 2022 11:04:44 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-007@comp.compilers>
 by: Hans-Peter Diettrich - Mon, 6 Jun 2022 06:59 UTC

On 6/6/22 1:05 AM, gah4 wrote:

> It has a completely different PSL, Pattern Specification Language,
> much more powerful than the usual regular expression.

I wonder about the need for powerful patterns in programming languages.
Most items (operators, punctuators, keywords) are fixed literals with a
fixed ID for use by the parser and code generator. If source code is
written by humans then the remaining types (identifiers, literals,
comments) should not have a too complicated syntax. For machine
generated source code a lexer is not required, the generator can
immediately produce the tokens for the parser. And if humans should
understand the code produced by the generator then again the syntax has
to be as simple and easy understandable as possible to humans.

> Among others, PSL has the ability to define approximate matches,
> such as a word with one or more misspellings, that is insertions,
> deletions, or substitutions. Usual RE don't have that ability.

That's fine for keywords but does not help with user defined
identifiers. Still a nice to have feature :-)

> There are also PSL expressions for ranges of numbers.
> You can often do that with very complicated RE, considering
> all of the possibilities. PSL automatically processes those
> possibilities. (Some can expand to complicated code.)

If this feature is really helpful to the user?

> I suspect that in many cases the usual RE is not optimal for
> lexical analysis, other than being well known.
>
> But as noted, DFA are likely the best way to do them.

ACK

> Though that could change with changes in computer hardware.

Or with the style of writing. APL already tried to simplify typing, in
the near future a Chinese programming language with a glyph for each
token (except literals) would eliminate the need for a lexer. Then a
demand may arise for speech-to-text and reverse tools instead of a
lexer, for each natural language.

DoDi
[Regular expressions have the advantage that once you've paid the one-time cost
of making a DFA, the matching is extremely fast. Since the lexer is usually
one of the slowest parts of a compiler since it is the only part that has to
look at each character of the source program, this is a place where speed
matters. Anyone know how fast PSLs are? I've seen fuzzy matchers but they
haven't been very fast. -John]

Re: State-of-the-art algorithms for lexical analysis?

<22-06-010@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=431&group=comp.compilers#431

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 480-992-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Mon, 6 Jun 2022 16:00:37 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 139
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-010@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="76736"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 06 Jun 2022 15:55:55 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Mon, 6 Jun 2022 16:00 UTC

On 2022-06-05, gah4 <gah4@u.washington.edu> wrote:
> On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:
>
> (snip)
>
>> Are regular expressions still the best way to specify tokens?
>
> Some years ago, I used to work with a company that sold hardware
> search processors to a certain three letter agency that we are not
> supposed to mention, but everyone knows.
>
> It has a completely different PSL, Pattern Specification Language,
> much more powerful than the usual regular expression.
>
> Both the standard and extended regular expression are nice, in that we
> get used to using them, especially with grep, and without thinking too
> much about them.
>
> I suspect, though, that if they hadn't previously been defined, we
> might come up with something different today.

Whether or not regexes are defined:

- we would still have have the concept of a machine with a finite number
of states.

- the result would hold that a machine with a finite number of states
can only recognize certain sets of strings (what we call "regular
languages"), and that those sets can be infinite.

- the observation would still be made that those sets of strings have
certain features, like expressing certain kinds of repetitions,
but not other repetitive patterns such as:
- an arbitrary number of N parentheses followed by a N closed
ones, for any N.

- obvious compressed notations would suggest themselves for expressing
the features of those sets.

- someone would dedicate him or herself toward finding the minimal set
of useful operations in the notation which can capture all such
sets (e.g. the same process by which we know that ? and + are not
necessary if we have the Kleene * and branching because
A+ is just AA*, and A? is (A|). The Kleene star and branching would
surely be rediscovered.

We would end up with regex under a different name, using different
notations: maybe some other symbol instead of star, perhaps in
a different position, like prefix instead of suffix, or whatever.

> Among others, PSL has the ability to define approximate matches,
> such as a word with one or more misspellings, that is insertions,
> deletions, or substitutions. Usual RE don't have that ability.

This may be great for some explorative programming, but doesn't do much
when you're writing a compiler for a very specific, defined language.

Programmers misspell not only the fixed tokens of a language, but also
program-defined identifiers like function names, variables, and types.

Today, when a C compiler says "undeclared identifier `pintf`, did you
mean `printf`?", this is not based on some misspelling support in the
lexical analyzer, and could not reasonably be. First the error is
identified in the ordinary way, and then some algorithm that is entirely
external to parsing is applied to the symbol tables to find identifiers
similar to the undeclared one.

> There are also PSL expressions for ranges of numbers.
> You can often do that with very complicated RE, considering
> all of the possibilities. PSL automatically processes those
> possibilities. (Some can expand to complicated code.)

But ranges of numbers are regular sets. You can have a macro operator
embedded in a regex language whuch generates that same code.

For instance for matching the decimal strings 27 to 993, there is a
regex, and a way of calculating that regex.

We know thre is a regex because the strings set{ "27", "28", ..., "993" }
is a regular set, being finite. We can form a regex just by combining
those elements with a | branch operator.

We can do something which condenses some of the redundancy like:

9(9(|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7
|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8
|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9
|8|7|6|5|4|3|2|1|0))|8(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)
|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|
0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|
1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|7(9(|9|8|7|6|5|4
|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5
|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6
|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7
|6|5|4|3|2|1|0))|6(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|
9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4
(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)
|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|5(9(|9|8|7|6|5|4|3|2
|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3
|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4
|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5
|4|3|2|1|0))|4(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|
7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|
8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|
9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|3(9(|9|8|7|6|5|4|3|2|1|0
)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1
|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2
|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3
|2|1|0))|2(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|
5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5|4|3|2|1|0)|4(9|8|7|6|5
|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|4|3|2|1|0)|1(9|8|7|6|5|
4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))|1(9(9|8|7|6|5|4|3|2|1|0)|8(9|8|7|6|
5|4|3|2|1|0)|7(9|8|7|6|5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5
|4|3|2|1|0)|4(9|8|7|6|5|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|
4|3|2|1|0)|1(9|8|7|6|5|4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))

where we can better notate sequences like (9|8|7|6|5|4|3|2|1|0) as
[0-9].

What I did there was turn these things into a trie, and then just
transliterated that trie into regex syntax.

(The digits appear in reverse because the trie implementation I'm using
relies on hash tables, and hash tables don't have a specified order; the
actual order observed as an artifact of the hashing function. In modern
systems that function can be perturbed by a randomly generated key for
thwarting hash table attacks.)

Anyway, that sort of thing being what it is, the mechanism for
generating it thing can be readily embedded as a syntactic sugar into a
regex language, without making it non-regular in any way.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[To put it another way, the set of strings you can recoginize with a
NFA or DFA is the same as the set of strings you can describe with a regex.
A DFA is such an obvious thing that we would have reverse engineered
regexes from them if Ken Thompson hadn't done it the other way. -John]

References for PSL ?

<22-06-012@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=433&group=comp.compilers#433

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: References for PSL ?
Date: Mon, 6 Jun 2022 20:11:47 +0300
Organization: Compilers Central
Lines: 16
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-012@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="77629"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, question
Posted-Date: 06 Jun 2022 15:58:29 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Mon, 6 Jun 2022 17:11 UTC

Is this the PSL to which you refer?
(Common Pattern Specification Language)

https://aclanthology.org/X98-1004.pdf

Or is it something else with a similar name? Is there a reference on
its specification?

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

State-of-the-art algorithms for lexical analysis?

<22-06-013@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=434&group=comp.compilers#434

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: State-of-the-art algorithms for lexical analysis?
Date: Mon, 6 Jun 2022 21:16:26 +0300
Organization: Compilers Central
Lines: 62
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-013@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="78500"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 06 Jun 2022 16:01:45 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Mon, 6 Jun 2022 18:16 UTC

As our moderator wisely states:

> Regular expressions have the advantage that once you've paid the one-time cost
> of making a DFA, the matching is extremely fast. Yhe lexer is usually
> one of the slowest parts of a compiler since it is the only part that has to
> look at each character of the source program, so this is a place where speed
> matters.

And, for most cases they really are sufficient, and it really behooves
one to stay within those limits. Why? Because when you get a syntax
error at the lexical level, which is surprisingly frequent unless you
never mistype closing quotes, you get whole sections of your code
misparsed and rarely does the compiler error correction help much.
Other single character errors , not . missing or extra ( { [ or ] } )
or ; have similar disastrous effects on program meaning, often not
detected until much later.

And, as I mentioned before, having the lexer be simply a scanner and
putting any extra semantics into a separate screener (per Frank
Deremer's recommendation) makes it all much simpler. You end up with
small state machines with very few states that easily fit in even
small machine caches or can be turned into circuitry, FPGAs or ASICs
that use minimal numbers of gates. Those things can often run as fast
as you can read the text in. And the screener being much less frequent
can do more complex things without imposing a significant penalty. The
screener is essentially running at parser speed and only looking at
"long" tokens not single (or double) character ones.

And sadly, you cannot go very much faster. Too often the transitions
occur at single character boundaries. One is lucky when it is a
two-character sequence and longer sequences terminating a token are
rare enough to be in the measurement noise. I know because I tried to
adapt the Boyer-Moore ideas once (skip and reverse) and found that
they were essentially ineffective for tokenization. They might apply
occasionally in parsing, but that's not as much of a performance hog.

Unless you are interested in dealing with nested comments or something similar,
you don't need a stack in your lexer and so no reason to do LL or LR parsing.
(Yes, we extended our Yacc++ lexer to do LR parsing but with special casing so
that the stack cost was only there if you had recursive productions and only
tracked the start of the recursive production so that you were staying in DFA
mode essentially all the time. And, while that helped us in a few cases, it
isn't something I would say was important nor recommend.) The only place
I might have found it interesting is if we made it recognize tokens inside of
strings or comments for use in error correction to help with the missing close
character cases. That might have made it worthwhile. But that would probably
have needed to be done only in the presence of syntax errors with a string or
comment in the recent context.

In fact, there is only thing that I have not seen a DFA lexer do that I think is
worth doing at the lexical level (and not via a screener). That is recognizing
tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are
common in things like network protocols and they would be relatively easy
to implement, but I've not seen it done.

Beyond that it is my relatively firm belief that one should almost always
have only simple regular expressions, e.g. that the one for floating point
numbers should be one of the most complex ones. Otherwise you are trying
to do too much in the scanner. And you are asking for trouble when you do.

Kind regards,
Chris

Re: State-of-the-art algorithms for lexical analysis?

<22-06-015@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=436&group=comp.compilers#436

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Tue, 7 Jun 2022 06:52:45 +0200
Organization: Compilers Central
Lines: 27
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-015@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="28212"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 07 Jun 2022 10:51:05 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Hans-Peter Diettrich - Tue, 7 Jun 2022 04:52 UTC

On 6/6/22 8:16 PM, Christopher F Clark wrote:

> In fact, there is only thing that I have not seen a DFA lexer do that I think is
> worth doing at the lexical level (and not via a screener). That is recognizing
> tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are
> common in things like network protocols and they would be relatively easy
> to implement, but I've not seen it done.

I'm not sure what you mean. The nnH syntax has to be included into
general number syntax (like 0x... or nnE...).

Or do you mean a token built from the next nn input characters? In this
case both a lower and upper bound were interesting for e.g. (recognized)
identifier length or distinction of Unicode codepoint formats.

> Beyond that it is my relatively firm belief that one should almost always
> have only simple regular expressions, e.g. that the one for floating point
> numbers should be one of the most complex ones. Otherwise you are trying
> to do too much in the scanner. And you are asking for trouble when you do.

ACK

DoDi
[I believe he means Fortran style Hollerith strings, where the number says
how many characters are in the following string. The number is just a count,
not semantically a number in the language. DFAs can't do that other than by
enumerating every possible length. -John]

Re: State-of-the-art algorithms for lexical analysis?

<22-06-019@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=439&group=comp.compilers#439

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Tue, 7 Jun 2022 19:40:11 +0300
Organization: Compilers Central
Lines: 30
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-019@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers> <22-06-015@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="50232"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 07 Jun 2022 13:05:09 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Tue, 7 Jun 2022 16:40 UTC

Yes, as our moderator explained. I was talking about things like
FORTRAN Hollerith strings, but more importantly network packets, where
they give the size of the "field" within a packet and then you simply
take that many characters (or bytes or bits or some other quanta) as
the "token". This is quite important for parsing "binary" data. And,
sometimes the numbers are text like I showed but in many protocols the
numbers are "binary" e.g. something like

\xAHabcdefghij where \xA is a single 8 bit character (octet) whose
bits are "0000 1010" (or maybe 4, 8 bit, characters -- 4 octets),
that represent a 32 integer).

And, as our moderator pointed out, this makes a terrible regular
expression, NFA, DFA, but it is actually quite easy in nearly any
programming language. You read the length in, convert it to an integer
and then loop reading that many characters from the input and call
that a "token".

Kind regards,
Chris

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
[Right. When I was writing Fortran lexers, Hollerith strings were among the
simplest of the kludges I had to use. -John]

Re: State-of-the-art algorithms for lexical analysis?

<22-06-021@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=441&group=comp.compilers#441

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Wed, 8 Jun 2022 05:32:40 +0200
Organization: Compilers Central
Lines: 15
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-021@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers> <22-06-015@comp.compilers> <22-06-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="64848"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, history
Posted-Date: 09 Jun 2022 12:33:49 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Hans-Peter Diettrich - Wed, 8 Jun 2022 03:32 UTC

On 6/7/22 6:40 PM, Christopher F Clark wrote:

> And, as our moderator pointed out, this makes a terrible regular
> expression, NFA, DFA, but it is actually quite easy in nearly any
> programming language.

Now I know what made me think of Hollerith constants with the "H" :-)

I doubt that it's "quite easy" to use Hollerith constants for humans -
how often do you have to check whether you got the right number of
characters when reading or writing such a constant? So the delimited
form of strings is easier to handle by both humans and DFA's, a win-win
situation :-)

DoDi

Re: counted strings, was State-of-the-art algorithms for lexical analysis?

<22-06-025@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=444&group=comp.compilers#444

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: counted strings, was State-of-the-art algorithms for lexical analysis?
Date: Thu, 9 Jun 2022 11:54:14 -0700 (PDT)
Organization: Compilers Central
Lines: 40
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-025@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers> <22-06-015@comp.compilers> <22-06-019@comp.compilers> <22-06-021@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="4912"; mail-complaints-to="abuse@iecc.com"
Keywords: Fortran, history
Posted-Date: 09 Jun 2022 16:22:15 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-021@comp.compilers>
 by: gah4 - Thu, 9 Jun 2022 18:54 UTC

On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:

(snip)

> Now I know what made me think of Hollerith constants with the "H" :-)

> I doubt that it's "quite easy" to use Hollerith constants for humans -
> how often do you have to check whether you got the right number of
> characters when reading or writing such a constant? So the delimited
> form of strings is easier to handle by both humans and DFA's, a win-win
> situation :-)

It definitely seems that way now.

There is a document that Knuth calls "Fortran 0", with the description
of the Fortran language before they finished the first compiler,
maybe before they started it.

I never had many of them, but there are plenty of stories about
"Fortran coding forms", with 80 little boxes on each row,
to write down what you want punched on cards. Then, as the
story goes, someone else will punch them for you. I never had
anyone to punch my cards, though I learned how to use a keypunch
about when I was nine.

In any case, if you write your program on a coding form, with
each character in a little box, it is easy to know how many are
in each H constant.

Even more, Fortran I/O depended on getting things in the right
column until list-directed I/O (name as well as I know, borrowed
from PL/I) was added in 1977.

IBM added apostrophe delimited constants to Fortran IV early
on, but they didn't get into the Fortran 66 standard.

One reason for the early Fortran character set was the characters
available on the 026 keypunch. For B5500 ALGOL, you had
to use multi-punch to get many of the characters that didn't
have a key. But IBM didn't use that.

Re: counted characters in strings

<22-06-029@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=446&group=comp.compilers#446

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: robi...@dodo.com.au (Robin Vowels)
Newsgroups: comp.compilers
Subject: Re: counted characters in strings
Date: Fri, 10 Jun 2022 12:21:00 +1000
Organization: Compilers Central
Lines: 19
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-029@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers> <22-06-015@comp.compilers> <22-06-019@comp.compilers> <22-06-021@comp.compilers> <22-06-025@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; charset="UTF-8"; reply-type=original
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="88392"; mail-complaints-to="abuse@iecc.com"
Keywords: lex
Posted-Date: 10 Jun 2022 11:41:04 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Robin Vowels - Fri, 10 Jun 2022 02:21 UTC

From: "gah4" <gah4@u.washington.edu>
Subject: Re: counted strings

> On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:
>
> In any case, if you write your program on a coding form, with
> each character in a little box, it is easy to know how many are
> in each H constant.

Nevertheless, counting the number of characters was a constant source of error.
It was easy enough to include the letter 'H' in the character count, sp that
the following character became gobbled up in the Hollerith constant,
and resulting in weird error messages.
When a Hollerith constant was long enough to require a continuation card,
it was even easier to lose count; the continuation character in column 6
sometimes being included.
And when the Hollerith constant required 133 characters, how many coud reliably
count all of them?

Re: counted characters in strings

<22-06-035@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=450&group=comp.compilers#450

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: mar...@gkc.org.uk (Martin Ward)
Newsgroups: comp.compilers
Subject: Re: counted characters in strings
Date: Sat, 11 Jun 2022 10:52:08 +0100
Organization: Compilers Central
Lines: 27
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-035@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers> <22-06-015@comp.compilers> <22-06-019@comp.compilers> <22-06-021@comp.compilers> <22-06-025@comp.compilers> <22-06-029@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="43659"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, history
Posted-Date: 11 Jun 2022 09:58:50 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-029@comp.compilers>
Content-Language: en-GB
 by: Martin Ward - Sat, 11 Jun 2022 09:52 UTC

On 10/06/2022 03:21, Robin Vowels wrote:
> Nevertheless, counting the number of characters was a constant source
> of error. It was easy enough to include the letter 'H' in the
> character count, sp that the following character became gobbled up in
> the Hollerith constant, and resulting in weird error messages. When a
> Hollerith constant was long enough to require a continuation card, it
> was even easier to lose count; the continuation character in column
> 6 sometimes being included. And when the Hollerith constant required
> 133 characters, how many coud reliably count all of them?

The point about coding forms was that each column of characters
was numbered, so you just had to take the first column and the last
and compute last - first + 1 to get the number of characters
in the string. You don't have to count each one individually.
If there is a continuation then you just compute last + 66 - first + 1
For 133 characters, there would be two continuation cards
and the last column would be the same as the first:
so quite easy to count reliably in fact!

Back in the days before pocket calculators, many people could
do simple arithmetic sums in their heads! :-)

--
Martin

Dr Martin Ward | Email: martin@gkc.org.uk | http://www.gkc.org.uk
G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4

Re: counted characters in strings

<22-06-036@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=451&group=comp.compilers#451

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: drb...@msu.edu (Dennis Boone)
Newsgroups: comp.compilers
Subject: Re: counted characters in strings
Date: Sat, 11 Jun 2022 11:09:11 -0500
Organization: Compilers Central
Lines: 12
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-036@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> <22-06-008@comp.compilers> <22-06-013@comp.compilers> <22-06-015@comp.compilers> <22-06-019@comp.compilers> <22-06-021@comp.compilers> <22-06-025@comp.compilers> <22-06-029@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="69116"; mail-complaints-to="abuse@iecc.com"
Keywords: Fortran, history, comment
Posted-Date: 11 Jun 2022 12:53:27 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Dennis Boone - Sat, 11 Jun 2022 16:09 UTC

> And when the Hollerith constant required 133 characters, how many coud
> reliably count all of them?

Such a long Hollerith string would be uncommon, I think. The main
purpose would seem to be headers on a printed report. It appears that
the 'T' specifier wasn't available in the early 60s versions of IBM
FORTRAN, but it certainly was there in FORTRAN 66.

De
[Early Fortran mostly read and wrote to tape files so who knows what long
strings people might have needed. Either way, I think we've beaten this
topic long enough. -John]

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor