Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

grep me no patterns and I'll tell you no lines.


devel / comp.compilers / Re: What stage should entities be resolved?

SubjectAuthor
* What stage should entities be resolved? Lexical analysis stage? Syntax analysis Roger L Costello
+* Re: What stage should entities be resolved? Lexical analysis stage? Syntax analyHans-Peter Diettrich
|+* Re: What stage should entities be resolved?Christopher F Clark
||`- Re: What stage should entities be resolved?Hans-Peter Diettrich
|`* Re: What stage should entities be resolved?Roger L Costello
| +- Re: What stage should entities be resolved?Hans-Peter Diettrich
| +- Re: What stage should entities be resolved?gah4
| +* Re: What stage should entities be resolved?Kaz Kylheku
| |+- Re: What stage should entities be resolved?gah4
| |`- Re: What stage should entities be resolved?Martin Ward
| `- Re: What stage should entities be resolved?matt.ti...@gmail.com
+- RE: What stage should entities be resolved?Christopher F Clark
`- Re: What stage should entities be resolved? Lexical analysis stage? Syntax analymatt.timmermans

1
What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?

<22-03-019@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=307&group=comp.compilers#307

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?
Date: Wed, 9 Mar 2022 17:22:00 +0000
Organization: Compilers Central
Lines: 57
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="88673"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, question
Posted-Date: 09 Mar 2022 15:21:37 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-US
 by: Roger L Costello - Wed, 9 Mar 2022 17:22 UTC

Hello Compiler Experts!
For learning purposes (and for fun) I want to build an XML parser.
While an XML is not a programming language and an XML parser is not a
compiler, I think that an XML parser performs the same steps as the front end
of a compiler.
I am reading a compiler book [1] and it says this:

---------------------------------------------------
The front end can be divided into lexical analyzer, syntax analyzer, and
semantic analyzer. The lexical analyzer, sometimes also called the scanner,
carries out the simplest level of structural analysis. It will group the
individual symbols of the source program text into their logical entities.
Thus the sequence of characters 'W', 'H', 'I', 'L', and 'E' would be
identified as the word 'WHILE' and the sequence of characters '1', '.', and
'0' would be identified as the floating-point number 1.0.
The syntax analyzer, often also called the parser, analyzes the overall
structure of the whole program, grouping the simple entities identified by the
scanner into the larger constructs, such as statements, loops, and routines,
that make up the complete program.
Once the structure of the program has been determined we can then analyze its
meaning (or semantics). We can determine which variables are to hold integers,
and which to hold floating point numbers, we can check that the size of all
arrays is defined and so on.
---------------------------------------------------

Okay, back to XML. Consider this non-well-formed XML:
<Publisher>Harper&amp;Row</Publsher>
(The end-tag is misspelled)
The &amp; is called an "XML entity." An XML parser will convert it to &. The
other XML entities are: &lt; ... &gt; ... &quot; ... &apos;
What stage should the entity &amp; be converted to &?

1. Lexical analysis stage
2. Syntax analysis stage
3. Semantic analysis stage
What stage should detect that the <Publisher> start-tag does not have a
matching end-tag?

1. Lexical analysis stage
2. Syntax analysis stage
3. Semantic analysis stage
Some background information: The Flex manual shows an example [2] of a lexer
that scans a string which is enclosed in quotes. For this input:
"Hello\040World"
the lexical analyzer generates this token:
Hello World
Notice that the octal entity ( \040 ) has been resolved to its character (the
space character). That example leads me to conclude that a lexical analyzer is
responsible for converting XML entities, e.g.,
The lexical analyzer converts &amp; to &
However, the Flex manual showed that a lexer "could" resolve an octal entity,
but the manual didn't say that the lexer "should" resolve entities, so I don't
know if it is appropriate for the lexer to convert XML entities. What are your
thoughts on this?
/Roger
[1] "Introduction to Compiling Techniques" by J.P. Bennett
[2] See page 24, https://epaperpress.com/lexandyacc/download/flex.pdf

Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?

<22-03-025@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=312&group=comp.compilers#312

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?
Date: Thu, 10 Mar 2022 09:48:48 +0100
Organization: Compilers Central
Lines: 50
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-025@comp.compilers>
References: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="37603"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 11 Mar 2022 14:48:00 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-019@comp.compilers>
 by: Hans-Peter Diettrich - Thu, 10 Mar 2022 08:48 UTC

On 3/9/22 6:22 PM, Roger L Costello wrote:

> Okay, back to XML. Consider this non-well-formed XML:
> <Publisher>Harper&amp;Row</Publsher>
> (The end-tag is misspelled)
> The &amp; is called an "XML entity." An XML parser will convert it to &. The
> other XML entities are: &lt; ... &gt; ... &quot; ... &apos;
> What stage should the entity &amp; be converted to &?

In other languages digraphs and trigraphs are used as replacements for
special characters. All such character replacements are handled at the
begin of the character input stage (lexer). In XML it also could be
handled by a preprocessor, to extend your stages:

0. Preprocessor
> 1. Lexical analysis stage
> 2. Syntax analysis stage
> 3. Semantic analysis stage

I prefer to describe/clarify the stages by their inputs and outputs:

A preprocessor inputs and outputs a stream of characters.
A Lexer reads a character stream and outputs a stream of terminal tokens.
A Parser accepts a stream of terminals, adds non-terminals from the
grammar, and outputs e.g. a tree structure.
Semantic analysis can be done during syntax analysis or later.

> What stage should detect that the <Publisher> start-tag does not have a
> matching end-tag?

As appropriate <g>. What should be the consequence of that mismatch?
It may be a quite harmless typo than can be fixed by auto correction.
Or it may indicate a missing closing tag if it matches some previous
opening tag?
Where in your implementation can you know enough about possible reasons
for the mismatch? Error handling and helpful error messages are a wide
and stony field <sigh>.

IMO it's up to the compiler writer to match the expectations of his
audience with such problems - warning, error, re-sync or abort processing?
Or you leave the handling to some user controlled compiler flags.

Don't take too seriously what you read about the one and only way to
classify or handle something. For XML (HTML...) you have a choice of DOM
or SAX parsing. Feel free to do it your way, after you have studied the
various approaches and pitfalls, and as long as you can be sure that the
results are correct and acceptable by your boss or users.

DoDi

RE: What stage should entities be resolved?

<22-03-026@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=313&group=comp.compilers#313

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: RE: What stage should entities be resolved?
Date: Thu, 10 Mar 2022 12:54:01 +0200
Organization: Compilers Central
Lines: 84
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-026@comp.compilers>
References: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="38032"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 11 Mar 2022 14:49:25 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Thu, 10 Mar 2022 10:54 UTC

There is a simple rule of thumb that answers your question.

The lexer deals with all issues related to characters.
The parser deals with all issues related to tokens. It should rarely
look at the spelling (characters) of a token.
The semantic analysis phase should deal with all issues related to the
AST. It should rarely look at individual tokens.

At each phase, the preceding phase should have resolved the issues at
the lower level and at worse converted them into a unique number
associated with the output units.

So, for example from lexer to parser:
That is a character string can be distinguished from an integer from
an identifier from a keyword after the lexer and each character string
and integer and identifier should have a unique id number, but the
value of the integer should rarely be looked at by the parser (except
in some very odd and rare cases, e.g. only an integer zero might be
allowed in certain contexts, so zero can be distinguished from 42 but
the parser doesn't need to know the value of 42, just that it is a
different value than 43. So, 42 and 43 should be the same kind of
token, just with different unique ids. And if 42.0 and 42.000 are the
same, then they should be the same kind of tokens (and may or may not
need to have the same unique id). You may or may not care about that
distinction until much later, e.g. in the optimizer or code generator.

Similarly, from parser to semantic analysis.
The trees built should be identified such that the semantic analysis
doesn't generally look at the specific tokens, just the trees
themselves. Is this an expression or a statement? Does the "type" of
the expression map the "type" of variable that is being assigned to,
or do we need to insert a conversion or signal a semantic error?

Each phase has its own complexities. You don't want to be carrying
the complexities of one phase over to the next. Sometimes you have no
choice but to do so, but you want to make that rare.

That said, you also want to keep each phase as simple as possible.
Notice how in my previous posting I mentioned splitting the scanner
and the screener into separate parts. So, it might be that resolving
special character constructs (e.g. &amp;) might be done outside the
scanner per se, but still part of lexing). Figure out what makes the
code easiest to read. And, with that implemented and having a working
system, then do the measurements that tell you whether you have a
bottleneck somewhere that you need to fix (possibly by putting code
together that was originally distinct). But having a working and
understandable first cut should be your first priority. Then, you
have something you can use to validate your refactorings against and
make certain you haven't taken an invalid shortcut that breaks
something.

Moreover, most of the time, the simplest code is also the fastest.
You will likely spend more time in the scanner than nearly any other
phase, because it has to touch each character. So, the less is has to
do with the majority of tokens the better. Simple code that gets the
text divided into tokens and roughly categorizes them will be fast at
doing so. Then if you have tokens like character strings that need
special processing, recognizing "escape sequences", you can fix them
up in the code that deals only with character strings, and before you
hand those strings off to the parser. Just as you can map identifiers
into keywords after figuring out what the identifier is. This is the
point of having a "screener" after the scanner, but part of the lexer.

-------

One final comment, related to a different post but relevant here.
There are times, you might push the result later. The "go" "to"
versus "goto" example might be such a case. You can easily make the
parser reconize both a two keyword sequence "go" "to" and a one
keyword sequence "goto" are synonymous and trying to patch the tokens
back in the lexer can be more complex, e.g. some bizarre dialect of
BASIC:

if i < 5 go to 100;
versus
for i = go to end { a[i] =2*i }

-
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: What stage should entities be resolved?

<22-03-028@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=314&group=comp.compilers#314

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Sat, 12 Mar 2022 14:11:21 +0200
Organization: Compilers Central
Lines: 36
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-028@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="8890"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 14 Mar 2022 11:36:04 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Sat, 12 Mar 2022 12:11 UTC

Contrary to what might assume from my previous posting on this topic.
I agree with Dodi.

Sometimes, the right answer is another phase. To keep your lexer
simple, it can be useful to have a separate phase that deals with
"character" issues, whether that is transforming UTF-8 extensions into
unique code points (or actual characters representing glyphs possibly
accented, i.e. resolving the combining code points into canonical
versions) or taking sequences like &amp; or \n or whatever into single
tokens (or characters). That *can* make the whole process simpler and
faster.

For example, years ago when working on a C compiler for Honeywell when
the first ANSI standard was still new, the standard had 8 stages (if I
recall correctly) that described the lexing process. We decided that
the best way to assure faithfulness to the standard was to implement
the 8 stages exactly as specified, at least in the first version.
That way we had a reliable model of the desired behavior that we could
track back to the standard. Moreover, by having them as separate
pieces of code, it was easy to turn them off (e.g. trigraphs in C were
an ANSI invention and some C programs used ??? not as a trigraph but
as a way of emphasis). Similarly, some pre-ANSI C dialects supported
nested comments and you might want to change that phase.

While you do want each phase to generally build larger and larger
structures. I.e. you don't want your parser very often dealing with
strings as individual characters. The exact number of phases or
content of each phase can vary slightly. One size rarely fits all.

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?

<22-03-029@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=315&group=comp.compilers#315

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: matt.tim...@gmail.com
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved? Lexical analysis stage? Syntax analysis stage? Semantic analysis stage?
Date: Sat, 12 Mar 2022 05:12:25 -0800 (PST)
Organization: Compilers Central
Lines: 44
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-029@comp.compilers>
References: <22-03-019@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="9665"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 14 Mar 2022 11:37:39 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: matt.tim...@gmail.com - Sat, 12 Mar 2022 13:12 UTC

On Wednesday, 9 March 2022 at 15:21:40 UTC-5, Roger L Costello wrote:
> [...]
> Okay, back to XML. Consider this non-well-formed XML:
> <Publisher>Harper&amp;Row</Publsher>
> (The end-tag is misspelled)
> The &amp; is called an "XML entity." An XML parser will convert it to &. The
> other XML entities are: &lt; ... &gt; ... &quot; ... &apos;
> What stage should the entity &amp; be converted to &?
>
> 1. Lexical analysis stage
> 2. Syntax analysis stage
> 3. Semantic analysis stage
> What stage should detect that the <Publisher> start-tag does not have a
> matching end-tag?

Other answers provide a discussion of how you make this decision in general.

Specifically for XML, though, these are practical questions.

Re Entities:
- you can't really recognize them in lexical analysis, because they aren't
valid everywhere. <T&#41;G> is not a valid tag, and <![CDATA[&amp;]]> has no
entities in it. It can be convenient, though, for the lexer to capture them
as ENTITY_REFERENCE tokens with their original text (like strings). Where
they occur in CDATA sections, phase 3 can convert them back into their
original text. Otherwise, the lexer should produce tokens like ENTITY_START,
WORD_CHARS, ENTITY_END.
- Regardless of what the lexer produces, the syntax analysis phase ensures
that entities only occur in valid locations, and produces a parse tree with
enough information to determine how they're handled. This is where <T&#41;G>
is rejected as invalid.
- During semantic processing, entities are converted to whatever their
appropriate final form is. They will be converted into the indicated
characters in strings or element content, or replaced with their original text
in CDATA sections.

Re: Tag Matching:
If you include tag matching in syntax, then the syntax is not context-free and
cannot be described with a context-free grammar... so don't do that.
Practically, only semantic analysis can match the </Publisher> end tag with
the preceding <Publisher> start tag.

Unfortunately, that means that your parse tree cannot have the element
hierarchy embedded in it. Your grammar cannot have an Element non-terminal.

Re: What stage should entities be resolved?

<22-03-031@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=316&group=comp.compilers#316

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Mon, 14 Mar 2022 19:43:22 +0100
Organization: Compilers Central
Lines: 28
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-031@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-028@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="56396"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 14 Mar 2022 14:50:21 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Hans-Peter Diettrich - Mon, 14 Mar 2022 18:43 UTC

On 3/12/22 1:11 PM, Christopher F Clark wrote:
> Contrary to what might assume from my previous posting on this topic.
> I agree with Dodi.
>
> Sometimes, the right answer is another phase. To keep your lexer
> simple, it can be useful to have a separate phase that deals with
> "character" issues, whether that is transforming UTF-8 extensions into
> unique code points (or actual characters representing glyphs possibly
> accented, i.e. resolving the combining code points into canonical
> versions) or taking sequences like &amp; or \n or whatever into single
> tokens (or characters). That *can* make the whole process simpler and
> faster.

I consider these "phases" as "filters". In my C parser I also had a
number of filter levels that handle the various aspects in detail of the
preprocessor macro substitution and conditional compilation. The parser
calls the top level filter to return the next C token, which in turn
calls lower level filters until all levels returned enough information
about the next token to parse.

A sloppy interpretation by Microsoft of the preprocessor as a
self-contained stage revealed that the newer C standards disallow a
stand-alone C preprocessor. Such a separate preprocessor could
synthesize tokens like "//" that never occured in a strict (embedded) C
standard implementation. Even if this was not stated explicitly in the
standard it turned out as a side effect of the lexer implementation.

DoDi

Re: What stage should entities be resolved?

<22-03-032@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=317&group=comp.compilers#317

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Tue, 15 Mar 2022 11:49:15 +0000
Organization: Compilers Central
Lines: 62
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-032@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="73851"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 17 Mar 2022 14:41:44 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Content-Language: en-US
 by: Roger L Costello - Tue, 15 Mar 2022 11:49 UTC

Thank you DoDi, Chris, and Matt. You have provided truly exceptional information.

One thing that I am still unclear about is this:

How much knowledge of the language should each stage have?

For instance, as I understand it a C preprocessor goes through a C program and replaces macros. With this:

#define PI 3.14

the preprocessor will convert this:

area = PI * radius * radius;

to this:

area = 3.14 * radius * radius;

But if PI is inside a quoted string:

"Today is PI day"

then the preprocessor does not replace PI.

So the preprocessor has some knowledge about the language: If a macro is within a quoted string, then don’t replace it.

Similarly, in XML if &amp; is embedded inside a CDATA section:

<![CDATA[&amp;]]>

then a preprocessor must not replace &amp; with &. That is, the preprocessor must have knowledge about the language: If an XML entity is within a CDATA section, then don’t replace it.

So that brings me to my questions:

1. How much knowledge of the language should the preprocessor stage have?

2. How much knowledge of the language should the lexical analysis stage have?

3. How much knowledge of the language should the syntax analysis stage have?

4. How much knowledge of the language should the semantic analysis stage have?

To make the questions concrete, consider this XML:

<foo>…</foo>

Should the lexical analysis stage know that the foo in <foo> is a
start tag (STAG) and the foo in </foo> is an end tag (ETAG)? That
would mean the lexical analysis stage has considerable knowledge of
the XML language. Or should the lexical analysis stage simply identify
the foo in <foo> as a name (NAME) and the foo in </foo> as a name
(NAME)? That would mean the lexical analysis stage has lesser
knowledge of the XML language. How much knowledge of the XML language
should the lexical analysis stage have?

/Roger
[I'd say start and end tags are sufficiently basic that they are different
but your life is made more complicated with tags like <foo opt="42" />
which are both. Since you'll need to parse attributes, lexemes like
< </ > /> make sense.-John]

Re: What stage should entities be resolved?

<22-03-033@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=318&group=comp.compilers#318

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: DrDiettr...@netscape.net (Hans-Peter Diettrich)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Fri, 18 Mar 2022 00:31:39 +0100
Organization: Compilers Central
Lines: 17
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-033@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="49533"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, design
Posted-Date: 17 Mar 2022 19:34:55 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-032@comp.compilers>
 by: Hans-Peter Diettrich - Thu, 17 Mar 2022 23:31 UTC

On 3/15/22 12:49 PM, Roger L Costello wrote:

> So that brings me to my questions:
>
> 1. How much knowledge of the language should the preprocessor stage have?
[...]

Knowledge of how many languages should the preprocessor stage have?

I think that all parts of a compiler front end should follow the
specific language rules. Just for maintenance purposes.

If universal generator tools for lexer or parser are used then the
specific language grammar is the input to these tools.

DoDi

Re: What stage should entities be resolved?

<22-03-034@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=319&group=comp.compilers#319

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Thu, 17 Mar 2022 17:06:15 -0700 (PDT)
Organization: Compilers Central
Lines: 72
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-034@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="69868"; mail-complaints-to="abuse@iecc.com"
Keywords: C
Posted-Date: 17 Mar 2022 20:59:45 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-032@comp.compilers>
 by: gah4 - Fri, 18 Mar 2022 00:06 UTC

On Thursday, March 17, 2022 at 11:41:47 AM UTC-7, Roger L Costello wrote:

(snip)

> For instance, as I understand it a C preprocessor goes through a C program and replaces macros. With this:

> #define PI 3.14

> the preprocessor will convert this:

> area = PI * radius * radius;

> to this:

> area = 3.14 * radius * radius;

> But if PI is inside a quoted string:

> "Today is PI day"

(You missed by a few days.)

> then the preprocessor does not replace PI.

As well as I know it, the early preprocessor, used by K&R C, did
substitute in strings. The current standard version does not, but some
versions have the --traditional option which might allow it.

It is not unusual to use the C preprocessor with --traditional for
Fortran programs, and some compilers will automate this. Some of the
processing done by the current C standard version messes up Fortran
programs too much.

Note, though, that the C preprocessor is pretty good at ignoring
almost anything, including mismatched quotes, between

#if 0

and the matching

#endif

The text doesn't need to look much like C at all, and can even have
mismatched quotes. It does have to properly process nested #if though.

The PL/I preprocessor isn't quite as forgiving.

At least in the early compilers, like early C compilers, it was
implemented as a separate pass, with intermediate disk file. I suspect
you can't put quite as much garbage between

%IF 0 %THEN %DO

and

%END

and especially likely not unmatched quotes. (PL/I uses apostrophes
for character constants, in both the language itself and the preprocessor.)

Java, on the other hand, does not have a preprocessor. The compiler is
supposed to know enough not to compile statements between

if(0) {

and

}

but I suspect that they need to look a lot like Java statements.
(You can also use a static final variable in such an if statement,
which the compiler knows how to evaluate at compile time.)

Re: What stage should entities be resolved?

<22-03-037@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=322&group=comp.compilers#322

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 480-992-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Fri, 18 Mar 2022 17:50:22 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 194
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-037@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="29574"; mail-complaints-to="abuse@iecc.com"
Keywords: design
Posted-Date: 18 Mar 2022 14:00:37 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Fri, 18 Mar 2022 17:50 UTC

On 2022-03-15, Roger L Costello <costello@mitre.org> wrote:
> But if PI is inside a quoted string:
>
> "Today is PI day"
>
> then the preprocessor does not replace PI.

This was not historically true, though. I know for a fact that ancient
versions of the C preprocessor would have replaced PI in character
constants, as in 'PI'. Possibly, some versions of it also didn't
care about quotes.

> 1. How much knowledge of the language should the preprocessor stage have?

A preprocessor doesn't have to have any knowledge of the language
being preprocessed.

If the preprocessor is universal, like m4, that is practically a given.

If a preprocessor is dedicated to a language, like the C one,
it's a good idea for it to work at the token level: tokenize its
input the same way, or at least very similarly, to the target language.

In C, a preprocessor token is a kind of superset of a proper token.
There are pp-number tokens that are not number tokens, for instance,
but not vice versa.

> 2. How much knowledge of the language should the lexical analysis stage have?
>
> 3. How much knowledge of the language should the syntax analysis stage have?

The traditional lexical/syntax division comes from a blueprint for
programming language processing laid down in the work of computer
scientists done in the 1960's and 1970's.

It is not necessary; a context-free grammar can handle the entire syntax
down to the token level, like a decimal integer being a sequence of
digits and whatnot.

The division into lexical analysis and parsing comes from several
pragmatic observations:

1. There are dedicated algorithms that are good for tokenizing.
Tokens exhibit regular syntax: they can be recognized using
finite automata (regular expressions).

2. There are efficient parsing techniques like LALR(1) that assume
that there is one symbol of lookahead. If you use LALR(1)
with a scanner, you get one *token* of lookahead.
If you use LALR(1) without a scanner (token == character)
then you get one *character* of lookahead, which is going to
be crippling. Thus, separating lexical analysis effectively
amplifies LALR(1) into LALR(k), for some k related to maximum
token character length.

E.g. if you design a C parser without a scanner, over the character
level syntax, when that parser sees the character "do", it
has no idea whether it's looking at the "do" keyword, the
start of the "double" keyword, or something else like "dongle".
One character of lookahead will confirm or eliminate "do",
but not resolve "double" versus "dongle".

So to answer the questions, if you're assuming that you're going to be
using the traditional framework, with a regular-expression-driven
lexer and a LR parser with 1 token of lookahead, the way you divide
the work, roughly, is by identifying the token-like elements in the
language that have regular syntax. Anything that exhibits nesting,
requiring rule recursion, will be farmed off to the parser. Your
decision could sometimes also be informed by the lookahead concern;
you could choose to clump together several items that might plausibly
be individual tokens into a "supertoken" if there is some parsing
advantage in it, or other simplification.

Other kinds of decisions interact with the language definition.
For instance, what is -1234? In Common Lisp, and most other Lisp
dialect, I suspect, that is a token denoting a negative integer.
It may not be written - 1234. In C, and languages imitating its syntax,
-1234 is a unary expression: the unary - operator applied to the
integer 1234, and so there are two tokens. They may be separated
by whitespace, or a comment.

This strikes at the language definition, becuase what it implies is
that C does not have negative constants.

And that causes a subtle problems.

Say the int type is 32 bits wide.

The most negative 32 bit two's complement integer is -2147483648.

But in C this is actually the negation of 2147483648 which does
not fit into the type int. The resulting constant expression will
not have type int.

Defining the INT_MIN constant in <limits.h> such that
the expression has type int, and doesn't contain any casts
requires something like:

#define INT_MIN (-2147483647 - 1)

> 4. How much knowledge of the language should the semantic analysis stage have?

All knowledge that the previous stages do not have, minus
knowledge that only the application domain has. :)

>
> To make the questions concrete, consider this XML:
>
><foo>…</foo>
>
> Should the lexical analysis stage know that the foo in <foo> is a
> start tag (STAG) and the foo in </foo> is an end tag (ETAG)?

XML has structure inside tags, namely attributes. Various implementation
choices present themselves.

One possibility is to recognize the special cases <abc>, </abc>
and <abc/> as a special token categories, e.g

ELEM_START_TRIVIAL
ELEM_END
ELEM_COMPLETE

All three token categories will carry a semantic attribute on the token
which gives the identity of the element, e.g "abc".

Then the case like <abc foo="bar">

The "<abc" part could produce a ELEM_START_OPEN token (distinct from
ELEM_START_TRIVIAL). After this token, zero or or more attributes
are expected, followed by ELEM_START_CLOSE produced by the angle
bracket.

The parser has to recognize that ELEM_START_OPEN has been seen,
and then parse the attribute tokens, looking for ELEM_START_CLOSE.

Note that ELEM_START_TRIVIAL is then not necessary, becuase
<a> could produce ELEM_START_OPEN followed immediately by
ELEM_START_CLOSE.

> Or should the lexical analysis stage simply identify
> the foo in <foo> as a name (NAME) and the foo in </foo> as a name
> (NAME)? That would mean the lexical analysis stage has lesser
> knowledge of the XML language. How much knowledge of the XML language
> should the lexical analysis stage have?

I would say that because the attributes have special syntax with
quoting like foo="bar", it would be better for the lexer to be aware
of the tag structure, so that it can switch to a mode for precisely
handling attributes.

Consider <abc foo="bar">foo="bar"</a>. Do you want to tokenize the
attribute foo="bar" in the same way as the interior of the element?

Chances are you want to turn the entire content of the <a> element,
into a single text datum, if it doesn't contain any sub-elements.
The quotes and equal sign are not special in any way.

Also, attributes can have angle brackets; so that is to say,
I think <abc foo="<abc>"> is valid.

If the tokenizer is oblivious to all this, it will lead to ugly
complexity in the parser.

As a final general remark, languages are often defined at multiple
description levels which can be identified as lexical and syntactic.
If you're implementing, it behooves you to follow those separations
in the spec, except when your own experience tells you to adjust the
interpretation here and there; like you recognize that some
syntactic aspect is nicely handled by your scanner or vice versa.

The W3C's definition of XML uses one big grammar, without regard
for separation into lexing and parsing. However, you can recognize
regular elements in it. Consider this grammar rule:

Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

where:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF] /* any Unicode character,
excluding the surrogate blocks,
FFFE, and FFFF. *

Note that this is entirely regular: we can convert the Comment match
entirely into the rules for a lexical analyzer, and therefore
Comment can be a token. So that would be the main principle here:
scan the tree structur of the XML grammar and look for the tallest
subtrees that are entirely regular, considering those to be token
candidates. Or something like that.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Re: What stage should entities be resolved?

<22-03-040@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=325&group=comp.compilers#325

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Fri, 18 Mar 2022 14:08:26 -0700 (PDT)
Organization: Compilers Central
Lines: 43
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-040@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers> <22-03-037@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="21815"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, Fortran
Posted-Date: 18 Mar 2022 18:43:30 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-037@comp.compilers>
 by: gah4 - Fri, 18 Mar 2022 21:08 UTC

On Friday, March 18, 2022 at 11:00:40 AM UTC-7, Kaz Kylheku wrote:

> So to answer the questions, if you're assuming that you're going to be
> using the traditional framework, with a regular-expression-driven
> lexer and a LR parser with 1 token of lookahead, the way you divide
> the work, roughly, is by identifying the token-like elements in the
> language that have regular syntax. Anything that exhibits nesting,
> requiring rule recursion, will be farmed off to the parser. Your
> decision could sometimes also be informed by the lookahead concern;
> you could choose to clump together several items that might plausibly
> be individual tokens into a "supertoken" if there is some parsing
> advantage in it, or other simplification.

Besides simplifying parsing, it also makes better error messages.

For one, if it has a whole token, the message can indicate that.

> Other kinds of decisions interact with the language definition.
> For instance, what is -1234? In Common Lisp, and most other Lisp
> dialect, I suspect, that is a token denoting a negative integer.
> It may not be written - 1234. In C, and languages imitating its syntax,
> -1234 is a unary expression: the unary - operator applied to the
> integer 1234, and so there are two tokens. They may be separated
> by whitespace, or a comment.

> This strikes at the language definition, becuase what it implies is
> that C does not have negative constants.

It gets more interesting in Fortran.

Fortran mostly doesn't have signed constants, so it would be a unary
expression, except that there are some places where constants are allowed
and not expressions, so in those cases it allows for signed constants.

The most important one, and maybe only one left, is the DATA statement.

Well, it used to be that constants and variables, but not expressions, were
allowed for DO statements, but then again it didn't even allow for 0,
and especially not negative constants.

I suspect, though, that even though DATA statements are defined to have
signed constants, that the parser can allow unary expressions and not
tell anyone. Well, it might affect error messages, too.

Re: What stage should entities be resolved?

<22-03-042@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=327&group=comp.compilers#327

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: mar...@gkc.org.uk (Martin Ward)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Sat, 19 Mar 2022 18:17:14 +0000
Organization: Compilers Central
Lines: 21
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-042@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers> <22-03-037@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="63378"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 19 Mar 2022 17:50:09 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-037@comp.compilers>
Content-Language: en-GB
 by: Martin Ward - Sat, 19 Mar 2022 18:17 UTC

On 2022-03-15, Roger L Costello <costello@mitre.org> wrote:
> 1. How much knowledge of the language should the preprocessor stage have?

In general, the preprocessor processes one language and passes
the result to a compiler which processes a different (but often
closely related) language. So the preprocessor can be thought of
as a text to text language translator (the first C++ compiler
was implemented as macros in the C preprocessor which accepted
C++ source code and generated C source code). As a language
tranlslator it should ideally know all about the input language
that it is processing and only generate valid output. In practice,
of course, many preprocessors have a very limited understanding
of the source language they are processing and pass on syntax errors
to the compiler (perhaps with hints about which line of the original
source file this line of code came from).

--
Martin

Dr Martin Ward | Email: martin@gkc.org.uk | http://www.gkc.org.uk
G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4

Re: What stage should entities be resolved?

<22-03-044@comp.compilers>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=329&group=comp.compilers#329

 copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: matt.tim...@gmail.com (matt.ti...@gmail.com)
Newsgroups: comp.compilers
Subject: Re: What stage should entities be resolved?
Date: Sun, 20 Mar 2022 07:32:14 -0700 (PDT)
Organization: Compilers Central
Lines: 62
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-044@comp.compilers>
References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> <22-03-032@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="70947"; mail-complaints-to="abuse@iecc.com"
Keywords: C, syntax
Posted-Date: 20 Mar 2022 15:10:23 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-032@comp.compilers>
 by: matt.ti...@gmail.com - Sun, 20 Mar 2022 14:32 UTC

On Thursday, 17 March 2022 at 14:41:47 UTC-4, Roger L Costello wrote:
> For instance, as I understand it a C preprocessor goes through a C program
and replaces macros. With this:

The C preprocessor is a completely different language. It compiles (many
would say "transpiles" these days) a C-with-macros file into a
C-without-macros file.

C, therefore, is a composition of two languages.

I think this turned out to be a *very bad idea*, except that we didn't know
enough about what the preprocessor would be used for to put these features in
the real language. Either way, you should not do anything like this, and you
should not think of what the lexer does, or what the parser does, or what
semantic analysis does, as any sort of "text replacement".

The output of the lexer is a token stream, and the output of the parser is an
AST, or some other intermediate representation that is richer than a token
stream. Text needs to be parsed, so it any of these stages produce
intermediate text, you have to start again at lexing.

> Similarly, in XML if &amp; is embedded inside a CDATA section:
>
> <![CDATA[&amp;]]>
>
> then a preprocessor must not replace &amp; with &. That is, the preprocessor
must have knowledge about the language: If an XML entity is within a CDATA
section, then don’t replace it.

Yeah, a preprocessing stage (which would require its own lexer and parser) is
not useful. Don't do it. A lexer can produce a rich token like
ENTITY_REF("&amp;"), which has enough information in it to be useful in either
context.

> 2. How much knowledge of the language should the lexical analysis stage
have?
> 3. How much knowledge of the language should the syntax analysis stage have?

As much as you need. Don't worry about it. All of these things are made with
deep knowledge of the language being parsed. The lexer is simple and fast and
divides the text into labelled atomic units. Use whatever division is
convenient. The only real restriction is that the lexer should not produce
anything that isn't atomic (may need to be subdivided later), or that requires
a rich internal structure (because you don't want to run token text through
another parser), or that can't be recognized with a regular language (because
that's how lexers work).

> Should the lexical analysis stage know that the foo in <foo> is a
> start tag (STAG) and the foo in </foo> is an end tag (ETAG)? That
> would mean the lexical analysis stage has considerable knowledge of
> the XML language. Or should the lexical analysis stage simply identify
> the foo in <foo> as a name (NAME) and the foo in </foo> as a name
> (NAME)?

Neither. Start tags can have a rich internal structure like <foo att1="a"
att2="b">. To provide access to that structure, the lexer must produce
smaller tokens. Usually a lexer would translate "<Foo>" into something like
"STAGO, NAME(Foo), TAGC". When you have attributes, you get something like
"STAGO, NAME(Foo), NAME(att1), EQ, STRING(a), TAGC"

End tags *could* be recognized by the lexer, but they aren't. Programmers
would handle them in a way similar to start tags, just for the consistency.

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor