novaBBS - comp.arch.embedded

On Wed, 29 Mar 2023 09:00:34 +1100, Clifford Heath
<no.spam@please.net> wrote:

>On 29/03/23 02:17, George Neuner wrote:
>> On Mon, 27 Mar 2023 16:18:51 +1100, Clifford Heath
>> <no.spam@please.net> wrote:
>>
>>> On 26/03/23 15:45, George Neuner wrote:
>>>> On Wed, 22 Mar 2023 18:15:43 -0700, Don Y
>>>> <blockedofcourse@foo.invalid> wrote:
>>>> The terms "FA" (finite automaton) and "FSM" (finite state machine)
>>>> are, in fact, synonymous.
>>>>
>>>> What is confusing is that we got to this point through discussion of
>>>> parsing and lexing tools - which ARE geared toward languages.
>>>> Moreover, yacc and bison do NOT implement a general FA, but rather a
>>>> particular variety of FA that useful for language parsing and which
>>>> involves an auxiliary stack.
>>>
>>> The stack means it's not a FA.
>>
>> No, it still is an FA ... it just is a specialized form.
>
>Ok, it's an FA operating on a stack. The stack makes the whole thing
>non-regular, aka infinite, so it's only an FA if you exclude the stack
>from the machine.
>
>> Stackless FA, in fact, can process LR(1) grammars ... they just need
>> (typically many) more states in the machine to do so
>
>
>No. A stack is not finite.

Nor is the input stream. So what? The stack is NOT part of the
machine, it is a memory used BY the state machine.

>Every FA is finite, that's why they're called
>FA. If you want to process a regular language, you can use an FA. If you
>want to process an irregular language, you cannot - you need somewhere
>to store unbounded staes and an FA *cannot* do that. It's in the
>definition of such things!

Nowhere in the definition of finite automaton does it say the
automaton is limited to what can be encoded by its states. In
particular there is no prohibition against using an external memory.
Recall that Turing machines used tapes of infinite length.

In any event, I'm still not following why you think this somehow is
important.

>>> ... and you'll
>>> get parse errors that probably don't really tell you what is wrong with
>>> the input :P.
>>
>> You can't rely on the tool for error handling (or even just messages)
>> ... you really need to add deliberate error handling.
>
>I wasn't talking about error recovery, just about reporting. Both are
>hugely easier in an LL grammar. In the PEG parsers that I favour, you
>can almost always just report the rules on the stack at the furthest
>point reached, and (in all the grammars I've implemented) that gives a
>better error report than anything you'd bother to create manually.

I wasn't talking about recovery either. When using an LR parser the
grammar designer/implementer has to augment BOTH error reporting and
error handling - which may or may not involve "recovery". See next.

>It amuses me that the folk who understand grammar well enough to be able
>to produce powerful parser generators seem to be universally incapable
>of generating code that can report parse failures in plain language.
>Something about their brain's language centres has become so esoteric
>that normal language escapes them.

LR works by incrementally assembling a sequence of tokens and looking
for a pattern that matches it.

LL works by selecting a pattern and incrementally looking to match
that pattern with the sequence of tokens beginning at the current
position in the input. Of course the pattern may be an alternation
having multiple possibilities, but the principle of operation remains.

Very, very different.

Neither method innately knows the context when a pattern match fails,
but in LL the context is readily apparent from the driver code which
directs the parse, so it is easy to provide a (somewhat) meaningful
error message just by maintaining a stack of the non-terminals already
matched and dumping the last N entries.

In contrast, in LR the context of the current match is given by the
machine state and the stack of unreduced (as-yet unmatched) tokens.
There is nothing readily available that could be used to provide a
user meaningful message ... you'd have to examine the machine state to
figure out even what you /might/ be looking for. Your position in the
input is about as close as you can get to a meaningful message without
the user code manually tracking context.

>>> Lex/Flex on the other hand exists to process only finite
>>> states. The FSM algorithms they use are more efficient than any
>>> algorithm that can handle LALR2, which is why these tools still exist as
>>> independent tools.
>>
>> They exist separately because they were intended for different tasks
>
>The articles published at the time I first used them (in 1980) clearly
>stated that the two tools were needed "because we don't have a single
>algorithm that is equally efficient at both tokenisation and parsing".

That was true, but wasn't the reason AT THE TIME they were written.
They were separate first and foremost because they were written at
different times. They never were combined because most machines of
that time did not have enough memory to handle the analysis and
recognizer generation for even moderately complex grammars ... making
the tool larger by including lexing was out of the question.

After a while, it was simply inertia that kept them from being
combined. Everyone was used to the status quo and so even when memory
sizes grew to the point where having a combination tool could be
useful, very few people cared.

Inertia is the reason why a lot of potentially interesting things
never happened. Diversion is the other reason - the people who could
have done it were doing other things.

>Ken Thompson's implementation in the mid 1960s (documented in a 1968
>CACM paper) translated the regexp into machine code. The list of
>possible states was just a sequence of function call instructions.

Yes, but Thompson's method was not widely used - again because of
memory sizes. Most uses of regex used many patterns, and it was more
efficient (memory-wise) to simply interpret the pattern directly: one
driver function to handle N patterns.

>The technique of converting multiple NFAs into a single DFA has also
>been in use since the early 70s.

Yes, and lex is from ~1973, IIRC. It was the first /publically/
available tool able to combine multiple NDFAs into single DFA.

>> ANTLR implements LL(*) which is LL with unbounded lookahead.
>
>It's unbounded, but must be regular. Many languages (including my
>Constellation Query Language) require unbounded non-regular look-ahead,
>which PEG provides, at some extra cost in memory. But the pathological
>cases which *require* memoization only occur rarely, so a global packrat
>strategy is sub-optimal.
>
>> There
>> are other LL(k) tools which require the programmer to choose a fixed
>> amount of lookahead (and fail to process the grammar if the k value is
>> too small). ANTLR analyzes the grammar and computes what lookahead is
>> required pattern by pattern.
>
>That's a poor description of how it works. It looks ahead using an FA,
>so lookahead must be regular ("Finite State").

No. Lookahead (and backtracking both) simply requires maintaining a
queue of as-yet unmatched tokens. It certainly could be done by a
state machine, but it does NOT require a state machine.

>>> Anyone interested in the overlap between regular languages and finite
>>> state machines should refer to the excellent
>>> <https://github.com/katef/libfsm>.
>
>Did you look at the FSM on the main README page of that site?
>It shows two RE's being combined into one DFA. Very neat stuff.

I haven't examined their method. It may be that they have found some
particularly efficient way to do it. That would be great. But
algorithms for merging FAs in graph representation have been around at
least since the 60s.

>>> I'm currently building a generalised parsing engine that also has the
>>> capability of processing arbitrary binary file and network stream
>>> formats, using a VM approach that interprets something very like a BNF,
>>> but in prefix notation (+a means one-or-more "a"s, not a+). It's tiny,
>>> efficient, embeddable, but can take a protocol description in a very few
>>> bytes of VM code to handle almost any new protocol or format. I don't
>>> think that has been done before, and I've wanted to do it for 25 years.
>>
>> I would be interested to see that (when it's finished, of course).
>> Good luck!
>
>The putative grammar for Px is here (but this doesn't describe captures
>fully):
><https://github.com/cjheath/strpp/blob/main/grammars/px.px>
>
>and the Pegexp engine is here (a template that I'm specialising to add
>non-regular aka full LL grammar capability):
><https://github.com/cjheath/strpp/blob/main/include/pegexp.h>
>
>The Px grammar rewritten as a named-map of Pegexp expressions is here:
><https://github.com/cjheath/strpp/blob/main/test/peg_test.cpp#L55-L91>
>but I'll use a better structure for a compiled Px grammar, so that names
>don't need to be looked up at runtime.
>
>I've almost finished dicking with the structure of input streams that
>will make it feasible for this to process data directly arriving on a
>socket, and only caching as much as is needed for back-up and retry.
>It's also possible to compile with/without UTF-8 support, but I can make
>that more convenient. It's possible to specify binary matching even in a
>Unicode parser though.
>
>I want captures to do things like turn the ASCII digits on an HTTP
>Content-Length header into a binary integer, save that integer as a
>capture variable, and use that variable to count bytes in a later
>repetition. This will enable a simple grammar describe all of HTTP/2.
>
>By nesting parsers (incrementally feeding capture sections to a nested
>parser) it should be possible to for example, run a protocol engine that
>generates an HTTP/2 request (generating from an HTTP request grammar),
>parses the response chunks, feeds base64-encoded chunks into a
>conversion function (not specified in Px), and the output of that
>conversion into e.g. a JPEG parser that actually verifies the JPEG
>format, and can e.g. extract (as a parse capture) the GPS location from
>inside the Exif data attached... and all without having to extend or
>recompile the engine. Just load the target grammar, and if it succeeds,
>you get the GPS location... and all file formats have been validated.
>
>I envisage a world where the file-system is type-safe; almost no file is
>a pure byte-stream, and it's not possible to save a JPEG file that
>doesn't match the JPEG syntax. The file system must be pre-loaded with a
>grammar for every new file type before writing such a file.
>
>Clifford Heath.

George

Subject	Replies	Author
Re: Text on FSM By: George Neuner on Wed, 22 Mar 2023	13	George Neuner

Hackers of the world, unite!

computers / comp.arch.embedded / Re: Text on FSM