Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

It is not well to be thought of as one who meekly submits to insolence and intimidation.


devel / comp.lang.forth / Re: Updated String Parsing Words in kForth

SubjectAuthor
* Updated String Parsing Words in kForthKrishna Myneni
+* Re: Updated String Parsing Words in kForthP Falth
|`- Re: Updated String Parsing Words in kForthKrishna Myneni
+* Re: Updated String Parsing Words in kForthMarcel Hendrix
|`- Re: Updated String Parsing Words in kForthDoug Hoffman
+* Re: Updated String Parsing Words in kForthAnton Ertl
|+* Re: Updated String Parsing Words in kForthMarcel Hendrix
||+* Re: Updated String Parsing Words in kForthS Jack
|||+- Re: Updated String Parsing Words in kForthminf...@arcor.de
|||`* Re: Updated String Parsing Words in kForthHans Bezemer
||| `* Re: Updated String Parsing Words in kForthS Jack
|||  `- Re: Updated String Parsing Words in kForthS Jack
||+* Re: Updated String Parsing Words in kForthKrishna Myneni
|||+- Re: Updated String Parsing Words in kForthKrishna Myneni
|||`* Re: Updated String Parsing Words in kForthAnton Ertl
||| `- Re: Updated String Parsing Words in kForthKrishna Myneni
||`- Re: Updated String Parsing Words in kForthAnton Ertl
|`* Re: Updated String Parsing Words in kForthKrishna Myneni
| `* Re: Updated String Parsing Words in kForthAnton Ertl
|  `* Re: Updated String Parsing Words in kForthKrishna Myneni
|   `* Re: Updated String Parsing Words in kForthKrishna Myneni
|    +* Re: Updated String Parsing Words in kForthKrishna Myneni
|    |`* Re: Updated String Parsing Words in kForthDoug Hoffman
|    | `- Re: Updated String Parsing Words in kForthKrishna Myneni
|    `* Re: Updated String Parsing Words in kForthKrishna Myneni
|     `* Re: Updated String Parsing Words in kForthNN
|      `* Re: Updated String Parsing Words in kForthKrishna Myneni
|       `* Re: Updated String Parsing Words in kForthDoug Hoffman
|        +* Re: Updated String Parsing Words in kForthHans Bezemer
|        |`- Re: Updated String Parsing Words in kForthDoug Hoffman
|        +- Re: Updated String Parsing Words in kForthDoug Hoffman
|        `* Re: Updated String Parsing Words in kForthKrishna Myneni
|         +- Re: Updated String Parsing Words in kForthDoug Hoffman
|         `* Re: Updated String Parsing Words in kForthDoug Hoffman
|          `* Re: Updated String Parsing Words in kForthHans Bezemer
|           `* Re: Updated String Parsing Words in kForthDoug Hoffman
|            `* Re: Updated String Parsing Words in kForthHans Bezemer
|             `- Re: Updated String Parsing Words in kForthHans Bezemer
+* Re: Updated String Parsing Words in kForthNN
|+- Re: Updated String Parsing Words in kForthdxforth
|`- Re: Updated String Parsing Words in kForthKrishna Myneni
`* Re: Updated String Parsing Words in kForthminf...@arcor.de
 `* Re: Updated String Parsing Words in kForthKrishna Myneni
  `- Re: Updated String Parsing Words in kForthMarcel Hendrix

Pages:12
Updated String Parsing Words in kForth

<t0jcsp$9g4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17206&group=comp.lang.forth#17206

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Updated String Parsing Words in kForth
Date: Sat, 12 Mar 2022 18:12:07 -0600
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <t0jcsp$9g4$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 13 Mar 2022 00:12:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4f96354a5b57c657c8e7065812ae6b61";
logging-data="9732"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX192d9XKpW5aBhc81yOqPk55"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:lYZ/C3dvDkFJjMVeZNHbq9IHIwY=
Content-Language: en-US
 by: Krishna Myneni - Sun, 13 Mar 2022 00:12 UTC

I am overhauling the string parsing words in the kForth String Words
library (strings.4th). The existing words are not named as clearly as
they should be, and have inefficient implementations:

\ PARSE_TOKEN ( a u -- arem urem atok utok )
\ PARSE_LINE ( a u -- a1 u1 a2 u2 ... an un n )
\ PARSE_ARGS ( a u -- n ) ( F: r1 ... rn )
\ PARSE_CSV ( a u -- n ) ( F: r1 ... rn )

PARSE_TOKEN skips leading spaces and parses the next blank(s)-delimited
substring, atok utok, and also returns the remaining portion of the string.

PARSE_LINE applies PARSE_TOKEN repeatedly to place the token strings and
token count on the stack.

PARSE_ARGS parses a string with blank delimited numbers, converting each
substring to a floating point number, returning the n floating point
numbers and the count.

PARSE_CSV is a hack which replaces each comma in the string with a space
and performs the same function as PARSE_ARGS.

These words need clearer names and an upgrade (particularly PARSE_CSV).
My proposed replacements are

\ NEXT-BS-TOKEN ( a u -- atok utok arem urem )
\ NEXT-CS-TOKEN ( a u -- atok utok arem urem )
\ PARSED-BSV ( a u -- a1 u1 a2 u2 ... an un n )
\ PARSED-CSV ( a u -- a1 u1 a2 u2 ... an un n )
\ ITH-PARSED ( a1 u1 ... an un n i -- a1 u1 ... an un n ai ui )
\ DROP-PARSED ( a1 u1 ... an un n -- )
\ PARSED>FLOATS ( a1 u1 ... an un n -- n ) ( F: r1 ... rn )

NEXT-BS-TOKEN is a slightly more efficient version of PARSE_TOKEN, but
leaves the token and remaining strings in a swapped order. The name
indicates that it is parsing the next "blank(s)-separated" token in the
string.

NEXT-CS-TOKEN parses the next "comma-separated" token in the string and
returns the token and remaining substrings.

PARSED-BSV starts with an input string and repeatedly applies
NEXT-BS-TOKEN until the remaining string is null and returns all of the
substrings and the token count on the stack.

PARSED-CSV starts with an input string and repeatedly applies
NEXT-CS-TOKEN until the remaining string is null. It returns all of the
substrings and the token count on the stack. Unlike PARSED-BSV, each
comma delimiter will mark a separte token string, e.g. if the input
string is ",," there will be three token strings, each of length zero.

ITH-PARSED provides a way to PICK the ith substring returned by
PARSED-BSV and PARSED-CSV.

DROP-PARSED may be used to discard the n token substrings returned by
PARSED-BSV and PARSED-CSV.

PARSED>FLOATS converts each of the token substrings returned by
PARSED-BSV or PARSED-CSV to n floating point numbers. If the string to
float conversion fails, the floating point value NAN will be returned.
If a substring has zero length, the corresponding fp value will also be NAN.

Comments or suggestions?

--
Krishna Myneni

The previous set of words will be redefined using the new words (for
running existing source code); however, they will be considered deprecated.

Re: Updated String Parsing Words in kForth

<bf035654-a7af-4a30-8562-4de960cec122n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17211&group=comp.lang.forth#17211

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a05:622a:208:b0:2e1:b3ec:b7ce with SMTP id b8-20020a05622a020800b002e1b3ecb7cemr11819640qtx.345.1647159965576;
Sun, 13 Mar 2022 00:26:05 -0800 (PST)
X-Received: by 2002:a05:620a:2953:b0:67d:2135:6362 with SMTP id
n19-20020a05620a295300b0067d21356362mr11438632qkp.306.1647159965226; Sun, 13
Mar 2022 00:26:05 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 00:26:05 -0800 (PST)
In-Reply-To: <t0jcsp$9g4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=90.224.116.62; posting-account=ryzhhAoAAAAIqf1uqmG9E4uP1Bagd-k2
NNTP-Posting-Host: 90.224.116.62
References: <t0jcsp$9g4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bf035654-a7af-4a30-8562-4de960cec122n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: peter.m....@gmail.com (P Falth)
Injection-Date: Sun, 13 Mar 2022 08:26:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 105
 by: P Falth - Sun, 13 Mar 2022 08:26 UTC

On Sunday, 13 March 2022 at 01:12:12 UTC+1, Krishna Myneni wrote:
> I am overhauling the string parsing words in the kForth String Words
> library (strings.4th). The existing words are not named as clearly as
> they should be, and have inefficient implementations:
>
> \ PARSE_TOKEN ( a u -- arem urem atok utok )
> \ PARSE_LINE ( a u -- a1 u1 a2 u2 ... an un n )
> \ PARSE_ARGS ( a u -- n ) ( F: r1 ... rn )
> \ PARSE_CSV ( a u -- n ) ( F: r1 ... rn )
>
> PARSE_TOKEN skips leading spaces and parses the next blank(s)-delimited
> substring, atok utok, and also returns the remaining portion of the string.
>
> PARSE_LINE applies PARSE_TOKEN repeatedly to place the token strings and
> token count on the stack.
>
> PARSE_ARGS parses a string with blank delimited numbers, converting each
> substring to a floating point number, returning the n floating point
> numbers and the count.
>
> PARSE_CSV is a hack which replaces each comma in the string with a space
> and performs the same function as PARSE_ARGS.

In ntf/lxf I have

GET-WORD ( addr len -- addr" len" addr' len' )
GET-LINE ( addr len -- addr" len" addr' len' )
SPLIT-BUFFER ( addr len xchar -- addr" len" addr' len' )

GET-WORD is like your PARSE_TOKEN
GET-LINE will return a line but not split in tokens like your PARSE_LINE
SPLIT-BUFFER is a lower level word to construct other parsing words from.

Why do you need a word that potentially fills up the stack?

> These words need clearer names and an upgrade (particularly PARSE_CSV).
> My proposed replacements are
>
>
> \ NEXT-BS-TOKEN ( a u -- atok utok arem urem )
> \ NEXT-CS-TOKEN ( a u -- atok utok arem urem )
> \ PARSED-BSV ( a u -- a1 u1 a2 u2 ... an un n )
> \ PARSED-CSV ( a u -- a1 u1 a2 u2 ... an un n )
> \ ITH-PARSED ( a1 u1 ... an un n i -- a1 u1 ... an un n ai ui )
> \ DROP-PARSED ( a1 u1 ... an un n -- )
> \ PARSED>FLOATS ( a1 u1 ... an un n -- n ) ( F: r1 ... rn )
>
> NEXT-BS-TOKEN is a slightly more efficient version of PARSE_TOKEN, but
> leaves the token and remaining strings in a swapped order. The name
> indicates that it is parsing the next "blank(s)-separated" token in the
> string.

Why is it better to have them swapped?
In all my use cases I parse for a word, do something with it, check if there is remaining text
and repeat if there is.

BR
Peter Fälth

> NEXT-CS-TOKEN parses the next "comma-separated" token in the string and
> returns the token and remaining substrings.
>
> PARSED-BSV starts with an input string and repeatedly applies
> NEXT-BS-TOKEN until the remaining string is null and returns all of the
> substrings and the token count on the stack.
>
> PARSED-CSV starts with an input string and repeatedly applies
> NEXT-CS-TOKEN until the remaining string is null. It returns all of the
> substrings and the token count on the stack. Unlike PARSED-BSV, each
> comma delimiter will mark a separte token string, e.g. if the input
> string is ",," there will be three token strings, each of length zero.
>
> ITH-PARSED provides a way to PICK the ith substring returned by
> PARSED-BSV and PARSED-CSV.
>
> DROP-PARSED may be used to discard the n token substrings returned by
> PARSED-BSV and PARSED-CSV.
>
> PARSED>FLOATS converts each of the token substrings returned by
> PARSED-BSV or PARSED-CSV to n floating point numbers. If the string to
> float conversion fails, the floating point value NAN will be returned.
> If a substring has zero length, the corresponding fp value will also be NAN.
>
> Comments or suggestions?
>
> --
> Krishna Myneni
>
>
>
>
> The previous set of words will be redefined using the new words (for
> running existing source code); however, they will be considered deprecated.

Re: Updated String Parsing Words in kForth

<d78b8b14-1885-4235-b0da-364967354359n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17212&group=comp.lang.forth#17212

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a05:620a:e1c:b0:47d:87eb:18b2 with SMTP id y28-20020a05620a0e1c00b0047d87eb18b2mr11251882qkm.527.1647161977107;
Sun, 13 Mar 2022 00:59:37 -0800 (PST)
X-Received: by 2002:ac8:5993:0:b0:2dd:c4df:35aa with SMTP id
e19-20020ac85993000000b002ddc4df35aamr14606499qte.369.1647161976919; Sun, 13
Mar 2022 00:59:36 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 00:59:36 -0800 (PST)
In-Reply-To: <t0jcsp$9g4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:1c05:2f14:600:811b:1195:3b17:3967;
posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 2001:1c05:2f14:600:811b:1195:3b17:3967
References: <t0jcsp$9g4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d78b8b14-1885-4235-b0da-364967354359n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Sun, 13 Mar 2022 08:59:37 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 26
 by: Marcel Hendrix - Sun, 13 Mar 2022 08:59 UTC

On Sunday, March 13, 2022 at 1:12:12 AM UTC+1, Krishna Myneni wrote:
[..]
> Comments or suggestions?
[..]
It looks too specific to me, I mean, it probably serves your current
interests, but it wouldn't suffice for mine.

What sticks out is the lack of a separator character (space, comma, ..)
or string (cr+lf, eol, arbitrary word..).

It is interesting that you have words to break larger strings down in
sub-entities, however, returning all the addresses and lengths on the
stack is unacceptable. You may want to supply the name of a string
array as input to this parser, possibly with the count of tokens returned
on the stack. Such a word is the latest addition to my own string word
set. In my case I limit the number of tokens to a maximum, which
allows to have a fixed size offset/length buffer which can remain
unnamed.

Lately, I've encountered the need for a splitter where the start delimiter
is different from the end one ( like "(" and ")" ) but I didn't generalize it
yet.

Wil Baden's Toolbelt words are quite good (with some additions for upper
and lower case).

-marcel

Re: Updated String Parsing Words in kForth

<2022Mar13.103712@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17214&group=comp.lang.forth#17214

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sun, 13 Mar 2022 09:37:12 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 59
Message-ID: <2022Mar13.103712@mips.complang.tuwien.ac.at>
References: <t0jcsp$9g4$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="e66a2a1c1fda86ae3ad8854ee4682443";
logging-data="23620"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19oEqhL88YSBBFonDo6QPGH"
Cancel-Lock: sha1:OmMfnIW+F7cYP/DZvymsUNWzpLc=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sun, 13 Mar 2022 09:37 UTC

Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>Comments or suggestions?

In one of my papers [ertl13-strings], I suggest for splitting/parsing
to do it one result at a time. We have SEARCH (and SCAN/SKIP) that
can help in working that way, but require some additional work around them. In the paper I suggest:

|search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
| |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
|c-a1 u3 is the substring before the first match, c-a4 u4 is the first
|match, and c-a5 u5 is the rest of the string, and the TOS is true;
|otherwise return false.

You would then call it again for c-a5 u5 to get the next match.

I also suggest to use a PARSE-like interface (in combination with
EXECUTE-PARSING) to avoid the stack load of keeping c-a1 u1 and c-a5
u5 around. The parsing version of the word above is:

|parse-regexp ( c-a2 u2 -- c-a1 u3 c-a4 u4 true | false )
| |Search for the regexp c-a2 u2 in the parse area. If a match is found,
|c-a4 u4 is the address of the match, and c-a1 u3 is the string that
|was skipped before the match was found. The next parse starts right
|behind the matching string.

SEARCH-REGEXP would be a factor of PARSE-REGEXP.

I have not implemented these words.

@InProceedings{ertl13-strings,
author = {M. Anton Ertl},
title = {Standardize Strings Now!},
crossref = {euroforth13},
pages = {39--43},
url = {http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-strings.pdf},
OPTnote = {not refereed},
abstract = {This paper looks at the issues in string words: what
operations may be required, various design options,
and why this has lead to the current state of
standardization of string operations that is
insufficient in the eyes of many.}
}

@Proceedings{euroforth13,
title = {29th EuroForth Conference},
booktitle = {29th EuroForth Conference},
year = {2013},
key = {EuroForth'13},
url = {http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/proceedings.pdf}
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2021: https://euro.theforth.net/2021

Re: Updated String Parsing Words in kForth

<622dc743$0$696$14726298@news.sunsite.dk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17215&group=comp.lang.forth#17215

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Date: Sun, 13 Mar 2022 06:28:15 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.7.0
Subject: Re: Updated String Parsing Words in kForth
Content-Language: en-US
Newsgroups: comp.lang.forth
References: <t0jcsp$9g4$1@dont-email.me>
<d78b8b14-1885-4235-b0da-364967354359n@googlegroups.com>
From: dhoffman...@gmail.com (Doug Hoffman)
In-Reply-To: <d78b8b14-1885-4235-b0da-364967354359n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 22
Message-ID: <622dc743$0$696$14726298@news.sunsite.dk>
Organization: SunSITE.dk - Supporting Open source
NNTP-Posting-Host: c4116c1e.news.sunsite.dk
X-Trace: 1647167299 news.sunsite.dk 696 glidedog@gmail.com/68.55.82.126:61702
X-Complaints-To: staff@sunsite.dk
 by: Doug Hoffman - Sun, 13 Mar 2022 10:28 UTC

On 3/13/22 4:59 AM, Marcel Hendrix wrote:

> What sticks out is the lack of a separator character (space, comma, ..)

> You may want to supply the name of a string
> array as input to this parser, ...

Agreed. I do both of the above in my string library code and it has
proven very useful. I have a word :split that is supplied the char and
the string to split. The output is a dynamically resizable array of
string objects. The output array size is only limited by heap memory.

\ find substring(s) delimited by:
\ 1) start of string and char
\ 2) and char and char
\ 3) and char and end of string
\ return all of them as an array of string objects allocated in the heap
: :split ( char str-obj -- 1-arry-obj ) ...

-Doug

Re: Updated String Parsing Words in kForth

<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17216&group=comp.lang.forth#17216

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a37:acc:0:b0:67d:320e:7eb with SMTP id 195-20020a370acc000000b0067d320e07ebmr11568313qkk.513.1647175635709;
Sun, 13 Mar 2022 05:47:15 -0700 (PDT)
X-Received: by 2002:a05:620a:1722:b0:67d:8efe:d4e8 with SMTP id
az34-20020a05620a172200b0067d8efed4e8mr6199427qkb.327.1647175635527; Sun, 13
Mar 2022 05:47:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 05:47:15 -0700 (PDT)
In-Reply-To: <2022Mar13.103712@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:1c05:2f14:600:811b:1195:3b17:3967;
posting-account=-JQ2RQoAAAB6B5tcBTSdvOqrD1HpT_Rk
NNTP-Posting-Host: 2001:1c05:2f14:600:811b:1195:3b17:3967
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: mhx...@iae.nl (Marcel Hendrix)
Injection-Date: Sun, 13 Mar 2022 12:47:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 31
 by: Marcel Hendrix - Sun, 13 Mar 2022 12:47 UTC

On Sunday, March 13, 2022 at 10:52:01 AM UTC+1, Anton Ertl wrote:
> Krishna Myneni <krishna...@ccreweb.org> writes:
> >Comments or suggestions?
>
> In one of my papers [ertl13-strings], I suggest for splitting/parsing
> to do it one result at a time. We have SEARCH (and SCAN/SKIP) that
> can help in working that way, but require some additional work around them. In the paper I suggest:
>
> |search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
> |
> |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
> |c-a1 u3 is the substring before the first match, c-a4 u4 is the first
> |match, and c-a5 u5 is the rest of the string, and the TOS is true;
> |otherwise return false.

One of the first BASIS documents had (LEX) :

FORTH> help (lex)
(lex) IFORTH
( c-addr1 u1 del -- c-addr3 u3 c-addr1 u4 del true ) or
( c-addr1 u1 del -- false )
Break a string in three pieces: before the delimiter del, the delimiter
itself, and after the delimiter. c-addr3 points to the remaining string,
c-addr1 points to the string in front of the delimiter. If false is
returned, the input string does not contain the delimiter character.

Coupling this to a full reg-exp package is tempting. However, what I've
seen with Python users is that they use regexp's all the time, without
any understanding of how it works. They rely on an online tool that shows
them how to hack it until the next gotcha. A middle ground would be nice.

-marcel

Re: Updated String Parsing Words in kForth

<e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17217&group=comp.lang.forth#17217

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a05:622a:189:b0:2e1:ba50:7ff0 with SMTP id s9-20020a05622a018900b002e1ba507ff0mr9759984qtw.265.1647178111703;
Sun, 13 Mar 2022 06:28:31 -0700 (PDT)
X-Received: by 2002:ac8:5a8f:0:b0:2e1:b34b:30f3 with SMTP id
c15-20020ac85a8f000000b002e1b34b30f3mr12843792qtc.77.1647178111487; Sun, 13
Mar 2022 06:28:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 06:28:31 -0700 (PDT)
In-Reply-To: <da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:3f7a:20d0:f167:75ac:4ce9:e05a;
posting-account=V5nGoQoAAAC_P2U0qnxm2kC0s1jNJXJa
NNTP-Posting-Host: 2600:1700:3f7a:20d0:f167:75ac:4ce9:e05a
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: sdwjac...@gmail.com (S Jack)
Injection-Date: Sun, 13 Mar 2022 13:28:31 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 12
 by: S Jack - Sun, 13 Mar 2022 13:28 UTC

On Sunday, March 13, 2022 at 7:47:17 AM UTC-5, Marcel Hendrix wrote:
> Coupling this to a full reg-exp package is tempting. However, what I've
> seen with Python users is that they use regexp's all the time, without
> any understanding of how it works. They rely on an online tool that shows
> them how to hack it until the next gotcha. A middle ground would be nice.

Regex is standard fare in many programs. Used often enough
competent people don't require aids. Provide for it and your
Forth won't stick out like a sore thumb and seem lacking to the
many that use and expect Regex.

--
me

Re: Updated String Parsing Words in kForth

<t0l099$jkn$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17221&group=comp.lang.forth#17221

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sun, 13 Mar 2022 09:49:11 -0500
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <t0l099$jkn$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 13 Mar 2022 14:49:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4f96354a5b57c657c8e7065812ae6b61";
logging-data="20119"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181m0FZRlXrJVh1q/mYQKGF"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:vnIM7j0keFXZmFZwJ6PwNOjcook=
In-Reply-To: <da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
Content-Language: en-US
 by: Krishna Myneni - Sun, 13 Mar 2022 14:49 UTC

On 3/13/22 07:47, Marcel Hendrix wrote:
> On Sunday, March 13, 2022 at 10:52:01 AM UTC+1, Anton Ertl wrote:
>> Krishna Myneni <krishna...@ccreweb.org> writes:
>>> Comments or suggestions?
>>
>> In one of my papers [ertl13-strings], I suggest for splitting/parsing
>> to do it one result at a time. We have SEARCH (and SCAN/SKIP) that
>> can help in working that way, but require some additional work around them. In the paper I suggest:
>>
>> |search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
>> |
>> |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
>> |c-a1 u3 is the substring before the first match, c-a4 u4 is the first
>> |match, and c-a5 u5 is the rest of the string, and the TOS is true;
>> |otherwise return false.
>
> One of the first BASIS documents had (LEX) :
>
> FORTH> help (lex)
> (lex) IFORTH
> ( c-addr1 u1 del -- c-addr3 u3 c-addr1 u4 del true ) or
> ( c-addr1 u1 del -- false )
> Break a string in three pieces: before the delimiter del, the delimiter
> itself, and after the delimiter. c-addr3 points to the remaining string,
> c-addr1 points to the string in front of the delimiter. If false is
> returned, the input string does not contain the delimiter character.
>
> Coupling this to a full reg-exp package is tempting. However, what I've
> seen with Python users is that they use regexp's all the time, without
> any understanding of how it works. They rely on an online tool that shows
> them how to hack it until the next gotcha. A middle ground would be nice.
>

A middle ground to using a regexp-based parser could be a word such as
PARSE-STRING, described by

PARSE-STRING ( caddr1 u1 xt-nexttoken xt-processtoken -- ntokens )

where xt-nexttoken provides the execution semantics for obtaining the
next token from the input string,

caddr1 u1 -- caddr-rem urem caddr-token utoken

and xt-processtoken provides the execution semantics for processing the
token,

caddr-token utoken --

The number of tokens parsed and processed is returned by PARSE-STRING.

This allows one to write and use simple token parsers, e.g. not just
simple character delimiter token parsers, but whitespace token parsers
which can skip leading whitespace. It also allows flexible processing of
the tokens as they are parsed, e.g. conversion to floating point numbers
on the fp stack or copying the token strings into a buffer.

Nevertheless, more than 99.0% of my use cases are for blank(s)-delimited
tokens, whitespace-delimited tokens, or comma-delimited tokens. Simple
and fast specialized string parsers for these cases are still useful to
have in a library.

--
Krishna

Re: Updated String Parsing Words in kForth

<t0l4t7$n24$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17223&group=comp.lang.forth#17223

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sun, 13 Mar 2022 11:08:05 -0500
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <t0l4t7$n24$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
<t0l099$jkn$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 13 Mar 2022 16:08:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4f96354a5b57c657c8e7065812ae6b61";
logging-data="23620"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+0WK62Wcvrqr+1Q59gugE/"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:L7/bMOdcu/gPC6m0d79IsF+kEAw=
In-Reply-To: <t0l099$jkn$1@dont-email.me>
Content-Language: en-US
 by: Krishna Myneni - Sun, 13 Mar 2022 16:08 UTC

On 3/13/22 09:49, Krishna Myneni wrote:
> On 3/13/22 07:47, Marcel Hendrix wrote:

>> ... A middle ground would be nice.

> A middle ground to using a regexp-based parser could be a word such as
> PARSE-STRING, described by
>
>   PARSE-STRING ( caddr1 u1 xt-nexttoken xt-processtoken -- ntokens )
>
> where xt-nexttoken provides the execution semantics for obtaining the
> next token from the input string,
>
>   caddr1 u1 -- caddr-rem urem caddr-token utoken
> ...

xt-nexttoken should also return a flag,

caddr u -- caddr-rem urem caddr-token utoken true | caddr-rem urem false

The order of args to PARSE-STRING should also be changed

PARSE-STRING ( xt-nexttoken xt-processtoken caddr u -- ntokens )

The implementation of PARSE-STRING should become trivial, except how to
handle the final case of urem not being equal to zero.

: PARSE-STRING ( xt-nexttoken xt-prcesstoken caddr u -- ntokens )
0 >R
BEGIN
2OVER DROP EXECUTE
WHILE
4 PICK EXECUTE
1 RP@ +!
REPEAT
2drop 2drop \ or handle urem <> 0
R> ;

Some clarity on what to do after REPEAT is needed.

--
Krishna

Re: Updated String Parsing Words in kForth

<3857b050-61d3-4f2b-aed7-fa6125a0b2e4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17226&group=comp.lang.forth#17226

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a05:622a:13ca:b0:2e1:a52f:18f4 with SMTP id p10-20020a05622a13ca00b002e1a52f18f4mr16439111qtk.412.1647203058242;
Sun, 13 Mar 2022 13:24:18 -0700 (PDT)
X-Received: by 2002:a05:620a:240f:b0:67d:5844:d5 with SMTP id
d15-20020a05620a240f00b0067d584400d5mr11065088qkn.110.1647203058052; Sun, 13
Mar 2022 13:24:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 13:24:17 -0700 (PDT)
In-Reply-To: <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=79.224.111.239; posting-account=AqNUYgoAAADmkK2pN-RKms8sww57W0Iw
NNTP-Posting-Host: 79.224.111.239
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com> <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3857b050-61d3-4f2b-aed7-fa6125a0b2e4n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: minfo...@arcor.de (minf...@arcor.de)
Injection-Date: Sun, 13 Mar 2022 20:24:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 17
 by: minf...@arcor.de - Sun, 13 Mar 2022 20:24 UTC

S Jack schrieb am Sonntag, 13. März 2022 um 14:28:32 UTC+1:
> On Sunday, March 13, 2022 at 7:47:17 AM UTC-5, Marcel Hendrix wrote:
> > Coupling this to a full reg-exp package is tempting. However, what I've
> > seen with Python users is that they use regexp's all the time, without
> > any understanding of how it works. They rely on an online tool that shows
> > them how to hack it until the next gotcha. A middle ground would be nice.
> Regex is standard fare in many programs. Used often enough
> competent people don't require aids. Provide for it and your
> Forth won't stick out like a sore thumb and seem lacking to the
> many that use and expect Regex.

Nobody with a minimum of engineering background will bloat his
embedded application programs with regex when there are smaller
and faster other options.

Re: Updated String Parsing Words in kForth

<dc811433-2a0d-441d-968e-b9509d15afd3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17227&group=comp.lang.forth#17227

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:ac8:5fd2:0:b0:2e1:b346:7505 with SMTP id k18-20020ac85fd2000000b002e1b3467505mr13917482qta.94.1647207365650;
Sun, 13 Mar 2022 14:36:05 -0700 (PDT)
X-Received: by 2002:a05:6214:5007:b0:436:5f36:1819 with SMTP id
jo7-20020a056214500700b004365f361819mr13496757qvb.29.1647207365454; Sun, 13
Mar 2022 14:36:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 14:36:05 -0700 (PDT)
In-Reply-To: <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=82.95.228.79; posting-account=Ebqe4AoAAABfjCRL4ZqOHWv4jv5ZU4Cs
NNTP-Posting-Host: 82.95.228.79
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com> <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dc811433-2a0d-441d-968e-b9509d15afd3n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: the.beez...@gmail.com (Hans Bezemer)
Injection-Date: Sun, 13 Mar 2022 21:36:05 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 33
 by: Hans Bezemer - Sun, 13 Mar 2022 21:36 UTC

On Sunday, March 13, 2022 at 2:28:32 PM UTC+1, S Jack wrote:
> Provide for it and your
> Forth won't stick out like a sore thumb and seem lacking to the
> many that use and expect Regex.
Do you REALLY think that people who NEED REGEX will actually consider
using a language like Forth?!

I do have a tiny one - but I very rarely use it. Lots of time you don't provide
an option string on a command line, but need an expression fixed in a program.
For that I use a lib like "chmatch.4th" which can considered to be precompiled
regex code - but using Forth as the interpreter. It's much lighter - and it's Forth.

s" 0123456789" sconstant 'digits'
s" +-" sconstant 'sign'
char . constant 'point'

'sign' char-match drop
'point' char-equal >r
'digits' char-match r> swap >r >r
'digits' skip-while r> 0=
if
'point' char-equal
if
'digits' char-match r> and >r
'digits' skip-while
then
then r>

Flag returned means "match- or-no-match". Easy.

I guess those people also are incapable of doing a binary search, a decent sort and
desperately need hash tables and associative arrays to get any work done, huh?

Hans Bezemer

Re: Updated String Parsing Words in kForth

<t0ltmo$kuc$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17228&group=comp.lang.forth#17228

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sun, 13 Mar 2022 18:11:19 -0500
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <t0ltmo$kuc$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<bf035654-a7af-4a30-8562-4de960cec122n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 13 Mar 2022 23:11:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9c820fee464043a74e24b8949100c166";
logging-data="21452"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX190Fr/GW5dDrQcuE7X99jSi"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:0EfxJGPWOUKyxLv+wN1+Z2WNTas=
In-Reply-To: <bf035654-a7af-4a30-8562-4de960cec122n@googlegroups.com>
Content-Language: en-US
 by: Krishna Myneni - Sun, 13 Mar 2022 23:11 UTC

On 3/13/22 03:26, P Falth wrote:
> On Sunday, 13 March 2022 at 01:12:12 UTC+1, Krishna Myneni wrote:
>> I am overhauling the string parsing words in the kForth String Words
>> library (strings.4th). The existing words are not named as clearly as
>> they should be, and have inefficient implementations:
>>
>> \ PARSE_TOKEN ( a u -- arem urem atok utok )
>> \ PARSE_LINE ( a u -- a1 u1 a2 u2 ... an un n )
>> \ PARSE_ARGS ( a u -- n ) ( F: r1 ... rn )
>> \ PARSE_CSV ( a u -- n ) ( F: r1 ... rn )
>>
>> PARSE_TOKEN skips leading spaces and parses the next blank(s)-delimited
>> substring, atok utok, and also returns the remaining portion of the string.
>>
>> PARSE_LINE applies PARSE_TOKEN repeatedly to place the token strings and
>> token count on the stack.
>>
>> PARSE_ARGS parses a string with blank delimited numbers, converting each
>> substring to a floating point number, returning the n floating point
>> numbers and the count.
>>
>> PARSE_CSV is a hack which replaces each comma in the string with a space
>> and performs the same function as PARSE_ARGS.
>
> In ntf/lxf I have
>
> GET-WORD ( addr len -- addr" len" addr' len' )
> GET-LINE ( addr len -- addr" len" addr' len' )
> SPLIT-BUFFER ( addr len xchar -- addr" len" addr' len' )
>
> GET-WORD is like your PARSE_TOKEN
> GET-LINE will return a line but not split in tokens like your PARSE_LINE
> SPLIT-BUFFER is a lower level word to construct other parsing words from.
>
> Why do you need a word that potentially fills up the stack?
>

In practice, I mostly need to parse strings with fewer than 8 tokens
(columns/fields of data). Most Forth systems for desktop use provide a
deep enough data stack for handling parsing for much more than even 64
fields.

But your point is certainly valid. These words were not designed to
handle, say, a 256x256 matrix written to a file in text format, with
each row or column on a single line. Some scientific computing packages
such as R will do that. For reading such files, the stack approach could
be a problem.

In my revised approach with PARSE-STRING, mentioned elsewhere in this
thread, the token recognizer/parser will be accompanied by a token
processor for immediately handling the parsed string.

>> These words need clearer names and an upgrade (particularly PARSE_CSV).
>> My proposed replacements are
>>
>>
>> \ NEXT-BS-TOKEN ( a u -- atok utok arem urem )
>> \ NEXT-CS-TOKEN ( a u -- atok utok arem urem )
>> \ PARSED-BSV ( a u -- a1 u1 a2 u2 ... an un n )
>> \ PARSED-CSV ( a u -- a1 u1 a2 u2 ... an un n )
>> \ ITH-PARSED ( a1 u1 ... an un n i -- a1 u1 ... an un n ai ui )
>> \ DROP-PARSED ( a1 u1 ... an un n -- )
>> \ PARSED>FLOATS ( a1 u1 ... an un n -- n ) ( F: r1 ... rn )
>>
>> NEXT-BS-TOKEN is a slightly more efficient version of PARSE_TOKEN, but
>> leaves the token and remaining strings in a swapped order. The name
>> indicates that it is parsing the next "blank(s)-separated" token in the
>> string.
>
> Why is it better to have them swapped?
> In all my use cases I parse for a word, do something with it, check if there is remaining text
> and repeat if there is.
>

For the case where the parsed substrings are placed on the stack, it
allows the NEXT-xx-TOKEN to be used repeatedly. However, if you are
immediately processing the substring and removing it, then the exchanged
ordering ( arem urem atok utok ) is better.

--
Krishna

Re: Updated String Parsing Words in kForth

<t0lumk$8jf$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17229&group=comp.lang.forth#17229

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sun, 13 Mar 2022 18:28:19 -0500
Organization: A noiseless patient Spider
Lines: 79
Message-ID: <t0lumk$8jf$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 13 Mar 2022 23:28:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9c820fee464043a74e24b8949100c166";
logging-data="8815"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19iEXgqydZE5Cc8wuovyPyF"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:EbT3W8wuXCImZmjqQT6UboKb2Y4=
In-Reply-To: <2022Mar13.103712@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Krishna Myneni - Sun, 13 Mar 2022 23:28 UTC

On 3/13/22 04:37, Anton Ertl wrote:
> Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>> Comments or suggestions?
>
> In one of my papers [ertl13-strings], I suggest for splitting/parsing
> to do it one result at a time. We have SEARCH (and SCAN/SKIP) that
> can help in working that way, but require some additional work around them. In the paper I suggest:
>
> |search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
> |
> |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
> |c-a1 u3 is the substring before the first match, c-a4 u4 is the first
> |match, and c-a5 u5 is the rest of the string, and the TOS is true;
> |otherwise return false.
>
> You would then call it again for c-a5 u5 to get the next match.
>
....

In my generalized approach with PARSE-STRING, the xt-nexttoken
represents a generalized token recognizer/parser, which could use regexp
but can also be a simpler algorithm or a more complex algorithm. For
example, it allows for the possibility of including a state machine for
fields validation.

One example of a multicolumn field comes the github repo, H2SPEC, linked
below. The fortran program in this repo, h2spec.f, generates a
multicolumn file, h2lines.dat, with each line providing info about a
spectral line of the hydrogen molecule (in the "vacuum ultraviolet"
region). An example line in this file, which contains a mixture of
integer, floating point, and text data fields, is

3226 1589.199 62924.797 0.1125E+08 X 14 3 O B 7 4 O

The first column is an index number, col 2 is the wavelength of the
spectral line in Angstroms, col 3 is the frequency of the line in units
of cm^-1, col 4 is the transition rate in units of s^-1, cols 5--8 give
the quantum numbers/labels for the electronic, vibration, rotation, and
parity (Ortho/Para hydrogen) describing the lower level of transition,
and cols 9--12 are the corresponding information for the upper level.

The xt-nexttoken can validate the field type using a simple state
machine as each token is parsed from the string.

--
Krishna

H2SPEC
https://github.com/mynenik/H2SPEC

>
> I have not implemented these words.
>
> @InProceedings{ertl13-strings,
> author = {M. Anton Ertl},
> title = {Standardize Strings Now!},
> crossref = {euroforth13},
> pages = {39--43},
> url = {http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-strings.pdf},
> OPTnote = {not refereed},
> abstract = {This paper looks at the issues in string words: what
> operations may be required, various design options,
> and why this has lead to the current state of
> standardization of string operations that is
> insufficient in the eyes of many.}
> }
>
> @Proceedings{euroforth13,
> title = {29th EuroForth Conference},
> booktitle = {29th EuroForth Conference},
> year = {2013},
> key = {EuroForth'13},
> url = {http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/proceedings.pdf}
> }
>
> - anton

Re: Updated String Parsing Words in kForth

<5cab428d-a7a5-49ca-98c1-154e9d960b8cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17230&group=comp.lang.forth#17230

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a37:5881:0:b0:649:10c4:60c2 with SMTP id m123-20020a375881000000b0064910c460c2mr13038710qkb.615.1647218941881;
Sun, 13 Mar 2022 17:49:01 -0700 (PDT)
X-Received: by 2002:a05:6214:ca6:b0:435:79e5:6a6 with SMTP id
s6-20020a0562140ca600b0043579e506a6mr16127622qvs.2.1647218941751; Sun, 13 Mar
2022 17:49:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Sun, 13 Mar 2022 17:49:01 -0700 (PDT)
In-Reply-To: <t0jcsp$9g4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23c5:6f05:3a01:8c42:f514:542a:9833;
posting-account=9A5f7goAAAD_QfJPZnlK3Xq_UhzYjdP-
NNTP-Posting-Host: 2a00:23c5:6f05:3a01:8c42:f514:542a:9833
References: <t0jcsp$9g4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5cab428d-a7a5-49ca-98c1-154e9d960b8cn@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: november...@gmail.com (NN)
Injection-Date: Mon, 14 Mar 2022 00:49:01 +0000
Content-Type: text/plain; charset="UTF-8"
 by: NN - Mon, 14 Mar 2022 00:49 UTC

On Sunday, 13 March 2022 at 00:12:12 UTC, Krishna Myneni wrote:
> I am overhauling the string parsing words in the kForth String Words
> library (strings.4th). The existing words are not named as clearly as
> they should be, and have inefficient implementations:
>
> \ PARSE_TOKEN ( a u -- arem urem atok utok )
> \ PARSE_LINE ( a u -- a1 u1 a2 u2 ... an un n )
> \ PARSE_ARGS ( a u -- n ) ( F: r1 ... rn )
> \ PARSE_CSV ( a u -- n ) ( F: r1 ... rn )
>
> PARSE_TOKEN skips leading spaces and parses the next blank(s)-delimited
> substring, atok utok, and also returns the remaining portion of the string.
>
> PARSE_LINE applies PARSE_TOKEN repeatedly to place the token strings and
> token count on the stack.
>
> PARSE_ARGS parses a string with blank delimited numbers, converting each
> substring to a floating point number, returning the n floating point
> numbers and the count.
>
> PARSE_CSV is a hack which replaces each comma in the string with a space
> and performs the same function as PARSE_ARGS.
>
> These words need clearer names and an upgrade (particularly PARSE_CSV).
> My proposed replacements are
>
>
> \ NEXT-BS-TOKEN ( a u -- atok utok arem urem )
> \ NEXT-CS-TOKEN ( a u -- atok utok arem urem )
> \ PARSED-BSV ( a u -- a1 u1 a2 u2 ... an un n )
> \ PARSED-CSV ( a u -- a1 u1 a2 u2 ... an un n )
> \ ITH-PARSED ( a1 u1 ... an un n i -- a1 u1 ... an un n ai ui )
> \ DROP-PARSED ( a1 u1 ... an un n -- )
> \ PARSED>FLOATS ( a1 u1 ... an un n -- n ) ( F: r1 ... rn )
>
> NEXT-BS-TOKEN is a slightly more efficient version of PARSE_TOKEN, but
> leaves the token and remaining strings in a swapped order. The name
> indicates that it is parsing the next "blank(s)-separated" token in the
> string.
>
> NEXT-CS-TOKEN parses the next "comma-separated" token in the string and
> returns the token and remaining substrings.
>
> PARSED-BSV starts with an input string and repeatedly applies
> NEXT-BS-TOKEN until the remaining string is null and returns all of the
> substrings and the token count on the stack.
>
> PARSED-CSV starts with an input string and repeatedly applies
> NEXT-CS-TOKEN until the remaining string is null. It returns all of the
> substrings and the token count on the stack. Unlike PARSED-BSV, each
> comma delimiter will mark a separte token string, e.g. if the input
> string is ",," there will be three token strings, each of length zero.
>
> ITH-PARSED provides a way to PICK the ith substring returned by
> PARSED-BSV and PARSED-CSV.
>
> DROP-PARSED may be used to discard the n token substrings returned by
> PARSED-BSV and PARSED-CSV.
>
> PARSED>FLOATS converts each of the token substrings returned by
> PARSED-BSV or PARSED-CSV to n floating point numbers. If the string to
> float conversion fails, the floating point value NAN will be returned.
> If a substring has zero length, the corresponding fp value will also be NAN.
>
> Comments or suggestions?
>
> --
> Krishna Myneni
>
>
>
>
> The previous set of words will be redefined using the new words (for
> running existing source code); however, they will be considered deprecated.

Comment :

For my situation, I ended up writing two sets of words because I couldnt decide
which solution was better.
(1) one that took strings as arguments and returned string ( a u ) pairs,
(2) one that assumed the string was pointed to by mysrc and myin> and returned
string ( a u ) pairs.

I ended up using solution (2) where I pointed mysrc to the string addr/len and myin> and
parsed the string using forth-like words

Additionally also found it simpler to maintain a vector and accumulate the result
there rather than the stack.

I would be interested to discover if it was something you considered at some point of time.

Re: Updated String Parsing Words in kForth

<t0me8f$112t$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17232&group=comp.lang.forth#17232

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!aioe.org!7AktqsUqy5CCvnKa3S0Dkw.user.46.165.242.75.POSTED!not-for-mail
From: dxfo...@gmail.com (dxforth)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Mon, 14 Mar 2022 14:53:52 +1100
Organization: Aioe.org NNTP Server
Message-ID: <t0me8f$112t$1@gioia.aioe.org>
References: <t0jcsp$9g4$1@dont-email.me>
<5cab428d-a7a5-49ca-98c1-154e9d960b8cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="33885"; posting-host="7AktqsUqy5CCvnKa3S0Dkw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB
 by: dxforth - Mon, 14 Mar 2022 03:53 UTC

On 14/03/2022 11:49, NN wrote:
> ...
> For my situation, I ended up writing two sets of words because I couldnt decide
> which solution was better.

I agree. It's near impossible to come up with a set of words that works for
every situation. Other than a few parsing primitives in the kernel which
I use to implement >FLOAT and the like, I leave such things to libraries or
the application itself. The parsers in the kernel are mostly dumb as the
words that use them can check for things such as empty strings.

Re: Updated String Parsing Words in kForth

<t0nd6r$lcn$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17242&group=comp.lang.forth#17242

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Mon, 14 Mar 2022 07:42:01 -0500
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <t0nd6r$lcn$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<5cab428d-a7a5-49ca-98c1-154e9d960b8cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 14 Mar 2022 12:42:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9c820fee464043a74e24b8949100c166";
logging-data="21911"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iwT7j8qTCWFJ93BU2XdRF"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:/lSyJyFj7aCyXNe3RDMcIIeuOn0=
In-Reply-To: <5cab428d-a7a5-49ca-98c1-154e9d960b8cn@googlegroups.com>
Content-Language: en-US
 by: Krishna Myneni - Mon, 14 Mar 2022 12:42 UTC

On 3/13/22 19:49, NN wrote:
> On Sunday, 13 March 2022 at 00:12:12 UTC, Krishna Myneni wrote:
>> I am overhauling the string parsing words in the kForth String Words
>> library (strings.4th). The existing words are not named as clearly as
>> they should be, and have inefficient implementations:
>>
>> \ PARSE_TOKEN ( a u -- arem urem atok utok )
>> \ PARSE_LINE ( a u -- a1 u1 a2 u2 ... an un n )
>> \ PARSE_ARGS ( a u -- n ) ( F: r1 ... rn )
>> \ PARSE_CSV ( a u -- n ) ( F: r1 ... rn )
>>
>> PARSE_TOKEN skips leading spaces and parses the next blank(s)-delimited
>> substring, atok utok, and also returns the remaining portion of the string.
>>
>> PARSE_LINE applies PARSE_TOKEN repeatedly to place the token strings and
>> token count on the stack.
>>
>> PARSE_ARGS parses a string with blank delimited numbers, converting each
>> substring to a floating point number, returning the n floating point
>> numbers and the count.
>>
>> PARSE_CSV is a hack which replaces each comma in the string with a space
>> and performs the same function as PARSE_ARGS.
>>
>> These words need clearer names and an upgrade (particularly PARSE_CSV).
>> My proposed replacements are
>>
>>
>> \ NEXT-BS-TOKEN ( a u -- atok utok arem urem )
>> \ NEXT-CS-TOKEN ( a u -- atok utok arem urem )
>> \ PARSED-BSV ( a u -- a1 u1 a2 u2 ... an un n )
>> \ PARSED-CSV ( a u -- a1 u1 a2 u2 ... an un n )
>> \ ITH-PARSED ( a1 u1 ... an un n i -- a1 u1 ... an un n ai ui )
>> \ DROP-PARSED ( a1 u1 ... an un n -- )
>> \ PARSED>FLOATS ( a1 u1 ... an un n -- n ) ( F: r1 ... rn )
>>
>> NEXT-BS-TOKEN is a slightly more efficient version of PARSE_TOKEN, but
>> leaves the token and remaining strings in a swapped order. The name
>> indicates that it is parsing the next "blank(s)-separated" token in the
>> string.
>>
>> NEXT-CS-TOKEN parses the next "comma-separated" token in the string and
>> returns the token and remaining substrings.
>>
>> PARSED-BSV starts with an input string and repeatedly applies
>> NEXT-BS-TOKEN until the remaining string is null and returns all of the
>> substrings and the token count on the stack.
>>
>> PARSED-CSV starts with an input string and repeatedly applies
>> NEXT-CS-TOKEN until the remaining string is null. It returns all of the
>> substrings and the token count on the stack. Unlike PARSED-BSV, each
>> comma delimiter will mark a separte token string, e.g. if the input
>> string is ",," there will be three token strings, each of length zero.
>>
>> ITH-PARSED provides a way to PICK the ith substring returned by
>> PARSED-BSV and PARSED-CSV.
>>
>> DROP-PARSED may be used to discard the n token substrings returned by
>> PARSED-BSV and PARSED-CSV.
>>
>> PARSED>FLOATS converts each of the token substrings returned by
>> PARSED-BSV or PARSED-CSV to n floating point numbers. If the string to
>> float conversion fails, the floating point value NAN will be returned.
>> If a substring has zero length, the corresponding fp value will also be NAN.
>>
>> Comments or suggestions?
>>
>> --
>> Krishna Myneni
>>
>>
>>
>>
>> The previous set of words will be redefined using the new words (for
>> running existing source code); however, they will be considered deprecated.
>
> Comment :
>
> For my situation, I ended up writing two sets of words because I couldnt decide
> which solution was better.
>
> (1) one that took strings as arguments and returned string ( a u ) pairs,
> (2) one that assumed the string was pointed to by mysrc and myin> and returned
> string ( a u ) pairs.
>
> I ended up using solution (2) where I pointed mysrc to the string addr/len and myin> and
> parsed the string using forth-like words
>
> Additionally also found it simpler to maintain a vector and accumulate the result
> there rather than the stack.
>
>
>
> I would be interested to discover if it was something you considered at some point of time.
>
>

Do you mean that you used standard words like WORD PARSE PARSE-NAME to
parse strings from the input source? That's an alternative. But I think
it doesn't eliminate the need for string parsing words in general.

--
Krishna

Re: Updated String Parsing Words in kForth

<2fa880e3-aeb0-4335-b1ae-aa49ee8fb231n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17260&group=comp.lang.forth#17260

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a05:620a:e1c:b0:47d:87eb:18b2 with SMTP id y28-20020a05620a0e1c00b0047d87eb18b2mr17045495qkm.527.1647329989128;
Tue, 15 Mar 2022 00:39:49 -0700 (PDT)
X-Received: by 2002:a05:6214:e6a:b0:435:cb88:d111 with SMTP id
jz10-20020a0562140e6a00b00435cb88d111mr20752515qvb.46.1647329988941; Tue, 15
Mar 2022 00:39:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Tue, 15 Mar 2022 00:39:48 -0700 (PDT)
In-Reply-To: <dc811433-2a0d-441d-968e-b9509d15afd3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:3f7a:20d0:55fb:718a:bdac:f426;
posting-account=V5nGoQoAAAC_P2U0qnxm2kC0s1jNJXJa
NNTP-Posting-Host: 2600:1700:3f7a:20d0:55fb:718a:bdac:f426
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com> <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
<dc811433-2a0d-441d-968e-b9509d15afd3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2fa880e3-aeb0-4335-b1ae-aa49ee8fb231n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: sdwjac...@gmail.com (S Jack)
Injection-Date: Tue, 15 Mar 2022 07:39:49 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 43
 by: S Jack - Tue, 15 Mar 2022 07:39 UTC

On Sunday, March 13, 2022 at 4:36:06 PM UTC-5, the.bee...@gmail.com wrote:
>
> s" 0123456789" sconstant 'digits'
> s" +-" sconstant 'sign'
> char . constant 'point'
>
> 'sign' char-match drop
> 'point' char-equal >r
> 'digits' char-match r> swap >r >r
> 'digits' skip-while r> 0=
> if
> 'point' char-equal
> if
> 'digits' char-match r> and >r
> 'digits' skip-while
> then
> then r>
>
> Flag returned means "match- or-no-match". Easy.
>

frogd
go

: sign "[+-]" ;
: digits "[[:digit:]]" ;
: point "\." ;
: any "*" ;
: some "\{1\}" ;

xx sign any digits any point digits some PATTERN
s -123.456 expr ==> -1
s +123.456 expr ==> -1
s 123.456 expr ==> -1
s .666 expr ==> -1
s -.42 expr ==> -1
s +1 expr ==> 0
s . expr ==> 0
s -. expr ==> 0

-fin-
ok
--
me

Re: Updated String Parsing Words in kForth

<4fbd18a2-e33e-4d06-aca9-6092b8b26ef2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17278&group=comp.lang.forth#17278

  copy link   Newsgroups: comp.lang.forth
X-Received: by 2002:a37:d241:0:b0:67b:3360:3644 with SMTP id f62-20020a37d241000000b0067b33603644mr18562215qkj.274.1647361735587;
Tue, 15 Mar 2022 09:28:55 -0700 (PDT)
X-Received: by 2002:a05:622a:194:b0:2e1:e733:5798 with SMTP id
s20-20020a05622a019400b002e1e7335798mr1437885qtw.104.1647361735433; Tue, 15
Mar 2022 09:28:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.forth
Date: Tue, 15 Mar 2022 09:28:55 -0700 (PDT)
In-Reply-To: <2fa880e3-aeb0-4335-b1ae-aa49ee8fb231n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:3f7a:20d0:c557:6003:bbb9:38d6;
posting-account=V5nGoQoAAAC_P2U0qnxm2kC0s1jNJXJa
NNTP-Posting-Host: 2600:1700:3f7a:20d0:c557:6003:bbb9:38d6
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com> <e68990d8-5033-40ef-9aef-888d8fd03579n@googlegroups.com>
<dc811433-2a0d-441d-968e-b9509d15afd3n@googlegroups.com> <2fa880e3-aeb0-4335-b1ae-aa49ee8fb231n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4fbd18a2-e33e-4d06-aca9-6092b8b26ef2n@googlegroups.com>
Subject: Re: Updated String Parsing Words in kForth
From: sdwjac...@gmail.com (S Jack)
Injection-Date: Tue, 15 Mar 2022 16:28:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 54
 by: S Jack - Tue, 15 Mar 2022 16:28 UTC

On Tuesday, March 15, 2022 at 2:39:50 AM UTC-5, S Jack wrote:

:) frogd
ELF32X86_64 Frog Version 1.0d

"regex" demo

Demo Regular Expression (REGEX)

PATTERN ( sN..s1 -- )

PATTERN concatenates regex sub-strings into one regex pattern
assigned to value SPATTERN

: 'sign' "[+-]" ;
: 'digits' "[[:digit:]]" ;
: 'point' "\." ;
: 'any' "*" ;
: 'some' "\{1\}" ;

xx 'sign' 'any' 'digits' 'any' 'point' 'digits' 'some' PATTERN

EXPR ( s1 -- PASS | FAIL | s2 )

EXPR applies SPATTERN to input string s1 and returns "PASS"
if the s1 matches SPATTERN, "FAIL" otherwise OR extracts string

s2 if s1 matches SPATTERN and SPATTERN contains a sub-pattern enclosed
in parenthesis where s2 is the matched content of the enclosed
sub-pattern.

bl !to s
s -123.456 EXPR ==> PASS
s +123.456 EXPR ==> PASS
s 123.456 EXPR ==> PASS
s .666 EXPR ==> PASS
s -.42 EXPR ==> PASS
s +1 EXPR ==> FAIL
s . EXPR ==> FAIL
s -. EXPR ==> FAIL

xx ".*\(X.*Y\).*" PATTERN
asc ; !to s
s Good magic XyzzY found here; EXPR ==> XyzzY
cr

-fin-
ok
( Not promoting; just showing off and having fun)
( What's not seen in the above pass/fail cases is output is lined up, PASS is white on
green background and FAIL is white on red background.)
--
me

Re: Updated String Parsing Words in kForth

<2022Mar17.231326@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17319&group=comp.lang.forth#17319

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Thu, 17 Mar 2022 22:13:26 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 56
Message-ID: <2022Mar17.231326@mips.complang.tuwien.ac.at>
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at> <da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="b2d48990e0675cc7fa059364b140bc50";
logging-data="1765"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19M89rPBCypzvrCNWBW+fjc"
Cancel-Lock: sha1:7SAKOntAKTpK4scfdvsAtw4ndEQ=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 17 Mar 2022 22:13 UTC

Marcel Hendrix <mhx@iae.nl> writes:
>On Sunday, March 13, 2022 at 10:52:01 AM UTC+1, Anton Ertl wrote:
>> Krishna Myneni <krishna...@ccreweb.org> writes:
>> >Comments or suggestions?
>>
>> In one of my papers [ertl13-strings], I suggest for splitting/parsing
>> to do it one result at a time. We have SEARCH (and SCAN/SKIP) that
>> can help in working that way, but require some additional work around them. In the paper I suggest:
>>
>> |search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
>> |
>> |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
>> |c-a1 u3 is the substring before the first match, c-a4 u4 is the first
>> |match, and c-a5 u5 is the rest of the string, and the TOS is true;
>> |otherwise return false.
>
>One of the first BASIS documents had (LEX) :
>
>FORTH> help (lex)
>(lex) IFORTH
> ( c-addr1 u1 del -- c-addr3 u3 c-addr1 u4 del true ) or
> ( c-addr1 u1 del -- false )
> Break a string in three pieces: before the delimiter del, the delimiter
> itself, and after the delimiter. c-addr3 points to the remaining string,
> c-addr1 points to the string in front of the delimiter. If false is
> returned, the input string does not contain the delimiter character.

Interesting difference of result order compared to what I suggested.

>Coupling this to a full reg-exp package is tempting. However, what I've
>seen with Python users is that they use regexp's all the time, without
>any understanding of how it works. They rely on an online tool that shows
>them how to hack it until the next gotcha.

For the splitting task the benefit of having a regexp is that you can
e.g., use a sequence of chars, or one character from a set as the
in-between things. I expect that it would be very rare to use complex
regexps for that task.

In general, trying to do some sophisticated things with regexp
(especially things they are not very good at, such as complements of
patterns) leads to longish regexps that are hard to debug. And yet,
all the other options tend to be even more cumbersome, so we stick
with regexps even in cases where they obviously reach the limits of
their usability.

The whole problem is not helped by the fact that especially the more
sophisticated features have different syntax in different regexp
implementations.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2021: https://euro.theforth.net/2021

Re: Updated String Parsing Words in kForth

<2022Mar18.092919@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17330&group=comp.lang.forth#17330

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Fri, 18 Mar 2022 08:29:19 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 53
Message-ID: <2022Mar18.092919@mips.complang.tuwien.ac.at>
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at> <da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com> <t0l099$jkn$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="dde9b1e8841056cc64c19726ef7a3663";
logging-data="5049"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18iuoRfAGRe9LNk3Fnu6VMs"
Cancel-Lock: sha1:a3gWOEXwbNSjghKoVcNWUDQDChA=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Mar 2022 08:29 UTC

Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>On 3/13/22 07:47, Marcel Hendrix wrote:
>> On Sunday, March 13, 2022 at 10:52:01 AM UTC+1, Anton Ertl wrote:
>>> |search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
>>> |
>>> |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
>>> |c-a1 u3 is the substring before the first match, c-a4 u4 is the first
>>> |match, and c-a5 u5 is the rest of the string, and the TOS is true;
>>> |otherwise return false.
....
>A middle ground to using a regexp-based parser could be a word such as
>PARSE-STRING, described by
>
> PARSE-STRING ( caddr1 u1 xt-nexttoken xt-processtoken -- ntokens )
>
>where xt-nexttoken provides the execution semantics for obtaining the
>next token from the input string,
>
> caddr1 u1 -- caddr-rem urem caddr-token utoken
>
>and xt-processtoken provides the execution semantics for processing the
>token,
>
> caddr-token utoken --
>
>The number of tokens parsed and processed is returned by PARSE-STRING.

While there are some who use "token" to mean "lexeme", I make a
difference: A token represents the lexeme and is typically implemented
an integer or address, i.e., in Forth something cell-sized. Therefore
I find the descriptions above confusing.

Anyway, yes, using an xt provides a way to match arbitrary strings,
although not convenient as regexps. However, your xt-nexttoken puts
the additional burden on the xt to walk through the string; this can
be helpful for performance, but makes the xt even more burdensome to
write; interestingly, with regexps one can optimize the
string-scanning (not just matching) case without having to write
additional code (except to fend off unwanted matches).

Xt-processtoken means that PARSE-STRING is a wrapper while
SEARCH-REGEXP is designed such that one would have to write a loop
around it for the same effect. However, if not all parts have to be
treated the same (e.g., the fields of a line in a csv file), the
uniform treatment of the lexemes in the wrapper approach is not
helpful.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2021: https://euro.theforth.net/2021

Re: Updated String Parsing Words in kForth

<2022Mar18.095852@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17331&group=comp.lang.forth#17331

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Fri, 18 Mar 2022 08:58:52 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 50
Message-ID: <2022Mar18.095852@mips.complang.tuwien.ac.at>
References: <t0jcsp$9g4$1@dont-email.me> <2022Mar13.103712@mips.complang.tuwien.ac.at> <t0lumk$8jf$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="dde9b1e8841056cc64c19726ef7a3663";
logging-data="5049"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+RBy+9pNJBID+Emse/VQb9"
Cancel-Lock: sha1:KfRsZlXyuTTvbRbJusrCCb7MMQw=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Mar 2022 08:58 UTC

Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>One example of a multicolumn field comes the github repo, H2SPEC, linked
>below. The fortran program in this repo, h2spec.f, generates a
>multicolumn file, h2lines.dat, with each line providing info about a
>spectral line of the hydrogen molecule (in the "vacuum ultraviolet"
>region). An example line in this file, which contains a mixture of
>integer, floating point, and text data fields, is
>
>3226 1589.199 62924.797 0.1125E+08 X 14 3 O B 7 4 O
>
>The first column is an index number, col 2 is the wavelength of the
>spectral line in Angstroms, col 3 is the frequency of the line in units
>of cm^-1, col 4 is the transition rate in units of s^-1, cols 5--8 give
>the quantum numbers/labels for the electronic, vibration, rotation, and
>parity (Ortho/Para hydrogen) describing the lower level of transition,
>and cols 9--12 are the corresponding information for the upper level.
>
>The xt-nexttoken can validate the field type using a simple state
>machine as each token is parsed from the string.

But how would xt-processtoken look for this example?

Anyway, it seems to me that execute-parsing together with parse-name
would work well in this case:

\ incomplete sketch
: parse-level ( -- level )
parse-name str>... \ not sure what to do with that
parse-name str>int
parse-name str>int
parse-name str>int
make-level ;

: process-line ( -- nindex rwavelength rfrequency rtransitionrate lower upper )
parse-name str>int
parse-name str>float
parse-name str>float
parse-name str>float
parse-level
parse-level ;

s" 3226 1589.199 62924.797 0.1125E+08 X 14 3 O B 7 4 O"
' process-line execute-parsing

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2021: https://euro.theforth.net/2021

Re: Updated String Parsing Words in kForth

<t1244o$8f2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17338&group=comp.lang.forth#17338

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Fri, 18 Mar 2022 09:14:46 -0500
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <t1244o$8f2$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at>
<da284a67-9b52-4ecf-9013-ec56146c529fn@googlegroups.com>
<t0l099$jkn$1@dont-email.me> <2022Mar18.092919@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Mar 2022 14:14:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ccda8a585b354e01da84a93132435a9b";
logging-data="8674"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+PDFI9J2ZH0LmsBj3i3nwn"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:kvoWN+OodaJc95Xf+c8GbXGxlP8=
In-Reply-To: <2022Mar18.092919@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Krishna Myneni - Fri, 18 Mar 2022 14:14 UTC

On 3/18/22 03:29, Anton Ertl wrote:
> Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>> On 3/13/22 07:47, Marcel Hendrix wrote:
>>> On Sunday, March 13, 2022 at 10:52:01 AM UTC+1, Anton Ertl wrote:
>>>> |search-regexp ( c-a1 u1 c-a2 u2 -- c-a1 u3 c-a4 u4 c-a5 u5 true | false )
>>>> |
>>>> |Search for regexp c-a2 u2 in string c-a1 u1; if the regexp is found,
>>>> |c-a1 u3 is the substring before the first match, c-a4 u4 is the first
>>>> |match, and c-a5 u5 is the rest of the string, and the TOS is true;
>>>> |otherwise return false.
> ...
>> A middle ground to using a regexp-based parser could be a word such as
>> PARSE-STRING, described by
>>
>> PARSE-STRING ( caddr1 u1 xt-nexttoken xt-processtoken -- ntokens )
>>
>> where xt-nexttoken provides the execution semantics for obtaining the
>> next token from the input string,
>>
>> caddr1 u1 -- caddr-rem urem caddr-token utoken
>>
>> and xt-processtoken provides the execution semantics for processing the
>> token,
>>
>> caddr-token utoken --
>>
>> The number of tokens parsed and processed is returned by PARSE-STRING.
>
> While there are some who use "token" to mean "lexeme", I make a
> difference: A token represents the lexeme and is typically implemented
> an integer or address, i.e., in Forth something cell-sized. Therefore
> I find the descriptions above confusing.
>

"Token" is used in C programming, e.g. the description of strtok() is,

"The strtok() function breaks a string into a sequence of zero or
more nonempty tokens... Each call to strtok() returns a pointer to a
null-terminated string containing the next token."

Perhaps the more precise term is "lexeme", but it's clear that
interpretation of "token" should be based on the context.

> Anyway, yes, using an xt provides a way to match arbitrary strings,
> although not convenient as regexps. However, your xt-nexttoken puts
> the additional burden on the xt to walk through the string;

Yes.

> this can
> be helpful for performance, but makes the xt even more burdensome to
> write;

This was not the intent, but as I have not yet gotten around to writing
some examples to illustrate the PARSE-STRING approach, using
xt-nexttoken and xt-processtoken, you may be right. The idea was to
allow different approaches to walking through the string, including regexps.

> interestingly, with regexps one can optimize the
> string-scanning (not just matching) case without having to write
> additional code (except to fend off unwanted matches).
>
> Xt-processtoken means that PARSE-STRING is a wrapper while
> SEARCH-REGEXP is designed such that one would have to write a loop
> around it for the same effect. However, if not all parts have to be
> treated the same (e.g., the fields of a line in a csv file), the
> uniform treatment of the lexemes in the wrapper approach is not
> helpful.
>

Yes, if the fields are not uniform, separated by the same delimiter(s),
the burden is shifted to xt-nexttoken and xt-processtoken.

--
Krishna

Re: Updated String Parsing Words in kForth

<t125t2$agt$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17339&group=comp.lang.forth#17339

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Fri, 18 Mar 2022 09:44:48 -0500
Organization: A noiseless patient Spider
Lines: 84
Message-ID: <t125t2$agt$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at> <t0lumk$8jf$1@dont-email.me>
<2022Mar18.095852@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Mar 2022 14:44:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ccda8a585b354e01da84a93132435a9b";
logging-data="10781"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18+IEVLF2ezQVVhOdIYkqUN"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:441gOLfO0s0RIRzIJWqajcL+QYs=
In-Reply-To: <2022Mar18.095852@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Krishna Myneni - Fri, 18 Mar 2022 14:44 UTC

On 3/18/22 03:58, Anton Ertl wrote:
> Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>> One example of a multicolumn field comes the github repo, H2SPEC, linked
>> below. The fortran program in this repo, h2spec.f, generates a
>> multicolumn file, h2lines.dat, with each line providing info about a
>> spectral line of the hydrogen molecule (in the "vacuum ultraviolet"
>> region). An example line in this file, which contains a mixture of
>> integer, floating point, and text data fields, is
>>
>> 3226 1589.199 62924.797 0.1125E+08 X 14 3 O B 7 4 O
>>
>> The first column is an index number, col 2 is the wavelength of the
>> spectral line in Angstroms, col 3 is the frequency of the line in units
>> of cm^-1, col 4 is the transition rate in units of s^-1, cols 5--8 give
>> the quantum numbers/labels for the electronic, vibration, rotation, and
>> parity (Ortho/Para hydrogen) describing the lower level of transition,
>> and cols 9--12 are the corresponding information for the upper level.
>>
>> The xt-nexttoken can validate the field type using a simple state
>> machine as each token is parsed from the string.
>
> But how would xt-processtoken look for this example?
>

I'll work up an example using xt-nexttoken and xt-processtoken for this
case. Since the fields are limited by blank characters, code for
xt-nexttoken (without fields validation) should be fairly simple. For
this case I would make a column counter, used by both xt-nexttoken and
xt-processtoken. The column counter indexes an array which specifies the
field type and size, allowing xt-nexttoken to perform a check on the
parsed string (if desired) and informs xt-processtoken about how to
convert the string to the proper data format and possibly provides a
structure field offset for storing the data.

A concrete implementation of the above is needed for comparison with
other approaches. I will be interested to see how regexps might be
useful for field validation.

> Anyway, it seems to me that execute-parsing together with parse-name
> would work well in this case:
>
> \ incomplete sketch
> : parse-level ( -- level )
> parse-name str>... \ not sure what to do with that

Return the first non-blank char.

> parse-name str>int
> parse-name str>int
> parse-name str>int
> make-level ;
>
> : process-line ( -- nindex rwavelength rfrequency rtransitionrate lower upper )
> parse-name str>int
> parse-name str>float
> parse-name str>float
> parse-name str>float
> parse-level
> parse-level ;
>
> s" 3226 1589.199 62924.797 0.1125E+08 X 14 3 O B 7 4 O"
> ' process-line execute-parsing
>

Yes, with EXECUTE-PARSING the input string can be processed as you
describe above by PROCESS-LINE. The converted fields stored on the data
and fp stacks will have to be stored into an array of data structures.
With the PARSE-STRING approach, the parsing, string to field conversion,
and storage of the data is handled on a token by token basis by
xt-nexttoken and xt-processtoken, with the latter two tasks being
handled by xt-processtoken.

I have been preoccupied with the complementary problem of creating
output strings, from numeric fields, for output to a file. This was
prompted by your discussion of optimizing "#". The string parsing
problem is useful for reading text data files. Both data file output and
input are, of course, practical necessities for storing and distributing
data. Optimization in both of these areas is worthwhile.

--
Krishna

Re: Updated String Parsing Words in kForth

<t14quq$2hu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17359&group=comp.lang.forth#17359

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sat, 19 Mar 2022 09:56:25 -0500
Organization: A noiseless patient Spider
Lines: 205
Message-ID: <t14quq$2hu$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at> <t0lumk$8jf$1@dont-email.me>
<2022Mar18.095852@mips.complang.tuwien.ac.at> <t125t2$agt$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 19 Mar 2022 14:56:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="13238209891ef33ee2e48408d79e70f5";
logging-data="2622"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IrbnzMC6LfEmCvZlNni6c"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:oIpsCUcRdSdO/2CHUh/zV5ZQyAo=
In-Reply-To: <t125t2$agt$1@dont-email.me>
Content-Language: en-US
 by: Krishna Myneni - Sat, 19 Mar 2022 14:56 UTC

On 3/18/22 09:44, Krishna Myneni wrote:
> On 3/18/22 03:58, Anton Ertl wrote:
>> Krishna Myneni <krishna.myneni@ccreweb.org> writes:
>>> One example of a multicolumn field comes the github repo, H2SPEC, linked
>>> below. The fortran program in this repo, h2spec.f, generates a
>>> multicolumn file, h2lines.dat, with each line providing info about a
>>> spectral line of the hydrogen molecule (in the "vacuum ultraviolet"
>>> region). An example line in this file, which contains a mixture of
>>> integer, floating point, and text data fields, is
>>>
>>> 3226  1589.199  62924.797 0.1125E+08   X 14  3 O   B  7  4 O
>>>
>>> The first column is an index number, col 2 is the wavelength of the
>>> spectral line in Angstroms, col 3 is the frequency of the line in units
>>> of cm^-1, col 4 is the transition rate in units of s^-1, cols 5--8 give
>>> the quantum numbers/labels for the electronic, vibration, rotation, and
>>> parity (Ortho/Para hydrogen) describing the lower level of transition,
>>> and cols 9--12 are the corresponding information for the upper level.
>>>
>>> The xt-nexttoken can validate the field type using a simple state
>>> machine as each token is parsed from the string.
>>
>> But how would xt-processtoken look for this example?
>>
>
> I'll work up an example using xt-nexttoken and xt-processtoken for this
> case. Since the fields are limited by blank characters, code for
> xt-nexttoken (without fields validation) should be fairly simple. For
> this case I would make a column counter, used by both xt-nexttoken and
> xt-processtoken. The column counter indexes an array which specifies the
> field type and size, allowing xt-nexttoken to perform a check on the
> parsed string (if desired) and informs xt-processtoken about how to
> convert the string to the proper data format and possibly provides a
> structure field offset for storing the data.
>
> A concrete implementation of the above is needed for comparison with
> other approaches. I will be interested to see how regexps might be
> useful for field validation.
>

I have now added an example of my proposed string parsing approach using
PARSE-STRING and xt-nexttoken and xt-processtoken to read, parse, and
store the entire file h2lines.dat. To keep it simple, the example does
not currently do field validation. The program, parse-h2lines.4th, may
be viewed or downloaded at the following link:

https://github.com/mynenik/kForth-64/blob/master/forth-src/parse-h2lines.4th

It runs under kForth and, with a few changes to the include files, shown
below, also runs under Gforth. Gforth processes the entire file,
consisting of 3261 spectral lines, very quickly (in about 4 ms under
gforth-fast). kforth64 is quite sluggish in comparison, about 150 ms,
due, I think, to inefficient implementations of SCAN and SKIP.

The following output is generated when the program is loaded:

Under kforth64
---
Elapsed time = 148 ms
Number of lines = 3261

Selected Lines
1 1108.127 90242.352 3.060e+06 X 0 0 P B 0 1 P
10 991.382 100869.258 5.200e+07 X 0 0 P B 9 1 P
140 1085.145 92153.586 2.575e+07 X 0 4 P B 2 5 P
3261 1449.615 68983.812 1.727e+07 X 14 4 P B 13 5 P
ok
---

Under gforth-fast
---
Elapsed time = 4 ms
Number of lines = 3261

Selected Lines
1 1108.127 90242.352 3.060E6 X 0 0 P B 0 1 P
10 991.382 100869.258 5.200E7 X 0 0 P B 9 1 P
140 1085.145 92153.586 2.575E7 X 0 4 P B 2 5 P
3261 1449.615 68983.812 1.727E7 X 14 4 P B 13 5 P
ok
---

The line list file, h2lines.dat, must be present in the directory. If
you wish to try the example program under your Forth (2012-compatible)
system, you will need a copy of the line list file. The program is meant
to illustrate a general purpose approach to parsing strings into data
fields, with a real, non-trivial example. It may also be used to check
the efficiency of your Forth system's string parsing words.

I am interested in comparing this method with other approaches which
provide the same functionality as this example, and which would be
applicable for general purpose string-to-data field parsing.

The word PARSE-STRING is different from what I proposed originally. In
particular xt-processtoken has the following stack diagram,

xt-processtoken ( caddr u ufield -- )

with ufield being the current field number ( 0, 1, 2, ... ).
PARSE-STRING is now defined as,

: parse-string ( xt-nexttoken xt-processtoken caddr u -- ntokens )
0 >r
BEGIN
2over drop execute
WHILE
r@ 5 pick execute
1 rp@ +!
REPEAT
2drop 2drop 2drop
r> ;

This example uses an xt-nexttoken for obtaining the next
blank(s)-separated substring:

: next-line-field ( caddr1 u1 -- caddr2 u2 caddr3 u3 flag )
next-bs-token dup 0= invert ;

with NEXT-BS-TOKEN defined by

: next-bs-token ( caddr u -- arem urem atok utok )
bl skip 2dup bl scan 2>r r@ - 2r> 2swap ;

Note the order of the remaining and token strings returned on the stack.
The xt-nexttoken only adds the execution semantics of returning a flag
as well, for use by PARSE-STRING.

The xt-processtoken makes use of a table of field processors, which both
convert the token substring and store the converted field into a data
structure array,

: process-line-field ( caddr u ufield -- )
CELLS FieldProcessors + a@ execute ;

where FieldProcessors is the address of the table of individual field
processing xt's, each of which is trivially defined. Writing a basic
field processor is simple but can become complex if one wants to add
field validation as well. There is no requirement to do so, however.

The main loop for reading each line of text from the file and passing it
to PARSE-STRING is pretty simple. It contains

inpLine swap \ caddr u
['] next-line-field ['] process-line-field
2swap parse-string

--
Krishna

For use of parse-h2lines.4th under Gforth, comment out the INCLUDE
statements at the top and add

INCLUDE kforth-compat.fs

\ include ans-words
\ include struct-200x
\ include strings
\ include files
include kforth-compat.fs

>
>> Anyway, it seems to me that execute-parsing together with parse-name
>> would work well in this case:
>>
>> \ incomplete sketch
>> : parse-level ( -- level )
>>    parse-name str>... \ not sure what to do with that
>
> Return the first non-blank char.
>
>>    parse-name str>int
>>    parse-name str>int
>>    parse-name str>int
>>    make-level ;
>>
>> : process-line ( -- nindex rwavelength rfrequency rtransitionrate
>> lower upper )
>>    parse-name str>int
>>    parse-name str>float
>>    parse-name str>float
>>    parse-name str>float
>>    parse-level
>>    parse-level ;
>>
>> s" 3226  1589.199  62924.797 0.1125E+08   X 14  3 O   B  7  4 O"
>> ' process-line execute-parsing
>>

Re: Updated String Parsing Words in kForth

<t155th$9eu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17361&group=comp.lang.forth#17361

  copy link   Newsgroups: comp.lang.forth
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: krishna....@ccreweb.org (Krishna Myneni)
Newsgroups: comp.lang.forth
Subject: Re: Updated String Parsing Words in kForth
Date: Sat, 19 Mar 2022 13:03:27 -0500
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <t155th$9eu$1@dont-email.me>
References: <t0jcsp$9g4$1@dont-email.me>
<2022Mar13.103712@mips.complang.tuwien.ac.at> <t0lumk$8jf$1@dont-email.me>
<2022Mar18.095852@mips.complang.tuwien.ac.at> <t125t2$agt$1@dont-email.me>
<t14quq$2hu$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Mar 2022 18:03:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="13238209891ef33ee2e48408d79e70f5";
logging-data="9694"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/I44mxr6qyt8fg9Q9L+HqN"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.0
Cancel-Lock: sha1:y6X0bLqiHkhWlbKgZ7Yy4a3M0XU=
In-Reply-To: <t14quq$2hu$1@dont-email.me>
Content-Language: en-US
 by: Krishna Myneni - Sat, 19 Mar 2022 18:03 UTC

On 3/19/22 09:56, Krishna Myneni wrote:
> ...
> I have now added an example of my proposed string parsing approach using
> PARSE-STRING and xt-nexttoken and xt-processtoken to read, parse, and
> store the entire file h2lines.dat. To keep it simple, the example does
> not currently do field validation. The program, parse-h2lines.4th, may
> be viewed or downloaded at the following link:
>
> https://github.com/mynenik/kForth-64/blob/master/forth-src/parse-h2lines.4th
>
....
> The line list file, h2lines.dat, must be present in the directory. If
> you wish to try the example program under your Forth (2012-compatible)
> system, you will need a copy of the line list file. ...

If you download the H2SPEC files, you may build h2spec with the
following command, under Linux:

$ gfortran -o h2spec h2spec.f

Then, run h2spec to generate the line list, using input parameters shown
below:
---
$ ./h2spec

Compute the H2 spectrum with these conditions:

Rotational temperature (K) >> 300
Lower wavenumber (cm-1) >> 60000
Upper wavenumber (cm-1) >> 90000
FWHM (cm-1) >> 10
$ ---
This will generate two files: the line list, h2lines.dat, and a spectrum
file, h2vuv.dat. Although the parameters entered into the program are
only relevant for the spectrum file, the design of the program does not
allow to only generate the line list.

---
$ ls -l h2lines.dat
-rw-rw-r--. 1 krishna krishna 202182 Mar 19 12:52 h2lines.dat
$ md5sum h2lines.dat
dbb81b16961d0a71e28fece40129def2 h2lines.dat
---

--
Krishna

--
Krishna

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor