Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

It's easy to get on the internet and forget you have a life -- Topic on #LinuxGER


devel / comp.compilers / Re: What does it mean to "move characters" in the lexer?

SubjectAuthor
* What does it mean to "move characters" in the lexer?Roger L Costello
+* Re: What does it mean to "move characters" in the lexer?gah4
|`* Re: What does it mean to "move characters" in the lexer?Christopher F Clark
| `* Re: What does it mean to "move characters" in the lexer?Kaz Kylheku
|  `- Re: What does it mean to "move characters" in the lexer?Thomas Koenig
`- Re: What does it mean to "move characters" in the lexer?Kaz Kylheku

1
What does it mean to "move characters" in the lexer?

<22-06-057@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=464&group=comp.compilers#464

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: coste...@mitre.org (Roger L Costello)
Newsgroups: comp.compilers
Subject: What does it mean to "move characters" in the lexer?
Date: Tue, 21 Jun 2022 10:27:15 +0000
Organization: Compilers Central
Lines: 29
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-057@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="85436"; mail-complaints-to="abuse@iecc.com"
Keywords: history, comment
Posted-Date: 21 Jun 2022 12:25:09 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Thread-Topic: What does it mean to "move characters" in the lexer?
Thread-Index: AdiFWBix4QF9p6qWTPmZjnkljZpiHA==
Accept-Language: en-US
Content-Language: en-US
 by: Roger L Costello - Tue, 21 Jun 2022 10:27 UTC

Hi Folks,

Page 89 of the dragon book says:

Because a large amount of time can be consumed moving characters, specialized
buffering techniques have been developed to reduce the amount of overhead to
process an input character.

Page 102 of "A Retargetable C Compiler: Design and Implementation" says:

The lexical analyzer's main activity is moving characters, so minimizing the
amount of character movement helps increase speed.

And on page 103 it says:

An important consequence of this design is that most of the input characters
are accessed by *cp and many characters are never moved. Only identifiers
(excluding keywords) and string literals that appear in executable code are
copied out of the buffer into permanent storage.

I don't understand what they mean by "moving characters". Do they mean copying
characters? Do they mean reading characters from a file into memory? Would you
explain what this "character movement" thing is all about, please?

/Roger
[Keeping in mind that this was written in the 1970s, they mean copying strings
of characters from one place to another. On a PDP-11. With 64K bytes of memory.
It is still true that character processing in a lexer consumes a large fraction
of the time in compilers. -John]

Re: What does it mean to "move characters" in the lexer?

<22-06-058@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=465&group=comp.compilers#465

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: What does it mean to "move characters" in the lexer?
Date: Tue, 21 Jun 2022 10:30:58 -0700 (PDT)
Organization: Compilers Central
Lines: 67
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-058@comp.compilers>
References: <AdiFWBix4QF9p6qWTPmZjnkljZpiHA==> <22-06-057@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="37104"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance, comment
Posted-Date: 21 Jun 2022 15:49:51 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-06-057@comp.compilers>
 by: gah4 - Tue, 21 Jun 2022 17:30 UTC

On Tuesday, June 21, 2022 at 9:25:12 AM UTC-7, Roger L Costello wrote:

(snip)
> Because a large amount of time can be consumed moving characters, specialized
> buffering techniques have been developed to reduce the amount of overhead to
> process an input character.

(snip)
> I don't understand what they mean by "moving characters". Do they mean copying
> characters? Do they mean reading characters from a file into memory? Would you
> explain what this "character movement" thing is all about, please?

Yes it is copying, and yes it can take a lot of the time.

On many systems, the disk controller reads the data into its own buffer,
and then the OS copies the data from the controller buffer into its buffer.

Then when the user does an I/O (input) request, the data is copied
into the program's own buffer, and finally into the place where the data
actually goes. So maybe four copies.

Early in the days of TCP/IP there was trailer encapsulation.
(I never saw it used, but some have the ability to turn it on.)

If you follow the ISO seven letter model, or even if you don't.

The program gives data to TCP, which divides it up into
packets to send. Each of those gets a TCP header.
It is then passed to IP where IP puts its header on.
And then before sending, it gets an Ethernet header.

Since there is often something before the buffer, but
the buffer might not be full, so there might be space at
the end, there was trailer encapsulation. Instead of
putting the TCP and IP header on the beginning, you
put them on the end! Less copying!

I believe people found other ways to reduce copying, though.

The I/O hardware for IBM S/360 copies data directly
from the I/O device into memory. (Memory was expensive!)
Also, it is blocked on disk the same as it is for the user,
unlike most systems now. It would be usual, though,
for the last copy -- from the I/O buffer to/from the actual
data area -- to be an actual copy. IBM has locate mode
I/O to eliminate that one. For locate mode, instead of
copying, the program gets a pointer to the actual buffer.
(That works in assembly and PL/I, C hadn't been invented.)

For write, you request the address of the output buffer,
operate on the data there, and then request it be written.

There has been much work over the years on reducing
the amount of data copying, or operations needed to
copy it. For byte addressed machines, to copy data
a whole word at a time. (Depending on alignment.)

There are also search algorithms like Boyer-Moore,
to search strings without looking at every character.
[Now you can usually ask operating systems to map a file into your
process so there is no extra copying at all, the disk reads a block
into a page frame and you address the data directly in that page
frame. In a program called grepcidr that does grep-like searches for
IP address strings, for large files it got somewhat faster when I
switched from stdio to mapping the whole file in and treating it as
one big string. This is pretty remote from compilers, though. tl;dr
less copying is faster. -John]

Re: What does it mean to "move characters" in the lexer?

<22-06-064@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=469&group=comp.compilers#469

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: Re: What does it mean to "move characters" in the lexer?
Date: Wed, 22 Jun 2022 00:44:05 +0300
Organization: Compilers Central
Lines: 20
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-064@comp.compilers>
References: <22-06-057@comp.compilers> <22-06-058@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="90727"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 21 Jun 2022 19:25:33 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Christopher F Clark - Tue, 21 Jun 2022 21:44 UTC

While worrying about copying characters around in compilers isn't given
much thought these days, it is very relevant to people implementing
networking software and also those doing hardware accelerators and their
device drivers. The startup I'm working with these days, spends a lot of
time worrying about zero-copy abstractions, i.e. how to avoid moving data
around. Of course, that doesn't surprise me as we are building hardware
accelerators and lots of the staff has a networking background and our
accelerators communicate with each other over network connections or shared
memory, but the less the data moves, the faster throughput we get with less
energy usage and usually with less hardware too.

--
******************************************************************************

Chris Clark email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: What does it mean to "move characters" in the lexer?

<22-06-065@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=470&group=comp.compilers#470

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 480-992-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: What does it mean to "move characters" in the lexer?
Date: Wed, 22 Jun 2022 01:05:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 28
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-065@comp.compilers>
References: <22-06-057@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="24266"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 21 Jun 2022 21:47:45 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Wed, 22 Jun 2022 01:05 UTC

On 2022-06-21, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> Page 89 of the dragon book says:
>
> Because a large amount of time can be consumed moving characters, specialized
> buffering techniques have been developed to reduce the amount of overhead to
> process an input character.

It's not clear what exactly this is referring to, but probably just to
the practice of making multiple copies of the same data in the
processing stack. If we put an imaginary tracer on a single character,
we may see it hopping among multiple buffers. In the Chaper 2 lexical
analyzer, getchar is used to read a character; getchar fills a buffer in
the stdio stream, and the program is sucking it out from there one
character at a time. So to build a lexeme, it has to have its own buffer
for the lexeme, which is another copy of the data.

The technique described in chapter 3 allows bulk reads into a buffer,
eliminating the stream library. The tokens are delimited right inside
the buffer, reducing a copy.

[The technique seems closely related to the "flip buffer",
which can be found in places like TTY implementations of Unix
kernels. Linux had one; there was once a "struct tty_flip_buffer".
At a quick glance, it looks like that today there are nmore than two
buffers used which are allocated and freed on the fly. There is still a
<linux/tty_flip.h> which contains a modicum of "flip" terminology,]

Re: What does it mean to "move characters" in the lexer?

<22-06-066@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=471&group=comp.compilers#471

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: 480-992-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.compilers
Subject: Re: What does it mean to "move characters" in the lexer?
Date: Wed, 22 Jun 2022 01:13:50 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 26
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-066@comp.compilers>
References: <22-06-057@comp.compilers> <22-06-058@comp.compilers> <22-06-064@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="24599"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 21 Jun 2022 21:48:10 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Kaz Kylheku - Wed, 22 Jun 2022 01:13 UTC

On 2022-06-21, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> While worrying about copying characters around in compilers isn't given
> much thought these days, it is very relevant to people implementing
> networking software and also those doing hardware accelerators and their
> device drivers.

Don't write off buffering optimization'sn as yesterday's game just yet;
e.g. this could still be relevant to someone needing to parse gigabytes
of JSON or XML. Or maybe even just megabytes.

I remember reading some article some years ago whereby some Javascript
programmer discovered it was faster to read JSON from a file using
dedicated JSON routines available in Javascript, than to declare the
same syntax in the Javascript program as a literal and let it be
scanned along with the program and available to it that way.

(I realize there may be other reasons for the performance difference,
because JSON isn't Javascript, but likely part of it was that the JS
implementation didn't care about being efficient for large amounts of
data: the "hot spot" isn't going to be the scanning stage that takes
place oncem before the program has a chance to execute any sort of
loop.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Re: What does it mean to "move characters" in the lexer?

<22-06-071@comp.compilers>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=475&group=comp.compilers#475

  copy link   Newsgroups: comp.compilers
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.compilers
Subject: Re: What does it mean to "move characters" in the lexer?
Date: Wed, 22 Jun 2022 11:45:22 -0000 (UTC)
Organization: news.netcologne.de
Lines: 24
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-071@comp.compilers>
References: <22-06-057@comp.compilers> <22-06-058@comp.compilers> <22-06-064@comp.compilers> <22-06-066@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="77766"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, performance, parallel
Posted-Date: 22 Jun 2022 10:01:48 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
 by: Thomas Koenig - Wed, 22 Jun 2022 11:45 UTC

Kaz Kylheku <480-992-1380@kylheku.com> schrieb:

> I remember reading some article some years ago whereby some Javascript
> programmer discovered it was faster to read JSON from a file using
> dedicated JSON routines available in Javascript, than to declare the
> same syntax in the Javascript program as a literal and let it be
> scanned along with the program and available to it that way.

This came up on comp.arch recently.

There is an insanely fast JSON parser ad UTF-8 validator based
on SIMD to be found at https://github.com/simdjson/simdjson .
They select a different length of vector according to
the CPU version they find. The algorithm is described at
https://arxiv.org/pdf/1902.08318.pdf. It
heavily relies on special-casing for JSON and for the SIMD
instructions that are available.

A general SIMD-based parser generator is likely to be even harder
to write and will probably not outperform the package above (nor,
for that case, a traditional character-at-a-time approach).

Is there research on this?

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor