Message-ID:

The idea of male and female are universal constants. -- Kirk, "Metamorphosis", stardate 3219.8

devel / comp.compilers / RE: Integer sizes and DFAs

RE: Integer sizes and DFAs

<22-03-073@comp.compilers>

https://www.novabbs.com/devel/article-flat.php?id=350&group=comp.compilers#350

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: RE: Integer sizes and DFAs
Date: Sun, 27 Mar 2022 00:54:35 +0200
Organization: Compilers Central
Lines: 52
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-073@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="67075"; mail-complaints-to="abuse@iecc.com"
Keywords: design, performance
Posted-Date: 26 Mar 2022 19:42:51 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: Christopher F Clark - Sat, 26 Mar 2022 22:54 UTC

Kaz, I don't know whether you simply didn't read what I wrote or chose
to ignore it.

> In certain common situations we model elementary operations as being
> O(1), with certain justifications. That simplifies the analysis, while
> leaving its results still relevant and true for practical applications
> running on present hardware.

And, my point was 2**32 is large enough to be considered arbitrarily large with
respect to most DFAs. Not quite the human genome, see extended analysis
below. Here was my first analysis.

me > Note how that even works for the DNA DFAs, assuming 32 bit integers
me > and a byte addressed machine. With that model, you can fit 10 million
me > (1E7) states in the 2 billion byte (2GB) address space (each state has
me > 4, 4 byte entries or 16 bytes).

> Each 32 bit state has 4 byte entries only if it doesn't express that
> it branches to other states for certain input symbols.

A state isn't 32 bits, a transition is. A transition (in a DFA) is
just the address of another state. A state in a DFA for DNA has 4 such
transitions (one for A, T, C, and G--the 4 DNA bases). Those are the
only legal symbols in DNA. So a state is 4 * 4 bytes, each transition
is 4 bytes and there are 4 of them per state.

Thus, I explained what the justification is and how that allows more
than 100 million states in the combined DFA. That's without any fancy
tricks.

Now, I just looked up the size of the human genome. it is 3 billion,
so that's a little more than another order or magnitude bigger, so you
definitely need slightly bigger integers (or to do some compression,
although 64 bit integers are overkill, but convenient to work with and
with those you could easily fit the genome in an 128 GB SSD, which
would be addressible by 64 bit integer/pointers). And, for most other
applications even 1 million states are more than sufficient. Thus,
saying a transition can happen in O(1) time is a perfectly reasonable
assumption. You can layout nearly any DFA we expect to see in memory
and directly address it. Thus, the RAM model holds.

By the way with 40 bit integers, you could fit it in 60GB (20 bytes
per state and 3G states), We don't quite have laptops with that much
DRAM memory yet.

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Re: Integer sizes and DFAs

<22-03-074@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=351&group=comp.compilers#351

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: Integer sizes and DFAs
Date: Sat, 26 Mar 2022 19:32:17 -0700 (PDT)
Organization: Compilers Central
Lines: 25
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-074@comp.compilers>
References: <22-03-073@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="97762"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 26 Mar 2022 22:39:33 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-073@comp.compilers>

by: gah4 - Sun, 27 Mar 2022 02:32 UTC

On Saturday, March 26, 2022 at 4:42:55 PM UTC-7, Christopher F Clark wrote:

(snip)

> And, my point was 2**32 is large enough to be considered arbitrarily large with
> respect to most DFAs. Not quite the human genome, see extended analysis
> below. Here was my first analysis.

About 24 years ago I was working with a DNA sequencing group, and was
interested in speeding up this problem. The one I was most interested in
was special purpose hardware with many of the largest DRAM I could find,
arranged just to do this operation.

(Note that you need one more bit, to indicate when a match is found.)

There would be logic to read data off disk, and pass it directly to the DFA
array. There is, then, logic to store the offset into the disk file, and the
state at which the hit occured, to be read out later.

But we went onto other projects, and I never got to build one.

Since then, DRAM has gotten much larger, but so has the DNA database.

Yes the human genome is 3 gigabase, but the whole of GenBank is
now about 16 terabase, including WGS (whole genome sequences).

Re: Integer sizes and DFAs

<22-03-075@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=352&group=comp.compilers#352

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah...@u.washington.edu (gah4)
Newsgroups: comp.compilers
Subject: Re: Integer sizes and DFAs
Date: Sat, 26 Mar 2022 19:45:48 -0700 (PDT)
Organization: Compilers Central
Lines: 17
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-075@comp.compilers>
References: <22-03-073@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="45749"; mail-complaints-to="abuse@iecc.com"
Keywords: performance, comment
Posted-Date: 27 Mar 2022 12:17:34 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-073@comp.compilers>

by: gah4 - Sun, 27 Mar 2022 02:45 UTC

On Saturday, March 26, 2022 at 4:42:55 PM UTC-7, Christopher F Clark wrote:

(snip)

> Now, I just looked up the size of the human genome. it is 3 billion,
> so that's a little more than another order or magnitude bigger, so you
> definitely need slightly bigger integers

Note also that there are some larger genomes, such as the Japanese
flower, Paris japonica at about 150 terabase.

https://en.wikipedia.org/wiki/Paris_japonica

We do like to think that we are the most special species, but it seems
that in genome size, we aren't the winner.
[Unless we have have some new way to apply DFAs to genomes, this seems to
be wandering away from our toptic. -John]

RE: Integer sizes and DFAs

<22-03-076@comp.compilers>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=353&group=comp.compilers#353

copy link Newsgroups: comp.compilers

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: christop...@compiler-resources.com (Christopher F Clark)
Newsgroups: comp.compilers
Subject: RE: Integer sizes and DFAs
Date: Sun, 27 Mar 2022 15:02:16 +0300
Organization: Compilers Central
Lines: 73
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-076@comp.compilers>
References: <22-03-073@comp.compilers> <22-03-074@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="46571"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 27 Mar 2022 12:20:00 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com

by: Christopher F Clark - Sun, 27 Mar 2022 12:02 UTC

@gah4: Ok, The state space for the whole GenBank is larger. And, if
we ever get to the point where we can deal with epigenomic factors, we
probably need extra coding space for that too. But, I still think it
is likely that we will be in the range where we can make "disks" (or
"disk arrays") large enough for them and use 64 bit integers to
address them.

And, wandering off into the space of hardware implementations. At
Intel we were considering doing that circa 2010, so a bit later than
you were. In particular, we were looking to see if we could adapt our
technology to do the "FASTA" algorithm. Like you, we never brought
that to market.

But, I want to make a couple of slight points on the encoding of transitions.

If one is using byte addressing and one always puts the states on
aligned boundaries, one has a few low order bits in the address that
will always be 0. You can tweak one of those to be your "match
complete/final" bit and then mask that off when using it to address
the next state. Alternately, one keeps a separate list of "final
states". That is more common in implementations. And, the third choice
is to have a state header with extra data, like whether the state is
final or not. The last two alternatives expand your memory footprint
somewhat, but by a relatively minimal amount.

Now, if you don't use byte addressing, but use "word" addressing (a la
most computers in the 1960s), you actually have shrunk your address
requirement. You've moved those 0 bits from the bottom of the address
to the top. Even more so, if you use "state" addressing. Of course, if
your hardware has byte addressable memory, you may have to convert
those addresses into byte address before going to actual memory. But
all of that doesn't involve extra memory lookups, just some small
amount of integer/bit arithmetic which on modern computers is
essentially irrelevant and gets swamped by pipeline stalls waiting for
the cache.

Per your idea of keeping them on disk, you can "split" a 64 bit
address into a "page address" and an "address within a page" and use
the paging mechanism to bring in the relevant pages. I believe some
"huge page" implementations actually do that to work well with TLBs
and caches. Anyway, what you proposed is not far fetched.

Beyond that there are lots of tricks one can play. We thought for a
while of Huffman encoding the addresses and using relative pointers.
If you are doing something like the Aho-Corasick algorithm, pointers
that "match" and lead further on in the machine, so they are always
positive numbers, while "failure" pointers are the ones that point
backward and those do so to a limited number of states, where absolute
addressing makes sense. There is more you can do along those lines,
but those are the basic steps.

Now, in actual genes there are "network expressions" and "spacers".
The network expressions can be segments that are repeated, and that
leads to some extra backward links, but the number of them is also
small (and those are relative and not failure links).

---------

Still, none of this impacts the fact, that when doing analysis, you
can still treat those memory references as O(1) and say your lexer is
linear in terms of bytes processed per second (and that having more
states doesn't impact that), because fundamentally reading from the
memory is the bottleneck, and we are always talking fixed memory
images (RAM or "disks", that is it is a FINITE state machine) and not
reading from a Turing Machine tape.

Subject	Author
RE: Integer sizes and DFAs	Christopher F Clark
Re: Integer sizes and DFAs	gah4
RE: Integer sizes and DFAs	Christopher F Clark
Re: Integer sizes and DFAs	gah4