Message-ID:

I dunno, I dream in Perl sometimes... -- Larry Wall in <8538@jpl-devvax.JPL.NASA.GOV>

devel / comp.lang.c / Re: How to optimize this search?

On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
> Thiago Adams <thiago...@gmail.com> writes:
> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> >
> Use a binary search. Max comparisons for your 70+ entries would
> be 6.

One alternative is create a hash given the first char it return the low/high
index on a sorted array. Then we do normal binary search only at this reduced range.

enum token_type is_keyword(const char* text)
{ static struct keyword_pair keywords[] =
{
/*0*/ {"NULL", TK_KEYWORD_NULL},
/*code removed*/
/*23*/{ "_asm", TK_KEYWORD__ASM},
/*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
/*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
/*26*/{ "auto", TK_KEYWORD_AUTO},
/*27*/{ "bool", TK_KEYWORD__BOOL},
/*28*/{ "break", TK_KEYWORD_BREAK},
/*code removed*/
};

/*hash*/
static struct low_high {
int low, high;
} map[] = {
['a'] = {.low = 24, .high = 26},
['b'] = {.low = 27, .high = 28},
/*etc*/
};
if (text[0] >= 'N' && text[0] <= 'w')
{
return binary_search_str(keywords,
map[text[0]].low,
map[text[0]].high,
text);
}
return TK_NONE;
}

On Monday, August 22, 2022 at 2:20:09 PM UTC-3, bart c wrote:
> On Monday, 22 August 2022 at 15:17:29 UTC+1, Thiago Adams wrote:
> > Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> >
> >
> > struct keyword_pair{
> > const char* lexeme;
> > int token;
> > };
> >
> > struct keyword_pair[] = {
> > { "NULL", 0},
> > { "_Alignas", 1},
> > { "_Atomic", 2},
> ...
> > { "volatile", 73},
> > { "while", 74},
> > };
> I suspect other things are going on. Presumably you are looking up an identifier encountered in source code to see if it's a keyword, but if not, another lookup is needed to see if it matches a user-identifier in a symbol table.
>
> My approach usually involves a hash table (the symbol table), containing both keywords and user identifiers. Each entry contains information on whether it is a keyword name (and its token value), or user identifier.

Interesting, but not sure if this make code more complicated not having
a "keyword" token for sure without check a symbol table.

Thiago Adams <thiago.adams@gmail.com> writes:
>On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
>> Thiago Adams <thiago...@gmail.com> writes:
>> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
>> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
>> >
>> Use a binary search. Max comparisons for your 70+ entries would
>> be 6.
>
>One alternative is create a hash given the first char it return the low/high
>index on a sorted array. Then we do normal binary search only at this reduced range.
>
>enum token_type is_keyword(const char* text)
>{
> static struct keyword_pair keywords[] =
> {
> /*0*/ {"NULL", TK_KEYWORD_NULL},
> /*code removed*/
> /*23*/{ "_asm", TK_KEYWORD__ASM},
> /*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
> /*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
> /*26*/{ "auto", TK_KEYWORD_AUTO},
> /*27*/{ "bool", TK_KEYWORD__BOOL},
> /*28*/{ "break", TK_KEYWORD_BREAK},
> /*code removed*/
> };
>
> /*hash*/
> static struct low_high {
> int low, high;
> } map[] = {
> ['a'] = {.low = 24, .high = 26},
> ['b'] = {.low = 27, .high = 28},
> /*etc*/
> };
> if (text[0] >= 'N' && text[0] <= 'w')
> {
> return binary_search_str(keywords,
> map[text[0]].low,
> map[text[0]].high,
> text);
> }
> return TK_NONE;
>}

While these experiments are enjoyable, in the real world
keep it simple and use a linear search until the
number of elements exceeds some number of table
comparisons, where that number is derived from the
clock speed.

Given a 3Ghz processor with a IPC
of 1.5, you are looking at 4.5 billion instructions
per second. If a comparison loop required 10 instructions
per entry (for round numbers, it's likely less) (assuming
an inline string compare function that aborts on first
mismatch character), a 2000 entry table linear lookup miss would
require 20k instruction and you could execute a quarter
million lookups that miss completely per second. That's
worst case.

The median hit cost would be perhaps half that assuming a
uniform access distribution.

Sorting the initial list by by expected frequency of use rather
than alphabetically will reduce the average hit cost substantially
for straight linear searches. Our mainframe compilers back at
Burroughs (where we had a 'search table' instruction in the hardware)
would order the keyword tables by expected frequency of use
(measured from many customer and internal applications).

Personally, unless the table has more than a few hundred entries, I just
use a simple linear table search, and for keyword tables, the
keywords are ordered by expected frequency of use.

Scott Lurndal <scott@slp53.sl.home> wrote:
> Thiago Adams <thiago.adams@gmail.com> writes:
> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> >
>
> Use a binary search. Max comparisons for your 70+ entries would
> be 6.

Around 2004-2005 Frank Heckenbach did some optimization of
search for record fields in Gnu Pascal. His measurements
indicated that below 50 fields linear search was faster,
above binary search won. Basically, in linear search
there is 1 mispredicted branch and a lot of predictable ones.
In binary search about half of branches is likely to
be misprediced. Assuming that mispredicted branch costs
about 10 predicted ones leads to similar level for
"breakeven". Of course machines changed and branch
predictors are better now, but it is not clear to me
if improvements to branch predictors help in case
of binary search.

Of course there is a little catch: in Gnu Pascal search
was done after symbol table lookup so all comparisons
were pointer comparisons. Apparently Thiago works
with strings which if done naively needs several character
comparisons (with their own chances for mispredicted
branches). But if length is limited one can pack the
whole string or first part in machine word and do
comparison as single instruction. Only in case of
match with long entry one needs extra check to
verify rest of match.

--
Waldek Hebisch

Thiago Adams <thiago.adams@gmail.com> wrote:
> On Monday, August 22, 2022 at 12:44:22 PM UTC-3, Siri Cruise wrote:
> > In article
> > <d8294420-7b45-4344...@googlegroups.com>,
> > Thiago Adams <thiago...@gmail.com> wrote:
> >
> > > Given a c string I need to return as fast as possible the pair (if exist) of
> > > the corresponding keyword_pair.
> > You can build a Finite State Automaton.
>
> Yes.. I am thinking exactly like that. But at some point
> we need to map like char -> state then it is the same
> switch problem. How to given char how to find the state in a fast way?

You need map (state, char) -> state, that is pair state, char to new
state. Easily done with two dimensional array:

int s = 1; /* could use symbolic name here */
unsigned char cp;
while (s && (c = *cp++)) {
s = transition[s][c];
}

After loop test if c = 0, if yes all input string was used, if no
it was rejected before end. If c = 0 thes is s is accepting (single
table lookup), if yes it gives result, if no mens not found.

In your case you will have few hundred cases, so table will be
largish. Traditionally table is compressed by so called comb
vector method. Namely, most entris are "empty", that is they
do not correspond to valid transitions (encoded by 0 above).
In comb vector method we put transitions in one dimensional
array, with some offset for each state. Offsets are chosen
so that _valid_ transitions are in different places. This
can be done in greedy way: use offset zero for first state.
For state n start from zero offset and shift it to find
place with no overlap with previous vectors. Repeat as long
as needed.

Comb vector methods needs two additional arrays: one for
offsets (indexed by states) and another one to check if
transition correspond to current state.

With compressed tables automaton needs more instructions
per transition. OTOH smaller tables means less cache
misses, so may win despite executiong more instructions.

--
Waldek Hebisch

On Monday, August 22, 2022 at 4:30:45 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> Thiago Adams <thiago...@gmail.com> wrote:
> > On Monday, August 22, 2022 at 12:44:22 PM UTC-3, Siri Cruise wrote:
> > > In article
> > > <d8294420-7b45-4344...@googlegroups.com>,
> > > Thiago Adams <thiago...@gmail.com> wrote:
> > >
> > > > Given a c string I need to return as fast as possible the pair (if exist) of
> > > > the corresponding keyword_pair.
> > > You can build a Finite State Automaton.
> >
> > Yes.. I am thinking exactly like that. But at some point
> > we need to map like char -> state then it is the same
> > switch problem. How to given char how to find the state in a fast way?
> You need map (state, char) -> state, that is pair state, char to new
> state. Easily done with two dimensional array:
>
> int s = 1; /* could use symbolic name here */
> unsigned char cp;
> while (s && (c = *cp++)) {
> s = transition[s][c];
> }

In 2013 I implemented chapter 3 of dragon book (https://github.com/thradams/tklgen/tree/master/tklgen)
...but instead of doing in C I did in C++ and instead of creating tables I output switch and ifs..
it is not easy to hack the code now. :(
The ouput was switch + if because it was easy to read and debug.
But it not ease as strcmp for instance. Maybe there is some site to generate these tables..
then I can decide if the performance improvements compensate..

On Tuesday, 23 August 2022 at 00:19:41 UTC+3, Thiago Adams wrote:
> On Monday, August 22, 2022 at 4:30:45 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> > Thiago Adams <thiago...@gmail.com> wrote:
> > > On Monday, August 22, 2022 at 12:44:22 PM UTC-3, Siri Cruise wrote:
> > > > In article
> > > > <d8294420-7b45-4344...@googlegroups.com>,
> > > > Thiago Adams <thiago...@gmail.com> wrote:
> > > >
> > > > > Given a c string I need to return as fast as possible the pair (if exist) of
> > > > > the corresponding keyword_pair.
> > > > You can build a Finite State Automaton.
> > >
> > > Yes.. I am thinking exactly like that. But at some point
> > > we need to map like char -> state then it is the same
> > > switch problem. How to given char how to find the state in a fast way?
> > You need map (state, char) -> state, that is pair state, char to new
> > state. Easily done with two dimensional array:
> >
> > int s = 1; /* could use symbolic name here */
> > unsigned char cp;
> > while (s && (c = *cp++)) {
> > s = transition[s][c];
> > }
> In 2013 I implemented chapter 3 of dragon book (https://github.com/thradams/tklgen/tree/master/tklgen)
> ..but instead of doing in C I did in C++ and instead of creating tables I output switch and ifs..
> it is not easy to hack the code now. :(
> The ouput was switch + if because it was easy to read and debug.
> But it not ease as strcmp for instance. Maybe there is some site to generate these tables..
> then I can decide if the performance improvements compensate..

Why not to use GNU gperf for generating code for that problem?
<https://www.gnu.org/software/gperf/>
Or at least a code to benchmark your whatever other solution?

On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:
> Thiago Adams writes:
> >On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
> >> Thiago Adams writes:
> >> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
> >> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> >> >
> >> Use a binary search. Max comparisons for your 70+ entries would
> >> be 6.
> >
> >One alternative is create a hash given the first char it return the low/high
> >index on a sorted array. Then we do normal binary search only at this reduced range.
> >
> >enum token_type is_keyword(const char* text)
> >{
> > static struct keyword_pair keywords[] =
> > {
> > /*0*/ {"NULL", TK_KEYWORD_NULL},
> > /*code removed*/
> > /*23*/{ "_asm", TK_KEYWORD__ASM},
> > /*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
> > /*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
> > /*26*/{ "auto", TK_KEYWORD_AUTO},
> > /*27*/{ "bool", TK_KEYWORD__BOOL},
> > /*28*/{ "break", TK_KEYWORD_BREAK},
> > /*code removed*/
> > };
> >
> > /*hash*/
> > static struct low_high {
> > int low, high;
> > } map[] = {
> > ['a'] = {.low = 24, .high = 26},
> > ['b'] = {.low = 27, .high = 28},
> > /*etc*/
> > };
> > if (text[0] >= 'N' && text[0] <= 'w')
> > {
> > return binary_search_str(keywords,
> > map[text[0]].low,
> > map[text[0]].high,
> > text);
> > }
> > return TK_NONE;
> >}
> While these experiments are enjoyable, in the real world
> keep it simple and use a linear search until the
> number of elements exceeds some number of table
> comparisons, where that number is derived from the
> clock speed.

Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?

The OP is posting /because/ they want something faster than a linear search.

But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.

Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.

Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.

A linear search might suffice for a toy or experimental product working on small inputs.

bart c <bart4858@gmail.com> writes:
> On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:
>> Thiago Adams writes:
>> >On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
>> >> Thiago Adams writes:
>> >> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
>> >> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
>> >> >
>> >> Use a binary search. Max comparisons for your 70+ entries would
>> >> be 6.
>> >
>> >One alternative is create a hash given the first char it return the low/high
>> >index on a sorted array. Then we do normal binary search only at this reduced range.
>> >
>> >enum token_type is_keyword(const char* text)
>> >{
>> > static struct keyword_pair keywords[] =
>> > {
>> > /*0*/ {"NULL", TK_KEYWORD_NULL},
>> > /*code removed*/
>> > /*23*/{ "_asm", TK_KEYWORD__ASM},
>> > /*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
>> > /*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
>> > /*26*/{ "auto", TK_KEYWORD_AUTO},
>> > /*27*/{ "bool", TK_KEYWORD__BOOL},
>> > /*28*/{ "break", TK_KEYWORD_BREAK},
>> > /*code removed*/
>> > };
>> >
>> > /*hash*/
>> > static struct low_high {
>> > int low, high;
>> > } map[] = {
>> > ['a'] = {.low = 24, .high = 26},
>> > ['b'] = {.low = 27, .high = 28},
>> > /*etc*/
>> > };
>> > if (text[0] >= 'N' && text[0] <= 'w')
>> > {
>> > return binary_search_str(keywords,
>> > map[text[0]].low,
>> > map[text[0]].high,
>> > text);
>> > }
>> > return TK_NONE;
>> >}
>> While these experiments are enjoyable, in the real world
>> keep it simple and use a linear search until the
>> number of elements exceeds some number of table
>> comparisons, where that number is derived from the
>> clock speed.
>
> Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
>
> The OP is posting /because/ they want something faster than a linear search.
>
> But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
>
> Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
>
> Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
>
> A linear search might suffice for a toy or experimental product working on small inputs.

So with that change, your compiler spent 50-75% of its time running
linear searches on a list of 250 keywords?

That's surprising. I can't help wondering if there was some additional
inefficiency in the code. Would you care to share a snippet?

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Thiago Adams <thiago.adams@gmail.com> wrote:
> On Monday, August 22, 2022 at 4:30:45 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> > Thiago Adams <thiago...@gmail.com> wrote:
> > > On Monday, August 22, 2022 at 12:44:22 PM UTC-3, Siri Cruise wrote:
> > > > In article
> > > > <d8294420-7b45-4344...@googlegroups.com>,
> > > > Thiago Adams <thiago...@gmail.com> wrote:
> > > >
> > > > > Given a c string I need to return as fast as possible the pair (if exist) of
> > > > > the corresponding keyword_pair.
> > > > You can build a Finite State Automaton.
> > >
> > > Yes.. I am thinking exactly like that. But at some point
> > > we need to map like char -> state then it is the same
> > > switch problem. How to given char how to find the state in a fast way?
> > You need map (state, char) -> state, that is pair state, char to new
> > state. Easily done with two dimensional array:
> >
> > int s = 1; /* could use symbolic name here */
> > unsigned char cp;
> > while (s && (c = *cp++)) {
> > s = transition[s][c];
> > }
>
> In 2013 I implemented chapter 3 of dragon book (https://github.com/thradams/tklgen/tree/master/tklgen)
> ..but instead of doing in C I did in C++ and instead of creating tables I output switch and ifs..
> it is not easy to hack the code now. :(
> The ouput was switch + if because it was easy to read and debug.
> But it not ease as strcmp for instance. Maybe there is some site to generate these tables..

Well, flex will generate tables for you (actually, the whols scanner,
but if you understand principle you can use flex tables with your
scanner code). However, if your "language" is set of constant
strings, then generating tables is very easy: each prefix of your
strings gives you state (if two strings give the same prefix you
gent only one state). To generate transition tables for each
string assign final state. After that drop final letter one
by one. Generate state for obtained prefix and transition
from this state to state corresonding to longer string. For
example, if you have strings "abc" and "ac", you first create
(final) state for "abc", dropping last letter we get "ab" so
we add state for it and transition "ab", 'c' -> "abc", then
we drop "b" getting "a", add transition to "ab", then we
drop "a" getting empty string, so we add state for empty
string and transition to "a". We handle "ab" in similar way.
However, we need to notice that we already created state
for "a", so we just add new transition. After that we
are done with "ab" because prefix of "a" was handled earlier.
In practice it may be convenient to crate state for empty
string at start, it is initial state and creating it first
we can ensure that it gets fixerd number (say 1). "Generate
state" part is really simple: first we check if string
already appeared as prefix (say using linear search
over already generated prefixes), if yes we use its
state (number), if no we use value from a counter and
increase it.

Comb vector technique requires a bit more work. However,
note that many stats will have only one valid transition,
so clearly will compress quite well (we can put single
trasition in any free spot).

> then I can decide if the performance improvements compensate..

Well, I am saying that generating finite automanton is not
hard. I am not saying that it will give best performance.
Maybe, maybe not. If you have something compiler-like
in mind, then following bartc advice may be best way.

--
Waldek Hebisch

On Monday, 22 August 2022 at 23:38:13 UTC+1, Keith Thompson wrote:
> bart c writes:

> > Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
> >
> > The OP is posting /because/ they want something faster than a linear search.
> >
> > But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
> >
> > Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
> >
> > Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
> >
> > A linear search might suffice for a toy or experimental product working on small inputs.
> So with that change, your compiler spent 50-75% of its time running
> linear searches on a list of 250 keywords?
>
> That's surprising. I can't help wondering if there was some additional
> inefficiency in the code. Would you care to share a snippet?

Yes, the inefficiency is doing a linear search at a place which is a bottleneck! Processing the next alphanumeric token. Surely that's obvious?

It is otherwise a pretty fast compiler. Here however is the normal code (sorry can't guarantee the formatting under Googlegroups):

lookup(lxsvalue, lxsptr-lxsvalue, hashw(hsum))

This is done after a series of alphanumeric characters has been identified as a token; it does a lookup via a hashtable which contains all keywords, reserved words, and user-identifiers (only one generic version of each unique identifier; multiple instances of those are put into a linked list of duplicate names; that /is/ searched linearly when they need resolving).

This is the code that was added just before that to do this experiment (clearly, not C):

length:=lxsptr-lxsvalue
memcpy(&str, lxsvalue, length)
str[length+1]:=0
for i to nkeywords do
if strcmp(str, keywords[i].name)=0 then
d:=keywords[i]
nextlx.symptr:=d
nextlx.symbol:=d.symbol
nextlx.subcode:=d.subcode
return
fi
od

lookup(lxsvalue, lxsptr-lxsvalue, hashw(hsum))
return

keywords[] is a compact linear table populated from the hashtable after the keywords have been added, but before any user identifiers.

Thiago Adams <thiago.adams@gmail.com> wrote:
> On Monday, August 22, 2022 at 2:20:09 PM UTC-3, bart c wrote:
> > On Monday, 22 August 2022 at 15:17:29 UTC+1, Thiago Adams wrote:
> > > Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> > >
> > >
> > > struct keyword_pair{
> > > const char* lexeme;
> > > int token;
> > > };
> > >
> > > struct keyword_pair[] = {
> > > { "NULL", 0},
> > > { "_Alignas", 1},
> > > { "_Atomic", 2},
> > ...
> > > { "volatile", 73},
> > > { "while", 74},
> > > };
> > I suspect other things are going on. Presumably you are looking up an identifier encountered in source code to see if it's a keyword, but if not, another lookup is needed to see if it matches a user-identifier in a symbol table.
> >
> > My approach usually involves a hash table (the symbol table), containing both keywords and user identifiers. Each entry contains information on whether it is a keyword name (and its token value), or user identifier.
>
> Interesting, but not sure if this make code more complicated not having
> a "keyword" token for sure without check a symbol table.

In compiler you almost surely need symbol table. And most natural
place to do symbol table lookup is in lexical analyser. In fact,
if you want simple code it makes sense to put almost everthing in
symbol table. For example when you see "+" symbol table tells you
that this special syntactic property, namely precedence (priority).

Logically you could even put numbers in symbol table, but I see
no gain from doing this. But for other things you scanner will
be simpler: once you know which characters form your token it
is single table lookup to get all semantic info.

BTW: All compiler text that I have seen introduce IMO
unnecessary distinctions from very beginning. Namely,
for parsing what matters are syntactic properties,
kewords without syntactic role are no different than
variables or functions. After parsing distinction between
variables, functions operators and keywords depend
on purpose. Naive compiler may simply "generate
semantics". Of couse, declarations are somewhat
special because they mainly change compiler state. But
other things also may change compiler state and
declarations may generate code. Particularly unhelpful
is distinction between operators and functions.
In most aspect they are the same and real difference
is in what is expanded inline and what is handled by
function calls. Anyway sorry for rant, but take
int account that handling things that look different
by single machanism usually leads to simpler compiler.

--
Waldek Hebisch

On Tuesday, 23 August 2022 at 00:31:54 UTC+1, anti... wrote:
> Thiago Adams wrote:
> > On Monday, August 22, 2022 at 2:20:09 PM UTC-3, bart c wrote:
> > > On Monday, 22 August 2022 at 15:17:29 UTC+1, Thiago Adams wrote:
> > > > Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> > > >
> > > >
> > > > struct keyword_pair{
> > > > const char* lexeme;
> > > > int token;
> > > > };
> > > >
> > > > struct keyword_pair[] = {
> > > > { "NULL", 0},
> > > > { "_Alignas", 1},
> > > > { "_Atomic", 2},
> > > ...
> > > > { "volatile", 73},
> > > > { "while", 74},
> > > > };
> > > I suspect other things are going on. Presumably you are looking up an identifier encountered in source code to see if it's a keyword, but if not, another lookup is needed to see if it matches a user-identifier in a symbol table.
> > >
> > > My approach usually involves a hash table (the symbol table), containing both keywords and user identifiers. Each entry contains information on whether it is a keyword name (and its token value), or user identifier.
> >
> > Interesting, but not sure if this make code more complicated not having
> > a "keyword" token for sure without check a symbol table.
> In compiler you almost surely need symbol table. And most natural
> place to do symbol table lookup is in lexical analyser. In fact,
> if you want simple code it makes sense to put almost everthing in
> symbol table. For example when you see "+" symbol table tells you
> that this special syntactic property, namely precedence (priority).

I wouldn't go as far as putting punctuation into a symbol table. Once you've identified the boundaries of the token, you're pretty much already there in identifying what it is. (Perhaps with user-defined symbols like '+++' it might start to be more worthwhile, as the compiler won't already know what they are.)

> Logically you could even put numbers in symbol table, but I see
> no gain from doing this.

That's what I thought too. But this may have merit, if you want to collate all instances of the same literal, especially with string literals and to a lesser extent with float literals (which usually can't be immediate values, but if 34.8 occurs 50 times, you don't want 50 copies of it).

bart c <bart4858@gmail.com> wrote:
> On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:
> > Thiago Adams writes:
> > >On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
> > >> Thiago Adams writes:
> > >> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
> > >> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> > >> >
> > >> Use a binary search. Max comparisons for your 70+ entries would
> > >> be 6.
> > >
> > >One alternative is create a hash given the first char it return the low/high
> > >index on a sorted array. Then we do normal binary search only at this reduced range.
> > >
> > >enum token_type is_keyword(const char* text)
> > >{
> > > static struct keyword_pair keywords[] =
> > > {
> > > /*0*/ {"NULL", TK_KEYWORD_NULL},
> > > /*code removed*/
> > > /*23*/{ "_asm", TK_KEYWORD__ASM},
> > > /*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
> > > /*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
> > > /*26*/{ "auto", TK_KEYWORD_AUTO},
> > > /*27*/{ "bool", TK_KEYWORD__BOOL},
> > > /*28*/{ "break", TK_KEYWORD_BREAK},
> > > /*code removed*/
> > > };
> > >
> > > /*hash*/
> > > static struct low_high {
> > > int low, high;
> > > } map[] = {
> > > ['a'] = {.low = 24, .high = 26},
> > > ['b'] = {.low = 27, .high = 28},
> > > /*etc*/
> > > };
> > > if (text[0] >= 'N' && text[0] <= 'w')
> > > {
> > > return binary_search_str(keywords,
> > > map[text[0]].low,
> > > map[text[0]].high,
> > > text);
> > > }
> > > return TK_NONE;
> > >}
> > While these experiments are enjoyable, in the real world
> > keep it simple and use a linear search until the
> > number of elements exceeds some number of table
> > comparisons, where that number is derived from the
> > clock speed.
>
> Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
>
> The OP is posting /because/ they want something faster than a linear search.
>
> But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
>
> Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
>

With reasonale assumption of one operator per line, linear search is
doing _more_ searches than in Scott worst case estimate (but one
would expect better from such short list).

Your compiler is highly tuned for compilation speed. If you did
the same to gcc you probably would be unable to measure speed
difference. OTOH if Thiago wants high compilation speed, then
following your advice is reasonable.

> Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
>
> A linear search might suffice for a toy or experimental product working on small inputs.

Actually there are real compilers using linear search. Some are considerd
to be very fast: they are much faster then gcc...

Concerning code size: in non-toy compiler I want dynamically growing
symbol table, so that compiler uses tiny amount of memory on small
programs but can handle very big ones. Fast resizable hash table is
likely to be much bigger than 100 lines (but less than 500 lines
is reasonable).

--
Waldek Hebisch

bart c <bart4858@gmail.com> writes:

> On Monday, 22 August 2022 at 23:38:13 UTC+1, Keith Thompson wrote:
>> bart c writes:
>
>> > Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
>> >
>> > The OP is posting /because/ they want something faster than a linear search.
>> >
>> > But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
>> >
>> > Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
>> >
>> > Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
>> >
>> > A linear search might suffice for a toy or experimental product working on small inputs.
>> So with that change, your compiler spent 50-75% of its time running
>> linear searches on a list of 250 keywords?
>>
>> That's surprising. I can't help wondering if there was some additional
>> inefficiency in the code. Would you care to share a snippet?
>
> Yes, the inefficiency is doing a linear search at a place which is a
> bottleneck! Processing the next alphanumeric token. Surely that's
> obvious?
>
> It is otherwise a pretty fast compiler. Here however is the normal
> code (sorry can't guarantee the formatting under Googlegroups):
>
> lookup(lxsvalue, lxsptr-lxsvalue, hashw(hsum))
>
> This is done after a series of alphanumeric characters has been
> identified as a token; it does a lookup via a hashtable which contains
> all keywords, reserved words, and user-identifiers (only one generic
> version of each unique identifier; multiple instances of those are put
> into a linked list of duplicate names; that /is/ searched linearly
> when they need resolving).
>
> This is the code that was added just before that to do this
> experiment (clearly, not C):
>
> length:=lxsptr-lxsvalue
> memcpy(&str, lxsvalue, length)
> str[length+1]:=0

So you need to copy the string?

> for i to nkeywords do
> if strcmp(str, keywords[i].name)=0 then

In C, one would try to avoid a copy and use strncmp here.

> d:=keywords[i]
> nextlx.symptr:=d
> nextlx.symbol:=d.symbol
> nextlx.subcode:=d.subcode
> return
> fi
> od
>
> lookup(lxsvalue, lxsptr-lxsvalue, hashw(hsum))
> return

--
Ben.

On Tuesday, 23 August 2022 at 01:18:34 UTC+1, Ben Bacarisse wrote:
> bart c writes:
>
> > On Monday, 22 August 2022 at 23:38:13 UTC+1, Keith Thompson wrote:
> >> bart c writes:
> >
> >> > Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
> >> >
> >> > The OP is posting /because/ they want something faster than a linear search.
> >> >
> >> > But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
> >> >
> >> > Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
> >> >
> >> > Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
> >> >
> >> > A linear search might suffice for a toy or experimental product working on small inputs.
> >> So with that change, your compiler spent 50-75% of its time running
> >> linear searches on a list of 250 keywords?
> >>
> >> That's surprising. I can't help wondering if there was some additional
> >> inefficiency in the code. Would you care to share a snippet?
> >
> > Yes, the inefficiency is doing a linear search at a place which is a
> > bottleneck! Processing the next alphanumeric token. Surely that's
> > obvious?
> >
> > It is otherwise a pretty fast compiler. Here however is the normal
> > code (sorry can't guarantee the formatting under Googlegroups):
> >
> > lookup(lxsvalue, lxsptr-lxsvalue, hashw(hsum))
> >
> > This is done after a series of alphanumeric characters has been
> > identified as a token; it does a lookup via a hashtable which contains
> > all keywords, reserved words, and user-identifiers (only one generic
> > version of each unique identifier; multiple instances of those are put
> > into a linked list of duplicate names; that /is/ searched linearly
> > when they need resolving).
> >
> > This is the code that was added just before that to do this
> > experiment (clearly, not C):
> >
> > length:=lxsptr-lxsvalue
> > memcpy(&str, lxsvalue, length)
> > str[length+1]:=0

> So you need to copy the string?

Well, it's not zero-terminated in memory (still pointing into the source at this point).

> > for i to nkeywords do
> > if strcmp(str, keywords[i].name)=0 then

> In C, one would try to avoid a copy and use strncmp here.

OK, if I do that (it doesn't work unless I check the lengths match first), then strncmp or memcmp both make it significantly faster, although still quite a bit slower than normal (surprisingly as I've never previously notices much difference).

(1.5 seconds to parse 740Kloc, compared with 2.25 using strcmp, but normal speed is 0.3 seconds.)

On Tuesday, 23 August 2022 at 01:08:00 UTC+1, anti... wrote:
> bart c wrote:
> > On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:
> > > Thiago Adams writes:
> > > >On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
> > > >> Thiago Adams writes:
> > > >> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
> > > >> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> > > >> >
> > > >> Use a binary search. Max comparisons for your 70+ entries would
> > > >> be 6.
> > > >
> > > >One alternative is create a hash given the first char it return the low/high
> > > >index on a sorted array. Then we do normal binary search only at this reduced range.
> > > >
> > > >enum token_type is_keyword(const char* text)
> > > >{
> > > > static struct keyword_pair keywords[] =
> > > > {
> > > > /*0*/ {"NULL", TK_KEYWORD_NULL},
> > > > /*code removed*/
> > > > /*23*/{ "_asm", TK_KEYWORD__ASM},
> > > > /*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
> > > > /*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
> > > > /*26*/{ "auto", TK_KEYWORD_AUTO},
> > > > /*27*/{ "bool", TK_KEYWORD__BOOL},
> > > > /*28*/{ "break", TK_KEYWORD_BREAK},
> > > > /*code removed*/
> > > > };
> > > >
> > > > /*hash*/
> > > > static struct low_high {
> > > > int low, high;
> > > > } map[] = {
> > > > ['a'] = {.low = 24, .high = 26},
> > > > ['b'] = {.low = 27, .high = 28},
> > > > /*etc*/
> > > > };
> > > > if (text[0] >= 'N' && text[0] <= 'w')
> > > > {
> > > > return binary_search_str(keywords,
> > > > map[text[0]].low,
> > > > map[text[0]].high,
> > > > text);
> > > > }
> > > > return TK_NONE;
> > > >}
> > > While these experiments are enjoyable, in the real world
> > > keep it simple and use a linear search until the
> > > number of elements exceeds some number of table
> > > comparisons, where that number is derived from the
> > > clock speed.
> >
> > Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
> >
> > The OP is posting /because/ they want something faster than a linear search.
> >
> > But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
> >
> > Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
> >
> With reasonale assumption of one operator per line, linear search is
> doing _more_ searches than in Scott worst case estimate (but one
> would expect better from such short list).
>
> Your compiler is highly tuned for compilation speed. If you did
> the same to gcc you probably would be unable to measure speed
> difference. OTOH if Thiago wants high compilation speed, then
> following your advice is reasonable.
> > Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
> >
> > A linear search might suffice for a toy or experimental product working on small inputs.
> Actually there are real compilers using linear search. Some are considerd
> to be very fast: they are much faster then gcc...

I use linear search in quite a few places as well. Mainly for linked lists and where typical lengths are not long. Not however for symbol table lookups, but the hashtable for that is the most sophisticated data structure I use.

As for gcc, I used to have a version of my compiler running as interpreted bytecode; it was still twice as fast as even gcc-O0.

> Concerning code size: in non-toy compiler I want dynamically growing
> symbol table, so that compiler uses tiny amount of memory on small
> programs but can handle very big ones. Fast resizable hash table is
> likely to be much bigger than 100 lines (but less than 500 lines
> is reasonable).

I can't find a compiler version with a resized hash table, but I'm sure I must have done it (or maybe it's part of the Dict implementation of an interpreter).

I don't think it will take an extra 400 lines; you just need to keep track of capacity, and move to a bigger table as needed. Then the contents need to be rehashed into the bigger table. Say an extra 100 lines. Or a cheap solution is to have the table size as a compiler option.

bart c <bart4858@gmail.com> wrote:
> On Tuesday, 23 August 2022 at 01:08:00 UTC+1, anti... wrote:
> > bart c wrote:
> > > On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:
> > > > Thiago Adams writes:
> > > > >On Monday, August 22, 2022 at 1:17:18 PM UTC-3, Scott Lurndal wrote:
> > > > >> Thiago Adams writes:
> > > > >> >On Monday, August 22, 2022 at 11:17:29 AM UTC-3, Thiago Adams wrote:
> > > > >> >> Given a c string I need to return as fast as possible the pair (if exist) of the corresponding keyword_pair.
> > > > >> >
> > > > >> Use a binary search. Max comparisons for your 70+ entries would
> > > > >> be 6.
> > > > >
> > > > >One alternative is create a hash given the first char it return the low/high
> > > > >index on a sorted array. Then we do normal binary search only at this reduced range.
> > > > >
> > > > >enum token_type is_keyword(const char* text)
> > > > >{
> > > > > static struct keyword_pair keywords[] =
> > > > > {
> > > > > /*0*/ {"NULL", TK_KEYWORD_NULL},
> > > > > /*code removed*/
> > > > > /*23*/{ "_asm", TK_KEYWORD__ASM},
> > > > > /*24*/{ "alignas", TK_KEYWORD__ALIGNAS},
> > > > > /*25*/{ "alignof", TK_KEYWORD__ALIGNOF},
> > > > > /*26*/{ "auto", TK_KEYWORD_AUTO},
> > > > > /*27*/{ "bool", TK_KEYWORD__BOOL},
> > > > > /*28*/{ "break", TK_KEYWORD_BREAK},
> > > > > /*code removed*/
> > > > > };
> > > > >
> > > > > /*hash*/
> > > > > static struct low_high {
> > > > > int low, high;
> > > > > } map[] = {
> > > > > ['a'] = {.low = 24, .high = 26},
> > > > > ['b'] = {.low = 27, .high = 28},
> > > > > /*etc*/
> > > > > };
> > > > > if (text[0] >= 'N' && text[0] <= 'w')
> > > > > {
> > > > > return binary_search_str(keywords,
> > > > > map[text[0]].low,
> > > > > map[text[0]].high,
> > > > > text);
> > > > > }
> > > > > return TK_NONE;
> > > > >}
> > > > While these experiments are enjoyable, in the real world
> > > > keep it simple and use a linear search until the
> > > > number of elements exceeds some number of table
> > > > comparisons, where that number is derived from the
> > > > clock speed.
> > >
> > > Rather strange advice. Would you also advocate the use of bubble sort in the 'real world'?
> > >
> > > The OP is posting /because/ they want something faster than a linear search.
> > >
> > > But I tried your idea on one of my compilers: doing a linear search to first see if an identifier was a keyword.
> > >
> > > Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall compiler throughout reduced by 2-4 times. This is for some 250 keywords in arbitrary order.
> > >
> > With reasonale assumption of one operator per line, linear search is
> > doing _more_ searches than in Scott worst case estimate (but one
> > would expect better from such short list).
> >
> > Your compiler is highly tuned for compilation speed. If you did
> > the same to gcc you probably would be unable to measure speed
> > difference. OTOH if Thiago wants high compilation speed, then
> > following your advice is reasonable.
> > > Imagine if you could speed up any compiler by 2-4 times just by adding 100 lines of code or even less.
> > >
> > > A linear search might suffice for a toy or experimental product working on small inputs.
> > Actually there are real compilers using linear search. Some are considerd
> > to be very fast: they are much faster then gcc...
>
> I use linear search in quite a few places as well. Mainly for linked lists and where typical lengths are not long. Not however for symbol table lookups, but the hashtable for that is the most sophisticated data structure I use.
>
> As for gcc, I used to have a version of my compiler running as interpreted bytecode; it was still twice as fast as even gcc-O0.
>
> > Concerning code size: in non-toy compiler I want dynamically growing
> > symbol table, so that compiler uses tiny amount of memory on small
> > programs but can handle very big ones. Fast resizable hash table is
> > likely to be much bigger than 100 lines (but less than 500 lines
> > is reasonable).
>
> I can't find a compiler version with a resized hash table, but I'm sure I must have done it (or maybe it's part of the Dict implementation of an interpreter).
>
> I don't think it will take an extra 400 lines; you just need to keep track of capacity, and move to a bigger table as needed. Then the contents need to be rehashed into the bigger table. Say an extra 100 lines. Or a cheap solution is to have the table size as a compiler option.

Well, 500 was upper estimate. My resizable hashtab is 305 wc lines,
164 is hashtab proper, rest is header files (with declaration of
data structures) and few utilities. One could probably make it
smaller, but I doubt that one could shrink it to 100 lines without
making it unreadable. And it has no frills, so compiler author
may wish to add to it...

BTW: GNU folks in their utility library have file "hashtab.c",
998 wc lines...

--
Waldek Hebisch

On 23/08/2022 00:22, bart c wrote:
> On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:

>> While these experiments are enjoyable, in the real world keep it
>> simple and use a linear search until the number of elements exceeds
>> some number of table comparisons, where that number is derived from
>> the clock speed.
>
> Rather strange advice. Would you also advocate the use of bubble sort
> in the 'real world'?

That's a non-sequitor. There are quite clearly situations where a
linear search is better than a binary search or hash-map, much less so
for a bubble sort compared to other sorts.

>
> The OP is posting /because/ they want something faster than a linear
> search.
>

He is trying to find the fastest way to do the mapping - if that's a
linear search, then he'll be happy with a linear search.

In the good old days, it was relatively easy to guess the speed of code.
And you could use simple algorithmic complexity arguments to see that
a binary search would be faster than a linear search for more than about
3 or 4 items. But now the key factors for speed here will be branch
prediction, jump target caches, and the like. Binary searches are
terrible for branch prediction, while linear searches are much faster.
The cross-over point when a binary search beats a linear search is far
higher than it used to be, and will vary wildly by processor - if the
processor can speculatively execute both sides of a branch far enough
for the cost of mispredicting to be low, binary search may be fine even
for small sizes. If not, then binary searches should be avoided
whenever speed is of the essence.

My guess for the best average result would be to put the keywords in a
sorted list. Have a table of 32 entries for the first letter as
pointers to a starting point in the keyword table - so a single-letter
dummy hash on the first letter, followed by a linear search.

But real-life measurements may show something else entirely.

> But I tried your idea on one of my compilers: doing a linear search
> to first see if an identifier was a keyword.
>

If that is for your language, then I believe you have a lot more
keywords, which will make a difference.

> Parsing got about 7 times slower (from 2.1Mlps to 0.3Mlps). Overall
> compiler throughout reduced by 2-4 times. This is for some 250
> keywords in arbitrary order.
>
> Imagine if you could speed up any compiler by 2-4 times just by
> adding 100 lines of code or even less.
>
> A linear search might suffice for a toy or experimental product
> working on small inputs.
>

On Tuesday, 23 August 2022 at 07:50:20 UTC+1, David Brown wrote:
> On 23/08/2022 00:22, bart c wrote:
> > On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal wrote:
>
> >> While these experiments are enjoyable, in the real world keep it
> >> simple and use a linear search until the number of elements exceeds
> >> some number of table comparisons, where that number is derived from
> >> the clock speed.
> >
> > Rather strange advice. Would you also advocate the use of bubble sort
> > in the 'real world'?
> That's a non-sequitor. There are quite clearly situations where a
> linear search is better than a binary search or hash-map, much less so
> for a bubble sort compared to other sorts.
> >
> > The OP is posting /because/ they want something faster than a linear
> > search.
> >
> He is trying to find the fastest way to do the mapping - if that's a
> linear search, then he'll be happy with a linear search.

I assume he's already tried a linear traversal.

> In the good old days, it was relatively easy to guess the speed of code.
> And you could use simple algorithmic complexity arguments to see that
> a binary search would be faster than a linear search for more than about
> 3 or 4 items. But now the key factors for speed here will be branch
> prediction, jump target caches, and the like. Binary searches are
> terrible for branch prediction, while linear searches are much faster.
> The cross-over point when a binary search beats a linear search is far
> higher than it used to be, and will vary wildly by processor - if the
> processor can speculatively execute both sides of a branch far enough
> for the cost of mispredicting to be low, binary search may be fine even
> for small sizes. If not, then binary searches should be avoided
> whenever speed is of the essence.
>
> My guess for the best average result would be to put the keywords in a
> sorted list. Have a table of 32 entries for the first letter as
> pointers to a starting point in the keyword table - so a single-letter
> dummy hash on the first letter, followed by a linear search.

(I think he said he did and that and got a worthwhile speed up. So already better than linear.)

> But real-life measurements may show something else entirely.
> > But I tried your idea on one of my compilers: doing a linear search
> > to first see if an identifier was a keyword.
> >
> If that is for your language, then I believe you have a lot more
> keywords, which will make a difference.

Scott mentioned 'a few hundred' entries before trying a difference approach..

But I have tried this on my C compiler. There, there were 77 keywords. Overall throughput was reduced by about 20% (timings were erratic but it was something like that). But that also had a more complicated, multi-layer lexer..

(This is using Ben's suggestion to compare as blocks of bytes not strings. A side-effect of that is that comparisons are only done when string lengths match, so the compare function is not called for every keyword. This requires that the linear table stores the length of each keyword.)

I also extracted the OP's list of 75 C reserved words into my scripting language. I put them into a simple linear list of strings, and into a hash table (a built-in type).

I then did 5 million lookups of 3 different words (from the beginning, middle and end) using code like this (.... acts as tab):

fun findlinear(w).....return w in keywords
end

fun findhash(w).....return hashtable{w,-1}
end

The hash method consistently took about 0.22 seconds. The linear search was 0.2, 1.9 and 3.8 seconds. Both searches are implemented in native code (inside the interpreter).

(BTW the hash-table implementation in my C compiler comprises about 120 lines of code, of which 55 is to do with resizing (I found that compiler in the end). So only a few dozen lines is needed.)

Scott Lurndal to Thiago Adams:

> > Use a binary search. Max comparisons for your 70+
> > entries would be 6.
>
> I tried!

If speed is that important, which does not seem so to me in
your case, generate a perfect hash.

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

On Tuesday, August 23, 2022 at 8:59:11 AM UTC-3, Anton Shepelev wrote:
> Scott Lurndal to Thiago Adams:
> > > Use a binary search. Max comparisons for your 70+
> > > entries would be 6.
> >
> > I tried!
>
> If speed is that important, which does not seem so to me in
> your case, generate a perfect hash.

The perfect hash will ensure each keyword has a unique hash right?

So we need hash the input then we have the index in a table.
Then because the input can have by coincidence the same hash we need strcmp.

The total time is time to compute hash + (index table is fast) + strcmp.
I believe we can have better performance than this for some cases.

On Monday, August 22, 2022 at 7:50:16 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> Thiago Adams <thiago...@gmail.com> wrote:
> > On Monday, August 22, 2022 at 4:30:45 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> > > Thiago Adams <thiago...@gmail.com> wrote:
> > > > On Monday, August 22, 2022 at 12:44:22 PM UTC-3, Siri Cruise wrote:
> > > > > In article
> > > > > <d8294420-7b45-4344...@googlegroups.com>,
> > > > > Thiago Adams <thiago...@gmail.com> wrote:
> > > > >
> > > > > > Given a c string I need to return as fast as possible the pair (if exist) of
> > > > > > the corresponding keyword_pair.
> > > > > You can build a Finite State Automaton.
> > > >
> > > > Yes.. I am thinking exactly like that. But at some point
> > > > we need to map like char -> state then it is the same
> > > > switch problem. How to given char how to find the state in a fast way?
> > > You need map (state, char) -> state, that is pair state, char to new
> > > state. Easily done with two dimensional array:
> > >
> > > int s = 1; /* could use symbolic name here */
> > > unsigned char cp;
> > > while (s && (c = *cp++)) {
> > > s = transition[s][c];
> > > }

I translate a state machine to code without tables / switch cases.
(Doing this I am not depending on compiler magic optimisations for switch)

The translation is done is each state has a label. State is changed moving the "cursor"
and goto to the next state.

For instance for the keywords "auto break bool"

bool is_keyword(const char* text)
{ int i = 0;
/*L_0:*/
if (text[i] == 'a') {
i++;
goto L_1;
}
else if (text[i] == 'b') {
i++;
goto L_2;
}

return false;

L_1:
if (text[i] == 'u') {
i++;
goto L_3;
}
return false;

L_2:
if (text[i] == 'o') {
i++;
goto L_4;
}
else if (text[i] == 'r') {
i++;
goto L_5;
}

return false;
L_3:
if (text[i] == 't') {
i++;
goto L_6;
}
return false;

L_4:
if (text[i] == 'o') {
i++;
goto L_7;
}
return false;

L_5:
if (text[i] == 'e') {
i++;
goto L_8;
}
return false;

L_6:
if (text[i] == 'o') {
i++;
if (text[i] == '\0')
return true;///* end state for TKAUTO*/
}
return false;

L_7:
if (text[i] == 'l') {
i++;
if (text[i] == '\0')
return true;/* end state for TKBOOL*/
}
return false;

L_8:
if (text[i] == 'a') {
i++;
goto L_11;
}
return false;

L_11:
if (text[i] == 'k') {
i++;
if (text[i] == '\0')
return true; /* end state for TKBREAK*/
}

return false;
}

The code has a "linear search" a sequence of ifs that is (for the first state) the number
of possible letters to start. So more frequent ones must go first.

Doing this "mechanical code" I changed it to be more "human" code.
The result is:

bool is_keyword(const char* s)
{
if (*s == 'a')
{
s++;
/*a uto*/
return *s++ == 'u' && *s++ == 't' && *s++ == 'o' && *s++ == '\0';
}
else if (*s == 'b')
{
s++;
if (*s == 'o') {

s++;
/*b o ol*/
return *s++ == 'o' && *s++ == 'l' && *s++ == '\0';
}
else if (*s == 'r')
{
s++;
/*b r eak*/
return *s++ == 'e' && *s++ == 'a' && *s++ == 'k' && *s++ == '\0';
}
}
return false;
}

I think this code has the best performance for many cases!

I will try to create a generator that does this. (I did in the past using switches.. now I want just ifs)
The state machine gotos is good for other types of tokens (regex that have loops) but for
keywords it is basically going forward.

On 23/08/2022 13:18, bart c wrote:
> On Tuesday, 23 August 2022 at 07:50:20 UTC+1, David Brown wrote:
>> On 23/08/2022 00:22, bart c wrote:
>>> On Monday, 22 August 2022 at 19:10:21 UTC+1, Scott Lurndal
>>> wrote:
>>
>>>> While these experiments are enjoyable, in the real world keep
>>>> it simple and use a linear search until the number of elements
>>>> exceeds some number of table comparisons, where that number is
>>>> derived from the clock speed.
>>>
>>> Rather strange advice. Would you also advocate the use of bubble
>>> sort in the 'real world'?
>> That's a non-sequitor. There are quite clearly situations where a
>> linear search is better than a binary search or hash-map, much less
>> so for a bubble sort compared to other sorts.
>>>
>>> The OP is posting /because/ they want something faster than a
>>> linear search.
>>>
>> He is trying to find the fastest way to do the mapping - if that's
>> a linear search, then he'll be happy with a linear search.
>
> I assume he's already tried a linear traversal.
>

I'm trying not to assume things. I am also not assuming that there /is/
an approach that is faster than linear searching, or faster than
whatever he has tried so far.

>> In the good old days, it was relatively easy to guess the speed of
>> code. And you could use simple algorithmic complexity arguments to
>> see that a binary search would be faster than a linear search for
>> more than about 3 or 4 items. But now the key factors for speed
>> here will be branch prediction, jump target caches, and the like.
>> Binary searches are terrible for branch prediction, while linear
>> searches are much faster. The cross-over point when a binary search
>> beats a linear search is far higher than it used to be, and will
>> vary wildly by processor - if the processor can speculatively
>> execute both sides of a branch far enough for the cost of
>> mispredicting to be low, binary search may be fine even for small
>> sizes. If not, then binary searches should be avoided whenever
>> speed is of the essence.
>>
>> My guess for the best average result would be to put the keywords
>> in a sorted list. Have a table of 32 entries for the first letter
>> as pointers to a starting point in the keyword table - so a
>> single-letter dummy hash on the first letter, followed by a linear
>> search.
>
> (I think he said he did and that and got a worthwhile speed up. So
> already better than linear.)
>

OK. I have probably skipped over some of the posts in this thread, and
missed that.

>
>> But real-life measurements may show something else entirely.
>>> But I tried your idea on one of my compilers: doing a linear
>>> search to first see if an identifier was a keyword.
>>>
>> If that is for your language, then I believe you have a lot more
>> keywords, which will make a difference.
>
> Scott mentioned 'a few hundred' entries before trying a difference
> approach.
>
> But I have tried this on my C compiler. There, there were 77
> keywords. Overall throughput was reduced by about 20% (timings were
> erratic but it was something like that). But that also had a more
> complicated, multi-layer lexer.
>
> (This is using Ben's suggestion to compare as blocks of bytes not
> strings. A side-effect of that is that comparisons are only done when
> string lengths match, so the compare function is not called for every
> keyword. This requires that the linear table stores the length of
> each keyword.)

That sounds like a bad idea to me. Comparing blocks of bytes with a
fixed size is the ideal - and that fixed size should ideally be 8. As
long as you are careful with your types, your alignments, and your use
of a good optimising C compiler, your comparisons are now just a single
64-bit integer comparison. (Clearly both the string for the lookup and
the strings in the table need consistent padding.) Some keywords are
longer than 8 characters and need extra checking once the first lookup
is done (all are, I think, distinguishable by their first 8 characters
except _Decimal128, _Decimal32 and _Decimal64).

A smart enough C compiler might automatically vectorise this for SIMD
instructions, if told it is targeting a suitable processor.

>
> I also extracted the OP's list of 75 C reserved words into my
> scripting language. I put them into a simple linear list of strings,
> and into a hash table (a built-in type).
>
> I then did 5 million lookups of 3 different words (from the
> beginning, middle and end) using code like this (.... acts as tab):
>
> fun findlinear(w)= ....return w in keywords end
>
> fun findhash(w)= ....return hashtable{w,-1} end
>
> The hash method consistently took about 0.22 seconds. The linear
> search was 0.2, 1.9 and 3.8 seconds. Both searches are implemented in
> native code (inside the interpreter).
>
> (BTW the hash-table implementation in my C compiler comprises about
> 120 lines of code, of which 55 is to do with resizing (I found that
> compiler in the end). So only a few dozen lines is needed.)

Be aware that it is difficult to test this kind of thing well. Testing
repeatedly with only a few words can give hugely different effects from
branch target buffers and branch prediction than you'd get from a random
mix of all the keywords.

Real-life source code, is different again - the mix of keywords is far
from random. It might even be worth doing a linear search for the most
common keywords such as "if" and "int", then moving to a hash table for
rare ones such as "thread_local" and "goto". Or to keep it simpler,
have a linear search but with the table ordered by the frequency of the
keywords in real code - who cares if keywords like "_Imaginary" or
"_BitInt" are found more slowly?

Subject	Author
How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Scott Lurndal
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Scott Lurndal
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Scott Lurndal
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	bart c
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Siri Cruise
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Siri Cruise
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Öö Tiib
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Stefan Ram
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	William Ahern
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	antispam
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Tim Rentsch
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Stefan Ram
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Anton Shepelev