Message-ID:

[A computer is] like an Old Testament god, with a lot of rules and no mercy. -- Joseph Campbell

devel / comp.lang.c / Re: How to optimize this search?

Thiago Adams:

> here is the C code you need to be faster. :)

Not "faster than."? You don't like to end sentences with prepostions?

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

> Thiago Adams to Anton Shepelev:
>
>> > If speed is that important, which does not seem so to me
>> > in your case, generate a perfect hash.
>>
>> The perfect hash will ensure each keyword has a unique
>> hash right?
>
> Yes.
>
>> So we need hash the input then we have the index in a
>> table. Then because the input can have by coincidence the
>> same hash we need strcmp.
>
> I am confused. A perfect hash is designed to avoid
> collisions. Why shluld we need strcmp?

There are no collision among the keywords, but some strings that are not
keywords might hash the same value as one.

I suppose there might be a way to generate a truly perfect hash that
generates an index outside the table for every possible string not in
it, but I've not seen such a thing.

>> The total time is time to compute
>> hash + (index table is fast) + strcmp.
>> I believe we can have better performance than this for
>> some cases.

You mean with an FSA? It would be interesting to compare times. Also
implementation times. I was cooking dinner when I suggested a perfect
hash, but I had a program running withing 45 minutes of posting. That
included finding gpref, installing it reading enough documentation to
get to going.

> But the hasing ought to be done at compile time or at
> program startup.

Not sure what you mean. You can't hash the input until there is input,
but gperf does, of course, make a static, compile time, hash table.

--
Ben.

Ben Bacarisse to Anton Shepelev:

> > I am confused. A perfect hash is designed to avoid
> > collisions. Why shluld we need strcmp?
>
> There are no collision among the keywords, but some
> strings that are not keywords might hash the same value as
> one.

Thiago wrote nothing about any other strings than the static
set of keywords known in advance.

> > But the hasing ought to be done at compile time or at
> > program startup.
>
> Not sure what you mean. You can't hash the input until
> there is input, but gperf does, of course, make a static,
> compile time, hash table.

Mutual. As understood Thiago's post, he wants to search a
static set of keywords, not arbitrary unpredictable input.
If need be, he chould use a separate and more traditional
hash table for non-keywords, but that seems beyond the
original question... Do you mean the hashing of string
literals?

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Thiago Adams <thiago.adams@gmail.com> writes:
>I think this code has the best performance for many cases!

First, one should be careful not to get bogged down in such
details. First make it work, then make it fast (if need be).

Next, when it comes to omptimizations: thinking is good,
measuring is better!

A simple approach often suffices:

if( !strcmp( s, "alpha" ))alpha();
else if( !strcmp( s, "beta" ))beta();
else otherwise();

, otherwise, look at how it's done in scanners.

Here's an example from "NMH's Simple C Compiler,
2011,2012" "Lexical analysis (scanner)" "scan.c"
"static int keyword(char *s)":

. It's a kind of combination between a simple kind
of hashing and "strcmp".

This is from a BSD C compiler ("cc"):

|struct kwtab {
| char *kwname;
| int kwval;
|} kwtab[] = {
| "int", INT,
| "char", CHAR,
....
|findkw()
|{
| register struct kwtab *kp;
| for (kp=kwtab; kp->kwname; kp++) {
| if (strcmp(symbuf, kp->kwname) == 0) {

This seems to use only a simple "strcmp". But then one sees
that before this "findkw" is being called, an additional
lookup table is used sometimes::

| if (kwhash[ihash/LNBPW] & (1 << (ihash%LNBPW)))
| if (findkw())

. I assume perfect hashing and gperf were already mentioned.

On Tuesday, August 23, 2022 at 1:21:30 PM UTC-3, Anton Shepelev wrote:
> Ben Bacarisse to Anton Shepelev:
> > > I am confused. A perfect hash is designed to avoid
> > > collisions. Why shluld we need strcmp?
> >
> > There are no collision among the keywords, but some
> > strings that are not keywords might hash the same value as
> > one.

Yes the input is any identifier.

Thiago Adams <thiago.adams@gmail.com> writes:

> Given a c string I need to return as fast as possible the pair (if
> exist) of the corresponding keyword_pair.
>
>
> struct keyword_pair{
> const char* lexeme;
> int token;
> };
>
> struct keyword_pair[] = {
> { "NULL", 0},
> { "_Alignas", 1},
> [...]
> { "volatile", 73},
> { "while", 74},
> };

There is no way to give a good answer to this question without more
information, as for example how often does a query result in a miss
rather than a hit, and what is the expected distribution of queries
that result in a hit. And there are other factors that could play a
major role in making good choices for this.

Also, it seems likely that a better way to approach this problem is
to change the callers (or at least some of them) so that they
provide more information than just a pointer to a string. For
example, if this is to be part of a scanner, then it should be
possible without too much difficult to compute a hash as the
identifier is scanned, and also a length. Both of those could
help speed up the keyword lookup code.

Thiago Adams:

> Yes the input is any identifier.

I see. What about an NFA-based tokeniser that would narrow
down the set of possible keywords while reading the source
character by character, as described here wrt RegExes:

https://swtch.com/~rsc/regexp/regexp1.html

Since each step handles a single character, the search can
be efficiently indexed using arrays. It seems to me the most
efficient (because earliest) use of incoming information.

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

On Tuesday, August 23, 2022 at 1:50:53 PM UTC-3, Tim Rentsch wrote:
> Thiago Adams <thiago...@gmail.com> writes:
>
> > Given a c string I need to return as fast as possible the pair (if
> > exist) of the corresponding keyword_pair.
> >
> >
> > struct keyword_pair{
> > const char* lexeme;
> > int token;
> > };
> >
> > struct keyword_pair[] = {
> > { "NULL", 0},
> > { "_Alignas", 1},
> > [...]
> > { "volatile", 73},
> > { "while", 74},
> > };
> There is no way to give a good answer to this question without more
> information, as for example how often does a query result in a miss
> rather than a hit, and what is the expected distribution of queries
> that result in a hit. And there are other factors that could play a
> major role in making good choices for this.
>
> Also, it seems likely that a better way to approach this problem is
> to change the callers (or at least some of them) so that they
> provide more information than just a pointer to a string. For
> example, if this is to be part of a scanner, then it should be
> possible without too much difficult to compute a hash as the
> identifier is scanned, and also a length. Both of those could
> help speed up the keyword lookup code.

Imagine the input is a C identifier.
Things like "void" "int" "struct" "enum" are common keywords , very frequent.
Things like "i" "j" "k" are common not keywords.

On Tuesday, 23 August 2022 at 18:08:07 UTC+1, Thiago Adams wrote:
> On Tuesday, August 23, 2022 at 1:50:53 PM UTC-3, Tim Rentsch wrote:
> > Thiago Adams writes:
> >
> > > Given a c string I need to return as fast as possible the pair (if
> > > exist) of the corresponding keyword_pair.
> > >
> > >
> > > struct keyword_pair{
> > > const char* lexeme;
> > > int token;
> > > };
> > >
> > > struct keyword_pair[] = {
> > > { "NULL", 0},
> > > { "_Alignas", 1},
> > > [...]
> > > { "volatile", 73},
> > > { "while", 74},
> > > };
> > There is no way to give a good answer to this question without more
> > information, as for example how often does a query result in a miss
> > rather than a hit, and what is the expected distribution of queries
> > that result in a hit. And there are other factors that could play a
> > major role in making good choices for this.
> >
> > Also, it seems likely that a better way to approach this problem is
> > to change the callers (or at least some of them) so that they
> > provide more information than just a pointer to a string. For
> > example, if this is to be part of a scanner, then it should be
> > possible without too much difficult to compute a hash as the
> > identifier is scanned, and also a length. Both of those could
> > help speed up the keyword lookup code.
> Imagine the input is a C identifier.
> Things like "void" "int" "struct" "enum" are common keywords , very frequent.
> Things like "i" "j" "k" are common not keywords.

If input is C source code, then some brief tests suggest that 30-40% of identifiers are keywords, the rest are used-defined.

For the test input sqlite3.c + shell.c + shell.h, 30% were keywords. And the most common were (counts shown):

46066 define
10005 if
09531 int
03522 endif
03468 return
03053 char
02924 void
02802 else
02366 const
01491 static
01461 struct
01369 typedef
.....

What the percentage means is, if you have an alphanumeric token but don't know if it's a keyword or not, then if you first check for a keyword, that effort will be wasted 2/3 of the time; you need to do a keyword check AND a normal lookup.

On the other hand, 1/3 of the time, you will save having to do a conventional lookup.

If a keyword check (that is, determine which keyword, if any, not just yes or no as one of your posts did) is much faster than a normal lookup, this might have a small net benefit.

Although I can't myself see the extra complication as being worthwhile, this might make your tokenising logic simpler. Since it's not going to make it much slower either, provided you don't just do a linear search.

bart c <bart4858@gmail.com> writes:
[...]
> If input is C source code, then some brief tests suggest that 30-40% of identifiers are keywords, the rest are used-defined.
>
> For the test input sqlite3.c + shell.c + shell.h, 30% were keywords. And the most common were (counts shown):
>
> 46066 define
> 10005 if
> 09531 int
> 03522 endif
> 03468 return
> 03053 char
> 02924 void
> 02802 else
> 02366 const
> 01491 static
> 01461 struct
> 01369 typedef
> ....

"define" and "endif" would be recognized by the preprocessor, which
doesn't know about "int", "return", et al. After preprocessing,
"define" and "endif" don't have to be recognized.

It would be difficult to have a correct C implementation in which
"define" and "int" both need to be recognized by the same code.

And that's a *lot* of "define"s. It sounds like the code you're
looking at is not typical. Your numbers don't match what I see
in the latest sqlite3 sources.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

On Tuesday, August 23, 2022 at 4:17:18 PM UTC-3, Keith Thompson wrote:
> bart c <bart...@gmail.com> writes:
> [...]
> > If input is C source code, then some brief tests suggest that 30-40% of identifiers are keywords, the rest are used-defined.
> >
> > For the test input sqlite3.c + shell.c + shell.h, 30% were keywords. And the most common were (counts shown):
> >
> > 46066 define
> > 10005 if
> > 09531 int
> > 03522 endif
> > 03468 return
> > 03053 char
> > 02924 void
> > 02802 else
> > 02366 const
> > 01491 static
> > 01461 struct
> > 01369 typedef
> > ....
> "define" and "endif" would be recognized by the preprocessor, which
> doesn't know about "int", "return", et al. After preprocessing,
> "define" and "endif" don't have to be recognized.
>
> It would be difficult to have a correct C implementation in which
> "define" and "int" both need to be recognized by the same code.
>
> And that's a *lot* of "define"s. It sounds like the code you're
> looking at is not typical. Your numbers don't match what I see
> in the latest sqlite3 sources.

Having the keywords out from preprocessor has the advantage of
not checking keywords on the inactive preprocessor blocks.

My transpiler promotes the "surviving" identifiers to keywords after preprocessing
on the first usage.

pp-numbers are similar.. that are checked after preprocessing except the ones that are
in #if expression , :) that need to be evaluated at preprocessor phase.
We also can see that bad ppnumber survives if not used in # if expressions.
the other tokens like punctuators strings..are the same (for me) for preprocessor or not.
define endif etc are used as identifers and checked with strcmp.

On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
> bart c <bart...@gmail.com> writes:
> [...]
> > If input is C source code, then some brief tests suggest that 30-40% of identifiers are keywords, the rest are used-defined.
> >
> > For the test input sqlite3.c + shell.c + shell.h, 30% were keywords. And the most common were (counts shown):
> >
> > 46066 define
> > 10005 if
> > 09531 int
> > 03522 endif
> > 03468 return
> > 03053 char
> > 02924 void
> > 02802 else
> > 02366 const
> > 01491 static
> > 01461 struct
> > 01369 typedef
> > ....
> "define" and "endif" would be recognized by the preprocessor, which
> doesn't know about "int", "return", et al. After preprocessing,
> "define" and "endif" don't have to be recognized.

I have a 3-level lex function (lex() calls mlex() which calls lexreadtoken()); the middle one takes of of macro expansion; cpp directives are recognised in there somewhere so don't make it through to the top function.

I don't have a completely separate preprocess stage; that's done as it goes.

For the purpose of my test, keywords are detected a bit earlier than normal (the compiler still seems able to do its job).

> And that's a *lot* of "define"s. It sounds like the code you're
> looking at is not typical. Your numbers don't match what I see
> in the latest sqlite3 sources.

Yeah, that looked dodgy. I don't know what's going on there. If I compile only sqlite3.c, then I get more reasonable looking figures:

14010 define
08728 if
08543 int
03115 endif
03111 return
02658 void
02291 else

Note that 10,000 of those defines are inside my windows.h. Also note that the same 'if' keyword is used for "'#if".

This is specific to normal C however. We don't know exactly what the OP is doing, or what his inputs will be, whether they are already preprocessed or not.

On Tuesday, August 23, 2022 at 5:18:26 PM UTC-3, bart c wrote:
> On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
> > bart c <bart...@gmail.com> writes:
> > [...]
> > > If input is C source code, then some brief tests suggest that 30-40% of identifiers are keywords, the rest are used-defined.
> > >
> > > For the test input sqlite3.c + shell.c + shell.h, 30% were keywords. And the most common were (counts shown):
> > >
> > > 46066 define
> > > 10005 if
> > > 09531 int
> > > 03522 endif
> > > 03468 return
> > > 03053 char
> > > 02924 void
> > > 02802 else
> > > 02366 const
> > > 01491 static
> > > 01461 struct
> > > 01369 typedef
> > > ....
> > "define" and "endif" would be recognized by the preprocessor, which
> > doesn't know about "int", "return", et al. After preprocessing,
> > "define" and "endif" don't have to be recognized.
> I have a 3-level lex function (lex() calls mlex() which calls lexreadtoken()); the middle one takes of of macro expansion; cpp directives are recognised in there somewhere so don't make it through to the top function.
>
> I don't have a completely separate preprocess stage; that's done as it goes.
>
> For the purpose of my test, keywords are detected a bit earlier than normal (the compiler still seems able to do its job).
> > And that's a *lot* of "define"s. It sounds like the code you're
> > looking at is not typical. Your numbers don't match what I see
> > in the latest sqlite3 sources.
> Yeah, that looked dodgy. I don't know what's going on there. If I compile only sqlite3.c, then I get more reasonable looking figures:
>
> 14010 define
> 08728 if
> 08543 int
> 03115 endif
> 03111 return
> 02658 void
> 02291 else
>
> Note that 10,000 of those defines are inside my windows.h. Also note that the same 'if' keyword is used for "'#if".
>
> This is specific to normal C however. We don't know exactly what the OP is doing, or what his inputs will be, whether they are already preprocessed or not.

After preprocessor , at first usage, the tokens of type "identifier" are checked and promoted to
keywords (if there are keywords ).

bart c <bart4858@gmail.com> writes:
> On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
[...]
>> And that's a *lot* of "define"s. It sounds like the code you're
>> looking at is not typical. Your numbers don't match what I see
>> in the latest sqlite3 sources.
>
> Yeah, that looked dodgy. I don't know what's going on there. If I
> compile only sqlite3.c, then I get more reasonable looking figures:
>
> 14010 define
> 08728 if
> 08543 int
> 03115 endif
> 03111 return
> 02658 void
> 02291 else
>
> Note that 10,000 of those defines are inside my windows.h. Also note
>that the same 'if' keyword is used for "'#if".
>
> This is specific to normal C however. We don't know exactly what the
> OP is doing, or what his inputs will be, whether they are already
> preprocessed or not.

Ah, so that's 14010 occurrences of "define" after preprocessing.

Except that after preprocessing there shouldn't be *any* occurrences of
"define" that result from macro definitions ("define" could be used as
an ordinary identifier). And if you're distinguishing between keywords
like "int" and identifiers like "foo" before translation phase 7, there
are likely to be some programs that you'll handle incorrectly. You
certainly don't have to implement the 8 translation phases as distinct
passes, but if you want a conforming implementation you must do
something that acts that way.

This:

#if int
#error "Error 1"
#endif
#if foo
#error "Error 2"
#endif

shouldn't produce any error messages.

The point is that if you're processing C source code and recognizing
whether tokens that look like identifiers are keywords or not, it
doesn't make sense to have "define" and "int" in the same table.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

> Ben Bacarisse to Anton Shepelev:
>
>> > I am confused. A perfect hash is designed to avoid
>> > collisions. Why shluld we need strcmp?
>>
>> There are no collision among the keywords, but some
>> strings that are not keywords might hash the same value as
>> one.
>
> Thiago wrote nothing about any other strings than the static
> set of keywords known in advance.

It would be very odd to only lookup strings that you know are in a
table. And very, very odd given the context -- language keywords. TA
will want to know if a token is or is not a keyword.

--
Ben.

On Tuesday, August 23, 2022 at 9:13:59 AM UTC-3, Thiago Adams wrote:
> On Monday, August 22, 2022 at 7:50:16 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> > Thiago Adams <thiago...@gmail.com> wrote:
> > > On Monday, August 22, 2022 at 4:30:45 PM UTC-3, anti...@math.uni.wroc.pl wrote:
> > > > Thiago Adams <thiago...@gmail.com> wrote:
> > > > > On Monday, August 22, 2022 at 12:44:22 PM UTC-3, Siri Cruise wrote:
> > > > > > In article
> > > > > > <d8294420-7b45-4344...@googlegroups.com>,
> > > > > > Thiago Adams <thiago...@gmail.com> wrote:
> > > > > >
> > > > > > > Given a c string I need to return as fast as possible the pair (if exist) of
> > > > > > > the corresponding keyword_pair.
> > > > > > You can build a Finite State Automaton.
> > > > >
> > > > > Yes.. I am thinking exactly like that. But at some point
> > > > > we need to map like char -> state then it is the same
> > > > > switch problem. How to given char how to find the state in a fast way?
> > > > You need map (state, char) -> state, that is pair state, char to new
> > > > state. Easily done with two dimensional array:
> > > >
> > > > int s = 1; /* could use symbolic name here */
> > > > unsigned char cp;
> > > > while (s && (c = *cp++)) {
> > > > s = transition[s][c];
> > > > }
> I translate a state machine to code without tables / switch cases.
> (Doing this I am not depending on compiler magic optimisations for switch)
>
> The translation is done is each state has a label. State is changed moving the "cursor"
> and goto to the next state.
>
> For instance for the keywords "auto break bool"
>
> bool is_keyword(const char* text)
> {
> int i = 0;
> /*L_0:*/
> if (text[i] == 'a') {
> i++;
> goto L_1;
> }
> else if (text[i] == 'b') {
> i++;
> goto L_2;
> }
>
> return false;
>
> L_1:
> if (text[i] == 'u') {
> i++;
> goto L_3;
> }
> return false;
>
> L_2:
> if (text[i] == 'o') {
> i++;
> goto L_4;
> }
> else if (text[i] == 'r') {
> i++;
> goto L_5;
> }
>
> return false;
> L_3:
> if (text[i] == 't') {
> i++;
> goto L_6;
> }
> return false;
>
> L_4:
> if (text[i] == 'o') {
> i++;
> goto L_7;
> }
> return false;
>
> L_5:
> if (text[i] == 'e') {
> i++;
> goto L_8;
> }
> return false;
>
> L_6:
> if (text[i] == 'o') {
> i++;
> if (text[i] == '\0')
> return true;///* end state for TKAUTO*/
> }
> return false;
>
> L_7:
> if (text[i] == 'l') {
> i++;
> if (text[i] == '\0')
> return true;/* end state for TKBOOL*/
> }
> return false;
>
> L_8:
> if (text[i] == 'a') {
> i++;
> goto L_11;
> }
> return false;
>
> L_11:
> if (text[i] == 'k') {
> i++;
> if (text[i] == '\0')
> return true; /* end state for TKBREAK*/
> }
>
> return false;
> }
>
> The code has a "linear search" a sequence of ifs that is (for the first state) the number
> of possible letters to start. So more frequent ones must go first.
>
> Doing this "mechanical code" I changed it to be more "human" code.
> The result is:
>
> bool is_keyword(const char* s)
> {
> if (*s == 'a')
> {
> s++;
> /*a uto*/
> return *s++ == 'u' && *s++ == 't' && *s++ == 'o' && *s++ == '\0';
> }
> else if (*s == 'b')
> {
> s++;
> if (*s == 'o') {
>
> s++;
> /*b o ol*/
> return *s++ == 'o' && *s++ == 'l' && *s++ == '\0';
> }
> else if (*s == 'r')
> {
> s++;
> /*b r eak*/
> return *s++ == 'e' && *s++ == 'a' && *s++ == 'k' && *s++ == '\0';
> }
> }
> return false;
> }
>
> I think this code has the best performance for many cases!
>
> I will try to create a generator that does this. (I did in the past using switches.. now I want just ifs)
> The state machine gotos is good for other types of tokens (regex that have loops) but for
> keywords it is basically going forward.

Jus to document I also tried something like:

if (ch < 'c')
{ if (ch == 'a') ...
if (ch == 'b') ...
} else
{ if (ch == 'c') ...
if (ch == 'd') ...
...
}

That is similar of binary search .
We can add more and more ifs like this.

On Tuesday, 23 August 2022 at 21:40:25 UTC+1, Keith Thompson wrote:
> bart c <bart...@gmail.com> writes:
> > On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
> [...]
> >> And that's a *lot* of "define"s. It sounds like the code you're
> >> looking at is not typical. Your numbers don't match what I see
> >> in the latest sqlite3 sources.
> >
> > Yeah, that looked dodgy. I don't know what's going on there. If I
> > compile only sqlite3.c, then I get more reasonable looking figures:
> >
> > 14010 define
> > 08728 if
> > 08543 int
> > 03115 endif
> > 03111 return
> > 02658 void
> > 02291 else
> >
> > Note that 10,000 of those defines are inside my windows.h. Also note
> >that the same 'if' keyword is used for "'#if".
> >
> > This is specific to normal C however. We don't know exactly what the
> > OP is doing, or what his inputs will be, whether they are already
> > preprocessed or not.
> Ah, so that's 14010 occurrences of "define" after preprocessing.

This is getting dragged down by mention of preprocessing.

In order to test the effects of a linear search, I added that at the lowest level of the tokeniser, which normally only does an incomplete lookup - determine the location in a symbol table, and only commit to an actual reserved word later on.

This can mean that all instances of "define" are counted as reserved words, which would include 4 from this code:

#define A 1
#define B(x) define##x
int define;

when it should really be only two. And under normal circumstances, all 4 (2 reserved word 'define's, one PP token 'define', and one user identifier 'define') share the same symbol table entry, via some magic.

So the counts may be a little skewed, but remember the purpose was a quick and dirty test to see how much a linear search can affect performances.

(Funnily enough, the code still above compiles correctly. I tested it on two other non-C compilers before, and those had ordinary lexers; it was much easier to see where the put that extra code!)

> The point is that if you're processing C source code and recognizing
> whether tokens that look like identifiers are keywords or not, it
> doesn't make sense to have "define" and "int" in the same table.

Well, they *are* in the same table, and all three kinds of "define" share the same slot too. Or does C stipulate how people should organise their symbol tables?

On Tuesday, 23 August 2022 at 23:55:43 UTC+1, Thiago Adams wrote:

> Jus to document I also tried something like:
>
> if (ch < 'c')
> {
> if (ch == 'a') ...
> if (ch == 'b') ...
> }
> else
> {
> if (ch == 'c') ...
> if (ch == 'd') ...
> ...
> }
>
> That is similar of binary search .
> We can add more and more ifs like this.

This is starting to look a little desperate. Such code looks unmaintainable to me (but I assume you will use some tool to generate it).

What is your actual aim here? Granted a simple linear search is not going to be that performant. But pretty much anything else is going to be faster. So how much faster does it have to be; have you done any measurements? Is checking for keywords mixed in with any other input such as punctuation or numeric literals or strings or comments; what's the bigger picture?

(And what is the purpose in just getting a Yes or No answer to whether the current token is a reserved word - what is done with that information? It seems wasteful determining which token you have, then discarding that information.)

bart c <bart4858@gmail.com> writes:
> On Tuesday, 23 August 2022 at 21:40:25 UTC+1, Keith Thompson wrote:
>> bart c <bart...@gmail.com> writes:
>> > On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
>> [...]
>> >> And that's a *lot* of "define"s. It sounds like the code you're
>> >> looking at is not typical. Your numbers don't match what I see
>> >> in the latest sqlite3 sources.
>> >
>> > Yeah, that looked dodgy. I don't know what's going on there. If I
>> > compile only sqlite3.c, then I get more reasonable looking figures:
>> >
>> > 14010 define
>> > 08728 if
>> > 08543 int
>> > 03115 endif
>> > 03111 return
>> > 02658 void
>> > 02291 else
>> >
>> > Note that 10,000 of those defines are inside my windows.h. Also note
>> >that the same 'if' keyword is used for "'#if".
>> >
>> > This is specific to normal C however. We don't know exactly what the
>> > OP is doing, or what his inputs will be, whether they are already
>> > preprocessed or not.
>> Ah, so that's 14010 occurrences of "define" after preprocessing.
>
> This is getting dragged down by mention of preprocessing.

I suggest that 14,000 spurious occurrences of "define" may have skewed
your results.

[...]

>> The point is that if you're processing C source code and recognizing
>> whether tokens that look like identifiers are keywords or not, it
>> doesn't make sense to have "define" and "int" in the same table.
>
> Well, they *are* in the same table, and all three kinds of "define"
> share the same slot too. Or does C stipulate how people should
> organise their symbol tables?

Of course it doesn't. But I don't know why you'd *want* all three kinds
of "define" in the same table.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
> bart c <bart4858@gmail.com> writes:
> > On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
> [...]
> >> And that's a *lot* of "define"s. It sounds like the code you're
> >> looking at is not typical. Your numbers don't match what I see
> >> in the latest sqlite3 sources.
> >
> > Yeah, that looked dodgy. I don't know what's going on there. If I
> > compile only sqlite3.c, then I get more reasonable looking figures:
> >
> > 14010 define
> > 08728 if
> > 08543 int
> > 03115 endif
> > 03111 return
> > 02658 void
> > 02291 else
> >
> > Note that 10,000 of those defines are inside my windows.h. Also note
> >that the same 'if' keyword is used for "'#if".
> >
> > This is specific to normal C however. We don't know exactly what the
> > OP is doing, or what his inputs will be, whether they are already
> > preprocessed or not.
>
> Ah, so that's 14010 occurrences of "define" after preprocessing.

Bart wrote that "define" came from "windows.h", so his input is
before preprocessing.

Why not? Of course logically lookups for preprocessor keywords
are different from normal lookups, but there is no law that
prevent storing two logical hash tables in single physical
hash table. In particular, info about preprocessor keywords
is likely to be quite small, so there is good chance that
you can put it in place that would be otherwise wasted by
padding.

If you want fast compiler it makes sense to do symbol
table lookup once, that is combine prepocessor with compiler.
In preprocessing stage you need to look up each identifier to
check if it is a macro, when it is not a macro you may deliver
its compilation meaninig. Details needed for correct handling
of C are likely to be tricky and I do not know if Bart did
this correctly. But this is clearly doable and likely to give
large speed advantage.

Note that after macros and normal C symbols are in single
table what remains is handful of preprocessor keywords.
Since they are so frequent and exclusive of normal identifiers
special purpose table may give some speed boost. OTOH
in combined table preprocessor keywords will be entered
first, so with high probablity single lookup will work.
In other words, single combined table is likely to behave
as perfect hash for the purpose of lookup for proprocessor
keywords. It looks tricky to come with faster alternative...

--
Waldek Hebisch

antispam@math.uni.wroc.pl writes:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
>> bart c <bart4858@gmail.com> writes:
>> > On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
>> [...]
>> >> And that's a *lot* of "define"s. It sounds like the code you're
>> >> looking at is not typical. Your numbers don't match what I see
>> >> in the latest sqlite3 sources.
>> >
>> > Yeah, that looked dodgy. I don't know what's going on there. If I
>> > compile only sqlite3.c, then I get more reasonable looking figures:
>> >
>> > 14010 define
>> > 08728 if
>> > 08543 int
>> > 03115 endif
>> > 03111 return
>> > 02658 void
>> > 02291 else
>> >
>> > Note that 10,000 of those defines are inside my windows.h. Also note
>> >that the same 'if' keyword is used for "'#if".
>> >
>> > This is specific to normal C however. We don't know exactly what the
>> > OP is doing, or what his inputs will be, whether they are already
>> > preprocessed or not.
>>
>> Ah, so that's 14010 occurrences of "define" after preprocessing.
>
> Bart wrote that "define" came from "windows.h", so his input is
> before preprocessing.

Well, it seems to be in the middle of preprocessing. Before
preprocessing, you'd just have `#include "windows.h"` and none of its
contents. After preprocessing, all the `#define` directives should be
gone and all macros expanded. (#include, #define, and macro expansion
are all done on translation phase 4.)

Of course Bart can do whatever he likes as long as the final result is
as if all 8 phases are invoked *or* he doesn't care about full
conformance.

>> The point is that if you're processing C source code and recognizing
>> whether tokens that look like identifiers are keywords or not, it
>> doesn't make sense to have "define" and "int" in the same table.
>
> Why not? Of course logically lookups for preprocessor keywords
> are different from normal lookups, but there is no law that
> prevent storing two logical hash tables in single physical
> hash table. In particular, info about preprocessor keywords
> is likely to be quite small, so there is good chance that
> you can put it in place that would be otherwise wasted by
> padding.
>
> If you want fast compiler it makes sense to do symbol
> table lookup once, that is combine prepocessor with compiler.
> In preprocessing stage you need to look up each identifier to
> check if it is a macro, when it is not a macro you may deliver
> its compilation meaninig. Details needed for correct handling
> of C are likely to be tricky and I do not know if Bart did
> this correctly. But this is clearly doable and likely to give
> large speed advantage.
>
> Note that after macros and normal C symbols are in single
> table what remains is handful of preprocessor keywords.
> Since they are so frequent and exclusive of normal identifiers
> special purpose table may give some speed boost. OTOH
> in combined table preprocessor keywords will be entered
> first, so with high probablity single lookup will work.
> In other words, single combined table is likely to behave
> as perfect hash for the purpose of lookup for proprocessor
> keywords. It looks tricky to come with faster alternative...

I'm not convinced, but if it works that's fine.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

On Wednesday, 24 August 2022 at 00:50:33 UTC+1, Keith Thompson wrote:
> bart c <bart...@gmail.com> writes:
> > On Tuesday, 23 August 2022 at 21:40:25 UTC+1, Keith Thompson wrote:
> >> bart c <bart...@gmail.com> writes:
> >> > On Tuesday, 23 August 2022 at 20:17:18 UTC+1, Keith Thompson wrote:
> >> [...]
> >> >> And that's a *lot* of "define"s. It sounds like the code you're
> >> >> looking at is not typical. Your numbers don't match what I see
> >> >> in the latest sqlite3 sources.
> >> >
> >> > Yeah, that looked dodgy. I don't know what's going on there. If I
> >> > compile only sqlite3.c, then I get more reasonable looking figures:
> >> >
> >> > 14010 define
> >> > 08728 if
> >> > 08543 int
> >> > 03115 endif
> >> > 03111 return
> >> > 02658 void
> >> > 02291 else
> >> >
> >> > Note that 10,000 of those defines are inside my windows.h. Also note
> >> >that the same 'if' keyword is used for "'#if".
> >> >
> >> > This is specific to normal C however. We don't know exactly what the
> >> > OP is doing, or what his inputs will be, whether they are already
> >> > preprocessed or not.
> >> Ah, so that's 14010 occurrences of "define" after preprocessing.
> >
> > This is getting dragged down by mention of preprocessing.
> I suggest that 14,000 spurious occurrences of "define" may have skewed
> your results.

Here are the two main files:

https://raw.githubusercontent.com/sal55/langs/master/sqlite3.c

https://github.com/sal55/langs/blob/master/windows.h

I believe the first has some 4000 defines, and (my) windows.h has about 10000. (There will be a few in my standard headers.)

I haven't counted them myself, but the 14,000 figure seems plausible.

> [...]
> >> The point is that if you're processing C source code and recognizing
> >> whether tokens that look like identifiers are keywords or not, it
> >> doesn't make sense to have "define" and "int" in the same table.
> >
> > Well, they *are* in the same table, and all three kinds of "define"
> > share the same slot too. Or does C stipulate how people should
> > organise their symbol tables?
> Of course it doesn't. But I don't know why you'd *want* all three kinds
> of "define" in the same table.

I'm not the kind to go for umpteen different symbol tables in a compiler, perhaps one for every scope level, and having to do more than one looking for any one instance of a name.

There is one global hash table which contains generic versions of all alphanumeric names (keyword or user identifiers or, for C, PP tokens) that occur throughout a program.

The ST lookup that occurs early on is dumb; it knows little about the language, and either returns a generic ST entry for each unique name, or it creates one.

Levels of scope, namespaces, multiple instances of the same name occurring in different contexts, etc are superimposed later.

On Tuesday, August 23, 2022 at 8:36:26 PM UTC-3, bart c wrote:
> On Tuesday, 23 August 2022 at 23:55:43 UTC+1, Thiago Adams wrote:
>
> > Jus to document I also tried something like:
> >
> > if (ch < 'c')
> > {
> > if (ch == 'a') ...
> > if (ch == 'b') ...
> > }
> > else
> > {
> > if (ch == 'c') ...
> > if (ch == 'd') ...
> > ...
> > }
> >
> > That is similar of binary search .
> > We can add more and more ifs like this.
> This is starting to look a little desperate. Such code looks unmaintainable to me (but I assume you will use some tool to generate it).
>
> What is your actual aim here? Granted a simple linear search is not going to be that performant. But pretty much anything else is going to be faster.. So how much faster does it have to be; have you done any measurements? Is checking for keywords mixed in with any other input such as punctuation or numeric literals or strings or comments; what's the bigger picture?

One of my objectives is to understand what compilers can do on switch cases
and to write equivalent code.

This binary search didn't change the performance. My tests are a bunch of files
that takes 9.4s-10s. And I guess all changes I did are changing less than 0..5s so
it hard to see.

This topic is more about exploration and to understand the options.
Also from a compiler point of view. How switch works. Is it a lot of linear ifs?

> (And what is the purpose in just getting a Yes or No answer to whether the current token is a reserved word - what is done with that information? It seems wasteful determining which token you have, then discarding that information.)

The answer I use is returning the token.. but the algorithm didn't change i guess.

Thiago Adams <thiago.adams@gmail.com> writes:
[...]
> This topic is more about exploration and to understand the options.
> Also from a compiler point of view. How switch works. Is it a lot of linear ifs?

The C standard specifies the behavior of a switch statement, not the
code that implements it.

Typically a compiler will generate a jump table if the case values
aren't too spread out, or perhaps a combination of jump tables and
conditional jumps if they are. If you're curious, write a few switch
statements and see what the generated code looks like. Try various
optimization levels, and try various compilers if you have access to
them.

(Many years ago, I crashed a UCSD Pascal compiler by writing a case
statement with cases 1, 10, 100, 1000, and 10000 -- back when 10000 was
a big number. I wouldn't expect a modern compiler to have this problem
even for arbitrarily large case values.)

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Subject	Author
How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Scott Lurndal
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Scott Lurndal
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Scott Lurndal
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	bart c
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Siri Cruise
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Siri Cruise
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Öö Tiib
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Stefan Ram
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	William Ahern
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Bonita Montero
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	antispam
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Tim Rentsch
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Stefan Ram
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Keith Thompson
Re: How to optimize this search?	David Brown
Re: How to optimize this search?	bart c
Re: How to optimize this search?	antispam
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Ben Bacarisse
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Anton Shepelev
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	Thiago Adams
Re: How to optimize this search?	bart c
Re: How to optimize this search?	Anton Shepelev