novaBBS - comp.lang.tcl - Re: Help with sqlite statement

Hey, since you people seem to be in the mood for sharing wisdom,
I would like to ask you about something else.

I was doing some kind of test that was very repetitive, testing
multiple possible conditions and assessing the results. This test
had to read from a few hundred thousand database lines.

Yes, that many. Please don't ask. Let's just say it's a good number
crunching workout. I think I'm going to call this one TkCruncher.

Anyway, I wasn't very happy about the performance. I thought it
was too slow. Like, almost 10 minutes for all passes and iterations.
And that was using only one table and only one whole table lookup
at each pass. I was disappointed.

So I had this bold idea. The script now reads THE ENTIRE DATABASE,
five tables in total, at startup. And everything is stored in
arrays. The time it takes to run all the tests now dropped from
almost 10 minutes to less than one.

Another benefit is that selecting from multiple tables has become
a lot simpler compared to how complicated SQL joins can be. A
little proc with sugar to parse my arrays and the code certainly
looks better now.

I tested the lookups on Tkcon and well, every look up returns the
result instantly. I was doing 'for' loops over multiple hundred
thousand lines on each look up and never felt like there was any
delay at all so bravo, Tcl!

The entire .db file is only 80MB so I figure that's how much RAM
I'm using to slurp the whole database into arrays?

Anyway, do you think there could be some hazard or pitfall in my
way for doing this? What do the elder of the village say?

--
Luc
>>

Luc <no@no.no> wrote:
> Hey, since you people seem to be in the mood for sharing wisdom,
> I would like to ask you about something else.
>
> I was doing some kind of test that was very repetitive, testing
> multiple possible conditions and assessing the results. This test
> had to read from a few hundred thousand database lines.
>
> Yes, that many. Please don't ask. Let's just say it's a good number
> crunching workout. I think I'm going to call this one TkCruncher.
>
> Anyway, I wasn't very happy about the performance. I thought it
> was too slow. Like, almost 10 minutes for all passes and iterations.
> And that was using only one table and only one whole table lookup
> at each pass. I was disappointed.

Without you telling us more details, at best we can only offer common
pitfalls.

First question: Are you talking about an Sqlite db, or another
(Oracle/PostgreSQL/Mysql/etc.) db?

One very common pitfall: Did you create indexes on the columns that
were part of the 'where' clauses used for selecting rows from the
database?

Without indexes, the database engine will be doing what is known as
'full table scans' - looking through every row in the table being
selected from for any rows that match the 'where' clause limits. Full
table scans, esp. if you repeat them for every select, produce in the
end an O(N^2) complexity factor.

> So I had this bold idea. The script now reads THE ENTIRE DATABASE,
> five tables in total, at startup. And everything is stored in
> arrays. The time it takes to run all the tests now dropped from
> almost 10 minutes to less than one.

This implies you had no indexes -- but you've not told us enough to
know any more than that.

> Another benefit is that selecting from multiple tables has become
> a lot simpler compared to how complicated SQL joins can be. A
> little proc with sugar to parse my arrays and the code certainly
> looks better now.

Don't feel so happy on that one. For simple cartesian joins, yes, your
code might look simpler than the corresponding SQL statement. But add
in grouping and 'with' clauses to limit by the grouped elements, and
suddenly the 'code' you'll need to write to produce the equivalent
results to a straightforward SQL statement will dwarf the SQL
statement, and look nowhere nearly as simple as the SQL.

> The entire .db file is only 80MB so I figure that's how much RAM
> I'm using to slurp the whole database into arrays?

Not necessarily. It might be more, it might be less. More if the
Tcl_Obj overhead is a significant portion of the total data size.
Could be less if the db has a lot of indexes that you've not replicated
in your in memory version.

> Anyway, do you think there could be some hazard or pitfall in my
> way for doing this? What do the elder of the village say?

Yes, you could eventually encounter a dataset too large to "slurp" into
RAM all up front.

You also may end up writing half (or better) of a poorly optimized SQL
query engine to satisfy lookups or data output patterns that in SQL
are simply the addition of another statement to the query.

Long term, if your project will grow to do more than just exactly what
it does now, you'll be 'foot-gunning' yourself more often than not this
way.

On Sat, 19 Nov 2022 23:07:09 -0000 (UTC), Rich wrote:

> First question: Are you talking about an Sqlite db, or another
> (Oracle/PostgreSQL/Mysql/etc.) db?

Strictly Sqlite. Nothing corporate. Very humble home desktop application.

> One very common pitfall: Did you create indexes on the columns that
> were part of the 'where' clauses used for selecting rows from the
> database?
>
> Without indexes, the database engine will be doing what is known as
> 'full table scans' - looking through every row in the table being
> selected from for any rows that match the 'where' clause limits. Full
> table scans, esp. if you repeat them for every select, produce in the
> end an O(N^2) complexity factor.

Excuse me, did you say 'indexes' in the plural? Yes, you did. How many
am I supposed to create then? It's an honest question. I suspect I have
been overestimating the efficiency of databases. Do I have to create one
index per column?

My database has an index, but it's a 'unique' column which I thought
would be good enough because I admit I don't fully understand indices.
That index is not used in my WHERE queries.

So, 1) How many indices do I have to create? 2) Should I use indices
or views? I did some reading at bed time yesterday and found this page
(among many others):

https://stackoverflow.com/questions/439056/is-a-view-faster-than-a-simple-query

You may skip to "Update 2" where he mentions Lithuania. He argues that
indices may not be the cat's pajamas after all.

> > Anyway, do you think there could be some hazard or pitfall in my
> > way for doing this? What do the elder of the village say?
>
> Yes, you could eventually encounter a dataset too large to "slurp" into
> RAM all up front.

If hundreds of thousands of lines take up 80MB, how many lines does it
take to cause someone to run out of RAM with modern computers?
How does a database like Sqlite manage its own RAM use footprint?

--
Luc
>>

Luc <no@no.no> wrote:
> On Sat, 19 Nov 2022 23:07:09 -0000 (UTC), Rich wrote:
>
>> First question: Are you talking about an Sqlite db, or another
>> (Oracle/PostgreSQL/Mysql/etc.) db?
>
> Strictly Sqlite. Nothing corporate. Very humble home desktop application.

Neither PostgreSQL nor Mysql imply 'corporate' -- both being 'free to
use'. Oracle, well, at the price for an oracle license, yes, it would
imply corporate.

>> One very common pitfall: Did you create indexes on the columns that
>> were part of the 'where' clauses used for selecting rows from the
>> database?
>>
>> Without indexes, the database engine will be doing what is known as
>> 'full table scans' - looking through every row in the table being
>> selected from for any rows that match the 'where' clause limits.
>> Full table scans, esp. if you repeat them for every select, produce
>> in the end an O(N^2) complexity factor.
>
> Excuse me, did you say 'indexes' in the plural?

Yes, intentionally so.

> Yes, you did. How many am I supposed to create then?

As many as are necessary for your needs (yes, I know, a vague answer,
but it /is/ the answer I can give, since you've revealed insufficient
information about your actual design for me to provide more than that
level of 'general answer'.

> It's an honest question. I suspect I have been overestimating the
> efficiency of databases. Do I have to create one index per column?

The answer there depends upon what kinds of queries you plan to run.

For example, lets take a 'music database' (i.e., catalog of CD's, vinyl
records, compressed audio files). You might have a table like (note,
all made up here on the fly):

create table music (artist text, album text, disk_num text,
song text, format text, shelf text, row integer,
column integer);

Where 'format' might be "CD" or "Vinyl" or "mp3" or "flac".

If all you ever did was retrieve all records from this table, you would
'need' zero indexes (note, I'm skipping the 'need' for a unique index
if you want to avoid duplicate rows).

But, lets say you often want to retrieve just your collection of Vinyl
entries. You'd run this query:

db eval {select * from music where format = 'Vinyl';}

And with on indexes, that would have to look at every row to decide
which rows had 'Vinyl' in the format column.

But, if you indexed the format column:

db eval {create index format_idx on music(format);}

Then the same query can utilize the index to (essentially) directly
retrieve only the rows with "Vinyl" as format, without ever looking at
any other rows.

If you commonly want to look for "music by artist A, and in flac
format" (note, made up query again), you would do:

db eval {select * from music where artist = 'Frank Sinatra' and
format = 'flac';}

Again, with no index this looks through the entire set of rows. But
with an index:

db eval {create index my_new_idx on music(artist, format);}

Then the same query can retrieve all of Frank's music, that is in flac
format, without looking at all the other rows (and it is the 'without
looking at the other rows' that provides the speedup, it literally does
not look at the other non-Sinatra rows).

So the answer to your question is: you need indexes that cover the
queries you are performing and that you want to speed up. But we don't
know enough details to recommend more than that.

> My database has an index, but it's a 'unique' column which I thought
> would be good enough because I admit I don't fully understand indices.
> That index is not used in my WHERE queries.

If the unique column is never part of a where clause, it is never used
for a retrieval query.

> So, 1) How many indices do I have to create?

The number necessary to cover your slow queries.

> 2) Should I use indices or views?

Given that your questions imply this is your initial foray into the
world of SQL queries, just stick to indexes for the moment. You can
level-up to views once you've become more proficient at the next level
down basics.

> I did some reading at bed time
> yesterday and found this page (among many others):
>
> https://stackoverflow.com/questions/439056/is-a-view-faster-than-a-simple-query
>
> You may skip to "Update 2" where he mentions Lithuania. He argues that
> indices may not be the cat's pajamas after all.

He's discussing a rather advanced 'view' there where the view looks to
be an actual materialized view containing only the subset of the
records to which the create view statement pertains. This is several
levels up in SQL knowledge from where your postings imply you are now,
and likely unnecessary for what you are trying to do. Second, I'm not
sure Sqlite has added support for materialized views, and if not, you
can't (yet) do with sqlite what this poster is discussing.

>> > Anyway, do you think there could be some hazard or pitfall in my
>> > way for doing this? What do the elder of the village say?
>>
>> Yes, you could eventually encounter a dataset too large to "slurp"
>> into RAM all up front.
>
> If hundreds of thousands of lines take up 80MB, how many lines does it
> take to cause someone to run out of RAM with modern computers?

Note that 80MB on disk does not equate to 80MB of RAM if you hoover
everything into TCL lists or arrays. This is due to Tcl_Obj overhead
(see https://www.tcl.tk/man/tcl/TclLib/Object.html for the C definition
of a Tcl_Obj). On a 64-bit machine, this is somewhere around at least
5*8bytes (40 bytes) of overhead for each Tcl_Obj. If you retrieve each
row as a single string, and keep it a single string, then your in Tcl
memory footprint will be the byte length of each string, plus 40 bytes
per string, so for 100,000 rows you'd have 4,000,000 bytes just in
Tcl_Obj overhead.

If, instead, you convert each row into a nested list, and you have
100,000 rows, each row having 8 columns, then you get somewhere around
100000*40 + 100000*8*40 or 36,000,000 bytes for the Tcl_Obj overhead (each
row is a list, 40 byte Tcl_Obj per row, each element of each row is
itself another Tcl_Obj, 40 bytes each). And that does not count the
space taken up by the actual data in each list element.

As for "how much does your dataset take" -- well on that point you'll
have to do some of the math yourself.

> How does a database like Sqlite manage its own RAM use footprint?

You'd have to peruse the Sqlite source to know that with certainty.
But one way it does is by not (intentionally) hoovering everything into
ram up-front. Now, with sufficient spare cache, eventually much of the
file will be cached by the OS, giving the effect of "it is all in RAM
now" as far as disk reads go, but OS caching of the file is not Sqlite
intentionally "hoovering it all in up front".

Air is water with holes in it.

devel / comp.lang.tcl / Re: Help with sqlite statement

Subject	Author
Help with sqlite statement	clt.to.davebr
Help with sqlite statement	Luc
Help with sqlite statement	Rich
Help with sqlite statement	Luc
Help with sqlite statement	Rich