novaBBS - comp.lang.tcl - Re: Retrieving data from a thread

Re: Retrieving data from a thread

<ulceul$sqfl$1@dont-email.me>

https://www.novabbs.com/devel/article-flat.php?id=12571&group=comp.lang.tcl#12571

Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ric...@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 14:26:29 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <ulceul$sqfl$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me> <20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad> <20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me> <20231212185913.56b31be2@lud1.home> <ulasgl$3rq5s$1@dont-email.me> <20231212213644.66ceb2c4@lud1.home> <20231212215114.4c6d08f3@lud1.home> <ulb6oc$3svjp$2@dont-email.me> <20231213085435.67ed5c24@lud1.home>
Injection-Date: Wed, 13 Dec 2023 14:26:29 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8c13c7efcbba3215cafec4ee4078253";
logging-data="944629"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wnu8EtADGRr2yFmDaz42r"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.117 (x86_64))
Cancel-Lock: sha1:m+bjH8Lhh39gRvycjCTeJnD6hNM=

by: Rich - Wed, 13 Dec 2023 14:26 UTC

Luc <luc@sep.invalid> wrote:
> On Wed, 13 Dec 2023 03:00:28 -0000 (UTC), Rich wrote:
>
>>When you use "-sorted", ::BIGLIST is, in fact, sorted, right?
>>
>>> Visibly faster, but only 3 out of 10 words found.
>>>
>>> Not good.
>>
>>Given the reduction in hits, this implies you do not have ::BIGLIST
>>sorted.
> **************************
>
> ::BIGLIST is slurped straight from the file which was a merge of multiple
> word lists and dictionaries I found here and there, then sorted with
> sort -u to remove the duplicates.
>
> So it is sorted, but I guess it's not sorted in the way that lsearch
> expects.

That would be my guess.

Slurp it in, sort it with lsort, then save the result to a second file
(note you'll want to use [join] and join with \n when saving to a
second file.

Then run "diff -u" on both files (and being 1m lines, pipe the result
to less) and see if there is a difference, and what is different.

Re: Retrieving data from a thread

<ulcghh$t32g$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=12572&group=comp.lang.tcl#12572

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ric...@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 14:53:37 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <ulcghh$t32g$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me> <20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad> <20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me> <20231212185913.56b31be2@lud1.home> <ulasgl$3rq5s$1@dont-email.me> <20231212213644.66ceb2c4@lud1.home> <20231212215114.4c6d08f3@lud1.home> <ulb6oc$3svjp$2@dont-email.me> <20231213085435.67ed5c24@lud1.home> <20231213091543.291e75d8@lud1.home>
Injection-Date: Wed, 13 Dec 2023 14:53:37 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b8c13c7efcbba3215cafec4ee4078253";
logging-data="953424"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+9EOl7dE3Vi8B47ORyp5gZ"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.117 (x86_64))
Cancel-Lock: sha1:920/jIS7zzn6KzkQXeQkpT5DkXE=

by: Rich - Wed, 13 Dec 2023 14:53 UTC

Luc <luc@sep.invalid> wrote:
> On Wed, 13 Dec 2023 08:54:35 -0300, Luc wrote:
>
>>::BIGLIST is slurped straight from the file which was a merge of multiple
>>word lists and dictionaries I found here and there, then sorted with
>>sort -u to remove the duplicates.
>>
>>So it is sorted, but I guess it's not sorted in the way that lsearch
>>expects.
>>
> **************************
>
> Well, I added an lsort step to the file slurp procedure and now using
> lsort -nocase -sorted yields all the expected search hits.

Do note that if you lsort before saving, then you don't have to lsort
upon loading until such time as you add more words.

Re: Retrieving data from a thread

<uld7vr$10ts6$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=12575&group=comp.lang.tcl#12575

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!nntp.comgw.net!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: et9...@rocketship1.me (et99)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 13:33:45 -0800
Organization: A noiseless patient Spider
Lines: 125
Message-ID: <uld7vr$10ts6$1@dont-email.me>
References: <20231211183416.1833a560@lud1.home> <ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home> <CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home> <ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home> <ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home> <20231212215114.4c6d08f3@lud1.home>
<ulb4s7$3sqqt$1@dont-email.me> <20231213091038.100989ea@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 13 Dec 2023 21:33:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7756ec1fcc0f7fe281f4f4149240e3b4";
logging-data="1079174"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Pz0sIP956HjWpEDmYB+PO"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:aNlEzMb6eJ6GgBUERWjYsmBn7IQ=
In-Reply-To: <20231213091038.100989ea@lud1.home>
Content-Language: en-US

by: et99 - Wed, 13 Dec 2023 21:33 UTC

On 12/13/2023 4:10 AM, Luc wrote:
> On Tue, 12 Dec 2023 18:28:23 -0800, et99 wrote:
>
>> What I meant by pre-processing was to take your list as cleaned up,
>> sorted, etc. and write it out, once. Thereafter, you could use the
>> read/split to restore it to memory quickly.
>>
>> If, however, you are going to be adding words during a run, you could just
>> keep 2 lists. The second list would likely be very short if added by the
>> user during a session. Merging new words in might be a pain, and
>> re-sorting the entire list likewise.
>>
>> On the other hand, this is a plus for using the array, since order isn't
>> important there, as it's just hashing them.
>>
>> But are you also going to let the user do a "save dictionary" after adding
>> in new words? Programs never do stay simple :)
>>
> **************************
>
> Well, yes. It's done and in production already.
>
> You see, names are simple. They have to begin with a capital letter.
>
> But "begin" means it can be either Mary or MARY. For that I need some
> kind of -nocase parameter or one normalization step plus a second lookup.
> That may or may not defeat the superior speed of array lookups or more
> likely just make the difference less meaningful.
>
> Common words are less simple. In the beginning of a sentence, they must
> begin with a capital letter. In the middle of a sentence, they must
> begin with a small letter. But in either case it may be all upper case
> too.
>
> The shortest route I could think of was two lists: things and names.
> 1. Search in the first list with no case and that's it.
> 2. Not found? Search in the second list as is and that's it.
> 3. Still not found? Capitalize it and look for it again in the list
> of names.
>
> In case you're wondering, the problem of capitalizing words (or not)
> according to punctuation is taken care of by a completely different proc
> that does auto correct according to another list. I actually use the
> concept of auto correct to auto expand abbreviations and type faster.
> That proc takes care of capitalization according to punctuation.
> In a public application that would not be good enough, but since this
> is for private use and is working as intended, I won't bother fixing
> what ain't broken.
>
> But another problem comes up.
>
> In my current design, boxes with any problem cannot be approved and I am
> not allowed to jump to the next one until the problem is properly fixed.
> A "problem" currently means too many characters or an empty box. Empty
> boxes may be desirable in certain circumstances so there is a "force"
> command (and key shortcut) in case I want to override it. Misspellings
> will just be a third kind of problem.
>
> Workflow speed is always a priority with this thing so I implemented the
> possibility of a double override action. The first override key press
> will add all unknown words to the word list and the second override will
> "approve and move forward."
>
> But then I can't distinguish things from names. I can, but I guess I
> would have to introduce a pop-up to decide which one every time. That
> would slow things down. I though that maybe it would be better to just
> use one global word list and take care of casing with my own human
> proofreading.
>
> Then again, unknown words are highly likely to be proper names so I
> decided to detect their case and send them straight to the names list
> if they are written with a capital letter whether it's a name or not.
> If they are not a name and happen to show up again in small letters,
> then I will add them again, in which case they will go to the word list.
>
> Now words or names are always added twice: to the list in memory and
> appended to the file on disk.
>

On your 3 step approach:

What about words that can be both, like Cat Stevens and I have a cat; Drew and drew a picture. Will you accept the user's case choice in that situation?

I see this as 2 different problems. Checking spelling and checking capitalization. You mention you already have separate procs for this.

As to sorting your list(s). What is the purpose other than to make lsearch lookup faster? As to doing a -nocase search, how is that any different than maintaining your list(s) in a single case and converting to that case before lookup?

So, two choices, use an array or use a list.

Here's my check list:

List:

Must be kept sorted for speed.
Adding words needs to re-sort or merge.
Duplicates could be a problem (but I'm not sure)
Can have more than 1 list

Array:

No need to sort for speed
Need to keep in one case (say lower) and convert to (lower) before lookup
Adding words is no problem, just make it lowercase first
No problems with duplicates, adding the same word is a no-op
Can have multiple arrays

As to the need to have an in memory plus a disk file would mean:

List:

Have to resort when adding new words

Array:

Just add any new word to the end of the file. No need to sort.

As to loading up with a separate thread. The tsv variables are arrays. A 2nd thread can load up a tsv array from a file and the main thread can see that shared array without needing to do lookups in the second thread (as I mistakenly wrote in a prior post).

My timings for tsv arrays with [tsv::exists dictionary $word] show it's just as fast as a non-shared array.

Re: Retrieving data from a thread

<20231213193327.0e499922@lud1.home>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=12577&group=comp.lang.tcl#12577

copy link Newsgroups: comp.lang.tcl

Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc...@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Retrieving data from a thread
Date: Wed, 13 Dec 2023 19:33:27 -0300
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <20231213193327.0e499922@lud1.home>
References: <20231211183416.1833a560@lud1.home>
<ul8984$3c5v9$1@dont-email.me>
<20231211225610.25a5d497@lud1.home>
<CzZdN.9642$xHn7.5747@fx14.iad>
<20231212131917.0f86250b@lud1.home>
<ula2ge$3nmp9$2@dont-email.me>
<20231212185913.56b31be2@lud1.home>
<ulasgl$3rq5s$1@dont-email.me>
<20231212213644.66ceb2c4@lud1.home>
<20231212215114.4c6d08f3@lud1.home>
<ulb4s7$3sqqt$1@dont-email.me>
<20231213091038.100989ea@lud1.home>
<uld7vr$10ts6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: dont-email.me; posting-host="e2ad7124d718a2de3cbc5b1a6927bfb6";
logging-data="1089256"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ssnS01fiT+gtWWKxpKP3S9TBiC+hKz5k="
Cancel-Lock: sha1:GW0mCOmHpvw3xksYHtw8A2Mo/Gg=

by: Luc - Wed, 13 Dec 2023 22:33 UTC

On Wed, 13 Dec 2023 13:33:45 -0800, et99 wrote:

>On your 3 step approach:
>
>What about words that can be both, like Cat Stevens and I have a cat; Drew
>and drew a picture. Will you accept the user's case choice in that
>situation?

First, the user (aka "yours truly") gets a warning that a certain word is
unknown. The user decides whether that word really is a typo or a new word
that should be added to the spelling dictionary. It's a human in control.
Which is good.

My word list is very thourough. I had a little fun looking for weird words
and only two or three were missing and all of them were slang. I had to
cheat a little to find something missing. So what kind of words will be
flagged as unknown although being correct? Names. I reckon about 99% of
the times.

Remember, I want to do things fast for optimum productivity. Being able
to press one key and add the word to its proper list automatically is
considered valuable. If the word begins with capital case, it's even more
likely to be a name. The odds of an unknown word that is not a name being
the first one in a sentence are very slim. But if it doesn't have a
capital letter then it has to be a word.

So maybe it is not a name. So maybe it was the first word in a sentence.
Well, it's a very rare exception and I don't mind adding it to the list
of names. The list won't be "tarnished." I don't care. The worst that
can happen is it shows up again (hopefully in small letter this time) and
I have to add it again. No big deal. It's a very rare occurrence.

The separation of things and names is good because it won't flag "Mary,"
but it will flag "mary." I do want that feature. The only problem is it
will not flag "john" (in case it refers to the restroom), but oh well,
this is a typical corner case and no spelling checking system is perfect.
I know I have worked with a lot of them. They always need human
supervision.

>So, two choices, use an array or use a list.

I will investigate the choice of array, but later. I have other
priorities right now. Like I said, I don't have a problem with lookup
speed. I had a problem with loading the list off a file at program
launch, but your code really solved that one. :-)

Threads are going to the back burner too. I don't need them now. I
copied and saved all the advice that wasshared here. I will have another
look at it eventually. I want to learn it.

--
Luc
>>

I've looked at the listing, and it's right! -- Joel Halpern

devel / comp.lang.tcl / Re: Retrieving data from a thread

Subject	Author
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Gerald Lester
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Rich
Retrieving data from a thread	Rich
Retrieving data from a thread	et99
Retrieving data from a thread	Rich
Retrieving data from a thread	et99
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Luc
Retrieving data from a thread	Rich
Retrieving data from a thread	Rich