Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"But what we need to know is, do people want nasally-insertable computers?"


devel / comp.lang.c / Re: Allocate length of word vs fixed length

SubjectAuthor
* Allocate length of word vs fixed lengthdfs
+* Re: Allocate length of word vs fixed lengthBart
|`* Re: Allocate length of word vs fixed lengthdfs
| +* Re: Allocate length of word vs fixed lengthBart
| |`* Re: Allocate length of word vs fixed lengthdfs
| | `- Re: Allocate length of word vs fixed lengthBart
| +* Re: Allocate length of word vs fixed lengthScott Lurndal
| |`- Re: Allocate length of word vs fixed lengthBranimir Maksimovic
| +* Re: Allocate length of word vs fixed lengthReal Troll
| |`* Re: Allocate length of word vs fixed lengthDFS
| | `- Re: Allocate length of word vs fixed lengthReal Troll
| `- Re: Allocate length of word vs fixed lengthKaz Kylheku
+* Re: Allocate length of word vs fixed lengthDavid Brown
|`- Re: Allocate length of word vs fixed lengthdfs
+* Re: Allocate length of word vs fixed lengthStefan Ram
|+* Re: Allocate length of word vs fixed lengthdfs
||`* Re: Allocate length of word vs fixed lengthBart
|| `- Re: Allocate length of word vs fixed lengthdfs
|`* Re: Allocate length of word vs fixed lengthReal Troll
| `* Re: Allocate length of word vs fixed lengthdfs
|  `- Re: Allocate length of word vs fixed lengthdfs
+* Re: Allocate length of word vs fixed lengthIke Naar
|`* Re: Allocate length of word vs fixed lengthdfs
| `- Re: Allocate length of word vs fixed lengthIke Naar
`- Re: Allocate length of word vs fixed lengthPeter 'Shaggy' Haywood

1
Allocate length of word vs fixed length

<U3iJI.61694$VU3.5811@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17552&group=comp.lang.c#17552

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
Newsgroups: comp.lang.c
X-Mozilla-News-Host: news://usnews.blocknews.net:119
From: nos...@dfs.com (dfs)
Subject: Allocate length of word vs fixed length
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Lines: 90
Message-ID: <U3iJI.61694$VU3.5811@fx46.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 16:51:00 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 12:50:59 -0400
X-Received-Bytes: 2911
 by: dfs - Mon, 19 Jul 2021 16:50 UTC

Loaded a list of words into an array.

The 370103 words came from https://github.com/dwyl/english-words
file = words_alpha.txt

Tested a couple memory allocation 'strategies' during loading:

1. allocate the strlen() of each word
2. allocate a fixed length (len of longest word = 32)

I figured getting strlen(word) each time would be slower than allocating
a fixed amt of memory, but that wasn't the case.

Strategy 2 is significantly slower, and uses nearly 3x the memory.

Also, does anyone have any 'tricks' to make such a file load routine
faster/more efficient? Thanks

===========================================================
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define maxlen 32

//removes trailing isspaces and quote marks
char *rtrim(char *str)
{ int len = strlen(str);
while(len>0 && (isspace(str[len-1]) || str[len-1] == '\"')) {len--;}
str[len] = '\0';
return str;
}

int main(int argc, char *argv[])
{

//vars
int i = 0, lines = 0;
char word[maxlen] = "";
char wordin[maxlen] = "";

//open file, count lines
FILE *fwords = fopen(argv[1],"r");
while(fgets(wordin,sizeof wordin,fwords)!=NULL) {lines++;}
//Geany adds a line feed to the very last line, so the true
//line count is overstated
lines -= 1;
//mem
char **words = malloc(sizeof(char*) * lines);

//add words to array
int memoption = atoi(argv[2]);
int memlen = 0, memtotal = 0;
rewind(fwords);
while(fgets(wordin,sizeof wordin,fwords)!=NULL)
{

strcpy(word,rtrim(wordin));

//different memory allocations
if(memoption==1)
{memlen=strlen(word);}
else
{memlen=maxlen;}

words[i] = malloc(sizeof(char*) * memlen);
memtotal += memlen;

strcpy(words[i],word);

//printf("%d. '%s' ",i,words[i]);
i++;
}

fclose(fwords);
printf("\nloaded %d words\n",i-1);
printf("\nFirst and Last words = '%s', '%s'\n",words[0],words[i-1]);
printf("malloc option %d, total memory %d\n",memoption,memtotal);
free(words);
return(0);
}
===========================================================

Re: Allocate length of word vs fixed length

<sd4em5$694$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17553&group=comp.lang.c#17553

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Mon, 19 Jul 2021 19:01:38 +0100
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <sd4em5$694$1@dont-email.me>
References: <U3iJI.61694$VU3.5811@fx46.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 19 Jul 2021 18:01:41 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9a40d7af667e807703af2f7034fa33aa";
logging-data="6436"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185Unov+jt3fnRkL7mG/3nJA7JwlJbeSic="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:4GN5Rd5KSU7Yrpn3FGQA5uox778=
In-Reply-To: <U3iJI.61694$VU3.5811@fx46.iad>
X-Antivirus-Status: Clean
Content-Language: en-GB
X-Antivirus: AVG (VPS 210719-4, 19/7/2021), Outbound message
 by: Bart - Mon, 19 Jul 2021 18:01 UTC

On 19/07/2021 17:50, dfs wrote:
> Loaded a list of words into an array.
>
> The 370103 words came from https://github.com/dwyl/english-words
> file = words_alpha.txt
>
> Tested a couple memory allocation 'strategies' during loading:
>
> 1. allocate the strlen() of each word
> 2. allocate a fixed length (len of longest word = 32)
>
> I figured getting strlen(word) each time would be slower than allocating
> a fixed amt of memory, but that wasn't the case.
>
> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>
> Also, does anyone have any 'tricks' to make such a file load routine
> faster/more efficient?  Thanks

How fast do you want it?

You program loaded 2.3 million words (5 combined copies of your test
file) in about 1.25 seconds (on my slow PC with hard drive, but using
file caching).

A faster version I created did it in about 0.2 seconds (so perhaps
50msec for one copy).

That uses this method:

(1) Get the file size (using seek etc)

(2) Allocate a single memory block and load entire file in one go

(3) Do a first pass counting words (assumes one per line), by looking
for \n characters and setting each to nul

(4) Use that figure to allocate a linear array of char* objects in one
memory block

(5) Do a second pass which sets each char* to the start of the word,
then steps a pointer to just past the next nul for the next word.

You shouldn't be bothering with trailing spaces and such here; do that
once, and write out a new cleaned-up list. Then apps will read that list
with no further processing.

(I didn't have time to do a C version; the following outlines my method
using another systems language:)

------------------------------------------
proc start=
ichar s,t
int nwords:=0, c
ref[]ichar words

s:=readfile("/texts/words5.")
t:=s

while c:=t^ do
if c=10 then
++nwords
t^:=0
fi
++t
od

words:=malloc((nwords+2)*ichar.bytes)

t:=s
for i in 1..nwords do
words[i]:=t
repeat
until (++t)^=0
++t
od

for i in 1..nwords do
if i in 1..10 or i in nwords-9..nwords then
println i,words[i]
fi
od
end

Re: Allocate length of word vs fixed length

<OvjJI.64366$Vv6.35081@fx45.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17554&group=comp.lang.c#17554

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
From: nos...@dfs.com (dfs)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <sd4em5$694$1@dont-email.me>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Lines: 99
Message-ID: <OvjJI.64366$Vv6.35081@fx45.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 18:29:02 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 14:29:01 -0400
X-Received-Bytes: 3729
 by: dfs - Mon, 19 Jul 2021 18:29 UTC

On 7/19/21 2:01 PM, Bart wrote:
> On 19/07/2021 17:50, dfs wrote:
>> Loaded a list of words into an array.
>>
>> The 370103 words came from https://github.com/dwyl/english-words
>> file = words_alpha.txt
>>
>> Tested a couple memory allocation 'strategies' during loading:
>>
>> 1. allocate the strlen() of each word
>> 2. allocate a fixed length (len of longest word = 32)
>>
>> I figured getting strlen(word) each time would be slower than
>> allocating a fixed amt of memory, but that wasn't the case.
>>
>> Strategy 2 is significantly slower, and uses nearly 3x the memory.

Any thoughts on this?

>> Also, does anyone have any 'tricks' to make such a file load routine
>> faster/more efficient?  Thanks
>
> How fast do you want it?

As fast as possible!

> You program loaded 2.3 million words (5 combined copies of your test
> file) in about 1.25 seconds (on my slow PC with hard drive, but using
> file caching).

0.1 seconds here for (1 * 370103 = 370103) on my old system
0.5 seconds for (6 * 370103 = 2220618)

> A faster version I created did it in about 0.2 seconds (so perhaps
> 50msec for one copy).
>
> That uses this method:
>
> (1) Get the file size (using seek etc)
>
> (2) Allocate a single memory block and load entire file in one go
>
> (3) Do a first pass counting words (assumes one per line), by looking
> for \n characters and setting each to nul
>
> (4) Use that figure to allocate a linear array of char* objects in one
> memory block
>
> (5) Do a second pass which sets each char* to the start of the word,
> then steps a pointer to just past the next nul for the next word.
>
> You shouldn't be bothering with trailing spaces and such here; do that
> once, and write out a new cleaned-up list. Then apps will read that list
> with no further processing.

Thanks for that breakdown. I'll try to replicate it.

> (I didn't have time to do a C version; the following outlines my method
> using another systems language:)
>
> ------------------------------------------
> proc start=
>     ichar s,t
>     int nwords:=0, c
>     ref[]ichar words
>
>     s:=readfile("/texts/words5.")
>     t:=s
>
>     while c:=t^ do
>         if c=10 then
>             ++nwords
>             t^:=0
>         fi
>         ++t
>     od
>
>     words:=malloc((nwords+2)*ichar.bytes)
>
>     t:=s
>     for i in 1..nwords do
>         words[i]:=t
>         repeat
>         until (++t)^=0
>         ++t
>     od
>
>     for i in 1..nwords do
>         if i in 1..10 or i in nwords-9..nwords then
>             println i,words[i]
>         fi
>     od
> end

Amazing you wrote your own language. Did you name it?

Re: Allocate length of word vs fixed length

<sd4hfb$r71$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17555&group=comp.lang.c#17555

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Mon, 19 Jul 2021 19:49:12 +0100
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <sd4hfb$r71$1@dont-email.me>
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 19 Jul 2021 18:49:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9a40d7af667e807703af2f7034fa33aa";
logging-data="27873"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5wEuq1q6d1DNjwg66pZyk3DM8MFnrHHY="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:Ue3tqnw5zozV0zC8IjVfm6Klg2g=
In-Reply-To: <OvjJI.64366$Vv6.35081@fx45.iad>
X-Antivirus-Status: Clean
Content-Language: en-GB
X-Antivirus: AVG (VPS 210719-6, 19/7/2021), Outbound message
 by: Bart - Mon, 19 Jul 2021 18:49 UTC

On 19/07/2021 19:29, dfs wrote:
> On 7/19/21 2:01 PM, Bart wrote:
>> On 19/07/2021 17:50, dfs wrote:
>>> Loaded a list of words into an array.
>>>
>>> The 370103 words came from https://github.com/dwyl/english-words
>>> file = words_alpha.txt
>>>
>>> Tested a couple memory allocation 'strategies' during loading:
>>>
>>> 1. allocate the strlen() of each word
>>> 2. allocate a fixed length (len of longest word = 32)
>>>
>>> I figured getting strlen(word) each time would be slower than
>>> allocating a fixed amt of memory, but that wasn't the case.
>>>
>>> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>
> Any thoughts on this?

Well, I must have loaded a different file from yours (words.zip), with
466,000 words.

The size of that file on disk (and in memory using my method) is 4.8MB,
so approx 10 bytes per word on average.

Allocating 32 bytes per word will use 3 times as much memory as you say
(some 14MB)

The speed of it depends the access patterns, but clearly spreading it
over an extra 9MB means 2/3 of the data loaded into cache mempry is useless.

Note that my method will use a total of 18 bytes per word on a 64-bit
machine, with a 64-bit pointer per word. I haven't done any random
access tests; I've concentrated on loading only.

Given that, there might be a way of using a fixed 16 bytes per word, not
32, but you'd need some way of dealing with words longer than 15
characters (nul is still needed). Whether that is going to be any
faster, is hard to predict.

Maybe changing the order (from alphabetical) might help. It really
depends on what you intend doing. What is needed are some benchmarks for
an actual application that shows a problem.

There are all sorts of clever ways to arrange strings in memory, but
some get very complicated.

>
>
>>> Also, does anyone have any 'tricks' to make such a file load routine
>>> faster/more efficient?  Thanks
>>
>> How fast do you want it?
>
> As fast as possible!

But is this loading the file (which doesn't appeart to be a problem on
your machine), or doing something else with it?

> Amazing you wrote your own language.  Did you name it?

I usually call it 'M'.

Re: Allocate length of word vs fixed length

<sd4k6f$e8k$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17556&group=comp.lang.c#17556

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Mon, 19 Jul 2021 21:35:43 +0200
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <sd4k6f$e8k$1@dont-email.me>
References: <U3iJI.61694$VU3.5811@fx46.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 19 Jul 2021 19:35:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c3b0f0bb951c72c059b2cae209c45b1b";
logging-data="14612"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zix7p0hyCB9MrzrXfxDEknirklQmYsM8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
Thunderbird/68.10.0
Cancel-Lock: sha1:0qHCI8pO1v7tzXEdcW8MjLrFjLI=
In-Reply-To: <U3iJI.61694$VU3.5811@fx46.iad>
Content-Language: en-GB
 by: David Brown - Mon, 19 Jul 2021 19:35 UTC

On 19/07/2021 18:50, dfs wrote:
> Loaded a list of words into an array.
>
> The 370103 words came from https://github.com/dwyl/english-words
> file = words_alpha.txt
>
> Tested a couple memory allocation 'strategies' during loading:
>
> 1. allocate the strlen() of each word
> 2. allocate a fixed length (len of longest word = 32)
>
> I figured getting strlen(word) each time would be slower than allocating
> a fixed amt of memory, but that wasn't the case.
>
> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>
> Also, does anyone have any 'tricks' to make such a file load routine
> faster/more efficient?  Thanks
>

#define max_word_count 500000
#define max_word_len 32

// Load your word file with:
char * word_file = malloc(max_word_count * max_word_len);
fgets(word_file, max_word_count * max_word_len, fwords);

// Put your words here:
static char words[max_word_count][max_word_len];

Adjust all that to suit - but the point is, don't mess around with
thousands of mallocs and char pointers. Your PC has 16 MB to spare. If
you want speed, use static arrays (or a single big malloc's), not vast
numbers of small mallocs and extra pointers. Similarly, use a single
large read, not lots of tiny reads.

Re: Allocate length of word vs fixed length

<allocation-20210719204511@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17557&group=comp.lang.c#17557

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: 19 Jul 2021 19:46:09 GMT
Organization: Stefan Ram
Lines: 20
Expires: 1 Dec 2021 11:59:58 GMT
Message-ID: <allocation-20210719204511@ram.dialup.fu-berlin.de>
References: <U3iJI.61694$VU3.5811@fx46.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 7Y9VN3uVd7dxYtXP2B3pLAUqtxcEjlXMaI/hD99ynHxTQh
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Mon, 19 Jul 2021 19:46 UTC

dfs <nospam@dfs.com> writes:
>Also, does anyone have any 'tricks' to make such a file load routine
>faster/more efficient? Thanks

A lot depends on what you then want to do with those words.

If they are not to be modified, but just to be read later,
you can read the whole file into one region of memory
and then write a NUL character after each word.

If you need to do some per-word processing while reading,
you can still allocate one large region of memory and
append each word to it.

> char **words = malloc(sizeof(char*) * lines);

I assume that you have your reasons for not checking
the result of malloc here.

Re: Allocate length of word vs fixed length

<qRkJI.14177$6U5.9032@fx02.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17558&group=comp.lang.c#17558

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx02.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me> <OvjJI.64366$Vv6.35081@fx45.iad>
Lines: 35
Message-ID: <qRkJI.14177$6U5.9032@fx02.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 19 Jul 2021 20:00:22 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 19 Jul 2021 20:00:22 GMT
X-Received-Bytes: 1918
 by: Scott Lurndal - Mon, 19 Jul 2021 20:00 UTC

dfs <nospam@dfs.com> writes:
>On 7/19/21 2:01 PM, Bart wrote:
>> On 19/07/2021 17:50, dfs wrote:
>>> Loaded a list of words into an array.
>>>
>>> The 370103 words came from https://github.com/dwyl/english-words
>>> file = words_alpha.txt
>>>
>>> Tested a couple memory allocation 'strategies' during loading:
>>>
>>> 1. allocate the strlen() of each word
>>> 2. allocate a fixed length (len of longest word = 32)
>>>
>>> I figured getting strlen(word) each time would be slower than
>>> allocating a fixed amt of memory, but that wasn't the case.
>>>
>>> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>
>Any thoughts on this?
>
>
>>> Also, does anyone have any 'tricks' to make such a file load routine
>>> faster/more efficient?  Thanks

Personally, I'd mmap it and make a single scanning pass over the entire
set of words and build an array of offsets from the start of the
map for each word; the only allocations would be required to
extend the array (e.g. a vector of 32-bit offset values
or const char * pointers). If the
words are delimited by LF, CR or CRLF, just write a nul byte
over the delimiter.

Voila, an array of words.

Re: Allocate length of word vs fixed length

<slrnsfbqkl.cnu.ike@rie.sdf.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17559&group=comp.lang.c#17559

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ike...@rie.sdf.org (Ike Naar)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Mon, 19 Jul 2021 21:11:49 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <slrnsfbqkl.cnu.ike@rie.sdf.org>
References: <U3iJI.61694$VU3.5811@fx46.iad>
Injection-Date: Mon, 19 Jul 2021 21:11:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f02df866d2d8199bc6fef669fe0ee8e7";
logging-data="12162"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18gxsMjtQnsnGlGXsRzWzm7"
User-Agent: slrn/1.0.3 (Patched for libcanlock3) (NetBSD)
Cancel-Lock: sha1:vqYvS50Wq2SBcLmgB7WcIbQCr1k=
 by: Ike Naar - Mon, 19 Jul 2021 21:11 UTC

On 2021-07-19, dfs <nospam@dfs.com> wrote:
> //different memory allocations
> if(memoption==1)
> {memlen=strlen(word);}

/* allocate one extra for the terminating null character */
memlen = strlen(word) + 1;

> else
> {memlen=maxlen;}
>
> words[i] = malloc(sizeof(char*) * memlen);

/* words[i] contains memlen chars, not memlen pointers to char */
words[i] = malloc(sizeof(char) * memlen);
/* or */ words[i] = malloc(memlen); /* sizeof(char) == 1 by definition */
/* or */ words[i] = malloc(memlen * sizeof *words[i]); /* clc idiom */

> memtotal += memlen;
>
> strcpy(words[i],word);
>
> //printf("%d. '%s' ",i,words[i]);
> i++;
> }
>
> fclose(fwords);
> printf("\nloaded %d words\n",i-1);

/* the number of words is i (numbered from 0 to i-1) */
printf("\nloaded %d words\n",i);

> printf("\nFirst and Last words = '%s', '%s'\n",words[0],words[i-1]);

Re: Allocate length of word vs fixed length

<9DmJI.26905$Yv3.24844@fx41.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17561&group=comp.lang.c#17561

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx41.iad.POSTED!not-for-mail
From: nos...@dfs.com (dfs)
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad> <sd4hfb$r71$1@dont-email.me>
X-Mozilla-News-Host: news://usnews.blocknews.net
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <sd4hfb$r71$1@dont-email.me>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Lines: 83
Message-ID: <9DmJI.26905$Yv3.24844@fx41.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 22:01:41 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 18:01:41 -0400
X-Received-Bytes: 3669
 by: dfs - Mon, 19 Jul 2021 22:01 UTC

On 7/19/21 2:49 PM, Bart wrote:
> On 19/07/2021 19:29, dfs wrote:
>> On 7/19/21 2:01 PM, Bart wrote:
>>> On 19/07/2021 17:50, dfs wrote:
>>>> Loaded a list of words into an array.
>>>>
>>>> The 370103 words came from https://github.com/dwyl/english-words
>>>> file = words_alpha.txt
>>>>
>>>> Tested a couple memory allocation 'strategies' during loading:
>>>>
>>>> 1. allocate the strlen() of each word
>>>> 2. allocate a fixed length (len of longest word = 32)
>>>>
>>>> I figured getting strlen(word) each time would be slower than
>>>> allocating a fixed amt of memory, but that wasn't the case.
>>>>
>>>> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>>
>> Any thoughts on this?
>
> Well, I must have loaded a different file from yours (words.zip), with
> 466,000 words.

https://github.com/dwyl/english-words
file = words_alpha.txt
It's sorted

> The size of that file on disk (and in memory using my method) is 4.8MB,
> so approx 10 bytes per word on average.
>
> Allocating 32 bytes per word will use 3 times as much memory as you say
> (some 14MB)
>
> The speed of it depends the access patterns, but clearly spreading it
> over an extra 9MB means 2/3 of the data loaded into cache mempry is
> useless.
>
> Note that my method will use a total of 18 bytes per word on a 64-bit
> machine, with a 64-bit pointer per word. I haven't done any random
> access tests; I've concentrated on loading only.
>
> Given that, there might be a way of using a fixed 16 bytes per word, not
> 32, but you'd need some way of dealing with words longer than 15
> characters (nul is still needed). Whether that is going to be any
> faster, is hard to predict.
>
> Maybe changing the order (from alphabetical) might help. It really
> depends on what you intend doing. What is needed are some benchmarks for
> an actual application that shows a problem.

There's no problem, per se. Just trying to learn good techniques.

> There are all sorts of clever ways to arrange strings in memory, but
> some get very complicated.

Thanks. 'Very complicated' sounds like trouble.

The speed I get is actually fine - who can argue with 0.003s to find 25K
matching words in a list of 370K words?

I'll try your suggestions.

>>>> Also, does anyone have any 'tricks' to make such a file load routine
>>>> faster/more efficient?  Thanks
>>>
>>> How fast do you want it?
>>
>> As fast as possible!
>
> But is this loading the file (which doesn't appeart to be a problem on
> your machine), or doing something else with it?
>
>> Amazing you wrote your own language.  Did you name it?
>
> I usually call it 'M'.

For Mistress?

Re: Allocate length of word vs fixed length

<cEmJI.13605$Ei1.622@fx07.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17562&group=comp.lang.c#17562

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx07.iad.POSTED!not-for-mail
From: nos...@dfs.com (dfs)
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4k6f$e8k$1@dont-email.me>
X-Mozilla-News-Host: news://usnews.blocknews.net
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <sd4k6f$e8k$1@dont-email.me>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Lines: 51
Message-ID: <cEmJI.13605$Ei1.622@fx07.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 22:02:48 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 18:02:48 -0400
X-Received-Bytes: 2570
 by: dfs - Mon, 19 Jul 2021 22:02 UTC

On 7/19/21 3:35 PM, David Brown wrote:
> On 19/07/2021 18:50, dfs wrote:
>> Loaded a list of words into an array.
>>
>> The 370103 words came from https://github.com/dwyl/english-words
>> file = words_alpha.txt
>>
>> Tested a couple memory allocation 'strategies' during loading:
>>
>> 1. allocate the strlen() of each word
>> 2. allocate a fixed length (len of longest word = 32)
>>
>> I figured getting strlen(word) each time would be slower than allocating
>> a fixed amt of memory, but that wasn't the case.
>>
>> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>>
>> Also, does anyone have any 'tricks' to make such a file load routine
>> faster/more efficient?  Thanks
>>
>
> #define max_word_count 500000
> #define max_word_len 32
>
>
> // Load your word file with:
> char * word_file = malloc(max_word_count * max_word_len);
> fgets(word_file, max_word_count * max_word_len, fwords);
>
> // Put your words here:
> static char words[max_word_count][max_word_len];
>
>
> Adjust all that to suit - but the point is, don't mess around with
> thousands of mallocs and char pointers.

I did that so I could load smaller and larger (and really large) files.

> Your PC has 16 MB to spare. If
> you want speed, use static arrays (or a single big malloc's), not vast
> numbers of small mallocs and extra pointers. Similarly, use a single
> large read, not lots of tiny reads.

Good points. Between Bart's and your techniques I have some good things
to try.

It's just the beginning of a simple linear pattern-match routine: how
many words in the dictionary begin with 'cat'.. that kind of thing.

Thanks for your help.

Re: Allocate length of word vs fixed length

<_EmJI.26906$Yv3.21593@fx41.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17563&group=comp.lang.c#17563

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx41.iad.POSTED!not-for-mail
From: nos...@dfs.com (dfs)
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad> <slrnsfbqkl.cnu.ike@rie.sdf.org>
X-Mozilla-News-Host: news://usnews.blocknews.net
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <slrnsfbqkl.cnu.ike@rie.sdf.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Lines: 70
Message-ID: <_EmJI.26906$Yv3.21593@fx41.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 22:03:38 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 18:03:38 -0400
X-Received-Bytes: 2601
 by: dfs - Mon, 19 Jul 2021 22:03 UTC

On 7/19/21 5:11 PM, Ike Naar wrote:
> On 2021-07-19, dfs <nospam@dfs.com> wrote:
>> //different memory allocations
>> if(memoption==1)
>> {memlen=strlen(word);}
>
> /* allocate one extra for the terminating null character */
> memlen = strlen(word) + 1;

I did it right. Look a few lines up (you snipped it)

strcpy(word,rtrim(wordin));

So 'word' already has the nul.

>> else
>> {memlen=maxlen;}
>>
>> words[i] = malloc(sizeof(char*) * memlen);

This works.

> /* words[i] contains memlen chars, not memlen pointers to char */
> words[i] = malloc(sizeof(char) * memlen);
> /* or */ words[i] = malloc(memlen); /* sizeof(char) == 1 by definition */
> /* or */ words[i] = malloc(memlen * sizeof *words[i]); /* clc idiom */

All 3 of those options compile but when run:

lettermatch: malloc.c:2539: sysmalloc: Assertion `(old_top ==
initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >=
MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize
- 1)) == 0)' failed.
Aborted (core dumped)

>> memtotal += memlen;
>>
>> strcpy(words[i],word);
>>
>> //printf("%d. '%s' ",i,words[i]);
>> i++;
>> }
>>
>> fclose(fwords);
>> printf("\nloaded %d words\n",i-1);
>
> /* the number of words is i (numbered from 0 to i-1) */
> printf("\nloaded %d words\n",i);
>
>> printf("\nFirst and Last words = '%s', '%s'\n",words[0],words[i-1]);

Of course (i starts at 0), but gremlins overcount by 1 so I compensated.

The file has 370103 words in it, but the fgets() block results in 370104.

You can run my code against words_alpha.txt from
https://github.com/dwyl/english-words

And see if it counts correctly.

Thanks for looking at it.

Re: Allocate length of word vs fixed length

<7WmJI.13186$6j.6538@fx04.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17564&group=comp.lang.c#17564

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!feeder5.feed.usenet.farm!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx04.iad.POSTED!not-for-mail
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad>
<allocation-20210719204511@ram.dialup.fu-berlin.de>
From: nos...@dfs.com (dfs)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <allocation-20210719204511@ram.dialup.fu-berlin.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Lines: 42
Message-ID: <7WmJI.13186$6j.6538@fx04.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 22:21:55 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 18:21:54 -0400
X-Received-Bytes: 1893
 by: dfs - Mon, 19 Jul 2021 22:21 UTC

On 7/19/21 3:46 PM, Stefan Ram wrote:
> dfs <nospam@dfs.com> writes:
>> Also, does anyone have any 'tricks' to make such a file load routine
>> faster/more efficient? Thanks
>
> A lot depends on what you then want to do with those words.
>
> If they are not to be modified, but just to be read later,
> you can read the whole file into one region of memory
> and then write a NUL character after each word.

They're just to be read and pattern-matched against.

> If you need to do some per-word processing while reading,
> you can still allocate one large region of memory and
> append each word to it.
>
>> char **words = malloc(sizeof(char*) * lines);
>
> I assume that you have your reasons for not checking
> the result of malloc here.

Program is for me only.

I just tested:
lines = 2000000000;
char **words = malloc(sizeof(char*) * lines);
if(words == NULL)
{ printf("malloc failed\n");
exit(0);
}

and it failed and exited.

Thanks

Re: Allocate length of word vs fixed length

<sd4u58$i2i$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17565&group=comp.lang.c#17565

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Mon, 19 Jul 2021 23:25:40 +0100
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <sd4u58$i2i$1@dont-email.me>
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad> <sd4hfb$r71$1@dont-email.me>
<9DmJI.26905$Yv3.24844@fx41.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 19 Jul 2021 22:25:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2cf338954d2a92f6357e50d991d768bb";
logging-data="18514"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+OZsfH1RsPC5F511kCIE1Wtl/zz1frHlM="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:ZrVQ/6ZJVZDbBubLP42V9prZEhw=
In-Reply-To: <9DmJI.26905$Yv3.24844@fx41.iad>
X-Antivirus-Status: Clean
Content-Language: en-GB
X-Antivirus: AVG (VPS 210719-6, 19/7/2021), Outbound message
 by: Bart - Mon, 19 Jul 2021 22:25 UTC

On 19/07/2021 23:01, dfs wrote:
> On 7/19/21 2:49 PM, Bart wrote:

>
>> There are all sorts of clever ways to arrange strings in memory, but
>> some get very complicated.
>
> Thanks.  'Very complicated' sounds like trouble.

I choose the simplest methods too.

In my script language, I read such a file like this:

words := readtextfile(file)

This function reads it line by line into a list of string objects
(somewhere in there is a loop with fgets in it). It's a bit slower, but
it reads your test file in under 0.3 seconds.

(I use it in a program that helps me cheat at crosswords.)

>> I usually call it 'M'.
>
> For Mistress?
>

I wish...

Re: Allocate length of word vs fixed length

<sd4ud6$j8g$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17566&group=comp.lang.c#17566

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Mon, 19 Jul 2021 23:29:55 +0100
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <sd4ud6$j8g$1@dont-email.me>
References: <U3iJI.61694$VU3.5811@fx46.iad>
<allocation-20210719204511@ram.dialup.fu-berlin.de>
<7WmJI.13186$6j.6538@fx04.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 19 Jul 2021 22:29:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2cf338954d2a92f6357e50d991d768bb";
logging-data="19728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+mi563LE3CaMlra5IhJbR4jxc4420CEkw="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:V5v8KdHepWhJ9ErrcyVk1QcoH0Q=
In-Reply-To: <7WmJI.13186$6j.6538@fx04.iad>
X-Antivirus-Status: Clean
Content-Language: en-GB
X-Antivirus: AVG (VPS 210719-6, 19/7/2021), Outbound message
 by: Bart - Mon, 19 Jul 2021 22:29 UTC

On 19/07/2021 23:21, dfs wrote:
> On 7/19/21 3:46 PM, Stefan Ram wrote:
>> dfs <nospam@dfs.com> writes:
>>> Also, does anyone have any 'tricks' to make such a file load routine
>>> faster/more efficient?  Thanks
>>
>>    A lot depends on what you then want to do with those words.
>>
>>    If they are not to be modified, but just to be read later,
>>    you can read the whole file into one region of memory
>>    and then write a NUL character after each word.
>
> They're just to be read and pattern-matched against.
>
>
>>    If you need to do some per-word processing while reading,
>>    you can still allocate one large region of memory and
>>    append each word to it.
>>
>>>     char **words = malloc(sizeof(char*) * lines);
>>
>>    I assume that you have your reasons for not checking
>>    the result of malloc here.
>
>
> Program is for me only.
>
> I just tested:
> lines = 2000000000;
> char **words = malloc(sizeof(char*) * lines);
> if(words == NULL)
> {
>   printf("malloc failed\n");
>   exit(0);
> }
>
> and it failed and exited.

Is this a 32-bit system? Or a 32-bit compiler on a 64-bit system?

If so, the largest allocation size might be 2GB, and your request was
for 8GB.

On a 64-bit system, you might be less lucky, and malloc won't fail even
if the allocation exceeds physical memory. It'll just get very slow.

Then there's little point in checking the malloc result.

Re: Allocate length of word vs fixed length

<UbnJI.13187$6j.4376@fx04.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17567&group=comp.lang.c#17567

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx04.iad.POSTED!not-for-mail
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad>
<allocation-20210719204511@ram.dialup.fu-berlin.de>
<7WmJI.13186$6j.6538@fx04.iad> <sd4ud6$j8g$1@dont-email.me>
From: nos...@dfs.com (dfs)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <sd4ud6$j8g$1@dont-email.me>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Lines: 87
Message-ID: <UbnJI.13187$6j.4376@fx04.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Mon, 19 Jul 2021 22:40:52 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 18:40:51 -0400
X-Received-Bytes: 4167
 by: dfs - Mon, 19 Jul 2021 22:40 UTC

On 7/19/21 6:29 PM, Bart wrote:
> On 19/07/2021 23:21, dfs wrote:
>> On 7/19/21 3:46 PM, Stefan Ram wrote:
>>> dfs <nospam@dfs.com> writes:
>>>> Also, does anyone have any 'tricks' to make such a file load routine
>>>> faster/more efficient?  Thanks
>>>
>>>    A lot depends on what you then want to do with those words.
>>>
>>>    If they are not to be modified, but just to be read later,
>>>    you can read the whole file into one region of memory
>>>    and then write a NUL character after each word.
>>
>> They're just to be read and pattern-matched against.
>>
>>
>>>    If you need to do some per-word processing while reading,
>>>    you can still allocate one large region of memory and
>>>    append each word to it.
>>>
>>>>     char **words = malloc(sizeof(char*) * lines);
>>>
>>>    I assume that you have your reasons for not checking
>>>    the result of malloc here.
>>
>>
>> Program is for me only.
>>
>> I just tested:
>> lines = 2000000000;
>> char **words = malloc(sizeof(char*) * lines);
>> if(words == NULL)
>> {
>>    printf("malloc failed\n");
>>    exit(0);
>> }
>>
>> and it failed and exited.
>
> Is this a 32-bit system? Or a 32-bit compiler on a 64-bit system?

64 on 64

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
--mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
--enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d,jit
--enable-offload-targets=nvptx-none,amdgcn-amdhsa, --without-cuda-driver
--enable-host-shared --enable-checking=release --disable-werror
--with-gxx-include-dir=/usr/include/c++/11 --enable-ssp --disable-libssp
--disable-libvtv --enable-cet=auto --disable-libcc1 --enable-plugin
--with-bugurl=https://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
--with-slibdir=/lib64 --with-system-zlib
--enable-libstdcxx-allocator=new --disable-libstdcxx-pch
--enable-libphobos --enable-version-specific-runtime-libs
--with-gcc-major-version-only --enable-linker-build-id
--enable-linux-futex --enable-gnu-indirect-function --program-suffix=-11
--without-system-libunwind --enable-multilib --with-arch-32=x86-64
--with-tune=generic --with-build-config=bootstrap-lto-lean
--enable-link-mutex --build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.1.1 20210510 [revision
23855a176609fe8dda6abaf2b21846b4517966eb] (SUSE Linux)

> If so, the largest allocation size might be 2GB, and your request was
> for 8GB.
>
> On a 64-bit system, you might be less lucky, and malloc won't fail even
> if the allocation exceeds physical memory. It'll just get very slow.

I just listen to my fan to know the workload (it goes up and down a lot
in Linux). In Windows 8.1 it stayed quiet, but with Win10 it also
speeds up and down a lot.

> Then there's little point in checking the malloc result.

I've never done it. My code is small-time for personal use.

Re: Allocate length of word vs fixed length

<MVnJI.30664$r21.4511@fx38.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17568&group=comp.lang.c#17568

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx38.iad.POSTED!not-for-mail
Newsgroups: comp.lang.c
From: branimir...@gmail.com (Branimir Maksimovic)
Subject: Re: Allocate length of word vs fixed length
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad> <qRkJI.14177$6U5.9032@fx02.iad>
User-Agent: slrn/1.0.3 (Darwin)
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 42
Message-ID: <MVnJI.30664$r21.4511@fx38.iad>
X-Complaints-To: abuse@usenet-news.net
NNTP-Posting-Date: Mon, 19 Jul 2021 23:29:48 UTC
Organization: usenet-news.net
Date: Mon, 19 Jul 2021 23:29:48 GMT
X-Received-Bytes: 2159
 by: Branimir Maksimovic - Mon, 19 Jul 2021 23:29 UTC

On 2021-07-19, Scott Lurndal <scott@slp53.sl.home> wrote:
> dfs <nospam@dfs.com> writes:
>>On 7/19/21 2:01 PM, Bart wrote:
>>> On 19/07/2021 17:50, dfs wrote:
>>>> Loaded a list of words into an array.
>>>>
>>>> The 370103 words came from https://github.com/dwyl/english-words
>>>> file = words_alpha.txt
>>>>
>>>> Tested a couple memory allocation 'strategies' during loading:
>>>>
>>>> 1. allocate the strlen() of each word
>>>> 2. allocate a fixed length (len of longest word = 32)
>>>>
>>>> I figured getting strlen(word) each time would be slower than
>>>> allocating a fixed amt of memory, but that wasn't the case.
>>>>
>>>> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>>
>>Any thoughts on this?
>>
>>
>>>> Also, does anyone have any 'tricks' to make such a file load routine
>>>> faster/more efficient?  Thanks
>
>
> Personally, I'd mmap it and make a single scanning pass over the entire
> set of words and build an array of offsets from the start of the
> map for each word; the only allocations would be required to
> extend the array (e.g. a vector of 32-bit offset values
> or const char * pointers). If the
> words are delimited by LF, CR or CRLF, just write a nul byte
> over the delimiter.
>
> Voila, an array of words.
>
+1
I would do that too.

--
bmaxa now listens Vanguard by yelworC from Collection 88-94

Re: Allocate length of word vs fixed length

<sd52c5$1qgp$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17569&group=comp.lang.c#17569

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!bJ9HIA9AN/keIQ+0iGyr/g.user.46.165.242.91.POSTED!not-for-mail
From: real.tr...@trolls.com (Real Troll)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Tue, 20 Jul 2021 00:30:00 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sd52c5$1qgp$1@gioia.aioe.org>
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad>
Mime-Version: 1.0
Content-Type: text/plain;
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="59929"; posting-host="bJ9HIA9AN/keIQ+0iGyr/g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
 by: Real Troll - Mon, 19 Jul 2021 23:30 UTC

On 19/07/2021 19:29, dfs wrote:
>
> 0.1 seconds here for (1 * 370103 =� 370103) on my old system
> 0.5 seconds for (6 * 370103 = 2220618)
>

How are you measuring the timing? Can you check this program by running:

"prog words_alpha.txt"� This is providing the text file name at the
command prompt to load it.

I have commented out the printf() function in the main() because the
file is very big to print on the screen.

<==================================================================================>

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const int STEPSIZE = 100;

char **loadfile(char *fileName, int *len);

int main(int argc, char *argv[])
{ ��� if (argc == 1)
��� {
������� perror("Error: ");
������� exit(1);
��� }

��� int length = 0;
��� char **words = loadfile(argv[1], &length);

��� if (!words)
��� {
������� perror("Error: ");
������� exit(1);
��� }

��� for (int i = 0; words[i] != NULL; i++)
��� {
������� // printf("%s\n", words[i]);
��� }

��� printf("Total Lines are: %d", length);
��� return EXIT_SUCCESS;
}

char **loadfile(char *fileName, int *len)
{ ��� FILE *f = fopen(fileName, "r");

��� if (!f)
��� {
������� perror("Error: ");
������� return NULL;
��� }

��� int arrlen = STEPSIZE;
��� // char **lines = NULL;
��� char **lines = (char **)malloc(arrlen * sizeof(char *));

��� char buf[1000];
��� int i = 0;
��� while (fgets(buf, 1000, f))
��� {
������� if (i == arrlen)
������� {
����������� arrlen += STEPSIZE;
����������� char **newlines = realloc(lines, arrlen * sizeof(char *));

����������� if (!newlines)
����������� {
��������������� perror("Error: ");
��������������� exit(1);
����������� }

����������� lines = newlines;
������� }

������� buf[strlen(buf) - 1] = '\0';

������� int slen = strlen(buf);
������� char *str = (char *)malloc((slen + 1) * sizeof(char));
������� strcpy(str, buf);
������� lines[i] = str;
������� i++;
��� }
��� *len = i;
��� fclose(f);
��� return lines;
}

<==================================================================================>

Re: Allocate length of word vs fixed length

<XXoJI.22218$ilwe.8362@fx35.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17570&group=comp.lang.c#17570

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx35.iad.POSTED!not-for-mail
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad> <sd52c5$1qgp$1@gioia.aioe.org>
From: nos...@dfs.com (DFS)
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
MIME-Version: 1.0
In-Reply-To: <sd52c5$1qgp$1@gioia.aioe.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Lines: 120
Message-ID: <XXoJI.22218$ilwe.8362@fx35.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Tue, 20 Jul 2021 00:40:23 UTC
Organization: blocknews - www.blocknews.net
Date: Mon, 19 Jul 2021 20:40:23 -0400
X-Received-Bytes: 4193
 by: DFS - Tue, 20 Jul 2021 00:40 UTC

On 7/19/2021 7:30 PM, Real Troll wrote:
> On 19/07/2021 19:29, dfs wrote:
>>
>> 0.1 seconds here for (1 * 370103 =  370103) on my old system
>> 0.5 seconds for (6 * 370103 = 2220618)
>>
>
> How are you measuring the timing?

When I first posted the code I was running Linux (gcc) and it had no
timing code so I used the standard:
$ time ./loadwords wordfile

I later added CLOCK_MONOTONIC_RAW timing to the source code. You
probably know already that on Windows you can use the
QueryPerformanceCounter() in the code.

I'm back on Windows now (tcc compiler), and can use an outside timer
called ptime.

$ tcc -Wall realtroll.c -o realtroll.exe
$ ptime realtroll words_alpha.txt
Total Lines are: 370103
Execution time: 0.529 s

> Can you check this program by running:
>
> "prog words_alpha.txt"  This is providing the text file name at the
> command prompt to load it.
>
> I have commented out the printf() function in the main() because the
> file is very big to print on the screen.
>
>
> <==================================================================================>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> const int STEPSIZE = 100;
>
> char **loadfile(char *fileName, int *len);
>
> int main(int argc, char *argv[])
> {
>     if (argc == 1)
>     {
>         perror("Error: ");
>         exit(1);
>     }
>
>     int length = 0;
>     char **words = loadfile(argv[1], &length);
>
>     if (!words)
>     {
>         perror("Error: ");
>         exit(1);
>     }
>
>     for (int i = 0; words[i] != NULL; i++)
>     {
>         // printf("%s\n", words[i]);
>     }
>
>     printf("Total Lines are: %d", length);
>     return EXIT_SUCCESS;
> }
>
> char **loadfile(char *fileName, int *len)
> {
>     FILE *f = fopen(fileName, "r");
>
>     if (!f)
>     {
>         perror("Error: ");
>         return NULL;
>     }
>
>     int arrlen = STEPSIZE;
>     // char **lines = NULL;
>     char **lines = (char **)malloc(arrlen * sizeof(char *));
>
>     char buf[1000];
>     int i = 0;
>     while (fgets(buf, 1000, f))
>     {
>         if (i == arrlen)
>         {
>             arrlen += STEPSIZE;
>             char **newlines = realloc(lines, arrlen * sizeof(char *));
>
>             if (!newlines)
>             {
>                 perror("Error: ");
>                 exit(1);
>             }
>
>             lines = newlines;
>         }
>
>         buf[strlen(buf) - 1] = '\0';
>
>         int slen = strlen(buf);
>         char *str = (char *)malloc((slen + 1) * sizeof(char));
>         strcpy(str, buf);
>         lines[i] = str;
>         i++;
>     }
>     *len = i;
>     fclose(f);
>     return lines;
> }
>
> <==================================================================================>

Re: Allocate length of word vs fixed length

<slrnsfcno5.a3m.ike@sdf.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17571&group=comp.lang.c#17571

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ike...@sdf.org (Ike Naar)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Tue, 20 Jul 2021 05:28:37 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <slrnsfcno5.a3m.ike@sdf.org>
References: <U3iJI.61694$VU3.5811@fx46.iad> <slrnsfbqkl.cnu.ike@rie.sdf.org>
<_EmJI.26906$Yv3.21593@fx41.iad>
Injection-Date: Tue, 20 Jul 2021 05:28:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="44a9970441a67a95ee7b179e8cabebb7";
logging-data="15707"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NhMSWxdca2z7oQm1wUfNw"
User-Agent: slrn/1.0.3 (Patched for libcanlock3) (NetBSD)
Cancel-Lock: sha1:kfTWEinJ9huik3SrYJqzgdWY7jw=
 by: Ike Naar - Tue, 20 Jul 2021 05:28 UTC

On 2021-07-19, dfs <nospam@dfs.com> wrote:
> On 7/19/21 5:11 PM, Ike Naar wrote:
>> On 2021-07-19, dfs <nospam@dfs.com> wrote:
>>> //different memory allocations
>>> if(memoption==1)
>>> {memlen=strlen(word);}
>>
>> /* allocate one extra for the terminating null character */
>> memlen = strlen(word) + 1;
>
>
> I did it right. Look a few lines up (you snipped it)
>
> strcpy(word,rtrim(wordin));
>
> So 'word' already has the nul.

Yes, 'word' has the null. And later on, 'word' is strcpy-ed into
words[i], so you want words[i] to have room for the copy
of 'word', including the terminating null character.

>
>
>
>>> else
>>> {memlen=maxlen;}
>>>
>>> words[i] = malloc(sizeof(char*) * memlen);
>
> This works.

It works by luck, not by design.

Suppose 'word' contains the 6-character text "potato".
To store this in words[i], 7 bytes are needed (6 for the
text plus 1 for the terminating null character).

Now suppose sizeof (char*) equals 8 (a common value for a 64-bit system).
Using malloc(sizeof(char*) * memlen) with memlen=6, 8*6 = 48 bytes are allocated
which is more than enough to store the 7 bytes.

But it would be more memory-efficient to allocate 1*7 = 7 bytes.
Hence the malloc(sizeof (char) * memlen) with memlen=7.

Re: Allocate length of word vs fixed length

<sd6vvt$1ogd$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17582&group=comp.lang.c#17582

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!bJ9HIA9AN/keIQ+0iGyr/g.user.46.165.242.91.POSTED!not-for-mail
From: real.tr...@trolls.com (Real Troll)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Tue, 20 Jul 2021 17:00:00 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sd6vvt$1ogd$1@gioia.aioe.org>
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad> <sd52c5$1qgp$1@gioia.aioe.org>
<XXoJI.22218$ilwe.8362@fx35.iad>
Mime-Version: 1.0
Content-Type: text/plain;
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="57869"; posting-host="bJ9HIA9AN/keIQ+0iGyr/g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
 by: Real Troll - Tue, 20 Jul 2021 17:00 UTC

On 20/07/2021 01:40, DFS wrote:
> On 7/19/2021 7:30 PM, Real Troll wrote:
>> On 19/07/2021 19:29, dfs wrote:
>>>
>>> 0.1 seconds here for (1 * 370103 =  370103) on my old system
>>> 0.5 seconds for (6 * 370103 = 2220618)
>>>
>>
>> How are you measuring the timing?
>
>
> When I first posted the code I was running Linux (gcc) and it had no
> timing code so I used the standard:
> $ time ./loadwords wordfile
>
> I later added CLOCK_MONOTONIC_RAW timing to the source code.  You
> probably know already that on Windows you can use the
> QueryPerformanceCounter() in the code.
>
> I'm back on Windows now (tcc compiler), and can use an outside timer
> called ptime.
>
> $ tcc -Wall realtroll.c -o realtroll.exe
> $ ptime realtroll words_alpha.txt
> Total Lines are: 370103
> Execution time: 0.529 s
>
>
>
>
I have always used something like this:

clock_t t;
t = clock();
char **words = loadfile(argv[1], &length);
t = clock() - t;

Then use a printf like so:

double time_taken = ((double)t) / CLOCKS_PER_SEC;
printf("loadfile took %f seconds to execute \n", time_taken);

It is quite rudimentary but works most of the time.

In my code example I tweaked some numbers and the time taken was reduced
dramatically but not as low as your figures.

The numbers I changed were:

const int STEPSIZE = 100000;
char buf[1000000];

Your timings are still very fast.

Re: Allocate length of word vs fixed length

<pcmhsh-pp1.ln1@aretha.foo>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17586&group=comp.lang.c#17586

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!rocksolid2!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 20 Jul 2021 16:36:07 -0500
Message-Id: <pcmhsh-pp1.ln1@aretha.foo>
From: phayw...@alphalink.com.au (Peter 'Shaggy' Haywood)
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
Date: Tue, 20 Jul 2021 11:44:24 +1000
References: <U3iJI.61694$VU3.5811@fx46.iad>
User-Agent: KNode/0.10.9
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7Bit
Lines: 183
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-oiTeyJjyDkyMfXRDLq3ZfU0Tt0ci+Ucm/VQm9LNJr7o7ZA0Kb1aOMWCLW73O/oi1747ZDe9BcPqIT7Y!z9aYQ0WBzdwLf0szrfrFPs/ghmXsst527EZa+ewgqy7KS47T6Q==
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 7021
 by: Peter 'Shaggy&# - Tue, 20 Jul 2021 01:44 UTC

Groovy hepcat dfs was jivin' in comp.lang.c on Tue, 20 Jul 2021 02:50
am. It's a cool scene! Dig it.

> Loaded a list of words into an array.
>
> The 370103 words came from https://github.com/dwyl/english-words
> file = words_alpha.txt
>
> Tested a couple memory allocation 'strategies' during loading:
>
> 1. allocate the strlen() of each word
> 2. allocate a fixed length (len of longest word = 32)
>
> I figured getting strlen(word) each time would be slower than
> allocating a fixed amt of memory, but that wasn't the case.
>
> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>
> Also, does anyone have any 'tricks' to make such a file load routine
> faster/more efficient? Thanks

That all depends on what you're actually trying to do. Reading in an
array of words is a means to some end, not an end in itself. What do
you want to do with these words?

> ===========================================================
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <ctype.h>
>
> #define maxlen 32

If the longest word is 32 letters, then maxlen should be 33 to allow
for the string's terminating null character. (Ike advised you about
allowing for this null in his response to you, but you didn't seem to
understand. See below.) The way you're reading in words, your 32 letter
word will be truncated to 31 letters, and the 32nd letter will wind up
being the next word.

> //removes trailing isspaces and quote marks
> char *rtrim(char *str)
> {
> int len = strlen(str);
> while(len>0 && (isspace(str[len-1]) || str[len-1] == '\"')) {len--;}
> str[len] = '\0';
> return str;
> }

Why is this (above) function needed? Does your file contain white
space (other than newlines) and quotes? What about such characters at
the start of a word?
This could probably be done by some kind of script before feeding the
file to your program.

> int main(int argc, char *argv[])
> {
> //vars

A point on style: it's a bad idea to use pointless comments like this.
Comments should explain the working and reasoning for code that's not
immediately obvious. Comments that don't do that just clutter the code
and can, in many cases, make it harder to comprehend.

> int i = 0, lines = 0;

Another word or two on style here: I'd make i and lines type size_t.
The point of size_t is to represent sizes (lines, in this case), so it
is good to use it for that. And the counter (i) is used to access
elements over the length of the array, so to my mind it makes sense to
use the same type for that (though no doubt others may disagree).
Also, you aren't using i until way below here. It might be better to
declare (or at least "initialise") it just before using it.

> char word[maxlen] = "";
> char wordin[maxlen] = "";

Initialising these is pointless, since you're only overwriting them
anyhow.

> //open file, count lines
> FILE *fwords = fopen(argv[1],"r");
> while(fgets(wordin,sizeof wordin,fwords)!=NULL) {lines++;}
>
> //Geany adds a line feed to the very last line, so the true
> //line count is overstated

Now that's a better comment. :)

> lines -= 1;
>
> //mem
> char **words = malloc(sizeof(char*) * lines);

Another style point: it is bad style to hard wire the type in a
malloc() call. This can lead to problems in case the type of the thing
being allocated changes. It is better to use something like the
following:

type *ptr = malloc(sizeof *ptr * num);

That way the size is always right for the thing being allocated. I know
Ike touched on this too, but it bears repeating.
Also, always check the return from functions that can fail, like
malloc(). This is vital. Never leave out this important step.

> //add words to array
> int memoption = atoi(argv[2]);

Always check that you have enough command line args when attempting to
use them. Again, this is vital. It's arguably more important than
checking that malloc() succeeded, because command line args usually
come from a human user; and you know how unpredictable those creatures
can be!
Also, what about argv[1]? You're not going to use that too?

> int memlen = 0, memtotal = 0;
> rewind(fwords);

Again, check that this succeeded. Use ferror() for this.

> while(fgets(wordin,sizeof wordin,fwords)!=NULL)
> {
> strcpy(word,rtrim(wordin));
>
> //different memory allocations
> if(memoption==1)
> {memlen=strlen(word);}

Now, about what I alluded to above (and Ike told you about), this is a
classic "off by one" error. To store a 5 letter word (for example) as a
string, you need 6 bytes. strlen() will return 5, meaning you need to
add 1 here. Remember, strlen() returns the length of the string *not*
including the terminating null character. You need to add 1.

> else
> {memlen=maxlen;}
>
> words[i] = malloc(sizeof(char*) * memlen);

I think Ike also mentioned the problem here. Remember as a general
rule, allocate the size of the thing being allocated, not a hard wired
type. But when you're just getting space for a string, it's okay to
just leave out the type altogether. You just want to allocate enough
bytes for the string (memlen bytes, in this case).

words[i] = malloc(memlen);

And, again, check that it succeeded.

> memtotal += memlen;
>
> strcpy(words[i],word);
>
> //printf("%d. '%s' ",i,words[i]);
> i++;
> }
>
> fclose(fwords);
> printf("\nloaded %d words\n",i-1);
> printf("\nFirst and Last words = '%s', '%s'\n",words[0],words[i-1]);
> printf("malloc option %d, total memory %d\n",memoption,memtotal);
> free(words);

Here you're freeing the overall array, but you're not freeing all the
smaller arrays (strings).

> return(0);

You don't need the parentheses around the return value here; return is
not a function.

> }

--

----- Dig the NEW and IMPROVED news sig!! -----

-------------- Shaggy was here! ---------------
Ain't I'm a dawg!!

Re: Allocate length of word vs fixed length

<sd7ie7$3mj$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17587&group=comp.lang.c#17587

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!bJ9HIA9AN/keIQ+0iGyr/g.user.46.165.242.91.POSTED!not-for-mail
From: real.tr...@trolls.com (Real Troll)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Tue, 20 Jul 2021 22:15:00 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sd7ie7$3mj$1@gioia.aioe.org>
References: <U3iJI.61694$VU3.5811@fx46.iad>
<allocation-20210719204511@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain;
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="3795"; posting-host="bJ9HIA9AN/keIQ+0iGyr/g.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
 by: Real Troll - Tue, 20 Jul 2021 22:15 UTC

> A lot depends on what you then want to do with those words.
>
As pointed by Stefan and others, If all you need to do is to find the
largest length of the words then a simple program such as this one will
also work.

<========================================================================================================>

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{ ��� FILE *gitFile;
��� char line[32];
��� int max_len = 0;
��� int counter = 0;

��� if ((gitFile = fopen("words_alpha.txt", "r")) == NULL)
��� {
������� perror("Error: No such file or directory\n");
������� EXIT_FAILURE;
��� }
��� else
��� {
������� while (!feof(gitFile))
������� {
����������� fscanf(gitFile, "%s", &line);
����������� if (strlen(line) > max_len)
����������� {
��������������� max_len = strlen(line);
����������� }
����������� counter += 1;
������� }
��� }
��� printf("Largest word size is: %zu\n", max_len);
��� printf("There are %d lines in the file \n", counter);

��� return EXIT_SUCCESS;
}

<========================================================================================================>

In such a situation speed of loading a file is not important because you
are trying to read each line one at a time.

I agree this is too simplistic but we don't know what exactly is the
purpose of the program.

Re: Allocate length of word vs fixed length

<qNJJI.18502$0N5.3707@fx06.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17588&group=comp.lang.c#17588

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx06.iad.POSTED!not-for-mail
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad>
<allocation-20210719204511@ram.dialup.fu-berlin.de>
<sd7ie7$3mj$1@gioia.aioe.org>
From: nos...@dfs.com (dfs)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <sd7ie7$3mj$1@gioia.aioe.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Lines: 127
Message-ID: <qNJJI.18502$0N5.3707@fx06.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Wed, 21 Jul 2021 00:22:46 UTC
Organization: blocknews - www.blocknews.net
Date: Tue, 20 Jul 2021 20:22:46 -0400
X-Received-Bytes: 4639
 by: dfs - Wed, 21 Jul 2021 00:22 UTC

On 7/20/21 6:15 PM, Real Troll wrote:
>
>> A lot depends on what you then want to do with those words.
>>
> As pointed by Stefan and others, If all you need to do is to find the
> largest length of the words then a simple program such as this one will
> also work.

I found the max word length of 31 before I posted, but I took the code out
----------------------------------------------------------
int wordlen = 0, maxwlen = 0;
while(fgets(wordin,sizeof wordin,fwords)!=NULL)
{ wordlen = strlen(rtrim(wordin));
if(wordlen > maxwlen) {maxwlen = wordlen;printf("%d ",maxwlen);}
lines++;
} printf("Max word len = %d\n",maxwlen);
----------------------------------------------------------

> <========================================================================================================>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main()
> {
>     FILE *gitFile;
>     char line[32];
>     int max_len = 0;
>     int counter = 0;
>
>     if ((gitFile = fopen("words_alpha.txt", "r")) == NULL)
>     {
>         perror("Error: No such file or directory\n");
>         EXIT_FAILURE;
>     }
>     else
>     {
>         while (!feof(gitFile))
>         {
>             fscanf(gitFile, "%s", &line);
>             if (strlen(line) > max_len)
>             {
>                 max_len = strlen(line);
>             }
>             counter += 1;
>         }
>     }
>     printf("Largest word size is: %zu\n", max_len);
>     printf("There are %d lines in the file \n", counter);
>
>     return EXIT_SUCCESS;
> }
>
> <========================================================================================================>
>
> In such a situation speed of loading a file is not important because you
> are trying to read each line one at a time.
>
> I agree this is too simplistic but we don't know what exactly is the
> purpose of the program.

A simple linear search of the array to do word search/count/filter when
you enter letter(s).

It's partly working.

Enter a value: z
Found 1386 matches in 0.004709 seconds

Enter a value : x
Found 507 matches in 0.009121 seconds

I say partly because for now it just matches on the first letter you
enter.

Enter a value: cat
That will return all words beginning with c, not just words beginning
with cat.

There are a boatload of string search algorithms, but 'brute force' is
good for now
//http://www-igm.univ-mlv.fr/~lecroq/string/index.html

I saw your other post on timing. I run an 11-year-old CPU
(i5-750@2.67GHz) that's nothing special, but all my C code just screams
here on Linux. Usually compiled with the standard:
gcc -Wall source -o executable

My timing is done like this:
----------------------------------------------------------------------
struct timespec start,stop;

double elapsedtime(struct timespec started)
{ const double B = 1e9;
clock_gettime(CLOCK_MONOTONIC_RAW,&stop);
return (stop.tv_sec-started.tv_sec)+
(stop.tv_nsec-started.tv_nsec)/B;
}

inside main()

clock_gettime(CLOCK_MONOTONIC_RAW, &start);
.... code to search word array
printf ("\nFound %d matches in %f secs\n",matches,elapsedtime(start));
----------------------------------------------------------------------

I put in a sleep(3) in to test it:

Enter a value : z
Found 1386 matches in 3.006703 seconds

Put your CLOCKS_PER_SEC timing in my code and got:

Enter a value : a
Found 25416 matches in 0.005890 seconds

Re: Allocate length of word vs fixed length

<Iw5KI.24867$bR5.9748@fx44.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17616&group=comp.lang.c#17616

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
Subject: Re: Allocate length of word vs fixed length
Newsgroups: comp.lang.c
References: <U3iJI.61694$VU3.5811@fx46.iad>
<allocation-20210719204511@ram.dialup.fu-berlin.de>
<sd7ie7$3mj$1@gioia.aioe.org> <qNJJI.18502$0N5.3707@fx06.iad>
From: nos...@dfs.com (dfs)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <qNJJI.18502$0N5.3707@fx06.iad>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Lines: 449
Message-ID: <Iw5KI.24867$bR5.9748@fx44.iad>
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Thu, 22 Jul 2021 03:23:20 UTC
Organization: blocknews - www.blocknews.net
Date: Wed, 21 Jul 2021 23:23:20 -0400
X-Received-Bytes: 11163
 by: dfs - Thu, 22 Jul 2021 03:23 UTC

On 7/20/21 8:22 PM, dfs wrote:
> On 7/20/21 6:15 PM, Real Troll wrote:

>> I agree this is too simplistic but we don't know what exactly is the
>> purpose of the program.
>
> A simple linear search of the array to do word search/count/filter when
> you enter letter(s).

That's what it started out as, but HGH took over and it bloated to the
below 450 lines.

Give it a try, please, and see if you can break it. There are a few
TODOs in there, so it's not quite done.

Use whatever word list text file you want. Most of my testing was done
with words_alpha.txt from https://github.com/dwyl/english-words

----------------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <ctype.h> //used with tolower(val)
#include <sys/ioctl.h> //used with finding terminal width
#include <sys/resource.h> //used with finding terminal width
#include <unistd.h> //used with finding terminal width

#define maxlen 32

//removes trailing isspaces and quote marks
char *rtrim(char *str)
{ int len = strlen(str);
while(len>0 && (isspace(str[len-1]) || str[len-1] == '\"')) {len--;}
str[len] = '\0';
return str;
}

//timing
struct timespec start,stop;
double elapsedtime(struct timespec started)
{ const double B = 1e9;
clock_gettime(CLOCK_MONOTONIC_RAW,&stop);
return (stop.tv_sec-started.tv_sec)+
(stop.tv_nsec-started.tv_nsec)/B;
}

//width of screen - gets checked/set each time summary is run
int tcols = 50;

//set value of screen width in characters
void settcols()
{ struct winsize w = {0};
ioctl(STDOUT_FILENO, TIOCGWINSZ, &w);
tcols = w.ws_col;
}

//string compare function for qsort
int comparechar(const void *a, const void *b) {
return strcmp(*(char **) a, *(char **) b);
}

//print a separator line on screen
void printline(int linewidth, char *linechar)
{ for(int i=0;i<linewidth;i++) { printf(linechar); }
printf("\n");
}

void countdupes(char *words[],int linecnt)
{ //each time word matches prev word increase dupe cnt
int dupecnt = 0;
for(int i=0;i<linecnt-1;i++)
{
for (int j=i+1;j<linecnt;j++)
{
if (strcmp(words[j],words[i])==0)
{ dupecnt++;
//printf("%d. %s-%s (%d)\n",i, words[i],words[j], dupecnt);
}
else
{i=j-1;break;}
}
}
printf("%d dupes\n",dupecnt);
}

char *getmode(char *words[])
{ return "three words appear 4 times";
}

void printsummary(int linecnt, char *words[], int maxwlen, int
countarr[], char *infile)
{ settcols(); //set # of columns visible onscreen
printf("\n");
printline(tcols*.8,"=");
printf("Summary of %s\n",infile);
printline(tcols*.8,"=");
printf("%d words\n",linecnt);
countdupes(words,linecnt);
printf("\nFirst word is '%s'\n",words[0]);
printf("Last word is '%s'\n",words[linecnt-1]);
printf("Longest word is %d letters\n\n",maxwlen);
printf("Mean = \n");
printf("Median = %s\n",words[(int)linecnt/2]);
printf("Mode = %s\n\n",getmode(words));
//output data by rows and columns
int a=0,rows=5,cols=6;
//count words by length
int matches = 0;
int cnts[31]={0};
for(int i=1;i<=maxwlen;i++)
{
for(int j=0;j<linecnt;j++)
{
if(strlen(words[j]) == i) {matches++;}
}
//printf("%d. %d\n",i,matches);
cnts[i-1] = matches;
matches = 0;
}
printf("Word counts by length\n");
a=0;
for(int r=0;r<=rows;r++)
{
for (int c=0;c<=cols;c++)
{
if(a<maxwlen) {printf("%2d. %5d ",a+1,cnts[a]);}
a++;
}
printf("\n");
}

//count words by first letter
printf("\nWord counts by first letter\n");
a=0;
for(int r=0;r<=rows;r++)
{
for (int c=0;c<=cols;c++)
{
if(a<26) {printf("%c. %5d ",a+97,countarr[a]);}
a++;
}
printf("\n");
}

//count frequency of letters across all words
int freq[26]={0};
for(int i=0;i<linecnt;i++)
{
for(int j=0;j<strlen(words[i]);j++)
{
freq[words[i][j]-'a']++;
}
}

//copy array for use with descending frequency output
int freq2[26]={0};
memcpy(freq2,freq,sizeof(freq2));
//sort counts descending
int n=26;
for (int i = 0; i < n; ++i)
{ for (int j = i + 1; j < n; ++j)
{ if (freq[i] < freq[j])
{ a = freq[i];
freq[i] = freq[j];
freq[j] = a;
}
}
}

//output letter frequency in descending order
//TODO: if multiple letters have the same frequency the code prints
// the 1st letter over and over. Need to resolve ties and print
letters in order
printf("Descending frequency counts\n");
a=0;
for(int r=0;r<=rows;r++)
{ for (int c=0;c<=cols;c++)
{ if(a<26)
{ for(int s=0;s<sizeof(freq2);s++)
{ if(freq2[s]==freq[a])
{ {printf("%c. %6d ",s+97,freq[a]);}
break;
}
}
}
a++;
}
printf("\n");
}
printline(tcols*.8,"=");
}

int main(int argc, char *argv[])
{
//vars
char word[maxlen] = "";
char wordin[maxlen] = "";
int countarr[26]={0};
static char *offon[]= {"off","on"};
char fcmd[37];
//open file, count lines, get max word length
char filein[50];
strcpy(filein,argv[1]);
FILE *fwords = fopen(filein,"r");
int lines=0, blanks=0, wordlen=0, maxwlen=0;
while(fgets(wordin,sizeof wordin,fwords)!=NULL)
{
wordlen = strlen(rtrim(wordin));
if (wordlen>0) {
if(wordlen > maxwlen) {maxwlen = wordlen;}
lines++;
}
else
{blanks++;}
}
//printf("%d lines, including %d blanks\n",lines+blanks,blanks);

//mem
char **words = malloc(sizeof(char*) * lines);
if(words == NULL) {printf("malloc failed\n");exit(0);}


// load word list into array
rewind(fwords);
int i=0;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
while(fgets(wordin,sizeof wordin,fwords)!=NULL)
{
strcpy(word,rtrim(wordin));
if(strlen(word)>0) {
words[i] = malloc(sizeof(char*) * (strlen(word) +1));
strcpy(words[i],word);
countarr[word[0]-'a']++;
i++;
}
}
printf ("\nLoaded %d words in %.3f seconds\n\n",i,elapsedtime(start));
//close file
fclose(fwords);
//sort the array
qsort(words, lines, sizeof(char*), comparechar);

//program options:
int prt=0,lsch=0,sub=0,wsz=0,dic=0,timing=0;
char opt;
menu:
printf("\nMenu \n\n");
printf(" -l search by start of word\n");
printf(" -b search by substring\n");
printf(" -n find words of length L\n");
printf(" -s summary of word list\n");
printf(" -d definitions\n");
printf(" -p print results to screen (%s)\n",offon[prt]);
printf(" -m print this menu\n");
printf(" -t show search times (%s)\n",offon[timing]);
printf(" -x exit program\n");

//capture keyboard input
char str[32];
while (strcmp(str,"-x")!=0)
{
//startup
search:
if(lsch==0&&sub==0&&wsz==0&&dic==0)
{printf("\nEnter -option to start: ");}

if(lsch) {printf("\nEnter letters to search for: ");}
if(sub) {printf("\nEnter substring to search for: ");}
if(wsz) {printf("\nEnter size of word to search for: ");}
if(dic) {printf("\nEnter word to find definition: ");}
scanf("%s", str);
//printf("'%s'",str);

if(str[0]=='-')
{
opt = tolower(str[1]);

if(opt=='x') {exit(0);}

//search for letters at beginning
if(opt=='l')
{if(lsch==0) {lsch=1;sub=0;wsz=0;dic=0;}}

//search for substring anywhere cerin word
if(opt=='b')
{if(sub==0) {sub=1;lsch=0;wsz=0;dic=0;}}

//search for words of a size
if(opt=='n')
{
if(wsz==0) {wsz=1;lsch=0;sub=0;dic=0;}
printf("\nlook for words of size 1 to %d",maxwlen);
}

//print summary of imported words
if(opt=='s')
{
printsummary(lines, words, maxwlen, countarr, filein);
}

//use dict search
if(opt=='d')
{if(dic==0) {dic=1;lsch=0;sub=0;wsz=0;}}

//show menu
if(opt=='m') {goto menu;break;}

//print word search results to screen
if(opt=='p')
{
prt = (prt==0) ? 1 : 0;
printf("\nprint is %s",offon[prt]);
}


Click here to read the complete article
Re: Allocate length of word vs fixed length

<20210723180723.700@kylheku.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17630&group=comp.lang.c#17630

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: 563-365-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Allocate length of word vs fixed length
Date: Sat, 24 Jul 2021 01:07:53 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <20210723180723.700@kylheku.com>
References: <U3iJI.61694$VU3.5811@fx46.iad> <sd4em5$694$1@dont-email.me>
<OvjJI.64366$Vv6.35081@fx45.iad>
Injection-Date: Sat, 24 Jul 2021 01:07:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="64176d21b1bdcb5146d7e81335e1f4e4";
logging-data="3076"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+IIL/9Hs+mz8wacXkQoO242rHuEauDAQ4="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:6iEZhIh2b0dW6iUmOglSavcMq54=
 by: Kaz Kylheku - Sat, 24 Jul 2021 01:07 UTC

On 2021-07-19, dfs <nospam@dfs.com> wrote:
> On 7/19/21 2:01 PM, Bart wrote:
>> On 19/07/2021 17:50, dfs wrote:
>>> Loaded a list of words into an array.
>>>
>>> The 370103 words came from https://github.com/dwyl/english-words
>>> file = words_alpha.txt
>>>
>>> Tested a couple memory allocation 'strategies' during loading:
>>>
>>> 1. allocate the strlen() of each word
>>> 2. allocate a fixed length (len of longest word = 32)
>>>
>>> I figured getting strlen(word) each time would be slower than
>>> allocating a fixed amt of memory, but that wasn't the case.
>>>
>>> Strategy 2 is significantly slower, and uses nearly 3x the memory.
>
> Any thoughts on this?

It must be that malloc is decently optimized for very small object sizes?

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor