Message-ID:

There can be no twisted thought without a twisted molecule. -- R. W. Gerard

devel / comp.lang.tcl / I need to organize some data

Solving problems and even reinventing wheels can be fun, I've done that a
lot, but sometimes it becomes tiresome and may even become stupid
pig-headed. So I decided to ask for help again.

I need to collect the names of all files and directories in a path. We have
glob for that.

OK. Now I need to organize that data. The problem is, there may be many
ways of organizing data. Talk about skinning cats.

Maybe the user wants it all sorted by file name, alphabetically. OK. The
problem is, maybe the user wants directories listed first. Maybe the user
wants files listed first. That alone means three possible lists. Not to
mention reverse order. That's another three lists.

Then maybe the user wants it all sorted by date. OK. The problem is, maybe
the user wants newest first, maybe the user wants oldest first. That alone
means two possible lists. And again, maybe the user wants directories
listed first. Maybe the user wants files listed first. Maybe the user wants
files and directories mixed. Those combined make six possible lists. And an
additional problem. I will certainly use the Unix epoch to sort dates, but
will display a nicely formatted date string to the user later, so whatever
is used in the sorting process cannot be used in the display step.

More or less the same happens with size. I have to run calculations on raw
byte sizes before I format them as KB, MB, GB etc. After all those
permutations I mentioned before.

I've done all that. It works. The problem is, 1. It's 150 lines of code.
More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
sorts lists with strides and assumes that filename is [lindex list 0], size
is [lindex list 1], and so on. That is going to break because I want to let
the user change the position of the columns. Maybe the user wants
filename-size-date-permission-owner or maybe just filename-date-size or
maybe filename-owner-date. Who knows? So I need to be able to query the
data in random orders, according to the user's choice of column
presence and order.

So I thought maybe I can use a dict:
dict create biglist $filelist
and $filelist would be yet another dict arrangement or maybe an array with
each file and its corresponding attributes.

Then I would just have to parse that dict, formatting size and date on the
fly before displaying them.

The problem is, it's still a lot of IFs.

Another problem is, there is a very, very unexpected little obstacle in the
way. You would never guess what it is.

I use an old kernel. And there is a bug in BTRFS. One of my files is listed
twice. Every time. Same name, same size, same date, same inode. I talked
about it in the #btrfs IRC channel and they confirmed the bug. It's old,
but it exists in old kernels. If it's there, it has to be seen. I can't let
my application hide it. It's not my job to sweep file system bugs under the
carpet. Which means that, counterintuitively, I must not treat filenames as
unique. I must assume there may be duplicates, as weird as that sounds. And
they must be shown as such. You know what, I think I am going to write code
to flag that kind of situation very visibly to the user because that's the
kind of meticulous sad soul that I am. So anyway, I can't index my dict or
array structure around file names.

That is going to make the problem more difficult. In fact, I'm not sure
where to start.

Would you like to contribute any ideas? No need for code. Just broad ideas
on how to organize the data efficiently. With duplicate file names.

TIA

--
Luc
>>

Re: I need to organize some data

<c0d87b7b-d0d8-4e0b-8067-15bb4a2ca96en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=12610&group=comp.lang.tcl#12610

copy link Newsgroups: comp.lang.tcl

X-Received: by 2002:a05:622a:1883:b0:425:7083:1460 with SMTP id v3-20020a05622a188300b0042570831460mr659664qtc.13.1703025520007;
Tue, 19 Dec 2023 14:38:40 -0800 (PST)
X-Received: by 2002:ad4:5aab:0:b0:67f:664d:c9c4 with SMTP id
u11-20020ad45aab000000b0067f664dc9c4mr175046qvg.7.1703025519745; Tue, 19 Dec
2023 14:38:39 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 19 Dec 2023 14:38:39 -0800 (PST)
In-Reply-To: <20231219180949.48884896@lud1.home>
Injection-Info: google-groups.googlegroups.com; posting-host=185.250.212.17; posting-account=99ziWAoAAADLScDbI_CyfIcpa_gFZ0zh
NNTP-Posting-Host: 185.250.212.17
References: <20231219180949.48884896@lud1.home>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c0d87b7b-d0d8-4e0b-8067-15bb4a2ca96en@googlegroups.com>
Subject: Re: I need to organize some data
From: gregor.e...@googlemail.com (greg)
Injection-Date: Tue, 19 Dec 2023 22:38:40 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4993

by: greg - Tue, 19 Dec 2023 22:38 UTC

Luc schrieb am Dienstag, 19. Dezember 2023 um 22:09:54 UTC+1:
> Solving problems and even reinventing wheels can be fun, I've done that a
> lot, but sometimes it becomes tiresome and may even become stupid
> pig-headed. So I decided to ask for help again.
>
> I need to collect the names of all files and directories in a path. We have
> glob for that.
>
> OK. Now I need to organize that data. The problem is, there may be many
> ways of organizing data. Talk about skinning cats.
>
> Maybe the user wants it all sorted by file name, alphabetically. OK. The
> problem is, maybe the user wants directories listed first. Maybe the user
> wants files listed first. That alone means three possible lists. Not to
> mention reverse order. That's another three lists.
>
> Then maybe the user wants it all sorted by date. OK. The problem is, maybe
> the user wants newest first, maybe the user wants oldest first. That alone
> means two possible lists. And again, maybe the user wants directories
> listed first. Maybe the user wants files listed first. Maybe the user wants
> files and directories mixed. Those combined make six possible lists. And an
> additional problem. I will certainly use the Unix epoch to sort dates, but
> will display a nicely formatted date string to the user later, so whatever
> is used in the sorting process cannot be used in the display step.
>
> More or less the same happens with size. I have to run calculations on raw
> byte sizes before I format them as KB, MB, GB etc. After all those
> permutations I mentioned before.
>
> I've done all that. It works. The problem is, 1. It's 150 lines of code.
> More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
> sorts lists with strides and assumes that filename is [lindex list 0], size
> is [lindex list 1], and so on. That is going to break because I want to let
> the user change the position of the columns. Maybe the user wants
> filename-size-date-permission-owner or maybe just filename-date-size or
> maybe filename-owner-date. Who knows? So I need to be able to query the
> data in random orders, according to the user's choice of column
> presence and order.
>
> So I thought maybe I can use a dict:
> dict create biglist $filelist
> and $filelist would be yet another dict arrangement or maybe an array with
> each file and its corresponding attributes.
>
> Then I would just have to parse that dict, formatting size and date on the
> fly before displaying them.
>
> The problem is, it's still a lot of IFs.
>
> Another problem is, there is a very, very unexpected little obstacle in the
> way. You would never guess what it is.
>
> I use an old kernel. And there is a bug in BTRFS. One of my files is listed
> twice. Every time. Same name, same size, same date, same inode. I talked
> about it in the #btrfs IRC channel and they confirmed the bug. It's old,
> but it exists in old kernels. If it's there, it has to be seen. I can't let
> my application hide it. It's not my job to sweep file system bugs under the
> carpet. Which means that, counterintuitively, I must not treat filenames as
> unique. I must assume there may be duplicates, as weird as that sounds. And
> they must be shown as such. You know what, I think I am going to write code
> to flag that kind of situation very visibly to the user because that's the
> kind of meticulous sad soul that I am. So anyway, I can't index my dict or
> array structure around file names.
>
> That is going to make the problem more difficult. In fact, I'm not sure
> where to start.
>
> Would you like to contribute any ideas? No need for code. Just broad ideas
> on how to organize the data efficiently. With duplicate file names.
>
> TIA
>
> --
> Luc
> >>
a idea:
https://wiki.tcl-lang.org/page/ls+-l+in+Tcl

Luc <luc@sep.invalid> wrote:
> Solving problems and even reinventing wheels can be fun, I've done that a
> lot, but sometimes it becomes tiresome and may even become stupid
> pig-headed. So I decided to ask for help again.
>
> I need to collect the names of all files and directories in a path. We have
> glob for that.
>
> OK. Now I need to organize that data. The problem is, there may be many
> ways of organizing data. Talk about skinning cats.
>
> Maybe the user wants it all sorted by file name, alphabetically. OK. The
> problem is, maybe the user wants directories listed first. Maybe the user
> wants files listed first. That alone means three possible lists. Not to
> mention reverse order. That's another three lists.

Why would that mean six lists?

Create one list of lists, with each sublist being the columns of raw
data for the files that you want to use.

Then, create a set of procs that utilize the "-command" interface to
lsort. One proc sorts by name, one proc by date, one proc by whatever
else.

Then when the user changes the sort order, just resort the single
nested list using the appropriate "sort type" proc.

> Then maybe the user wants it all sorted by date. OK. The problem is, maybe
> the user wants newest first, maybe the user wants oldest first. That alone
> means two possible lists. And again, maybe the user wants directories
> listed first. Maybe the user wants files listed first. Maybe the user wants
> files and directories mixed. Those combined make six possible lists. And an
> additional problem. I will certainly use the Unix epoch to sort dates, but
> will display a nicely formatted date string to the user later, so whatever
> is used in the sorting process cannot be used in the display step.

Internal storage format and external viewing format are two different
things, and do not (and often are not) the same format. The view is
derived from the internal data by formatting it as it needs to be
formatted for the view.

> I've done all that. It works. The problem is, 1. It's 150 lines of code.
> More IFs than ChatGPT. It's big and awkward. 2. I did it in a way that
> sorts lists with strides and assumes that filename is [lindex list 0], size
> is [lindex list 1], and so on. That is going to break because I want to let
> the user change the position of the columns.

lsort has the -index option just for sorting a list of lists.

And again, the internal storage format is not the same as display format.

Internal storage format is the same, always. That keeps the code
simpler.

If the user swaps around columns then you display the columns in a
different order to create the view, but the column order in the internal
format does not change.

> Another problem is, there is a very, very unexpected little obstacle in the
> way. You would never guess what it is.
>
> I use an old kernel. And there is a bug in BTRFS. One of my files is listed
> twice. Every time. Same name, same size, same date, same inode. I talked
> about it in the #btrfs IRC channel and they confirmed the bug. It's old,
> but it exists in old kernels. If it's there, it has to be seen. I can't let
> my application hide it. It's not my job to sweep file system bugs under the
> carpet. Which means that, counterintuitively, I must not treat filenames as
> unique. I must assume there may be duplicates, as weird as that sounds. And
> they must be shown as such. You know what, I think I am going to write code
> to flag that kind of situation very visibly to the user because that's the
> kind of meticulous sad soul that I am. So anyway, I can't index my dict or
> array structure around file names.

Nope, so just don't. Use a nested list of lists, which will store
duplicates with no problem.

Note that if you store the "filename" separated from the "path to the
file" then you have to be able to support duplicate filenames anyway,
even without a "bug", because two sub directories could have the
identical filename inside.

On Tue, 19 Dec 2023 22:49:37 -0000 (UTC), Rich wrote:

>Note that if you store the "filename" separated from the "path to the
>file" then you have to be able to support duplicate filenames anyway,
>even without a "bug", because two sub directories could have the
>identical filename inside.

I don't really understand what you mean here because in my case there
won't be any subdirectories. Each "view" contemplates the flat listing
inside a directory. There is no recursive action here whatsoever.

There will be recursive actions when I get to the stage of copying,
moving and deleting, but I'm not dealing with any sharp edges yet.

Anyway, let's see the new problem I have now:

>Create one list of lists, with each sublist being the columns of raw
>data for the files that you want to use.
>
>Then, create a set of procs that utilize the "-command" interface to
>lsort. One proc sorts by name, one proc by date, one proc by whatever
>else.
>
>Then when the user changes the sort order, just resort the single
>nested list using the appropriate "sort type" proc.

I never used the -command option before so I had to look it up.

Ugh! Why are so many examples in the manual so opaque?

% proc compare {a b} {
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
return -1
} elseif {$a0 > $b0} {
return 1
}
return [string compare [lindex $a 1] [lindex $b 1]]
}

% lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}
{1 dingo} {2 banana} {0x2 carrot} {3 apple}

I copied and changed and ran that many times trying to understand what in
the world is going on here, but I can't. How does the comparison work?

Is {3 apple} compared to {0x2 carrot} then {1 dingo} compared to
{2 banana}?

Does it stride in leaps of 2 because two arguments were given (a and b)?

But that doesn't make sense. I can't find a hypothesis that makes sense.
I can't see the logic.

I guess {3 apple} is compared with {0x2 carrot}. Actually, just their
0 indices. So 3 > 1 so return 1. OK, 1. But what is "1" here?

Holy mackerew. I need to dissect this thing:

proc compare {a b} {
puts "begin: a is $a b is $b"
set a0 [lindex $a 0]
set b0 [lindex $b 0]
if {$a0 < $b0} {
puts "R1 $a0 < $b0, returning -1"
return -1
} elseif {$a0 > $b0} {
puts "R2 $a0 > $b0, returning 1"
return 1
}
puts "two tests failed with $a0 vs. $b0, going to the third"
puts "R3 [lindex $a 1] [lindex $b 1], returning [string compare [lindex $a 1] [lindex $b 1]] for [lindex $a 1] and [lindex $b 1]"
return [string compare [lindex $a 1] [lindex $b 1]]
} puts [lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}]

Output:

begin: a is 3 apple b is 0x2 carrot
R2 3 > 0x2, returning 1
begin: a is 1 dingo b is 2 banana
R1 1 < 2, returning -1
begin: a is 0x2 carrot b is 1 dingo
R2 0x2 > 1, returning 1
begin: a is 0x2 carrot b is 2 banana
two tests failed with 0x2 vs. 2, going to the third
R3 carrot banana, returning 1 for carrot and banana
{1 dingo} {2 banana} {0x2 carrot} {3 apple}

So I see the first pair was compared to the second pair, then the third
pair was compared to the fourth pair... OK...

Then the second pair is compared to the fourth pair... Why?

And the results of all the comparisons were 1, -1, 1 and 1. OK.
And what does that mean?

Sorry, but I really can't understand the shuffle.

--
Luc
>>

Luc <luc@sep.invalid> wrote:
> On Tue, 19 Dec 2023 22:49:37 -0000 (UTC), Rich wrote:
>
>>Note that if you store the "filename" separated from the "path to the
>>file" then you have to be able to support duplicate filenames anyway,
>>even without a "bug", because two sub directories could have the
>>identical filename inside.
>
>
> I don't really understand what you mean here because in my case there
> won't be any subdirectories. Each "view" contemplates the flat listing
> inside a directory. There is no recursive action here whatsoever.

I had no idea whether you were building a recursive view or not. But
if you /were/ building one, and splitting the name away from the rest
of the path, then with recursion you can (more like 'will') have
identical "names" (just on different paths).

>>Create one list of lists, with each sublist being the columns of raw
>>data for the files that you want to use.
>>
>>Then, create a set of procs that utilize the "-command" interface to
>>lsort. One proc sorts by name, one proc by date, one proc by whatever
>>else.
>>
>>Then when the user changes the sort order, just resort the single
>>nested list using the appropriate "sort type" proc.
>
>
> I never used the -command option before so I had to look it up.
>
> Ugh! Why are so many examples in the manual so opaque?
>
>
> % proc compare {a b} {
> set a0 [lindex $a 0]
> set b0 [lindex $b 0]
> if {$a0 < $b0} {
> return -1
> } elseif {$a0 > $b0} {
> return 1
> }
> return [string compare [lindex $a 1] [lindex $b 1]]
> }
>
> % lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}
> {1 dingo} {2 banana} {0x2 carrot} {3 apple}

There's nothing opaque above.

> I copied and changed and ran that many times trying to understand what in
> the world is going on here, but I can't. How does the comparison work?

Note, I don't know what exact algorithm lsort uses, but in a general
sense, what lsort does is it traverses the list, comparing pairs of
list elements as it goes, and decides whether to swap them or not,
based upon the return value of the '-command' proc. This may not match
reality, but you should get the idea.

In the example, the 'list' is more easily viewed if formatted this way
(with list index values on the left):

{
0: {3 apple}
1: {0x2 carrot}
2: {1 dingo}
3: {2 banana}
}

> Is {3 apple} compared to {0x2 carrot} then {1 dingo} compared to
> {2 banana}?

No, 2 is compared to 0x2. The list is four elements, each a string
that is also a valid list. As lsort's algorithm chooses pairs to
compare, it calls -command, and appending two list elements as
parameters to the command.

Inside the compare proc:

> % proc compare {a b} {
> set a0 [lindex $a 0]
> set b0 [lindex $b 0]
> if {$a0 < $b0} {
> return -1
> } elseif {$a0 > $b0} {
> return 1
> }
> return [string compare [lindex $a 1] [lindex $b 1]]
> }

We have a parameter variable a and parameter variable b. The first
thing compare does is extract the zero'th element of a and of b (so the
3 in "3 apple" or the 1 in {1 dingo}. Then it checks if a0 is smaller
than b0, if yes, it returns -1 (if you read the lsort man page you'll
see that the expected return values from the -command is negative if a
is smaller than b, zero of a equals b, and positive if a is larger than
b. So the first if clause checks for a smaller than b and returns -1
if so. Then it checks for a larger than b, returning 1 if that is the
case. All that is left is a equal to b, and for that case it checks
the second element of a and b (the word) as a tie breaker and returns
-1, 0, or 1 again, this time using string compare.

> Does it stride in leaps of 2 because two arguments were given (a and b)?

lsort calls compare with two elements, which two depend upon what exact
sort algorithm it is that lsort uses.

> I guess {3 apple} is compared with {0x2 carrot}. Actually, just their
> 0 indices. So 3 > 1 so return 1. OK, 1. But what is "1" here?

The documented value that -command should return when the a parameter
is larger than the b parameter. It is right there in the manpage.

> Holy mackerew. I need to dissect this thing:
>
> proc compare {a b} {
> puts "begin: a is $a b is $b"
> set a0 [lindex $a 0]
> set b0 [lindex $b 0]
> if {$a0 < $b0} {
> puts "R1 $a0 < $b0, returning -1"
> return -1
> } elseif {$a0 > $b0} {
> puts "R2 $a0 > $b0, returning 1"
> return 1
> }
> puts "two tests failed with $a0 vs. $b0, going to the third"
> puts "R3 [lindex $a 1] [lindex $b 1], returning [string compare [lindex $a 1] [lindex $b 1]] for [lindex $a 1] and [lindex $b 1]"
> return [string compare [lindex $a 1] [lindex $b 1]]
> }
> puts [lsort -command compare {{3 apple} {0x2 carrot} {1 dingo} {2 banana}}]
>
>
> Output:
>
> begin: a is 3 apple b is 0x2 carrot
> R2 3 > 0x2, returning 1
> begin: a is 1 dingo b is 2 banana
> R1 1 < 2, returning -1
> begin: a is 0x2 carrot b is 1 dingo
> R2 0x2 > 1, returning 1
> begin: a is 0x2 carrot b is 2 banana
> two tests failed with 0x2 vs. 2, going to the third
> R3 carrot banana, returning 1 for carrot and banana
> {1 dingo} {2 banana} {0x2 carrot} {3 apple}
>
> So I see the first pair was compared to the second pair, then the third
> pair was compared to the fourth pair... OK...
>
> Then the second pair is compared to the fourth pair... Why?

Because that's the order that lsort's algorithm requests that the
comparisons be performed.

> And the results of all the comparisons were 1, -1, 1 and 1. OK.
> And what does that mean?

1 signals to lsort that the first command parameter was larger than the
second command parameter. -1 signals the opposite. 0 signals equal,
although no equal values appeared here. The -1, 0, 1 is how the
compare proc tells lsort which item it selected to be compared is
larger (or if they are equal).

> Sorry, but I really can't understand the shuffle.

It's not a shuffle, it's a sort (although sorts and shuffles are
somewhat two sides of the same coin).

On Wed, 20 Dec 2023 02:41:42 -0000 (UTC), Rich wrote:

>It's not a shuffle, it's a sort (although sorts and shuffles are
>somewhat two sides of the same coin).
**************************

I guess I struggled to understand the concept because I had planted some
idea in my head that was a lot more complicated. I continually failed
to realize that it amounts to mere one on one comparisons and a very
simple 1/-1 output. Now I get it and my code has shrunk significantly
and it's working very well.

I also learned that one pass of lsort "survives" a previous pass so I
can sort the listing by size then again by directories versus files and
have everything sorted by size but with directories first. Super neat.
Once again, the way Tcl handles lists surprises me in a very good way.

Thank you.

--
Luc
>>

Luc <luc@sep.invalid> wrote:
> On Wed, 20 Dec 2023 02:41:42 -0000 (UTC), Rich wrote:
>
>>It's not a shuffle, it's a sort (although sorts and shuffles are
>>somewhat two sides of the same coin).
> **************************
>
> I guess I struggled to understand the concept because I had planted
> some idea in my head that was a lot more complicated.

Basic sorts are not terribly complicated to understand. Where the
complication arises is the algorithms that try to reduce the number of
comparisons and/or amount of data movement. One of the simpler ones to
understand (but not an efficient algorithm) is the bubble sort
<https://en.wikipedia.org/wiki/Bubble_sort>.

> I continually failed to realize that it amounts to mere one on one
> comparisons and a very simple 1/-1 output.

Yes, for sorts, the unit of comparison is pairs. The advanced
algorithms just produce sorted output while minimizing the number of
comparisons that need to be made.

> I also learned that one pass of lsort "survives" a previous pass so I
> can sort the listing by size then again by directories versus files
> and have everything sorted by size but with directories first. Super
> neat.

Yes, that is what is meant by the word "stable" in the second sentence
of the manpage on lsort. Although one needs to have been exposed to
some CS sorting theory before one realizes what meaning that word is
conveying to a reader about the algorithm.

Subject	Author
I need to organize some data	Luc
I need to organize some data	greg
I need to organize some data	Rich
I need to organize some data	Luc
I need to organize some data	Rich
I need to organize some data	Luc
I need to organize some data	Rich