novaBBS - uk.comp.os.linux - Re: Pipe cleanup of text

Re: Pipe cleanup of text - help needed

<3037499139@f1.n221.z2.fidonet.fi>

https://www.novabbs.com/aus+uk/article-flat.php?id=298&group=uk.comp.os.linux#298

copy link Newsgroups: uk.comp.os.linux

Path: i2pn2.org!i2pn.org!aioe.org!F7FIqN6dkowTZ1CLxZIWTQ.user.46.165.242.75.POSTED!not-for-mail
From: Pau...@f1.n221.z2.fidonet.fi (Paul)
Newsgroups: uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Sun, 01 Aug 2021 21:16:23 +0200
Organization: rbb soupgate
Message-ID: <3037499139@f1.n221.z2.fidonet.fi>
References: <772741015@f0.n0.z0.fidonet.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="19565"; posting-host="F7FIqN6dkowTZ1CLxZIWTQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Comment-To: All
X-Notice: Filtered by postfilter v. 0.9.2
X-MailConverter: SoupGate-OS/2 v1.20

by: Paul - Sun, 1 Aug 2021 19:16 UTC

Java Jive wrote:
> I have an archive of scanned documents which I need to index. A typical
> sample output of ls is appended. I want to clean this up so that only
> the first and last of each section are output, separated by a single
> line containing just '...'. Can anyone suggest a way of doing this by
> piping the output through awk or sed on the fly, rather than having to
> write a program to post-process the index?
>
> Desired:
>
> Family History/Unknown/Unknown Person's Notebook:
> Unknown Person's Notebook - 01.png
> ...
> Unknown Person's Notebook - 33.png
> Unknown Person's Notebook - End 0.png
> ...
> Unknown Person's Notebook - End 5.png

Awk (Gawk) has the ability to store things in arrays.

For example, in awk, I can reverse the order of lines in
a text file. A file with lines 1..10 can be emitted in
order 10..1. This requires the usage of an array in memory,
which grows as the file (or piped input) is acquired, then
the memory array is dumped in the END() clause of the program.
In such a situation, a 10GB text file cannot be processed
by a 2GB RAM machine. "A person has to know their limits."

We might also have to decide what to do about

Unknown Person's Notebook - 1.png
...
Unknown Person's Notebook - 33.png

or the multiple iterator case (which is "easy" from
a sorting perspective, but how do we know which
iterator is the least significant one). Maybe the
controlling iterator is the one on the right.

Unknown 01 Person's Notebook - 01.png
Unknown 01 Person's Notebook - 02.png
Unknown 02 Person's Notebook - 01.png
Unknown 02 Person's Notebook - 02.png
Unknown 03 Person's Notebook - 01.png
Unknown 03 Person's Notebook - 02.png

output:

Unknown 01 Person's Notebook - 01.png Group 01
...
Unknown 03 Person's Notebook - 01.png
Unknown 01 Person's Notebook - 02.png Group 02
...
Unknown 03 Person's Notebook - 02.png

You could scan for digits from the right, and
assume the operator is logically minded. Or something.

The version of Gawk I traditionally use, only knows
ASCII. I don't know what the latest evolution is, in terms
of, say, UTF-8. Part of the problem, is the notion of
a character being one byte wide, and what does the
Gawk program do when the characters are variable width.
One side effect, is the runtime could be considerably
slower. Or, the memory array representation could be
"very inefficient" and four times larger than normal.
Sorta like how some image editing programs now use
absurdly wide internal representations.

The first part of any program, is "a complete specification".
The effort to write the program goes up exponentially,
if the program specification is "dribbling in". For example,
one of my attempts to tame some ls -R output, ran into
character set problems. And my solution at the time, was
to delete the offending files ("save as web page complete"
was the source of the bad file names).

Awk can store the entire input in memory, if you want it to.

*******

I'll offer these two.

find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt

The "dirlist" is a succinct summary, with less detail
than you would like.

But it also didn't require writing a program.

Paul

I will make you shorter by the head. -- Elizabeth I

aus+uk / uk.comp.os.linux / Re: Pipe cleanup of text - help needed

aus+uk / uk.comp.os.linux / Re: Pipe cleanup of text - help needed

Subject	Author
Re: Pipe cleanup of text - help needed	Paul