Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Sell by date stamped on bottom.

Re: Pipe cleanup of text - help needed

Subject	Author
Pipe cleanup of text - help needed	Java Jive
Re: Pipe cleanup of text - help needed	Grant Taylor
Re: Pipe cleanup of text - help needed	Paul
Re: Pipe cleanup of text - help needed	William Unruh
Re: Pipe cleanup of text - help needed	Java Jive
Re: Pipe cleanup of text - help needed	Java Jive
Re: Pipe cleanup of text - help needed	Paul
Re: Pipe cleanup of text - help needed	Java Jive
Re: Pipe cleanup of text - help needed	Martin Gregorie
Re: Pipe cleanup of text - help needed	Paul
Re: Pipe cleanup of text - help needed	Java Jive
Sorting in GAWK (Was: Pipe cleanup of text - help needed)	Kenny McCormack

Pipe cleanup of text - help needed

<se6k3q$1u0d$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=285&group=uk.comp.os.linux#285

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!aioe.org!cZqwowLvj+kmPGD0sEQEAQ.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Pipe cleanup of text - help needed
Date: Sun, 1 Aug 2021 18:02:49 +0100
Organization: Aioe.org NNTP Server
Message-ID: <se6k3q$1u0d$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="63501"; posting-host="cZqwowLvj+kmPGD0sEQEAQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
X-Mozilla-News-Host: news://nntp.aioe.org:119
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB

by: Java Jive - Sun, 1 Aug 2021 17:02 UTC

I have an archive of scanned documents which I need to index. A typical
sample output of ls is appended. I want to clean this up so that only
the first and last of each section are output, separated by a single
line containing just '...'. Can anyone suggest a way of doing this by
piping the output through awk or sed on the fly, rather than having to
write a program to post-process the index?

Desired:

Family History/Unknown/Unknown Person's Notebook:
Unknown Person's Notebook - 01.png
....
Unknown Person's Notebook - 33.png
Unknown Person's Notebook - End 0.png
....
Unknown Person's Notebook - End 5.png
Unknown Person's Notebook - Insert 00a.png
Unknown Person's Notebook - Insert 00b.png
Unknown Person's Notebook - Insert 01.png
Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
India.png
Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
Dekklan, India.png
Unknown Person's Notebook - Insert 03 - 17800208.png
Unknown Person's Notebook - Insert 04.png
Unknown Person's Notebook - Insert 05.png
Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
Unknown Person's Notebook - Insert 07a.png
Unknown Person's Notebook - Insert 07b.png
Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
Unknown Person's Notebook - Insert 10.png
Unknown Person's Notebook - Insert 11a.png
Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
Unknown Person's Notebook - Insert 12.png
Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
(partly completed).png
Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
Hébreu' - 1.png
....
Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
Hébreu' - 6.png
Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
Unknown Person's Notebook.txt

Original output for ls -1pr <etc>

Family History/Unknown/Unknown Person's Notebook:
Unknown Person's Notebook - 01.png
Unknown Person's Notebook - 02.png
Unknown Person's Notebook - 03.png
Unknown Person's Notebook - 04.png
Unknown Person's Notebook - 05.png
Unknown Person's Notebook - 06.png
Unknown Person's Notebook - 07.png
Unknown Person's Notebook - 08.png
Unknown Person's Notebook - 09.png
Unknown Person's Notebook - 10.png
Unknown Person's Notebook - 11.png
Unknown Person's Notebook - 12.png
Unknown Person's Notebook - 13.png
Unknown Person's Notebook - 14.png
Unknown Person's Notebook - 15.png
Unknown Person's Notebook - 16.png
Unknown Person's Notebook - 17.png
Unknown Person's Notebook - 18.png
Unknown Person's Notebook - 19.png
Unknown Person's Notebook - 20.png
Unknown Person's Notebook - 21.png
Unknown Person's Notebook - 22.png
Unknown Person's Notebook - 23.png
Unknown Person's Notebook - 24.png
Unknown Person's Notebook - 25.png
Unknown Person's Notebook - 26.png
Unknown Person's Notebook - 27.png
Unknown Person's Notebook - 28.png
Unknown Person's Notebook - 29.png
Unknown Person's Notebook - 30.png
Unknown Person's Notebook - 31.png
Unknown Person's Notebook - 32.png
Unknown Person's Notebook - 33.png
Unknown Person's Notebook - End 0.png
Unknown Person's Notebook - End 1.png
Unknown Person's Notebook - End 2.png
Unknown Person's Notebook - End 3.png
Unknown Person's Notebook - End 4.png
Unknown Person's Notebook - End 5.png
Unknown Person's Notebook - Insert 00a.png
Unknown Person's Notebook - Insert 00b.png
Unknown Person's Notebook - Insert 01.png
Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
India.png
Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
Dekklan, India.png
Unknown Person's Notebook - Insert 03 - 17800208.png
Unknown Person's Notebook - Insert 04.png
Unknown Person's Notebook - Insert 05.png
Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
Unknown Person's Notebook - Insert 07a.png
Unknown Person's Notebook - Insert 07b.png
Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
Unknown Person's Notebook - Insert 10.png
Unknown Person's Notebook - Insert 11a.png
Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
Unknown Person's Notebook - Insert 12.png
Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
(partly completed).png
Unknown Person's Notebook - Insert 15a - 'La PoÃ¨sie didactique des
HÃ©breu' - 1.png
Unknown Person's Notebook - Insert 15a - 'La PoÃ¨sie didactique des
HÃ©breu' - 2.png
Unknown Person's Notebook - Insert 15a - 'La PoÃ¨sie didactique des
HÃ©breu' - 3.png
Unknown Person's Notebook - Insert 15a - 'La PoÃ¨sie didactique des
HÃ©breu' - 4.png
Unknown Person's Notebook - Insert 15a - 'La PoÃ¨sie didactique des
HÃ©breu' - 5.png
Unknown Person's Notebook - Insert 15a - 'La PoÃ¨sie didactique des
HÃ©breu' - 6.png
Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
Unknown Person's Notebook.txt

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Pipe cleanup of text - help needed

<se6ne5$n57$1@tncsrv09.home.tnetconsulting.net>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=289&group=uk.comp.os.linux#289

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!tncsrv06.tnetconsulting.net!tncsrv09.home.tnetconsulting.net!.POSTED.alpha.home.tnetconsulting.net!not-for-mail
From: gtay...@tnetconsulting.net (Grant Taylor)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Sun, 1 Aug 2021 11:57:40 -0600
Organization: TNet Consulting
Message-ID: <se6ne5$n57$1@tncsrv09.home.tnetconsulting.net>
References: <se6k3q$1u0d$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Aug 2021 17:59:33 -0000 (UTC)
Injection-Info: tncsrv09.home.tnetconsulting.net; posting-host="alpha.home.tnetconsulting.net:198.18.18.251";
logging-data="23719"; mail-complaints-to="newsmaster@tnetconsulting.net"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.9.0
In-Reply-To: <se6k3q$1u0d$1@gioia.aioe.org>
Content-Language: en-US

by: Grant Taylor - Sun, 1 Aug 2021 17:57 UTC

On 8/1/21 11:02 AM, Java Jive wrote:
> I want to clean this up so that only the first and last of each
> section are output, separated by a single line containing just '...'.

It's not just /first/ and /last/ line of a group. There also seems to
be a component on a minimum number of lines. E.g. "Genealogy Of Job" is
only two lines, but you aren't inserting "..." between the first and
last member of the group.

> Can anyone suggest a way of doing this by piping the output through
> awk or sed on the fly, rather than having to write a program to
> post-process the index?

I don't see a way to do this in the 90 seconds that I've looked at it.
However I do see a thread that might be worth pulling at. Maybe someone
else, perhaps the OP, will see the next step.

I would be inclined drop the last item (term?) from the base file name,
with the intention of turning this:

Unknown Person's Notebook - End 0
Unknown Person's Notebook - End 1
Unknown Person's Notebook - End 2
Unknown Person's Notebook - End 3
Unknown Person's Notebook - End 4
Unknown Person's Notebook - End 5

Into this:

Unknown Person's Notebook - End
Unknown Person's Notebook - End
Unknown Person's Notebook - End
Unknown Person's Notebook - End
Unknown Person's Notebook - End
Unknown Person's Notebook - End

This seems like something you could run through uniq (-c) to have a
start at finding ""duplicate / incremental parts ~> bases of file names.

You could probably use that as information to drive a decision to
truncate the output or not.

I feel like this may need multiple passes through the input; one to
identify when things need to be abbreviated / truncated and another as
the source of the data to be abbreviated / truncated or not. This means
that it's not exactly conducive to a typical STDIN -> STDOUT like filter.

The next thing to think about is trying to leverage sed's hold space and
doing a comparison of the current line to the hold space. -- I don't
do this often enough to know how to do this. But, this probably does
have the advantage of being able to do this in a single pass.

Seeing as how this plays on coparing adjacent lines of text, it will
almost certainly be predicated on the list being sorted.

However, you can't blindly strip off the file extension (and last part
of the name). Lest you combine file-1.png, file-2.jpg, and file-3.gif.

You really seem to be talking about something that can dynamically allow
for one element in a series of file (base) names differ and
conditionally truncate them. But you don't want to truncate
file-1.{png,jpg,gif} where the base name is the same but the extension
is the only part that differs.

This seems like a non-trival problem for simply parsing text.

--
Grant. . . .
unix || die

Re: Pipe cleanup of text - help needed

<se6veo$9am$1@dont-email.me>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=290&group=uk.comp.os.linux#290

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Sun, 01 Aug 2021 16:16:23 -0400
Organization: A noiseless patient Spider
Lines: 92
Message-ID: <se6veo$9am$1@dont-email.me>
References: <se6k3q$1u0d$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 1 Aug 2021 20:16:24 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="76289adfd5ab025b092a33a879b5940e";
logging-data="9558"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1GLnbR/2+gCZrXSQohJK2GtBHb5jRKmc="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:/fL0RJGQDJlF1XEpv6Qtl7o/YD8=
In-Reply-To: <se6k3q$1u0d$1@gioia.aioe.org>

by: Paul - Sun, 1 Aug 2021 20:16 UTC

Java Jive wrote:
> I have an archive of scanned documents which I need to index. A typical
> sample output of ls is appended. I want to clean this up so that only
> the first and last of each section are output, separated by a single
> line containing just '...'. Can anyone suggest a way of doing this by
> piping the output through awk or sed on the fly, rather than having to
> write a program to post-process the index?
>
> Desired:
>
> Family History/Unknown/Unknown Person's Notebook:
> Unknown Person's Notebook - 01.png
> ...
> Unknown Person's Notebook - 33.png
> Unknown Person's Notebook - End 0.png
> ...
> Unknown Person's Notebook - End 5.png

Awk (Gawk) has the ability to store things in arrays.

For example, in awk, I can reverse the order of lines in
a text file. A file with lines 1..10 can be emitted in
order 10..1. This requires the usage of an array in memory,
which grows as the file (or piped input) is acquired, then
the memory array is dumped in the END() clause of the program.
In such a situation, a 10GB text file cannot be processed
by a 2GB RAM machine. "A person has to know their limits."

We might also have to decide what to do about

Unknown Person's Notebook - 1.png
...
Unknown Person's Notebook - 33.png

or the multiple iterator case (which is "easy" from
a sorting perspective, but how do we know which
iterator is the least significant one). Maybe the
controlling iterator is the one on the right.

Unknown 01 Person's Notebook - 01.png
Unknown 01 Person's Notebook - 02.png
Unknown 02 Person's Notebook - 01.png
Unknown 02 Person's Notebook - 02.png
Unknown 03 Person's Notebook - 01.png
Unknown 03 Person's Notebook - 02.png

output:

Unknown 01 Person's Notebook - 01.png Group 01
...
Unknown 03 Person's Notebook - 01.png
Unknown 01 Person's Notebook - 02.png Group 02
...
Unknown 03 Person's Notebook - 02.png

You could scan for digits from the right, and
assume the operator is logically minded. Or something.

The version of Gawk I traditionally use, only knows
ASCII. I don't know what the latest evolution is, in terms
of, say, UTF-8. Part of the problem, is the notion of
a character being one byte wide, and what does the
Gawk program do when the characters are variable width.
One side effect, is the runtime could be considerably
slower. Or, the memory array representation could be
"very inefficient" and four times larger than normal.
Sorta like how some image editing programs now use
absurdly wide internal representations.

The first part of any program, is "a complete specification".
The effort to write the program goes up exponentially,
if the program specification is "dribbling in". For example,
one of my attempts to tame some ls -R output, ran into
character set problems. And my solution at the time, was
to delete the offending files ("save as web page complete"
was the source of the bad file names).

Awk can store the entire input in memory, if you want it to.

*******

I'll offer these two.

find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt

The "dirlist" is a succinct summary, with less detail
than you would like.

But it also didn't require writing a program.

Paul

Re: Pipe cleanup of text - help needed

<se76fm$1f7k$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=293&group=uk.comp.os.linux#293

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!aioe.org!cZqwowLvj+kmPGD0sEQEAQ.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Sun, 1 Aug 2021 23:16:21 +0100
Organization: Aioe.org NNTP Server
Message-ID: <se76fm$1f7k$1@gioia.aioe.org>
References: <se6k3q$1u0d$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="48372"; posting-host="cZqwowLvj+kmPGD0sEQEAQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB

by: Java Jive - Sun, 1 Aug 2021 22:16 UTC

On 01/08/2021 18:02, Java Jive wrote:
>
> I have an archive of scanned documents which I need to index. A typical
> sample output of ls is appended. I want to clean this up so that only
> the first and last of each section are output, separated by a single
> line containing just '...'. Can anyone suggest a way of doing this by
> piping the output through awk or sed on the fly, rather than having to
> write a program to post-process the index?
>
> Desired:
>
> Family History/Unknown/Unknown Person's Notebook:
> Unknown Person's Notebook - 01.png
> ....
> Unknown Person's Notebook - 33.png
> Unknown Person's Notebook - End 0.png
> ....
> Unknown Person's Notebook - End 5.png
> Unknown Person's Notebook - Insert 00a.png
> Unknown Person's Notebook - Insert 00b.png
> Unknown Person's Notebook - Insert 01.png
> Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
> India.png
> Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
> Dekklan, India.png
> Unknown Person's Notebook - Insert 03 - 17800208.png
> Unknown Person's Notebook - Insert 04.png
> Unknown Person's Notebook - Insert 05.png
> Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
> Unknown Person's Notebook - Insert 07a.png
> Unknown Person's Notebook - Insert 07b.png
> Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
> Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
> Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
> Unknown Person's Notebook - Insert 10.png
> Unknown Person's Notebook - Insert 11a.png
> Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
> Unknown Person's Notebook - Insert 12.png
> Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
> Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
> Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
> (partly completed).png
> Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
> Hébreu' - 1.png
> ....
> Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
> Hébreu' - 6.png
> Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
> Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
> Unknown Person's Notebook.txt

Thanks Grant & Paul. To clarify:

There's no point in putting '...' in between Genealogy Of Job - 1 & 2
because there's nothing missing and it would make the index longer, not
shorter. The minimum series length that it's worthwhile for is 3.

There's only ever one iterator in operation at a time, and it's always
the last number in the filename.

So how would I truncate the current line in awk or sed, $0 in the
former, and hold it for comparison to the following lines until there's
a mismatch? I've used sed for very simple s/pattern/replace/ type
operations, but it's inner workings are something of a mystery. I've
only ever done the simplest things in awk.

I can see exactly how I would write a shell program to do this with the
input read from a file dump of ls output, but I can't help feeling there
must be a better way of doing it on the fly.

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Pipe cleanup of text - help needed

<se7768$kj8$1@dont-email.me>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=294&group=uk.comp.os.linux#294

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: unr...@invalid.ca (William Unruh)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Sun, 1 Aug 2021 22:28:25 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 102
Message-ID: <se7768$kj8$1@dont-email.me>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se6veo$9am$1@dont-email.me>
Injection-Date: Sun, 1 Aug 2021 22:28:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6669544dde06baeda06ad6948fda2f08";
logging-data="21096"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/jIxNBDNJBm08+mn9L4vFX"
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:OO+g6weclLo6TATvx/DISi74mcc=

by: William Unruh - Sun, 1 Aug 2021 22:28 UTC

On 2021-08-01, Paul <nospam@needed.invalid> wrote:
> Java Jive wrote:
>> I have an archive of scanned documents which I need to index. A typical
>> sample output of ls is appended. I want to clean this up so that only
>> the first and last of each section are output, separated by a single
>> line containing just '...'. Can anyone suggest a way of doing this by
>> piping the output through awk or sed on the fly, rather than having to
>> write a program to post-process the index?

Each section means what?

If youjust want the first and last

awk ' {T=$0; if ( NR==1) print T; }
END {if ( NR>2 ) then print "..."; print T} '

>>
>> Desired:
>>
>> Family History/Unknown/Unknown Person's Notebook:
>> Unknown Person's Notebook - 01.png
>> ...
>> Unknown Person's Notebook - 33.png
>> Unknown Person's Notebook - End 0.png
>> ...
>> Unknown Person's Notebook - End 5.png
>
> Awk (Gawk) has the ability to store things in arrays.
>
> For example, in awk, I can reverse the order of lines in
> a text file. A file with lines 1..10 can be emitted in
> order 10..1. This requires the usage of an array in memory,
> which grows as the file (or piped input) is acquired, then
> the memory array is dumped in the END() clause of the program.
> In such a situation, a 10GB text file cannot be processed
> by a 2GB RAM machine. "A person has to know their limits."
>
> We might also have to decide what to do about
>
> Unknown Person's Notebook - 1.png
> ...
> Unknown Person's Notebook - 33.png
>
> or the multiple iterator case (which is "easy" from
> a sorting perspective, but how do we know which
> iterator is the least significant one). Maybe the
> controlling iterator is the one on the right.
>
> Unknown 01 Person's Notebook - 01.png
> Unknown 01 Person's Notebook - 02.png
> Unknown 02 Person's Notebook - 01.png
> Unknown 02 Person's Notebook - 02.png
> Unknown 03 Person's Notebook - 01.png
> Unknown 03 Person's Notebook - 02.png
>
> output:
>
> Unknown 01 Person's Notebook - 01.png Group 01
> ...
> Unknown 03 Person's Notebook - 01.png
> Unknown 01 Person's Notebook - 02.png Group 02
> ...
> Unknown 03 Person's Notebook - 02.png
>
> You could scan for digits from the right, and
> assume the operator is logically minded. Or something.
>
> The version of Gawk I traditionally use, only knows
> ASCII. I don't know what the latest evolution is, in terms
> of, say, UTF-8. Part of the problem, is the notion of
> a character being one byte wide, and what does the
> Gawk program do when the characters are variable width.
> One side effect, is the runtime could be considerably
> slower. Or, the memory array representation could be
> "very inefficient" and four times larger than normal.
> Sorta like how some image editing programs now use
> absurdly wide internal representations.
>
> The first part of any program, is "a complete specification".
> The effort to write the program goes up exponentially,
> if the program specification is "dribbling in". For example,
> one of my attempts to tame some ls -R output, ran into
> character set problems. And my solution at the time, was
> to delete the offending files ("save as web page complete"
> was the source of the bad file names).
>
> Awk can store the entire input in memory, if you want it to.
>
> *******
>
> I'll offer these two.
>
> find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
> find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt
>
> The "dirlist" is a succinct summary, with less detail
> than you would like.
>
> But it also didn't require writing a program.
>
> Paul

Re: Pipe cleanup of text - help needed

<se79sh$hpb$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=295&group=uk.comp.os.linux#295

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!aioe.org!cZqwowLvj+kmPGD0sEQEAQ.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Mon, 2 Aug 2021 00:14:24 +0100
Organization: Aioe.org NNTP Server
Message-ID: <se79sh$hpb$1@gioia.aioe.org>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se6veo$9am$1@dont-email.me>
<se7768$kj8$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="18219"; posting-host="cZqwowLvj+kmPGD0sEQEAQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2

by: Java Jive - Sun, 1 Aug 2021 23:14 UTC

On 01/08/2021 23:28, William Unruh wrote:
> On 2021-08-01, Paul <nospam@needed.invalid> wrote:
>> Java Jive wrote:
>>> I have an archive of scanned documents which I need to index. A typical
>>> sample output of ls is appended. I want to clean this up so that only
>>> the first and last of each section are output, separated by a single
>>> line containing just '...'. Can anyone suggest a way of doing this by
>>> piping the output through awk or sed on the fly, rather than having to
>>> write a program to post-process the index?
>
> Each section means what?
>
> If youjust want the first and last
>
> awk ' {T=$0; if ( NR==1) print T; }
> END {if ( NR>2 ) then print "..."; print T} '
>
Yes, that gives me the first and last line in each directory listing, as
follows ...

~ # ls -1pR Family\ History/Unknown | awk '{T=$0; if (NR==1) print T};
END {if (NR > 2) print "..."; print T}'
Family History/Unknown:
....
Unknown Person's Notebook.txt

.... which is a start and may help me by example work out the sort of
thing I want, for which many thanks. However, what I was really after
was the truncation of long lists of essentially the same filename where
only the page number at the end varies, to give something like this ...

Family History/Unknown:
Blah-blah 01
...
Blah-blah 55
Widgetry 1
...
Widgetry 6

.... etc, so obviously some sort of comparison between the current line
and the previous line, or the first of the current series of similar
lines, is required.

It's late here in the UK, so I'm off to bed now, but I'll have a closer
look at your example to-morrow and see if I can extend it to do what I want.

Thanks again.

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Pipe cleanup of text - help needed

<se8c8b$qqu$1@dont-email.me>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=304&group=uk.comp.os.linux#304

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Mon, 02 Aug 2021 05:00:58 -0400
Organization: A noiseless patient Spider
Lines: 166
Message-ID: <se8c8b$qqu$1@dont-email.me>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se76fm$1f7k$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 2 Aug 2021 09:00:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c9f7e7bae8da2548c63c1404c9a16433";
logging-data="27486"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//CUHD7W5g8aRw35bkMjtmIN8xbwqgEG8="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:UJ2lifSQhRiG8MCUqa22DYxVLfg=
In-Reply-To: <se76fm$1f7k$1@gioia.aioe.org>

by: Paul - Mon, 2 Aug 2021 09:00 UTC

Java Jive wrote:
> On 01/08/2021 18:02, Java Jive wrote:
>>
>> I have an archive of scanned documents which I need to index. A
>> typical sample output of ls is appended. I want to clean this up so
>> that only the first and last of each section are output, separated by
>> a single line containing just '...'. Can anyone suggest a way of
>> doing this by piping the output through awk or sed on the fly, rather
>> than having to write a program to post-process the index?
>>
>> Desired:
>>
>> Family History/Unknown/Unknown Person's Notebook:
>> Unknown Person's Notebook - 01.png
>> ....
>> Unknown Person's Notebook - 33.png
>> Unknown Person's Notebook - End 0.png
>> ....
>> Unknown Person's Notebook - End 5.png
>> Unknown Person's Notebook - Insert 00a.png
>> Unknown Person's Notebook - Insert 00b.png
>> Unknown Person's Notebook - Insert 01.png
>> Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
>> India.png
>> Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
>> Dekklan, India.png
>> Unknown Person's Notebook - Insert 03 - 17800208.png
>> Unknown Person's Notebook - Insert 04.png
>> Unknown Person's Notebook - Insert 05.png
>> Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
>> Unknown Person's Notebook - Insert 07a.png
>> Unknown Person's Notebook - Insert 07b.png
>> Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
>> Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
>> Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
>> Unknown Person's Notebook - Insert 10.png
>> Unknown Person's Notebook - Insert 11a.png
>> Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
>> Unknown Person's Notebook - Insert 12.png
>> Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
>> Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
>> Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
>> (partly completed).png
>> Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
>> Hébreu' - 1.png
>> ....
>> Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
>> Hébreu' - 6.png
>> Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
>> Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
>> Unknown Person's Notebook.txt
>
> Thanks Grant & Paul. To clarify:
>
> There's no point in putting '...' in between Genealogy Of Job - 1 & 2
> because there's nothing missing and it would make the index longer, not
> shorter. The minimum series length that it's worthwhile for is 3.
>
> There's only ever one iterator in operation at a time, and it's always
> the last number in the filename.
>
> So how would I truncate the current line in awk or sed, $0 in the
> former, and hold it for comparison to the following lines until there's
> a mismatch? I've used sed for very simple s/pattern/replace/ type
> operations, but it's inner workings are something of a mystery. I've
> only ever done the simplest things in awk.
>
> I can see exactly how I would write a shell program to do this with the
> input read from a file dump of ls output, but I can't help feeling there
> must be a better way of doing it on the fly.
>

Not thoroughly tested. Will show some awk syntax, no guarantees
it meets the specs :-)

********************************* redund.awk ******************************
# howtorun
# gawk -f redund.awk inputfile.txt > outputfile.txt
# ls-like-program-piped-to | gawk -f redund.awk

# I usually put data samples inline like this, so I can stare at
# them while writing snippers for stuff.

# Unknown Person's Notebook - 01.png
# Unknown Person's Notebook - 02.png
# Unknown Person's Notebook - 03.png
# Unknown Person's Notebook - End 0.png
# Unknown Person's Notebook - End 1.png
# Unknown Person's Notebook - End 2.png
# Unknown Person's Notebook - Insert 00a.png
# Unknown Person's Notebook - Insert 00b.png
# Unknown Person's Notebook - Insert 01.png
# Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan, India, 30451.png
# 000001
# 000002
# 000003
# 000004.png.jpg # not compressible with the others

# Test some commands first, copy stuff from Internet, etc
# # gawk '{match($0,/[0-9]{6}/);print substr($0,RSTART,RLENGTH)}' Input_file

# gawk "{match($0,/[0-9]/);print substr($0,RSTART,RLENGTH)}" Works for the first digit only

# gawk "{match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)}" Works to detect first instance

# gawk "{match($0,/[[:digit:]]+/,arr);print arr[1] " " arr[2]}" Probably gawk5 only, cannot use

# for(i=length($0);i>0;i--) x=x substr($0,i,1); A way to reverse a string, not needed

BEGIN {
FS = "." # peel off extension, if present, using $0 processing
oldok = 0
oldroot = "" # not a problem with oldok false
}

{ # check the end of $1 for digits
# match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)
# By using a field separator of ".", we must disqualify NF>2 cases like 000004.png.jpg
# split() can be used instead of FS and $0 processing, for more general programming solutions
# Since the field separator is used very little in this program, it can be "wasted" like this.

ok = (NF<=2) # boolean for string with compressible digits on end, initial determination

match($1, /[[:digit:]]+/ ) # side effect... sets RSTART RLENGTH
ok = (RSTART+RLENGTH-1 == length($1)) && ok # ok true is equal to 1, false is 0
root = substr($1,1,RSTART-1) # Empty string for filename "000001"
# print ok " " $1 " \"" root "\"" # the usual debug statement

if ( root == oldroot && ok == 1 && oldok == 1) {
cntr++
}

if ( root != oldroot || ok == 0) { # new assignment
# Check processing of stuff in buffer
if (oldok == 1) {
if (cntr > 2) {
print "..."
}
if (cntr > 1) {
print oldstr
}
# cntr = 1 has already been printed
}
cntr = 1
print $0 # opening stanza of a potential compression
}

# bookkeeping
oldroot = root
oldok = ok
oldstr = $0

# When I make doodles like the following in the source, it means I'm
# struggling with the if-then-else order and making the code
# as succinct as possible. This table started me out on the
# wrong leg, and it took a second try to make a better if-then-else

# root ok oldroot oldok cntr oldstr
# xxx yyy dump previous if oldok,cntr,oldstr, define cntr = 1, print opening line
# xxx 1 xxx 1 increment cntr

}
********************************* end redund.awk ******************************

Paul

Re: Pipe cleanup of text - help needed

<sea1op$2nl$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=306&group=uk.comp.os.linux#306

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!aioe.org!cZqwowLvj+kmPGD0sEQEAQ.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Tue, 3 Aug 2021 01:14:16 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sea1op$2nl$1@gioia.aioe.org>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se76fm$1f7k$1@gioia.aioe.org>
<se8c8b$qqu$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="2805"; posting-host="cZqwowLvj+kmPGD0sEQEAQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2

by: Java Jive - Tue, 3 Aug 2021 00:14 UTC

On 02/08/2021 10:00, Paul wrote:
>
> Not thoroughly tested. Will show some awk syntax, no guarantees
> it meets the specs :-)
>
> ********************************* redund.awk ******************************
> # howtorun
> # gawk -f redund.awk inputfile.txt > outputfile.txt
> # ls-like-program-piped-to | gawk -f redund.awk
>
> # I usually put data samples inline like this, so I can stare at
> # them while writing snippers for stuff.
>
> # Unknown Person's Notebook - 01.png
> # Unknown Person's Notebook - 02.png
> # Unknown Person's Notebook - 03.png
> # Unknown Person's Notebook - End 0.png
> # Unknown Person's Notebook - End 1.png
> # Unknown Person's Notebook - End 2.png
> # Unknown Person's Notebook - Insert 00a.png
> # Unknown Person's Notebook - Insert 00b.png
> # Unknown Person's Notebook - Insert 01.png
> # Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
> India, 30451.png
> # 000001
> # 000002
> # 000003
> # 000004.png.jpg      # not compressible with the others
>
> # Test some commands first, copy stuff from Internet, etc
> #
> # gawk '{match($0,/[0-9]{6}/);print substr($0,RSTART,RLENGTH)}'
> Input_file
>
> # gawk "{match($0,/[0-9]/);print substr($0,RSTART,RLENGTH)}"
> Works for the first digit only
>
> # gawk "{match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)}"
> Works to detect first instance
>
> # gawk "{match($0,/[[:digit:]]+/,arr);print arr[1] " " arr[2]}"
> Probably gawk5 only, cannot use
>
> # for(i=length($0);i>0;i--) x=x substr($0,i,1);                      A
> way to reverse a string, not needed
>
> BEGIN {
> FS = "."     # peel off extension, if present, using $0 processing
> oldok = 0
> oldroot = "" # not a problem with oldok false
> }
>
> { # check the end of $1 for digits
> # match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)
> # By using a field separator of ".", we must disqualify NF>2 cases
> like 000004.png.jpg
> # split() can be used instead of FS and $0 processing, for more
> general programming solutions
> # Since the field separator is used very little in this program, it
> can be "wasted" like this.
>
> ok = (NF<=2)   # boolean for string with compressible digits on end,
> initial determination
>
> match($1, /[[:digit:]]+/ ) # side effect... sets RSTART RLENGTH
> ok = (RSTART+RLENGTH-1 == length($1)) && ok       # ok true is equal
> to 1, false is 0
> root = substr($1,1,RSTART-1)                       # Empty string for
> filename "000001"
> # print ok " " $1 " \"" root "\""                  # the usual debug
> statement
>
> if ( root == oldroot && ok == 1 && oldok == 1) {
>      cntr++
> }
>
> if ( root != oldroot || ok == 0) { # new assignment
>      # Check processing of stuff in buffer
>      if (oldok == 1) {
>         if (cntr > 2) {
>            print "..."
>         }
>         if (cntr > 1) {
>            print oldstr
>         }
>         # cntr = 1 has already been printed
>      }
>      cntr = 1
>      print $0 # opening stanza of a potential compression
> }
>
> # bookkeeping
> oldroot = root
> oldok = ok
> oldstr = $0
>
> # When I make doodles like the following in the source, it means I'm
> # struggling with the if-then-else order and making the code
> # as succinct as possible. This table started me out on the
> # wrong leg, and it took a second try to make a better if-then-else
>
> # root ok   oldroot oldok cntr oldstr
> # xxx        yyy                          dump previous if
> oldok,cntr,oldstr, define cntr = 1, print opening line
> # xxx   1    xxx      1                   increment cntr
>
> }
> ********************************* end redund.awk
> ******************************

Thanks very much, you and Willian Unruh have been a great help, and your
example above was very nearly perfect. However, I realised that it
relied on the files not having a 'dot' earlier in the filenames, and
when I searched through all the files to check whether that was true, I
found some files created by others that did have more than one 'dot', so
I resolved to rewrite it, using the same general approach as yourself.

Below is what I came up with, and is working pretty well. It's been
amended to deal with some situations I hadn't originally anticipated:

Notebooks with numbered pages sometimes had single pages in a run which
were blank, and so not worth scanning, and for these I created dummy
text files stating that the pages were blank, to make it clear that no
pages containing data were omitted from the scanning process. This
meant that other filename extensions had to be allowed to match.

Because some filenames included brackets, it was necessary to escape
these before creating the RE to do the matching.

Some documents with numbered pages included extra notes that needed
separate scans because they were on the back of the previous page,
resulting in this sort of thing ...
Blah-blah - 5.png
Blah-blah - 5 Note.png
.... and I've adapted it to deal with that as well. This has had the
unfortunate side-effect of overcompressing some of the test data that I
gave before, but elsewhere saves so much work that I've decided to keep
it in.

Apart from that, it all looks good:

#!/bin/awk
##########
# An AWK program to make an index for our Family History
########################################################

BEGIN
{
# Init variables
################
# Ensure whole line is one field
FS="\n";
# Current line pattern RE
cPattern="";
# Previous Line
pLine="";
# Count of similar lines
count=0;
}

{
# Test for being within an existing numbered run
if( (cPattern != "") && match($0, cPattern) )
# Yes, so increase counter and remember this line
# in case it turns out to be the last in this run
{
count++;
pLine = $0;
}
else
# Not part of existing numbered run
{
# So first exit what is now the previous numbered run
if( pLine != "" )
{
if( count > 2 )
print "...";
print pLine;
pLine = "";
}
# And we need to print this line anyway
print $0;
# Now test for a new numbered run. Some files created
# by others have more than one dot '.' in the filename,
# so we must be certain to match only to the last.
if ( match( $0, /^.*[ #px][[:digit:]]{1,3}\.[^.]+$/) )
# Potentially the start of a new numbered run
# so re-initialise for this new one
{
# Escape brackets for new RE
cPattern = gensub( /([][(){}])/, "\\\\\\1", "g", $0 );
# Build new RE for matching following input lines
cPattern = gensub( /(^.*[ #px])[[:digit:]]{1,3}(\.[^.]+)$/,
"\\1[0-9]{1,3}[^.]*\\.[^.]+", "g", cPattern );
count = 1;
}
else
{
# Not part of a numbered run, reinitialise ready for next
cPattern="";
count = 0;
}
}
}

Results:

Family History/Unknown/Unknown Person's Notebook:

Unknown Person's Notebook/Unknown Person's Notebook - 01.png
....
Unknown Person's Notebook/Unknown Person's Notebook - 33.png
Unknown Person's Notebook/Unknown Person's Notebook - End 0.png
....
Unknown Person's Notebook/Unknown Person's Notebook - End 5.png
Unknown Person's Notebook/Unknown Person's Notebook - Insert 00a.png
Unknown Person's Notebook/Unknown Person's Notebook - Insert 00b.png
Unknown Person's Notebook/Unknown Person's Notebook - Insert 01.png
....
Unknown Person's Notebook/Unknown Person's Notebook - Insert 15b -
Genealogy Of Job - 2.png
Unknown Person's Notebook/Unknown Person's Notebook.txt

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Pipe cleanup of text - help needed

<sea4kq$i2m$1@dont-email.me>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=307&group=uk.comp.os.linux#307

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mar...@mydomain.invalid (Martin Gregorie)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Tue, 3 Aug 2021 01:03:22 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <sea4kq$i2m$1@dont-email.me>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se76fm$1f7k$1@gioia.aioe.org>
<se8c8b$qqu$1@dont-email.me> <sea1op$2nl$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 3 Aug 2021 01:03:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="23078d2f15261314f37673fd21d6a365";
logging-data="18518"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+hhA4mDkV4MsaDNzSq9yUakZOWdFsu4d8="
User-Agent: Pan/0.146 (Hic habitat felicitas; 8107378
git@gitlab.gnome.org:GNOME/pan.git)
Cancel-Lock: sha1:Z6OWL+cQHqa92AyMWTpUk6Z+mIs=

by: Martin Gregorie - Tue, 3 Aug 2021 01:03 UTC

On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:

Years ago, when I was getting into awk, I found the O'Reilly book,
"sed & awk", subtitled "UNIX Power Tools" to be really helpful.

I think it explains how awk works, how to use it and shows what it can do
better than anything else I've found. It contains a lot of non-trivial
example code too.

That's not to knock the awk manpage, which is a good reference guide,
specially for the various built-in functions, just that I think the book
explains the way to structure awk scripts rather better than the manpage.

--
Martin | martin at
Gregorie | gregorie dot org

Re: Pipe cleanup of text - help needed

<seafo1$9me$1@dont-email.me>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=308&group=uk.comp.os.linux#308

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Tue, 03 Aug 2021 00:12:48 -0400
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <seafo1$9me$1@dont-email.me>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se76fm$1f7k$1@gioia.aioe.org> <se8c8b$qqu$1@dont-email.me> <sea1op$2nl$1@gioia.aioe.org> <sea4kq$i2m$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 3 Aug 2021 04:12:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="07d0af3e3cb0c2856c71c657d7a1308f";
logging-data="9934"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18TW8qwkpUY+uKbXI03OZl2SPnUO8QCwrU="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:Jw8jRivB301LIBXZHpTcsNnXy2s=
In-Reply-To: <sea4kq$i2m$1@dont-email.me>

by: Paul - Tue, 3 Aug 2021 04:12 UTC

Martin Gregorie wrote:
> On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:
>
> Years ago, when I was getting into awk, I found the O'Reilly book,
> "sed & awk", subtitled "UNIX Power Tools" to be really helpful.
>
> I think it explains how awk works, how to use it and shows what it can do
> better than anything else I've found. It contains a lot of non-trivial
> example code too.
>
> That's not to knock the awk manpage, which is a good reference guide,
> specially for the various built-in functions, just that I think the book
> explains the way to structure awk scripts rather better than the manpage.
>
>
> --
> Martin | martin at
> Gregorie | gregorie dot org
>

For zero dollars, you can get Arnold Robbins "Gawk.pdf",
which is all the instruction manual you need. Many a happy hour
spent flipping through that. There are at least three versions
of the manual, for Gawk3, Gawk4, and Gawk5 (just got a copy a
couple days ago, so don't know if Gawk5 is distributed as a package
yet).

I also have the gray book, which I bought in 1991. Written
by the three guys A, W, and K. ISBN 0-201-07981-X.

It's far from a perfect language. Sometimes it crushed the problem
you're working on. And other times, it's the problem (take sorting
as an example of migraine-induction).

Paul

Re: Pipe cleanup of text - help needed

<seb7ef$1usu$2@gioia.aioe.org>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=319&group=uk.comp.os.linux#319

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!aioe.org!cZqwowLvj+kmPGD0sEQEAQ.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Pipe cleanup of text - help needed
Date: Tue, 3 Aug 2021 11:57:18 +0100
Organization: Aioe.org NNTP Server
Message-ID: <seb7ef$1usu$2@gioia.aioe.org>
References: <se6k3q$1u0d$1@gioia.aioe.org> <se76fm$1f7k$1@gioia.aioe.org>
<se8c8b$qqu$1@dont-email.me> <sea1op$2nl$1@gioia.aioe.org>
<sea4kq$i2m$1@dont-email.me> <seafo1$9me$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="64414"; posting-host="cZqwowLvj+kmPGD0sEQEAQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2

by: Java Jive - Tue, 3 Aug 2021 10:57 UTC

On 03/08/2021 05:12, Paul wrote:
>
> Martin Gregorie wrote:
>>
>> On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:
>>
>> Years ago, when I was getting into awk, I found the O'Reilly book,
>> "sed & awk", subtitled "UNIX Power Tools" to be really helpful.
>> I think it explains how awk works, how to use it and shows what it can
>> do better than anything else I've found. It contains a lot of
>> non-trivial example code too.
>>
>> That's not to knock the awk manpage, which is a good reference guide,
>> specially for the various built-in functions, just that I think the
>> book explains the way to structure awk scripts rather better than the
>> manpage.
>
> For zero dollars, you can get Arnold Robbins "Gawk.pdf",
> which is all the instruction manual you need.

Yes, I found that online, and it's been most useful over the last couple
of days.

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Sorting in GAWK (Was: Pipe cleanup of text - help needed)

<sehj6d$epdc$1@news.xmission.com>

copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=344&group=uk.comp.os.linux#344

copy link Newsgroups: alt.os.linux uk.comp.os.linux

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Sorting in GAWK (Was: Pipe cleanup of text - help needed)
Date: Thu, 5 Aug 2021 20:54:37 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <sehj6d$epdc$1@news.xmission.com>
References: <se6k3q$1u0d$1@gioia.aioe.org> <sea1op$2nl$1@gioia.aioe.org> <sea4kq$i2m$1@dont-email.me> <seafo1$9me$1@dont-email.me>
Injection-Date: Thu, 5 Aug 2021 20:54:37 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="484780"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)

by: Kenny McCormack - Thu, 5 Aug 2021 20:54 UTC

In article <seafo1$9me$1@dont-email.me>, Paul <nospam@needed.invalid> wrote:
....
>It's far from a perfect language.

Well, nothing is. But I have very (very!) rarely found any job in AWK's
general problem domain (i.e., I wouldn't use it to write an OS, for example)
that AWK wasn't the best tool for.

>Sometimes it crushed the problem you're working on. And other times, it's
>the problem (take sorting as an example of migraine-induction).

Surprised to hear you say this. GAWK has several built-in sorting
capabilities. Maybe a review of the fine manual is in order?

BTW, once, long ago, before GAWK had these capabilities, I did code up a
qsort in AWK code. Worked pretty well, but I wouldn't recommend it to
anyone...

--
The difference between communism and capitalism?
In capitalism, man exploits man. In communism, it's the other way around.

- Daniel Bell, The End of Ideology (1960) -

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor