Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

If it's worth hacking on well, it's worth hacking on for money.


computers / news.software.readers / Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.

SubjectAuthor
* 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Sqwertz
+* Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF iBernd Rose
|`- Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF iSqwertz
`- Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF iVanguardLH

1
40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.

<1p6lxrrbb67u6$.dlg@sqwertz.com>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=645&group=news.software.readers#645

  copy link   Newsgroups: news.software.readers
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx04.iad.POSTED!not-for-mail
From: sqwert...@gmail.invalid (Sqwertz)
Subject: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.
Newsgroups: news.software.readers
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Organization: Me, Myself, and Inc.
Message-ID: <1p6lxrrbb67u6$.dlg@sqwertz.com>
Lines: 79
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Wed, 06 Oct 2021 04:45:18 UTC
Date: Tue, 5 Oct 2021 23:45:17 -0500
X-Received-Bytes: 3628
 by: Sqwertz - Wed, 6 Oct 2021 04:45 UTC

I'm trying to filter out extraneous headers from this text file
which I've exported using File->Save Selected Messages->MBOX|TXT.
There are a couple thousand of messages in here and I'm trying to
make it more legible without all the visual noise of the headers.

No other long headers seem to do the CRLF thing except for
X-Received: Is this my obnoxious newserver (highwinds) doing this
and Dialog doesn't care?

I've been working on this for days off and on. Can anybody help me
delete all headers except for:

Newsgroups:
Date:
From:
Subject:
Message-ID:
(in their natural order, not how I've listed)
From the text file at:

https://drive.google.com/file/d/1ElDcN7rUvmy7kn6f3Sn78jz6YXwz-WhJ/view?usp=sharing

It's for a very good cause (the Missouri Board of Nursing in regards
to a paedo pediatric HOME CARE nurse)

Using notepad++ I've gotten rid of everything EXCEPT for those nasty
X-Received: second lines and there's no pattern that won't remove
other context that I can figure - but my grepping and regex's are
really rusty.

Here's a sample of the text file to show my/our problem (more at the
link).

Thanks IA.

-sw

> From jwk6680@bjc.org Mon Oct 04 05:37:30 2021
> X-Folder: Kuthe
> X-Received: by 2002:a37:688b:: with SMTP id d133mr9895201qkc.352.1633351051221;
> Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
> X-Received: by 2002:a25:b84e:: with SMTP id b14mr15395553ybm.348.1633351051055;
> Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
> Path: not-for-mail
> Newsgroups: rec.food.cooking
> Date: Mon, 4 Oct 2021 05:37:30 -0700 (PDT)
> Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
> NNTP-Posting-Host: 35.129.9.50
> User-Agent: G2/1.0
> MIME-Version: 1.0
> Message-ID: <4c1d3472-5f99-4672-be69-3020138c3fefn@googlegroups.com>
> Subject: And I had a GREAT ORGASM yesterday!
> From: John Kuthe <jwk6680@bjc.org>
> Injection-Date: Mon, 04 Oct 2021 12:37:31 +0000
> Content-Type: text/plain; charset="UTF-8"
> X-Received-Bytes: 1007
>
> On Sunday, my DAY OFF! :-) Complete with ejaculation! Wow!
>
> At 61 Years old! And it felt SO GOOD! :-)
>
>
> John Kuthe, RN, BSN...
>
> From jwk6680@bjc.org Sun Oct 03 18:34:10 2021
> X-Folder: Kuthe
> X-Received: by 2002:a0c:e381:: with SMTP id a1mr20159752qvl.42.1633311251669;
> Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
> X-Received: by 2002:a25:3620:: with SMTP id d32mr12272072yba.46.1633311251515;
> Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
> Path: not-for-mail
> Newsgroups: rec.food.cooking
> Date: Sun, 3 Oct 2021 18:34:11 -0700 (PDT)
> In-Reply-To: <sjdl8i$9tg$1@dont-email.me>
> Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
> NNTP-Posting-Host: 35.129.9.50
> References: <c5d1ff82-d941-44de-b125-8e22ce08f555n@googlegroups.com> <sjdl8i$9tg$1@dont-email.me>
> User-Agent: G2/1.0

Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.

<1f5xuqljsv11r$.dlg@b.rose.tmpbox.news.arcor.de>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=646&group=news.software.readers#646

  copy link   Newsgroups: news.software.readers
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: b.rose.t...@arcor.de (Bernd Rose)
Newsgroups: news.software.readers
Subject: Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.
Date: Wed, 6 Oct 2021 07:50:47 +0200
Message-ID: <1f5xuqljsv11r$.dlg@b.rose.tmpbox.news.arcor.de>
References: <1p6lxrrbb67u6$.dlg@sqwertz.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Injection-Info: solani.org;
logging-data="12514"; mail-complaints-to="abuse@news.solani.org"
User-Agent: 40tude_Dialog/2.0.15.41 (ccffc398.17.367)
X-User-ID: eJwFwQcBACAIBMBKzAfjKKN/BO9cwagwOMzXl0jcu1kv5wnih6nXErZXKhXFdzTiZIno6nayl+EMhN5oDlEX3bVwbcz5tiAZjA==
Cancel-Lock: sha1:L1UfreV7pQJogqde3Vo+Y5zWXbU=
 by: Bernd Rose - Wed, 6 Oct 2021 05:50 UTC

On Tue, 5th Oct 2021 23:45:17 -0500, Sqwertz wrote:

> Using notepad++ I've gotten rid of everything EXCEPT for those nasty
> X-Received: second lines and there's no pattern that won't remove
> other context that I can figure - but my grepping and regex's are
> really rusty.

In Notepad++ replace (RegEx) the following with an empty string:
^X-Received: [^\r\n]+\r\n(\h[^\r\n]+\r\n)*

HTH.
Bernd

Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.

<8c5j6n5rgzm6.dlg@v.nguard.lh>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=647&group=news.software.readers#647

  copy link   Newsgroups: news.software.readers
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.fcku.it!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: V...@nguard.LH (VanguardLH)
Newsgroups: news.software.readers
Subject: Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.
Date: Wed, 6 Oct 2021 05:42:05 -0500
Organization: Usenet Elder
Lines: 122
Message-ID: <8c5j6n5rgzm6.dlg@v.nguard.lh>
References: <1p6lxrrbb67u6$.dlg@sqwertz.com>
Reply-To: invalid@invalid.invalid
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: individual.net urzC8BG5su2kwXXWs/kd3w9w25fmf+BKJzxtB0Qj3ataxHrKgl
Keywords: VanguardLH VLH811
Cancel-Lock: sha1:p47u0KcVX0pvMqVKEZyIchf8Naw=
User-Agent: 40tude_Dialog/2.0.15.41
 by: VanguardLH - Wed, 6 Oct 2021 10:42 UTC

Sqwertz <sqwertzme@gmail.invalid> wrote:

> I'm trying to filter out extraneous headers from this text file
> which I've exported using File->Save Selected Messages->MBOX|TXT.
> There are a couple thousand of messages in here and I'm trying to
> make it more legible without all the visual noise of the headers.
>
> No other long headers seem to do the CRLF thing except for
> X-Received: Is this my obnoxious newserver (highwinds) doing this
> and Dialog doesn't care?
>
> I've been working on this for days off and on. Can anybody help me
> delete all headers except for:
>
> Newsgroups:
> Date:
> From:
> Subject:
> Message-ID:
> (in their natural order, not how I've listed)
>
> From the text file at:
>
> https://drive.google.com/file/d/1ElDcN7rUvmy7kn6f3Sn78jz6YXwz-WhJ/view?usp=sharing
>
> It's for a very good cause (the Missouri Board of Nursing in regards
> to a paedo pediatric HOME CARE nurse)
>
> Using notepad++ I've gotten rid of everything EXCEPT for those nasty
> X-Received: second lines and there's no pattern that won't remove
> other context that I can figure - but my grepping and regex's are
> really rusty.
>
> Here's a sample of the text file to show my/our problem (more at the
> link).
>
> Thanks IA.
>
> -sw
>
>> From jwk6680@bjc.org Mon Oct 04 05:37:30 2021
>> X-Folder: Kuthe
>> X-Received: by 2002:a37:688b:: with SMTP id d133mr9895201qkc.352.1633351051221;
>> Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
>> X-Received: by 2002:a25:b84e:: with SMTP id b14mr15395553ybm.348.1633351051055;
>> Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
>> Path: not-for-mail
>> Newsgroups: rec.food.cooking
>> Date: Mon, 4 Oct 2021 05:37:30 -0700 (PDT)
>> Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
>> NNTP-Posting-Host: 35.129.9.50
>> User-Agent: G2/1.0
>> MIME-Version: 1.0
>> Message-ID: <4c1d3472-5f99-4672-be69-3020138c3fefn@googlegroups.com>
>> Subject: And I had a GREAT ORGASM yesterday!
>> From: John Kuthe <jwk6680@bjc.org>
>> Injection-Date: Mon, 04 Oct 2021 12:37:31 +0000
>> Content-Type: text/plain; charset="UTF-8"
>> X-Received-Bytes: 1007
>>
>> On Sunday, my DAY OFF! :-) Complete with ejaculation! Wow!
>>
>> At 61 Years old! And it felt SO GOOD! :-)
>>
>> John Kuthe, RN, BSN...
>>
>> From jwk6680@bjc.org Sun Oct 03 18:34:10 2021
>> X-Folder: Kuthe
>> X-Received: by 2002:a0c:e381:: with SMTP id a1mr20159752qvl.42.1633311251669;
>> Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
>> X-Received: by 2002:a25:3620:: with SMTP id d32mr12272072yba.46.1633311251515;
>> Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
>> Path: not-for-mail
>> Newsgroups: rec.food.cooking
>> Date: Sun, 3 Oct 2021 18:34:11 -0700 (PDT)
>> In-Reply-To: <sjdl8i$9tg$1@dont-email.me>
>> Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
>> NNTP-Posting-Host: 35.129.9.50
>> References: <c5d1ff82-d941-44de-b125-8e22ce08f555n@googlegroups.com> <sjdl8i$9tg$1@dont-email.me>
>> User-Agent: G2/1.0

Continuation lines are allowed for headers to accomodate those that are
long, sometimes exceeding the 998-character maximum per physical line.

headerName: string1
string2
string3
string2 and string3 are continuation lines.
Continuation lines are denoted by a leading space character. That is,
at a minimum, there must be a space character in column 1 of a header
line for it to be a continuation line. For a continuation line, it must
be prefixed with 1, or more, whitespace characters.

Nothing wrong with the Received header. It obeys the RFC standard for
Internet messages. The header section ends with the first blank line;
i.e., /n in column 1. Before that, your script would need to copy and
paste every continuation line to the preceding line to compose 1 long
header line as 1 physical line. Since you're throwing away the headers,
why keep anything before blank like delimiting the header section? Scan
(parse through) the message, and keep ignoring everything until, and
after, the first blank line your parser encounters.

If you want to keep some headers, you'll have to test each line on a
read to see if the header's name matches one of those you want to keep.
If so, you have to keep that line, and every continuation line
thereafter (ever following line with a space in column 1), until the
next line in the format:

headerName: string
^ ^
| |__ one whitespace minimum for parsing name from value
|__ must be in column 1 of a physical line

You'll need to write a parser script checking if each line is a header
line (headername:<space>), if that's one you want to keep, and if
following lines are continuation lines, or another header line, and
terminating the parsing upon reaching the first blank line.

Regex is handy, but I don't think you can get it to handle continuation
lines as part of the preceding header line.

Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.

<1g7yitvrjyjgt.dlg@sqwertz.com>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=650&group=news.software.readers#650

  copy link   Newsgroups: news.software.readers
Path: i2pn2.org!rocksolid2!news.neodome.net!weretis.net!feeder8.news.weretis.net!npeer.as286.net!npeer-ng0.as286.net!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: sqwert...@gmail.invalid (Sqwertz)
Subject: Re: 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRLF in X-Received: Lines.
Newsgroups: news.software.readers
User-Agent: ForteAgent/7.10.32.1212
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Organization: Me, Myself, and Inc.
References: <1p6lxrrbb67u6$.dlg@sqwertz.com> <1f5xuqljsv11r$.dlg@b.rose.tmpbox.news.arcor.de>
Message-ID: <1g7yitvrjyjgt.dlg@sqwertz.com>
Lines: 25
X-Complaints-To: abuse@blocknews.net
NNTP-Posting-Date: Thu, 07 Oct 2021 04:30:38 UTC
Date: Wed, 6 Oct 2021 23:30:36 -0500
X-Received-Bytes: 1702
 by: Sqwertz - Thu, 7 Oct 2021 04:30 UTC

On Wed, 6 Oct 2021 07:50:47 +0200, Bernd Rose wrote:

> On Tue, 5th Oct 2021 23:45:17 -0500, Sqwertz wrote:
>
>> Using notepad++ I've gotten rid of everything EXCEPT for those nasty
>> X-Received: second lines and there's no pattern that won't remove
>> other context that I can figure - but my grepping and regex's are
>> really rusty.
>
> In Notepad++ replace (RegEx) the following with an empty string:
> ^X-Received: [^\r\n]+\r\n(\h[^\r\n]+\r\n)*
>
> HTH.
> Bernd

Thanks Bernd! That worked perfectly. I also used it on the
multiple Received: lines as well as they got pretty extensive, too.

Thank you too, VanguardLH. I had definitely seen the multi-line
headers in SMTP email (especially google and MS), but I guess I
never really noticed them in NNTP. I swear I never saw a line break
in References: and Path: especially (and now we have no Path: header
with Highwinds <grrrr>).

-sw

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor