Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

In computing, the mean time to failure keeps getting shorter.


computers / news.admin.hierarchies / Re: New Usenet Archive

SubjectAuthor
* New Usenet ArchiveJason Evans
+* Re: New Usenet ArchiveAdam H. Kerman
|+* Re: New Usenet ArchiveThomas Hochstein
||+* Re: New Usenet ArchiveJason Evans
|||`- Re: New Usenet ArchiveAdam H. Kerman
||`- Re: New Usenet ArchiveAdam H. Kerman
|`- Re: New Usenet ArchiveJason Evans
`* Re: New Usenet ArchiveJulien_ÉLIE
 `* Re: New Usenet ArchiveJason Evans
  `- Re: New Usenet ArchiveJulien_ÉLIE

1
Subject: New Usenet Archive
From: Jason Evans
Newsgroups: news.admin.hierarchies
Organization: theuse.net
Date: Mon, 7 Feb 2022 14:05 UTC
Path: i2pn2.org!i2pn.org!aioe.org!news.theuse.net!.POSTED.ip-86-49-255-200.net.upcbroadband.cz!not-for-mail
From: jsev...@mailfence.com (Jason Evans)
Newsgroups: news.admin.hierarchies
Subject: New Usenet Archive
Date: Mon, 07 Feb 2022 14:05:36 +0000
Organization: theuse.net
Lines: 60
Message-ID: <str5f1$bt8$1@theuse.news.theuse.net>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Mon, 7 Feb 2022 13:05:37 -0000 (UTC)
Injection-Info: theuse.news.theuse.net; posting-host="ip-86-49-255-200.net.upcbroadband.cz:86.49.255.200";
logging-data="12200"; mail-complaints-to="news@theuse.news.theuse.net"
User-Agent: KNode/4.14.1
View all headers
Hi all,

For the past month, I have been downloading and sorting Usenet archives from
a news server (with their permission) of everything from 2003 until today.
My next step is to decide how to upload them to archive.org.

Here is the current archive that runs from the 80's and 90's until around
2003: https://archive.org/details/usenethistorical

Each newsgroup hierarchy has its entry. I'm thinking about something
different, and I want you input on how to do that.

Here my plan. The following newsgroup hierarchies will have their own
entries:

Big-8:
comp
sci
news
misc
talk
humanities
soc

uk

de

alt will be broken down into subgroups because it's so huge.

alt-a-e
alt-f-j
alt-k-o
alt-p-t
alt-u-z

For example, alt.folklore.computers would be found in alt-f-j.

The rest of the hierarchies will be grouped together since they are
generally smaller and more likely to be nothing but spam.

Misc Newsgroup hierarchies-a-e
Misc Newsgroup hierarchies-f-j
Misc Newsgroup hierarchies-k-o
Misc Newsgroup hierarchies-p-t
Misc Newsgroup hierarchies-u-z

These are questions to you folks:

1. Does this makes since or would breaking everything down by individual
hierarchy be better?

2. If I do it this way, are there any other hierarchies that should not be
grouped with the misc. groups that should stand alone?

One final note. In case you're wondering, I am not archiving any binary
groups or any group that I think could get deleted because of the extremely
distasteful subject matter. I think you can get my gist about what I mean.
Everything else is here. Even the stupid spammy revenge froops.

Jason


Subject: Re: New Usenet Archive
From: Adam H. Kerman
Newsgroups: news.admin.hierarchies
Organization: A noiseless patient Spider
Date: Mon, 7 Feb 2022 16:03 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ahk...@chinet.com (Adam H. Kerman)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Mon, 7 Feb 2022 16:03:05 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <strfrp$2at$2@dont-email.me>
References: <str5f1$bt8$1@theuse.news.theuse.net>
Injection-Date: Mon, 7 Feb 2022 16:03:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fdaa6cfbd6ac0da18f5cc64204e5eeb7";
logging-data="2397"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18q/lGIPRYYWkLSqz4ZX6tnxQyN+Xxv32s="
Cancel-Lock: sha1:ffu5A4j0VqTlF7xQ4zmMnIcmNmk=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
View all headers
Jason Evans <jsevans@mailfence.com> wrote:

For the past month, I have been downloading and sorting Usenet archives from
a news server (with their permission) of everything from 2003 until today.
My next step is to decide how to upload them to archive.org.

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

We've already got that. Google indexed Usenet articles as if they were
posted on the Web in the first place as the lousy Google Groups Web
interface was treated like a real Web page. Within Google Groups itself,
searching became seriously hideous because Google stopped devoting staff
resources to making sure the indexes were being maintained. The indexing
services weren't great but they were better than what they became.

An extremely serious problem with Google Groups indexing of the article
body, when it was working, was it didn't do a great job distinguishing
between the author's own text and the quoted text if it was a followup.

Usenet archives lack decent indexes. Is there a way for you to upload a
very small archive, then work on the indexing and presentation of the
articles so it in some way resembles walking the thread tree? Can the
index be developed along with the archive, and then tested tested tested
to avoid another Google Groups?

. . .

One final note. In case you're wondering, I am not archiving any binary
groups or any group that I think could get deleted because of the extremely
distasteful subject matter. I think you can get my gist about what I mean.
Everything else is here. Even the stupid spammy revenge froops.

Are you literally saying that you're archiving cancellable spam and
those various smaller-scale attacks on Usenet with articles uploaded by
the thousands from anonymyzing servers that aren't preventing abuse?

Revenge froups weren't any more spammy than any other part of Usenet.
Spam is spam regardless of the newsgroup.


Subject: Re: New Usenet Archive
From: Thomas Hochstein
Newsgroups: news.admin.hierarchies
Date: Mon, 7 Feb 2022 17:28 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!2.eu.feeder.erje.net!feeder.erje.net!news.szaf.org!thangorodrim.ancalagon.de!.POSTED.scatha.ancalagon.de!not-for-mail
From: thh...@thh.name (Thomas Hochstein)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Mon, 07 Feb 2022 18:28:54 +0100
Message-ID: <nah.20220207182852.361@scatha.ancalagon.de>
References: <str5f1$bt8$1@theuse.news.theuse.net> <strfrp$2at$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: thangorodrim.ancalagon.de; posting-host="scatha.ancalagon.de:10.0.1.1";
logging-data="28281"; mail-complaints-to="abuse@th-h.de"
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:KYR5fd0wPVol98+HahD9R1S8pcU=
X-Face: *OX>R5kq$7DjZ`^-[<HL?'n9%\ZDfCz/_FfV0_tpx7w{Vv1*byr`TC\[hV:!SJosK'1gA>1t8&@'PZ-tSFT*=<}JJ0nXs{WP<@(=U!'bOMMOH&Q0}/(W_d(FTA62<r"l)J\)9ERQ9?6|_7T~ZV2Op*UH"2+1f9[va
X-Clacks-Overhead: GNU Terry Pratchett
X-NNTP-Posting-Date: Mon, 07 Feb 2022 18:28:52 +0100
View all headers
Adam H. Kerman schrieb:

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

AFAIS, https://archive.org/details/usenethistorical has just zip'ed mbox
archives, one per group, with no way to browse, search or index anything.


Subject: Re: New Usenet Archive
From: Jason Evans
Newsgroups: news.admin.hierarchies
Organization: theuse.net
Date: Mon, 7 Feb 2022 19:14 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!aioe.org!news.theuse.net!.POSTED.ip-86-49-255-200.net.upcbroadband.cz!not-for-mail
From: jsev...@mailfence.com (Jason Evans)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Mon, 07 Feb 2022 19:14:27 +0000
Organization: theuse.net
Lines: 37
Message-ID: <strni4$43n$1@theuse.news.theuse.net>
References: <str5f1$bt8$1@theuse.news.theuse.net> <strfrp$2at$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Mon, 7 Feb 2022 18:14:28 -0000 (UTC)
Injection-Info: theuse.news.theuse.net; posting-host="ip-86-49-255-200.net.upcbroadband.cz:86.49.255.200";
logging-data="4215"; mail-complaints-to="news@theuse.news.theuse.net"
User-Agent: KNode/4.14.1
View all headers
Adam H. Kerman wrote:

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

We've already got that. Google indexed Usenet articles as if they were
posted on the Web in the first place as the lousy Google Groups Web
interface was treated like a real Web page. Within Google Groups itself,
searching became seriously hideous because Google stopped devoting staff
resources to making sure the indexes were being maintained. The indexing
services weren't great but they were better than what they became.


There are two differences between what I'm doing and what Google is doing.

First, I am archiving the raw source articles in the same format that are
already on archive.org, through plain text MBOX files. If you're doing
research, download the newsgroup that you want and let your mail client or
whatever you want to use for MBOX files do the heavy lifting for you when it
comes to sorting and searching.

Secondly Google no longer provides headers which is important for research.
I am providing everything.

An extremely serious problem with Google Groups indexing of the article
body, when it was working, was it didn't do a great job distinguishing
between the author's own text and the quoted text if it was a followup.

Usenet archives lack decent indexes. Is there a way for you to upload a
very small archive, then work on the indexing and presentation of the
articles so it in some way resembles walking the thread tree? Can the
index be developed along with the archive, and then tested tested tested
to avoid another Google Groups?

I don't have the time or energy to create a website to host this stuff that
would also do a good job of indexing everything. What I'm doing is providing
the files free of charge to archive.org so if someone else wants to do that,
they can.


Subject: Re: New Usenet Archive
From: Jason Evans
Newsgroups: news.admin.hierarchies
Organization: theuse.net
Date: Mon, 7 Feb 2022 19:16 UTC
References: 1 2 3
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.theuse.net!.POSTED.ip-86-49-255-200.net.upcbroadband.cz!not-for-mail
From: jsev...@mailfence.com (Jason Evans)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Mon, 07 Feb 2022 19:16:21 +0000
Organization: theuse.net
Lines: 13
Message-ID: <strnlm$43n$2@theuse.news.theuse.net>
References: <str5f1$bt8$1@theuse.news.theuse.net> <strfrp$2at$2@dont-email.me> <nah.20220207182852.361@scatha.ancalagon.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Mon, 7 Feb 2022 18:16:22 -0000 (UTC)
Injection-Info: theuse.news.theuse.net; posting-host="ip-86-49-255-200.net.upcbroadband.cz:86.49.255.200";
logging-data="4215"; mail-complaints-to="news@theuse.news.theuse.net"
User-Agent: KNode/4.14.1
View all headers
Thomas Hochstein wrote:

Adam H. Kerman schrieb:

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

AFAIS, https://archive.org/details/usenethistorical has just zip'ed mbox
archives, one per group, with no way to browse, search or index anything.

That is exactly what I have. My question is, is it better to have them on
archive.org with one entry per hierarchy or to group them like I suggested?




Subject: Re: New Usenet Archive
From: Adam H. Kerman
Newsgroups: news.admin.hierarchies
Organization: A noiseless patient Spider
Date: Mon, 7 Feb 2022 18:43 UTC
References: 1 2 3
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ahk...@chinet.com (Adam H. Kerman)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Mon, 7 Feb 2022 18:43:45 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <strp90$t16$2@dont-email.me>
References: <str5f1$bt8$1@theuse.news.theuse.net> <strfrp$2at$2@dont-email.me> <nah.20220207182852.361@scatha.ancalagon.de>
Injection-Date: Mon, 7 Feb 2022 18:43:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fdaa6cfbd6ac0da18f5cc64204e5eeb7";
logging-data="29734"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196roo5YrBiQ7g/R9+g9b/3DdKHwxL45+A="
Cancel-Lock: sha1:dmZ20LmxA3pybfHdxIIyFthc6gs=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
View all headers
Thomas Hochstein <thh@thh.name> wrote:
Adam H. Kerman schrieb:

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

AFAIS, https://archive.org/details/usenethistorical has just zip'ed mbox
archives, one per group, with no way to browse, search or index anything.

I saw that they were zipped. Jason stated he's doing something different.

So if he's merely presented Usenet articles as text files or
digestified somehow but still text filed, I was questing how he was
going to rely upon archive.org's own indexing processes.


Subject: Re: New Usenet Archive
From: Adam H. Kerman
Newsgroups: news.admin.hierarchies
Organization: A noiseless patient Spider
Date: Mon, 7 Feb 2022 18:49 UTC
References: 1 2 3 4
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ahk...@chinet.com (Adam H. Kerman)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Mon, 7 Feb 2022 18:49:29 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <strpjp$t16$3@dont-email.me>
References: <str5f1$bt8$1@theuse.news.theuse.net> <strfrp$2at$2@dont-email.me> <nah.20220207182852.361@scatha.ancalagon.de> <strnlm$43n$2@theuse.news.theuse.net>
Injection-Date: Mon, 7 Feb 2022 18:49:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fdaa6cfbd6ac0da18f5cc64204e5eeb7";
logging-data="29734"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19x2zNhXvHIDOcHxKx18rnTrRJNrWBTdvg="
Cancel-Lock: sha1:2AVr4M73FnjdB3UXfFg6hF0ybOo=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
View all headers
Jason Evans <jsevans@mailfence.com> wrote:
Thomas Hochstein wrote:
Adam H. Kerman schrieb:

So you'd be relying upon their indexing and its likely inability to tell
the difference between the article body, the .sig, and headers?

AFAIS, https://archive.org/details/usenethistoricalhas just zip'ed mbox
archives, one per group, with no way to browse, search or index anything.

That is exactly what I have. My question is, is it better to have them on
archive.org with one entry per hierarchy or to group them like I suggested?

I didn't mean to volunteer you to perform work you weren't willing to
do. I apologize for that. My comment, stating the obvious, was pointing
out what we don't have.

I don't have an opinion on whether your proposed grouping is better or
worse.


Subject: Re: New Usenet Archive
From: Julien_ÉLIE
Newsgroups: news.admin.hierarchies
Organization: Groupes francophones par TrigoFACILE
Date: Tue, 8 Feb 2022 19:46 UTC
References: 1
Path: i2pn2.org!i2pn.org!news.nntp4.net!news.gegeweb.eu!gegeweb.org!news.trigofacile.com!.POSTED.176.143-2-105.abo.bbox.fr!not-for-mail
From: iul...@nom-de-mon-site.com.invalid (Julien_ÉLIE)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Tue, 8 Feb 2022 20:46:39 +0100
Organization: Groupes francophones par TrigoFACILE
Message-ID: <stuhb0$1ann3$1@news.trigofacile.com>
References: <str5f1$bt8$1@theuse.news.theuse.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 8 Feb 2022 19:46:40 -0000 (UTC)
Injection-Info: news.trigofacile.com; posting-account="julien"; posting-host="176.143-2-105.abo.bbox.fr:176.143.2.105";
logging-data="1400547"; mail-complaints-to="abuse@trigofacile.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.5.1
Cancel-Lock: sha1:11NY33dt29HEHDAczTnKOFYZh8c= sha256:O1y61q0nwmu7En6PcVJDQNfmgu7H/cg8MdcEmMXPbIA=
sha1:beYVHzHVFJiThGoVQQV/z5Vxh+8= sha256:9wXEZZ/ikRmGN/OuI4nnqkCCmY1reDpOaiZnLCkVUO8=
Content-Language: fr
In-Reply-To: <str5f1$bt8$1@theuse.news.theuse.net>
View all headers

Hi Jason,

Here is the current archive that runs from the 80's and 90's until around
2003: https://archive.org/details/usenethistorical

As noted by another person (who spoke about that archive in a French newsgroup), the encoding of bodies is wrong.  All non-ASCII characters are mungled :-/
Seen in fr.* and de.*, and I bet it is the same for all hierarchies.

--
Julien ÉLIE

« J'oubliais qu'Assurancetourix a une nouvelle corde à sa harpe ! »
   (Astérix)


Subject: Re: New Usenet Archive
From: Jason Evans
Newsgroups: news.admin.hierarchies
Organization: theuse.net
Date: Wed, 9 Feb 2022 08:07 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!aioe.org!news.theuse.net!.POSTED.ip-86-49-255-200.net.upcbroadband.cz!not-for-mail
From: jsev...@mailfence.com (Jason Evans)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Wed, 09 Feb 2022 08:07:19 +0000
Organization: theuse.net
Lines: 32
Message-ID: <stvp77$m7$1@theuse.news.theuse.net>
References: <str5f1$bt8$1@theuse.news.theuse.net> <stuhb0$1ann3$1@news.trigofacile.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8Bit
Injection-Date: Wed, 9 Feb 2022 07:07:20 -0000 (UTC)
Injection-Info: theuse.news.theuse.net; posting-host="ip-86-49-255-200.net.upcbroadband.cz:86.49.255.200";
logging-data="711"; mail-complaints-to="news@theuse.news.theuse.net"
User-Agent: KNode/4.14.1
View all headers
Julien ÉLIE wrote:


Hi Jason,

Here is the current archive that runs from the 80's and 90's until around
2003: https://archive.org/details/usenethistorical

As noted by another person (who spoke about that archive in a French
newsgroup), the encoding of bodies is wrong.  All non-ASCII characters
are mungled :-/
Seen in fr.* and de.*, and I bet it is the same for all hierarchies.


Hi Julian,

This doesn't really answer the question that I asked in my original article
about organizing Usenet hierarchies for archive.org.

However, to respond to your comment, I picked this article at random from
fr.usenet.distribution. This is a screenshot
https://pasteboard.co/YA9d6r01LUnP.png)using Thunderbird from one of the
archives that I created. You can see that the French letters can be read
correctly because this article is from last year and encoded in UTF-8. Even
some of the old articles in this particular archive that are encoded in
iso-8859-15 appear correctly.

The problem is that when you go back far enough, either plain ASCII is used
or some non-standard encoding and then the non-English characters are
munged. My colleague, Tristan, has been doing some work on this when it
comes to this issue with Esperanto on the early Usenet.

Jason


Subject: Re: New Usenet Archive
From: Julien_ÉLIE
Newsgroups: news.admin.hierarchies
Organization: Groupes francophones par TrigoFACILE
Date: Wed, 9 Feb 2022 17:34 UTC
References: 1 2 3
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.trigofacile.com!.POSTED.176-143-2-105.abo.bbox.fr!not-for-mail
From: iul...@nom-de-mon-site.com.invalid (Julien_ÉLIE)
Newsgroups: news.admin.hierarchies
Subject: Re: New Usenet Archive
Date: Wed, 9 Feb 2022 18:34:46 +0100
Organization: Groupes francophones par TrigoFACILE
Message-ID: <su0tvm$1cqko$1@news.trigofacile.com>
References: <str5f1$bt8$1@theuse.news.theuse.net>
<stuhb0$1ann3$1@news.trigofacile.com> <stvp77$m7$1@theuse.news.theuse.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 9 Feb 2022 17:34:46 -0000 (UTC)
Injection-Info: news.trigofacile.com; posting-account="julien"; posting-host="176-143-2-105.abo.bbox.fr:176.143.2.105";
logging-data="1469080"; mail-complaints-to="abuse@trigofacile.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.5.1
Cancel-Lock: sha1:Wi8FuX7nV7m3LAuZAGb40YLVzOc= sha256:o3FSAdWV8NaEpxyHHBaT5dn+0uKkYyvMujVOgIcMGy0=
sha1:AsgbZ+X6/O3BSx1zX2Fr1iEDMvs= sha256:vtppIzv6o+0+OJaNqgblHTQHphtJMK5s70zsPzoxmDw=
Content-Language: en-GB
In-Reply-To: <stvp77$m7$1@theuse.news.theuse.net>
View all headers
Hi Jason,

The problem is that when you go back far enough, either plain ASCII is used
or some non-standard encoding and then the non-English characters are
munged. My colleague, Tristan, has been doing some work on this when it
comes to this issue with Esperanto on the early Usenet.

Yes, apparently, the problem is only for old archives (of last century or so).  When no encoding is specified, non-ASCII chars get mungled.
Thanks for the screenshot and information that recent articles are correctly archived.


This doesn't really answer the question that I asked in my original article about organizing Usenet hierarchies for archive.org.
I don't have a strong opinion about that.  I would tend to prefer a breaking down by individual hierarchies, as any kind of mixing hierarchies may not be what users want.

--
Julien ÉLIE

« You know what I did before I married?  Anything I wanted to. »


1
rocksolid light 0.7.2
clearneti2ptor