Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

<<<<< EVACUATION ROUTE <<<<<


devel / comp.unix.shell / Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

SubjectAuthor
* Cleaning up the junk you get from the web these days. Is there a normalized wayKenny McCormack
+* Re: Cleaning up the junk you get from the web these days. Is thereBit Twister
|`- Missing the point... (Was: Cleaning up the junk you get from the web these days.Kenny McCormack
+* Re: Cleaning up the junk you get from the web these days. Is there a normalizedDan Espen
|`* Re: Cleaning up the junk you get from the web these days. Is there a normalizedKenny McCormack
| +- Re: Cleaning up the junk you get from the web these days. Is there a normalizedBen Bacarisse
| `* Re: Cleaning up the junk you get from the web these days. Is there a normalizedComputer Nerd Kev
|  +- Re: Cleaning up the junk you get from the web these days. Is there a normalizedKenny McCormack
|  `* Re: Cleaning up the junk you get from the web these days. Is there a normalizedKenny McCormack
|   +* Re: Cleaning up the junk you get from the web these days. Is there a normalizedDan Espen
|   |`- Totally missing the point again (Was: Cleaning up the junk you get from the web Kenny McCormack
|   `- Re: Cleaning up the junk you get from the web these days. Is there aJanis Papanagnou
`- Re: Cleaning up the junk you get from the web these days. Is there a normalizedBen Bacarisse

1
Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0fqcg$1gi29$1@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5006&group=comp.unix.shell#5006

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 15:37:52 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <t0fqcg$1gi29$1@news.xmission.com>
Injection-Date: Fri, 11 Mar 2022 15:37:52 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="1591369"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Fri, 11 Mar 2022 15:37 UTC

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and
everything was good and proper. But those days are gone. Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web. In
particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my
pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct;
hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of
other command line options and/or environment variables that you can set to
control it - but it seems to work well enough for me just using the "-c"
option.

--
Donald Drumpf claims to be "the least racist person you'll ever meet".

This would be true if the only other person you've ever met was David Duke.

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<slrnt2n0dr.u52e.BitTwister@wb.home.test>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5007&group=comp.unix.shell#5007

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: BitTwis...@mouse-potato.com (Bit Twister)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there
a normalized way to do it?
Date: Fri, 11 Mar 2022 11:07:07 -0600
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <slrnt2n0dr.u52e.BitTwister@wb.home.test>
References: <t0fqcg$1gi29$1@news.xmission.com>
Injection-Info: reader02.eternal-september.org; posting-host="1b8be0eda43e8670b1fdd8cb0a5aa816";
logging-data="5542"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FK6eJ9QTReBsehLlFrrzJsnIe1MhxgNY="
User-Agent: slrn/pre1.0.4-6 (Linux)
Cancel-Lock: sha1:URTdgRxXYXPkbCyR+XXKc7JgJXo=
 by: Bit Twister - Fri, 11 Mar 2022 17:07 UTC

On Fri, 11 Mar 2022 15:37:52 -0000 (UTC), Kenny McCormack wrote:
> Cleaning up the junk you get from the web these days.
> Is there a normalized way to do it?
>
> Back in the good old days, the Internet was simple, 7 bit ASCII, and
> everything was good and proper. But those days are gone. Nowadays, there
> is all this i18n glop in the strings we get from The Internet/The Web. In
> particular, there seems to be about 9 different ways to represent the
> simple "single quote" character (normally represented as "\047").
>
> So, it becomes a normal part of my processing get rid of this glop. The
> tool that I've ended up using is "iconv", and I usually put somewhere in my
> pipelines the command: iconv -c
>
> This works reasonably well, but just doesn't quite feel entirely correct;
> hence my reason for posting this thread. Note that I don't really
> understand the full logic or point of iconv, and I think there are lots of
> other command line options and/or environment variables that you can set to
> control it - but it seems to work well enough for me just using the "-c"
> option.

To get the point of a linux app and/or a slightly better understanding
of a linux app I will try the man page for the app. Example man iconv

After that there is a search feature in google.

Missing the point... (Was: Cleaning up the junk you get from the web these days. Is there) a normalized way to do it?

<t0g3v5$1gm4p$1@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5008&group=comp.unix.shell#5008

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Missing the point... (Was: Cleaning up the junk you get from the web these days. Is there) a normalized way to do it?
Date: Fri, 11 Mar 2022 18:21:25 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <t0g3v5$1gm4p$1@news.xmission.com>
References: <t0fqcg$1gi29$1@news.xmission.com> <slrnt2n0dr.u52e.BitTwister@wb.home.test>
Injection-Date: Fri, 11 Mar 2022 18:21:25 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="1595545"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Fri, 11 Mar 2022 18:21 UTC

In article <slrnt2n0dr.u52e.BitTwister@wb.home.test>,
Bit Twister <BitTwister@mouse-potato.com> wrote:
....
>To get the point of a linux app and/or a slightly better understanding
>of a linux app I will try the man page for the app. Example man iconv

The point isn't to learn how iconv works. I know how iconv works.
I just don't think it works very well when you just want to get rid of all
the junk. I.e., iconv seems to be solving a different problem than the one
I am talking about.

Like I said, I know how to read man pages (like, duh...) and I know how to
use iconv; I just wish there was a better and more "normal" solution to the
problem. I hardly think I am alone in wanting this.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Aspergers

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0g8o2$k88$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5009&group=comp.unix.shell#5009

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: dan1es...@gmail.com (Dan Espen)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 14:42:58 -0500
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <t0g8o2$k88$1@dont-email.me>
References: <t0fqcg$1gi29$1@news.xmission.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="7f6ba3e826ed0f3a49345b7a64b8a1fa";
logging-data="20744"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Xiyyhb+J9cXYyXHzlB3Kzrc33Zi64xdg="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:G2oOk51WSLbWWdwRdl8clGaZ1FI=
 by: Dan Espen - Fri, 11 Mar 2022 19:42 UTC

gazelle@shell.xmission.com (Kenny McCormack) writes:

> Cleaning up the junk you get from the web these days.
> Is there a normalized way to do it?
>
> Back in the good old days, the Internet was simple, 7 bit ASCII, and
> everything was good and proper. But those days are gone. Nowadays, there
> is all this i18n glop in the strings we get from The Internet/The Web. In
> particular, there seems to be about 9 different ways to represent the
> simple "single quote" character (normally represented as "\047").
>
> So, it becomes a normal part of my processing get rid of this glop. The
> tool that I've ended up using is "iconv", and I usually put somewhere in my
> pipelines the command: iconv -c
>
> This works reasonably well, but just doesn't quite feel entirely correct;
> hence my reason for posting this thread. Note that I don't really
> understand the full logic or point of iconv, and I think there are lots of
> other command line options and/or environment variables that you can set to
> control it - but it seems to work well enough for me just using the "-c"
> option.

You are using iconv but don't feel it's correct but haven't shown us any
examples of what you are trying to do or how it's failing.

I'll have to guess you've asked it to convert utf8 to ascii.
Maybe you did something else. The man page shows an example making the
target "ASCII//TRANSLIT". The translit part sounds like it might help.

iconv sounds to me like the right tool. If there are other multi-byte
sequences you want to handle, something like sed can do the job.

--
Dan Espen

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<87fsno2qxf.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5010&group=comp.unix.shell#5010

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 20:08:12 +0000
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <87fsno2qxf.fsf@bsb.me.uk>
References: <t0fqcg$1gi29$1@news.xmission.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="4f06ec671da2a5ebff543b9e3793624b";
logging-data="7036"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+cLk6cPLyAkmqrGOrVCUyMi7ilWPUkz78="
Cancel-Lock: sha1:P0LkUuJTeWx8cQs5myln8Oii45g=
sha1:OIyHj4JgECVIOcKsRqx4pZ4AGjM=
X-BSB-Auth: 1.d32c4b637cd2b1037443.20220311200812GMT.87fsno2qxf.fsf@bsb.me.uk
 by: Ben Bacarisse - Fri, 11 Mar 2022 20:08 UTC

gazelle@shell.xmission.com (Kenny McCormack) writes:

> Cleaning up the junk you get from the web these days.
> Is there a normalized way to do it?
>
> Back in the good old days, the Internet was simple, 7 bit ASCII, and
> everything was good and proper.

Maybe you didn't look very hard. In the bad old days there were
inconsistent 8-bit versions of just about everything and (though there
were more) two common, widely used, incompatible character sets.

And don't get started on file types!

> But those days are gone.

Now those bad days are largely gone. Almost everything is a stream of
bytes, and there is one almost universally agreed character set. And
almost all protocols get to announce the encoding they are using so you
don't have to guess anymore.

> Nowadays, there
> is all this i18n glop in the strings we get from The Internet/The Web.

What on earth is i18n glop?

> In
> particular, there seems to be about 9 different ways to represent the
> simple "single quote" character (normally represented as "\047").

"Good old ASCII" had two single quotes -- an opening one and a closing
one -- though these were secondary meanings. The "closing quote" (also
called acute accent) is more usually referred to as apostrophe. The
modern rendering of it as vertical belies its original purpose.

There are many ways to represent that character (&39; for example), but
iconv -c won't handle these different representations of code point 39
(\047).

But there are also other characters that better fit the description of
"single quote". These used to be very common on the Web because, in a
twist of fate, Windows software often uses a closing single quote as an
apostrophe. I don't see that nearly as often these days. Maybe this is
one the 9 you see?

> So, it becomes a normal part of my processing get rid of this glop. The
> tool that I've ended up using is "iconv", and I usually put somewhere in my
> pipelines the command: iconv -c
>
> This works reasonably well, but just doesn't quite feel entirely correct;
> hence my reason for posting this thread.

It's not clear what you want and it's not clear what the source data
looks like. Do you take into account any declared character set
headers? If so, converting to UTF-8 would probably avoid the need to
discard anything in the input.

--
Ben.

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0gbu6$1gr1u$1@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5011&group=comp.unix.shell#5011

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 20:37:26 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <t0gbu6$1gr1u$1@news.xmission.com>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me>
Injection-Date: Fri, 11 Mar 2022 20:37:26 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="1600574"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Fri, 11 Mar 2022 20:37 UTC

In article <t0g8o2$k88$1@dont-email.me>,
Dan Espen <dan1espen@gmail.com> wrote:
>gazelle@shell.xmission.com (Kenny McCormack) writes:
>
>> Cleaning up the junk you get from the web these days.
>> Is there a normalized way to do it?
>>
>> Back in the good old days, the Internet was simple, 7 bit ASCII, and
>> everything was good and proper. But those days are gone. Nowadays, there
>> is all this i18n glop in the strings we get from The Internet/The Web. In
>> particular, there seems to be about 9 different ways to represent the
>> simple "single quote" character (normally represented as "\047").
>>
>> So, it becomes a normal part of my processing get rid of this glop. The
>> tool that I've ended up using is "iconv", and I usually put somewhere in my
>> pipelines the command: iconv -c
>>
>> This works reasonably well, but just doesn't quite feel entirely correct;
>> hence my reason for posting this thread. Note that I don't really
>> understand the full logic or point of iconv, and I think there are lots of
>> other command line options and/or environment variables that you can set to
>> control it - but it seems to work well enough for me just using the "-c"
>> option.
>
>You are using iconv but don't feel it's correct but haven't shown us any
>examples of what you are trying to do or how it's failing.
>
>I'll have to guess you've asked it to convert utf8 to ascii.
>Maybe you did something else. The man page shows an example making the
>target "ASCII//TRANSLIT". The translit part sounds like it might help.
>
>iconv sounds to me like the right tool. If there are other multi-byte
>sequences you want to handle, something like sed can do the job.

I get what you are saying. But I just want something that will remove
any/all high-ASCII junk. It *might* be as simple as simply writing a
simple search-and-replace script in your-favorite-scriping-language (in my
case, that would be AWK) to remove any character with ASCII value > 127.

But, my feeling is that there must be something better.

Also, my sense is that iconv doesn't do enough. Sometimes, even after
running it through iconv, you'll still see non-ASCII junk in the file.
Also, as I mentioned, the main character that seems to have a problem is
the ' character. It'd be nice if some commonly accepted solution would at
least handle all the mis-codings of that character.

Anyway, I was curious to find out what other people use, and how they have
fared with this problem.

--
"There's no chance that the iPhone is going to get any significant market share. No chance." - Steve Ballmer

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<87a6dw2n6m.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5012&group=comp.unix.shell#5012

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 21:29:05 +0000
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <87a6dw2n6m.fsf@bsb.me.uk>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me>
<t0gbu6$1gr1u$1@news.xmission.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="4f06ec671da2a5ebff543b9e3793624b";
logging-data="6726"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18rAlaNkoH8G7BTvk3gXC2Wr10Jrgzd5ps="
Cancel-Lock: sha1:XRCcE5dDYdHAkgaxVPsAAzIKwOc=
sha1:5mpuWBqev7mfU5Oi18VclUOQmjA=
X-BSB-Auth: 1.5ad6ac803d19a2076e7f.20220311212905GMT.87a6dw2n6m.fsf@bsb.me.uk
 by: Ben Bacarisse - Fri, 11 Mar 2022 21:29 UTC

gazelle@shell.xmission.com (Kenny McCormack) writes:

> ... But I just want something that will remove
> any/all high-ASCII junk.

That's a clearer statement of what you want (minus the trolling "junk").
Isn't

tr -d '\200\377'

what you want?

> Anyway, I was curious to find out what other people use, and how they have
> fared with this problem.

I've never come across a need for throwing characters away. Very often
the "junk" is there for a purpose.

--
Ben.

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0gf09$12jr$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5013&group=comp.unix.shell#5013

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!aioe.org!wn3sXX9SOBF44+pwmgXrQQ.user.46.165.242.75.POSTED!not-for-mail
From: not...@telling.you.invalid (Computer Nerd Kev)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 21:29:46 -0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID: <t0gf09$12jr$1@gioia.aioe.org>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me> <t0gbu6$1gr1u$1@news.xmission.com>
Injection-Info: gioia.aioe.org; logging-data="35451"; posting-host="wn3sXX9SOBF44+pwmgXrQQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/2.4.31 (i586))
X-Notice: Filtered by postfilter v. 0.9.2
 by: Computer Nerd Kev - Fri, 11 Mar 2022 21:29 UTC

Kenny McCormack <gazelle@shell.xmission.com> wrote:
> In article <t0g8o2$k88$1@dont-email.me>,
> Dan Espen <dan1espen@gmail.com> wrote:
>>gazelle@shell.xmission.com (Kenny McCormack) writes:
>>
>>> Cleaning up the junk you get from the web these days.
>>> Is there a normalized way to do it?
>>>
>>> Back in the good old days, the Internet was simple, 7 bit ASCII, and
>>> everything was good and proper. But those days are gone. Nowadays, there
>>> is all this i18n glop in the strings we get from The Internet/The Web. In
>>> particular, there seems to be about 9 different ways to represent the
>>> simple "single quote" character (normally represented as "\047").
>>>
>>> So, it becomes a normal part of my processing get rid of this glop. The
>>> tool that I've ended up using is "iconv", and I usually put somewhere in my
>>> pipelines the command: iconv -c
>>>
>>> This works reasonably well, but just doesn't quite feel entirely correct;
>>> hence my reason for posting this thread. Note that I don't really
>>> understand the full logic or point of iconv, and I think there are lots of
>>> other command line options and/or environment variables that you can set to
>>> control it - but it seems to work well enough for me just using the "-c"
>>> option.
>>
>>You are using iconv but don't feel it's correct but haven't shown us any
>>examples of what you are trying to do or how it's failing.
>>
>>I'll have to guess you've asked it to convert utf8 to ascii.
>>Maybe you did something else. The man page shows an example making the
>>target "ASCII//TRANSLIT". The translit part sounds like it might help.
>>
>>iconv sounds to me like the right tool. If there are other multi-byte
>>sequences you want to handle, something like sed can do the job.
>
> I get what you are saying. But I just want something that will remove
> any/all high-ASCII junk. It *might* be as simple as simply writing a
> simple search-and-replace script in your-favorite-scriping-language (in my
> case, that would be AWK) to remove any character with ASCII value > 127.

If you mean more like "search and delete", then I've seen this in
the past and kept it in mind for any case where iconv isn't
available:
tr -cd "\11\12\15\40-\176"

> But, my feeling is that there must be something better.
>
> Also, my sense is that iconv doesn't do enough. Sometimes, even after
> running it through iconv, you'll still see non-ASCII junk in the file.
> Also, as I mentioned, the main character that seems to have a problem is
> the ' character. It'd be nice if some commonly accepted solution would at
> least handle all the mis-codings of that character.

I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT

Compared to the tr command, the TRANSLIT functionality (search and
replace instead of search and delete) is very nice. Perhaps a way
to add custom rules for converting characters would be an
improvement, though not one that I frequently desire.

> Anyway, I was curious to find out what other people use, and how they have
> fared with this problem.

Sorry, just another iconv person, but faring quite well with it
anyway.

--
__ __
#_ < |\| |< _#

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0ggmh$1gt71$1@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5014&group=comp.unix.shell#5014

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Fri, 11 Mar 2022 21:58:41 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <t0ggmh$1gt71$1@news.xmission.com>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me> <t0gbu6$1gr1u$1@news.xmission.com> <t0gf09$12jr$1@gioia.aioe.org>
Injection-Date: Fri, 11 Mar 2022 21:58:41 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="1602785"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Fri, 11 Mar 2022 21:58 UTC

In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
....
>I find iconv does the job perfectly, running it like this:
>iconv -f utf-8 -t ASCII//TRANSLIT

Thanks. I'll try that at some point.

(I think you still need -c, or else it will error out when it sees
something unexpected - which, of course, can and does always happen in real
life).

--
"If our country is going broke, let it be from feeding the poor and caring for
the elderly. And not from pampering the rich and fighting wars for them."

--Living Blue in a Red State--

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0ngs4$1kc45$1@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5028&group=comp.unix.shell#5028

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Mon, 14 Mar 2022 13:44:36 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <t0ngs4$1kc45$1@news.xmission.com>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me> <t0gbu6$1gr1u$1@news.xmission.com> <t0gf09$12jr$1@gioia.aioe.org>
Injection-Date: Mon, 14 Mar 2022 13:44:36 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="1716357"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Mon, 14 Mar 2022 13:44 UTC

In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
....
>I find iconv does the job perfectly, running it like this:
>iconv -f utf-8 -t ASCII//TRANSLIT

I have put this line into production and it seems to work well. Thanks again.

However, it still just doesn't feel right. I mean, look at all those
"funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What
is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is
that a novice (and for the purposes of this particular area of discussion,
you can consider me to be a novice) would have no idea what these things
mean or what will need to be changed as time goes on.

That was the point (the "Reason for posting") of this thread. That there
should be a more "macro" type solution. More of a "You're the computer;
you figure it out" type solution.

But, apparently, there isn't.

>Compared to the tr command, the TRANSLIT functionality (search and
>replace instead of search and delete) is very nice. Perhaps a way
>to add custom rules for converting characters would be an
>improvement, though not one that I frequently desire.

As I said, it works. As long as you are going utf8 (whatever that is; yes,
I'm kidding) to ASCII (I know what that is). What if the next time I get
some data for this system, it is in utf-9 (or utf-10 or whatever) ?

>> Anyway, I was curious to find out what other people use, and how they have
>> fared with this problem.
>
>Sorry, just another iconv person, but faring quite well with it
>anyway.

Alas, that seems to be as far as it goes...

--
http://www.rollingstone.com/politics/news/the-10-dumbest-things-ever-said-about-global-warming-20130619

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0nhuc$3jp$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5029&group=comp.unix.shell#5029

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: dan1es...@gmail.com (Dan Espen)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?
Date: Mon, 14 Mar 2022 10:02:52 -0400
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <t0nhuc$3jp$1@dont-email.me>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me>
<t0gbu6$1gr1u$1@news.xmission.com> <t0gf09$12jr$1@gioia.aioe.org>
<t0ngs4$1kc45$1@news.xmission.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="b21afcdf3da9d748feb233972e4024dc";
logging-data="3705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5zfeAXrSq3tCVY2MBZvPcnvE1OVXbTfM="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:YJxc3yYLHD+7KLQFzHIrPwuDjpQ=
 by: Dan Espen - Mon, 14 Mar 2022 14:02 UTC

gazelle@shell.xmission.com (Kenny McCormack) writes:

> In article <t0gf09$12jr$1@gioia.aioe.org>,
> Computer Nerd Kev <not@telling.you.invalid> wrote:
> ...
>>I find iconv does the job perfectly, running it like this:
>>iconv -f utf-8 -t ASCII//TRANSLIT
>
> I have put this line into production and it seems to work well. Thanks again.
>
> However, it still just doesn't feel right. I mean, look at all those
> "funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What
> is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is
> that a novice (and for the purposes of this particular area of discussion,
> you can consider me to be a novice) would have no idea what these things
> mean or what will need to be changed as time goes on.
>
> That was the point (the "Reason for posting") of this thread. That there
> should be a more "macro" type solution. More of a "You're the computer;
> you figure it out" type solution.
>
> But, apparently, there isn't.

Since the solution was staring you in the face in the man page
I disagree. All that may be lacking is a more detailed explanation
of what the example does, but
"The next example converts from UTF-8 to ASCII, transliterating when
possible:"
Seems pretty clear to me.

>>Compared to the tr command, the TRANSLIT functionality (search and
>>replace instead of search and delete) is very nice. Perhaps a way
>>to add custom rules for converting characters would be an
>>improvement, though not one that I frequently desire.
>
> As I said, it works. As long as you are going utf8 (whatever that is; yes,
> I'm kidding) to ASCII (I know what that is). What if the next time I get
> some data for this system, it is in utf-9 (or utf-10 or whatever) ?

If you knew what utf-8 was, it's hard to imagine why you would mention
non-existing code pages.

>>> Anyway, I was curious to find out what other people use, and how they have
>>> fared with this problem.
>>
>>Sorry, just another iconv person, but faring quite well with it
>>anyway.
>
> Alas, that seems to be as far as it goes...

Declaring a problem when none exists? Submit the man page correction if
you think something is missing.

--
Dan Espen

Totally missing the point again (Was: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?)

<t0nid4$1kc45$2@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5030&group=comp.unix.shell#5030

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Totally missing the point again (Was: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?)
Date: Mon, 14 Mar 2022 14:10:44 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <t0nid4$1kc45$2@news.xmission.com>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0gf09$12jr$1@gioia.aioe.org> <t0ngs4$1kc45$1@news.xmission.com> <t0nhuc$3jp$1@dont-email.me>
Injection-Date: Mon, 14 Mar 2022 14:10:44 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="1716357"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Mon, 14 Mar 2022 14:10 UTC

In article <t0nhuc$3jp$1@dont-email.me>,
Dan Espen <dan1espen@gmail.com> demonstrates that he has totally missed
the point of my subtle, but amusing little contribution:

etc, etc
--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Seriously

Re: Cleaning up the junk you get from the web these days. Is there a normalized way to do it?

<t0o19e$7m0$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5031&group=comp.unix.shell#5031

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Cleaning up the junk you get from the web these days. Is there a
normalized way to do it?
Date: Mon, 14 Mar 2022 19:24:45 +0100
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <t0o19e$7m0$1@dont-email.me>
References: <t0fqcg$1gi29$1@news.xmission.com> <t0g8o2$k88$1@dont-email.me>
<t0gbu6$1gr1u$1@news.xmission.com> <t0gf09$12jr$1@gioia.aioe.org>
<t0ngs4$1kc45$1@news.xmission.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 14 Mar 2022 18:24:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5b6b7f577aad48b393057b7b8f7867eb";
logging-data="7872"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18QzCiIEdaK5bw3MYSsgd7r"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:o4Y+hfGUzSnkyPzxDw1/aciWkYY=
In-Reply-To: <t0ngs4$1kc45$1@news.xmission.com>
X-Enigmail-Draft-Status: N1110
 by: Janis Papanagnou - Mon, 14 Mar 2022 18:24 UTC

On 14.03.2022 14:44, Kenny McCormack wrote:
> In article <t0gf09$12jr$1@gioia.aioe.org>,
> Computer Nerd Kev <not@telling.you.invalid> wrote:
> ...
>> I find iconv does the job perfectly, running it like this:
>> iconv -f utf-8 -t ASCII//TRANSLIT
>
> However, it still just doesn't feel right. I mean, look at all those
> "funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What
> is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is
> that a novice (and for the purposes of this particular area of discussion,
> you can consider me to be a novice) would have no idea what these things
> mean or what will need to be changed as time goes on.

And in your original post you wrote: "Back in the good old days, the
Internet was simple, 7 bit ASCII, and everything was good and proper."

In my book that boils down to two observations.
From an isolated point of view, an US-centric/US-only view, that may
make sense. From a, say, EU view we can say we left the Stone Age and
are now able to express our languages with Unicode and communicate
all over the world and across borders. There's a universal character
set defined, and a quasi-standard encoding based on (quasi-standard)
units (octets, or "bytes" if someone prefers a less determined term).
The second observation is the inherent complexity; the character sets
topic is not trivial (and still not every tool supports it correctly).
And such a tool, like iconv, resembles that complexity (to a degree).
While "code-page" mappings (say, Windows to ISO Latin) are simple[*]
the other "conversions" like transliterations, that go beyond a
technical mapping, are generally not that trivial. (And yes, the '//'
delimiter is not common (and maybe there's better choices?). OTOH,
Unix is full of inconsistent syntax, and this one is harmless compared
to some other syntax variants, like inconsistent options specification
(-o, -opt, --opt, opt=, etc.) across many of the Unix tools.)

[*] Remember these "good ol' days" where for conversion we had (only?)
the 'dd' command that was able to convert from/to EBCDIC.

>
> That was the point (the "Reason for posting") of this thread. That there
> should be a more "macro" type solution.

How would a >>"macro" type solution<< look like? (I mean we can hide
ugly syntax issues for special purpose applications in wrapper scripts
or functions.)

> More of a "You're the computer; you figure it out" type solution.

There's not enough information in the data that allows "the computer"
to determine the code page; you need meta-data for it; the from/to
arguments in iconv, for example.

Yes, "the world" (including "the Internet") was simpler before, yet not
reflecting the global demands.

Janis

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor