Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

The Phone Booth Rule: A lone dime always gets the number nearly right.


aus+uk / uk.comp.os.linux / Re: Character Encoding (Was: while loop taking input from file via iconv )

SubjectAuthor
* while loop taking input from file via iconvJava Jive
+* Re: while loop taking input from file via iconvJ.O. Aho
|`* Re: while loop taking input from file via iconvMike Easter
| `* Re: while loop taking input from file via iconvJ.O. Aho
|  `* Re: while loop taking input from file via iconvMike Easter
|   +- Re: while loop taking input from file via iconvMike Easter
|   +- Re: while loop taking input from file via iconvAragorn
|   `* Re: while loop taking input from file via iconvMartin Gregorie
|    `* Re: while loop taking input from file via iconvMike Easter
|     `* Re: while loop taking input from file via iconvMartin Gregorie
|      +- Re: while loop taking input from file via iconvMike Easter
|      +* Re: while loop taking input from file via iconvWilliam Unruh
|      |+- Re: while loop taking input from file via iconvMike Easter
|      |+* Re: while loop taking input from file via iconvMartin Gregorie
|      ||+* Re: while loop taking input from file via iconvRichard Kettlewell
|      |||`* Re: while loop taking input from file via iconvMartin Gregorie
|      ||| +- Re: while loop taking input from file via iconvRichard Kettlewell
|      ||| +- Re: while loop taking input from file via iconvStéphane CARPENTIER
|      ||| `- Re: while loop taking input from file via iconvJasen Betts
|      ||`* Re: while loop taking input from file via iconvPaul
|      || `* Re: while loop taking input from file via iconvChris Elvidge
|      ||  `- Re: while loop taking input from file via iconvPaul
|      |`- Re: while loop taking input from file via iconvAragorn
|      `* Re: while loop taking input from file via iconvSpiros Bousbouras
|       `- Re: while loop taking input from file via iconvMartin Gregorie
+* Re: while loop taking input from file via iconvJasen Betts
|+- Re: while loop taking input from file via iconvPaul
|`* Re: while loop taking input from file via iconvSpiros Bousbouras
| `* Re: while loop taking input from file via iconvJasen Betts
|  `- Re: while loop taking input from file via iconvJava Jive
+* Re: while loop taking input from file via iconvSpiros Bousbouras
|+- Re: while loop taking input from file via iconvSpiros Bousbouras
|`* Re: while loop taking input from file via iconvJava Jive
| +* Re: while loop taking input from file via iconvMartin Gregorie
| |`- Re: while loop taking input from file via iconvJava Jive
| `- Re: while loop taking input from file via iconvStéphane CARPENTIER
`* Character Encoding (Was: while loop taking input from file via iconvJava Jive
 +* Re: Character Encoding (Was: while loop taking input from file via iconv )Spiros Bousbouras
 |`* Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 | `- Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 +* Re: Character Encoding (Was: while loop taking input from file viaPaul
 |`* Re: Character Encoding (Was: while loop taking input from file viaPaul
 | `- Re: Character Encoding (Was: while loop taking input from file viaJ.O. Aho
 +* Re: Character Encoding (Was: while loop taking input from file viajak
 |`* Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 | +* Re: Character Encoding (Was: while loop taking input from file via iconv )Spiros Bousbouras
 | |`* Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 | | `* Re: Character Encoding (Was: while loop taking input from file viaMartin Gregorie
 | |  `* Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 | |   `* Re: Character Encoding (Was: while loop taking input from file viaMartin Gregorie
 | |    `- Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 | `- Re: Character Encoding (Was: while loop taking input from file viajak
 +* Re: Character Encoding (Was: while loop taking input from file viaAndy Burns
 |`* Re: Character Encoding (Was: while loop taking input from file viaPaul
 | `* Re: Character Encoding (Was: while loop taking input from file viaJava Jive
 |  +- Re: Character Encoding (Was: while loop taking input from file viajak
 |  +- Re: Character Encoding (Was: while loop taking input from file viaAndy Burns
 |  +- Re: Character Encoding (Was: while loop taking input from file viajak
 |  `- Re: Character Encoding (Was: while loop taking input from file viaJasen Betts
 `* Re: Character Encoding (Was: while loop taking input from file viaJava Jive
  `- Re: Character Encoding (Was: while loop taking input from file viajak

Pages:123
Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfe4os$29e$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=458&group=uk.comp.os.linux#458

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!f3Ja+IUlF3LLNCdyvqay1w.user.46.165.242.91.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Mon, 16 Aug 2021 18:46:19 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sfe4os$29e$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<sfc0r9$1709$1@gioia.aioe.org> <sfdaqc$1hq3$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="2350"; posting-host="f3Ja+IUlF3LLNCdyvqay1w.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Content-Language: it
X-Notice: Filtered by postfilter v. 0.9.2
 by: jak - Mon, 16 Aug 2021 16:46 UTC

Il 16/08/2021 11:23, Java Jive ha scritto:
> On 15/08/2021 22:27, jak wrote:
>>
>> Il 15/08/2021 13:57, Java Jive ha scritto:
>>>
>>> Can anyone suggest a sequence that will find the file, when put
>>> inside quotes as the filename in the controlling data file mentioned
>>> previously in the thread, so that it can just be treated like all the
>>> other lines? As someone here suggested the data file is now stored as
>>> UTF-8 rather than ANSI as it was formerly, and some example lines are
>>> given below in a form for easier readability in a ng  -  in reality
>>> the fields are tab separated but here are separated by double spacing
>>> and have been further abbreviated to keep them from wrapping; leading
>>> symbols such as '+' and '=' have special meanings for the program
>>> doing the work; and, yes, the commands are basically DOS commands
>>> which for Linux are translated to their bash equivalents:
>>>
>>> =ATTRIB -R  "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
>>> =RD  "./F H /_all/1o/Blessig & Heyder"
>>> REN  "./Chat Bott'$'\302\202'', Le"  "Chat Botté, Le"
>>> MOVE  "./Photo - D & M Close.png" "./Photos/D & M Close.png"
>>> [etc]
>>>
>>
>> Hi,
>> you could use the find command looking for filenames as a regular
>> expression, then use the command you need on them.
>> In this example I search for files with the extension ".o", display the
>> name with the command 'echo' and display it again converted to
>> uppercase:
>>
>>   find . -iregex ".*\.o$" -exec bash -c "echo -n original: {} && echo
>> \"     modified: {}\" | tr [a-z] [A-Z]}" \;
>>
>> There should be everything you need.
>
> Thanks but no, that doesn't work.  I had considered, before the script
> works through the data file, of running a pre-process to find and rename
> all these characters, but neither find nor ls will actually find the
> erroneous characters *DIRECTLY*.  The best either can do is find the
> characters either side, but that means I have to know in advance where
> all the problems are, and I'm not sure yet that I do.  Really, if I'm
> going to go down that road, I need a way of searching the entire archive
> structure directly for affected files and renaming them, as a separate
> process from working through the data file.
>
> So, for example, this works because I'm specifying and finding the
> neighbouring characters of one known instance, not because ls is finding
> the oddball characters directly ...
>     ls Chat\ Bott?,\ Le | sed 's~\xc2\x82~é~g'
> ... whereas these don't, with neither single nor double backslashes nor
> various other combinations that I've tried, because neither find nor ls
> seem able to find the oddball characters directly:
>     find . -regex ".*\\xc2\\x82.*"
>     ls -R *\\xc2\\x82*
>     ls -R *'$'\\302\\202''*
>

Ok. I finally understood your problem (late age?). I tried to reproduce
your problem and in my opinion you could use this way:

add the -b option to the ls command; this will translate the bad
characters into octal sequence of text then this:

-rw-r--r-- 1 jak NONE 0 Aug 16 16:57 'foo'$'\302\202

$ ls -1 foo*
'foo' $ '\ 302 \ 202'

will become:
$ ls -1b foo*
foo\302\202

now you can search for it as if it were text. For example with the grep
command:

$ ls -1b foo* | grep -F "\\302"
foo\302\202

$ ls -1b | grep -F "foo\\302\\202"
foo\302\202

I hope it helps you
cheers

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfebr1$rvr$1@dont-email.me>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=465&group=uk.comp.os.linux#465

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Mon, 16 Aug 2021 14:46:56 -0400
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <sfebr1$rvr$1@dont-email.me>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org> <inviqqFeqbiU1@mid.individual.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 16 Aug 2021 18:46:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e5ab00210629d844a9ae6a3fec4ebd62";
logging-data="28667"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1950B2TJsf/XQbmAQDme/lTr9uYs57Ayik="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:tgX3YYEOD2r72eCaUAjDM+G+JdQ=
In-Reply-To: <inviqqFeqbiU1@mid.individual.net>
 by: Paul - Mon, 16 Aug 2021 18:46 UTC

Andy Burns wrote:
> Java Jive wrote:
>
>> console listing shows a very odd character sequence instead of the e
>> acute ...
>> "Chat Bott'$'\302\202'', Le"
>
> Are you sure the filename is exactly as you say/think? What does
>
> ls -b
>
> show?

Using a Perl script, I created some examples.
File "Y" is the php-failure induced problem name the OP has.
File "Z" is the visually-correct one.

https://i.postimg.cc/gksLyGFL/rename2-output.gif

So you can create your own for a test.

*********************** rename2.ps *************************
printf("this is a test\n");

$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;

$x = $start.$finish ;
$y = $start.$naughty1.$finish ;
$z = $start.$naughty2.$finish ;

open(OUT, ">>$x") || die("Cannot create X");
close(OUT);

open(OUT, ">>$y") || die("Cannot create Y");
close(OUT);

open(OUT, ">>$z") || die("Cannot create Z");
close(OUT);

use Cwd;

$c = getcwd ;

printf("Making a mess in %s\n", $c );

#rename( $y , $z );

exit(0);
*********************** end of rename2.ps *************************

Paul

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfeit7$5ag$1@dont-email.me>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=466&group=uk.comp.os.linux#466

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mar...@mydomain.invalid (Martin Gregorie)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Mon, 16 Aug 2021 20:47:35 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <sfeit7$5ag$1@dont-email.me>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<sfc0r9$1709$1@gioia.aioe.org> <sfdaqc$1hq3$1@gioia.aioe.org>
<zlhr6A1+YDCIh3bGX@bongo-ra.co> <sfdsve$9bg$1@gioia.aioe.org>
<sfe1v0$u60$1@dont-email.me> <sfe3mo$1h6t$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 16 Aug 2021 20:47:35 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ff94f311d8e737fbb646f0f4b3ea57fe";
logging-data="5456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++nrp8oak3DDrbHh6GLdfpD5FjGXg2Iqk="
User-Agent: Pan/0.146 (Hic habitat felicitas; 8107378
git@gitlab.gnome.org:GNOME/pan.git)
Cancel-Lock: sha1:I3o+e0GZCiXgyuXP4utia5hhTj4=
 by: Martin Gregorie - Mon, 16 Aug 2021 20:47 UTC

On Mon, 16 Aug 2021 17:28:06 +0100, Java Jive wrote:

> On 16/08/2021 16:58, Martin Gregorie wrote:
>> On Mon, 16 Aug 2021 15:33:15 +0100, Java Jive wrote:
>>
>>> No luck with that either ...
>>> ls: cannot access '*'$'\302\202''*': No such file or directory
>>>
>> Might be worth writing a noddy Java program to see if it can resolve
>> your problem character codes.
>>
>> The Java 'char' primitive can hold multibyte character values. and the
>> Character() class provides methods to recognise character types,
>> lengths,
>> and non-Unicode characters.
>
> But I can't be sure that any of the target machines will have Java,
> Perl, or Python installed. This has to be achieved with what will
> normally be installed on a Linux or MacOS box.

Does thet matter? I thought you were treating this archived article name
sanitization as either a one-off activity of something that doesn't
happen regularly and, anyway that it was something that you did on your
system before distributing the results round your family group.

As it happens I've just knocked up a bit of Java to see just what it can
do in the way of automated character translation, so if you'd care to
send me, martin@gregorie.org, a short file (100-500 chars max) containing
a mix of readable and non-readable example text, I'll run it through my
code.

Attaching it as a gzipped file should get it here without further
mangling.

--
--
Martin | martin at
Gregorie | gregorie dot org

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfen3u$8cv$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=467&group=uk.comp.os.linux#467

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!8YXKAhSo8fMBpI0CH1QWtw.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Mon, 16 Aug 2021 22:59:23 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sfen3u$8cv$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<inviqqFeqbiU1@mid.individual.net> <sfebr1$rvr$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="8607"; posting-host="8YXKAhSo8fMBpI0CH1QWtw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB
 by: Java Jive - Mon, 16 Aug 2021 21:59 UTC

On 16/08/2021 19:46, Paul wrote:
> Andy Burns wrote:
>> Java Jive wrote:
>>
>>> console listing shows a very odd character sequence instead of the e
>>> acute ...
>>>      "Chat Bott'$'\302\202'', Le"
>>
>> Are you sure the filename is exactly as you say/think? What does
>>
>> ls -b
>>
>> show?

Thanks to you and 'jak' for suggesting this!

While still none of the following work ...
ls -R -b | grep '\xc2\x82'
ls -R -b | grep -E '\xc2\x82'
ls -R -b | grep '\uc282'
ls -R -b | grep -E '\uc282'
ls -R -b | grep '\u82c2'
ls -R -b | grep -E '\u82c2'
ls -R -b | grep '\uc282'
ls -R -b | grep -E '\uc282'
ls -R -b | grep '\u82c2'
ls -R -b | grep -E '\u82c2'
.... this at least finds all the files that I'm already aware of,
suggesting that I may know about all of them ...
ls -R -b | grep -E '\\[0-7]{3}'

There are 35 files or directories at fault, nearly all are e acute, but
there a couple of e umlaut and 6 files with both an e grave and an e
acute :-(

Now I have to devise a method of renaming them, in other words of
ensuring that the mv command will find them. I've just tried the
following manual command to see what happens (it'll wrap, but originally
it was all one command-line):

OLDIFS=${IFS}; IFS=$'\n'; for A in $(ls -1Rb * | grep -E
'(:|\\302\\202)'; do if [ "S{A: -1}" == ":" ]; then export
LASTDIR="${A/:/}; else pushd "${LASTDIR}"; mv ${A/\\302\\202/?}
${A/\\302\\202/é}; popd; fi; done; unset LASTDIR; IFS=${OLDIFS}

Guess what now! The files were renamed, but the slashes that were
supposed to escape the spaces were included in the name! FFS, HOW
INCONSISTENT IS THAT???!!! Why are the slashes successful in escaping
the spaces in the source name but getting included as part of the target
name? Alright, so I can programme around that, but I shouldn't have to,
the illogicality of it all is just maddening!

> Using a Perl script, I created some examples.
> File "Y" is the php-failure induced problem name the OP has.
> File "Z" is the visually-correct one.

PHP was not involved, it was WinZip that created the problem, whereas 7z
did not, but for one thing, I didn't notice at the time, and for
another, people would have had to install software to handle *.7z files,
whereas the ability to handle *.zip files is native to many/most/all
modern OSs.

> https://i.postimg.cc/gksLyGFL/rename2-output.gif
>
> So you can create your own for a test.
>
> *********************** rename2.ps *************************
> printf("this is a test\n");
>
> $start = "Chat Bott";
> $finish = ", Le";
> $naughty1 = <\x{C3}\x{A9}> ;
> $naughty2 = <\x{E9}> ;

I think this is suffering from the same problem that all the other
approaches have had, that you're creating two characters not one. BTW,
it's hex C2, followed by hex 82.

After some further thought, I remembered about the \u regular expression
syntax. Being unsure of the correct byte order, I tried both, but
neither of the following work either, whereas logically I would have
thought that one of them should:
find . -regex ".*\uc282.*"
find . -regex ".*\u82c2.*"

But at least now there's hope, see above.

Tx again to all.

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sferc5$1j71$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=468&group=uk.comp.os.linux#468

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!8YXKAhSo8fMBpI0CH1QWtw.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Tue, 17 Aug 2021 00:12:03 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sferc5$1j71$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<sfc0r9$1709$1@gioia.aioe.org> <sfdaqc$1hq3$1@gioia.aioe.org>
<zlhr6A1+YDCIh3bGX@bongo-ra.co> <sfdsve$9bg$1@gioia.aioe.org>
<sfe1v0$u60$1@dont-email.me> <sfe3mo$1h6t$1@gioia.aioe.org>
<sfeit7$5ag$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="52449"; posting-host="8YXKAhSo8fMBpI0CH1QWtw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB
 by: Java Jive - Mon, 16 Aug 2021 23:12 UTC

On 16/08/2021 21:47, Martin Gregorie wrote:
> On Mon, 16 Aug 2021 17:28:06 +0100, Java Jive wrote:
>
>> On 16/08/2021 16:58, Martin Gregorie wrote:
>>> On Mon, 16 Aug 2021 15:33:15 +0100, Java Jive wrote:
>>>
>>>> No luck with that either ...
>>>> ls: cannot access '*'$'\302\202''*': No such file or directory
>>>>
>>> Might be worth writing a noddy Java program to see if it can resolve
>>> your problem character codes.
>>>
>>> The Java 'char' primitive can hold multibyte character values. and the
>>> Character() class provides methods to recognise character types,
>>> lengths,
>>> and non-Unicode characters.
>>
>> But I can't be sure that any of the target machines will have Java,
>> Perl, or Python installed. This has to be achieved with what will
>> normally be installed on a Linux or MacOS box.
>
> Does thet matter? I thought you were treating this archived article name
> sanitization as either a one-off activity of something that doesn't
> happen regularly and, anyway that it was something that you did on your
> system before distributing the results round your family group.

No, I have to have the run one or other of the programs on the machine
of any family member who has already downloaded the first and, as I've
now discovered, faulty version of the archive.

> As it happens I've just knocked up a bit of Java to see just what it can
> do in the way of automated character translation, so if you'd care to
> send me, martin@gregorie.org, a short file (100-500 chars max) containing
> a mix of readable and non-readable example text, I'll run it through my
> code.
>
> Attaching it as a gzipped file should get it here without further
> mangling.

Thanks, but I'm busy writing my own solution based on what I've already
posted.

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sffm7k$1mk6$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=473&group=uk.comp.os.linux#473

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!f3Ja+IUlF3LLNCdyvqay1w.user.46.165.242.91.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Tue, 17 Aug 2021 08:50:28 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sffm7k$1mk6$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<inviqqFeqbiU1@mid.individual.net> <sfebr1$rvr$1@dont-email.me>
<sfen3u$8cv$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="55942"; posting-host="f3Ja+IUlF3LLNCdyvqay1w.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Content-Language: it
X-Notice: Filtered by postfilter v. 0.9.2
 by: jak - Tue, 17 Aug 2021 06:50 UTC

Il 16/08/2021 23:59, Java Jive ha scritto:
> On 16/08/2021 19:46, Paul wrote:
>> Andy Burns wrote:
>>> Java Jive wrote:
>>>
>>>> console listing shows a very odd character sequence instead of the e
>>>> acute ...
>>>>      "Chat Bott'$'\302\202'', Le"
>>>
>>> Are you sure the filename is exactly as you say/think? What does
>>>
>>> ls -b
>>>
>>> show?
>
> Thanks to you and 'jak' for suggesting this!
>
> While still none of the following work ...
>     ls -R -b | grep '\xc2\x82'
>     ls -R -b | grep -E '\xc2\x82'
>     ls -R -b | grep '\uc282'
>     ls -R -b | grep -E '\uc282'
>     ls -R -b | grep '\u82c2'
>     ls -R -b | grep -E '\u82c2'
>     ls -R -b | grep '\uc282'
>     ls -R -b | grep -E '\uc282'
>     ls -R -b | grep '\u82c2'
>     ls -R -b | grep -E '\u82c2'
> ... this at least finds all the files that I'm already aware of,
> suggesting that I may know about all of them ...
>     ls -R -b | grep -E '\\[0-7]{3}'
>
> There are 35 files or directories at fault, nearly all are e acute, but
> there a couple of e umlaut and 6 files with both an e grave and an e
> acute :-(
>
> Now I have to devise a method of renaming them, in other words of
> ensuring that the mv command will find them.  I've just tried the
> following manual command to see what happens (it'll wrap, but originally
> it was all one command-line):
>
> OLDIFS=${IFS}; IFS=$'\n'; for A in $(ls -1Rb * | grep -E
> '(:|\\302\\202)'; do if [ "S{A: -1}" == ":" ]; then export
> LASTDIR="${A/:/}; else pushd "${LASTDIR}"; mv ${A/\\302\\202/?}
> ${A/\\302\\202/é}; popd; fi; done; unset LASTDIR; IFS=${OLDIFS}
>
> Guess what now!  The files were renamed, but the slashes that were
> supposed to escape the spaces were included in the name!  FFS, HOW
> INCONSISTENT IS THAT???!!!  Why are the slashes successful in escaping
> the spaces in the source name but getting included as part of the target
> name?  Alright, so I can programme around that, but I shouldn't have to,
> the illogicality of it all is just maddening!
>
>> Using a Perl script, I created some examples.
>> File "Y" is the php-failure induced problem name the OP has.
>> File "Z" is the visually-correct one.
>
> PHP was not involved, it was WinZip that created the problem, whereas 7z
> did not, but for one thing, I didn't notice at the time, and for
> another, people would have had to install software to handle *.7z files,
> whereas the ability to handle *.zip files is native to many/most/all
> modern OSs.
>

mmmumble...
.... winzip probably got it wrong when saving/restoring files between
systems that have different code pages. the "é" (e-acute), in fact,
corresponds to the position 0x82 in the table cp863 (french codepage)
which is probably not the default in your system. To work around this
problem it is necessary to enable "Store Unicode filenames in Zip files"
in the "Advanced options" of WinZip. This can also be done on systems
that have WinZip integrated.

>> https://i.postimg.cc/gksLyGFL/rename2-output.gif
>>
>> So you can create your own for a test.
>>
>> *********************** rename2.ps *************************
>> printf("this is a test\n");
>>
>> $start = "Chat Bott";
>> $finish = ", Le";
>> $naughty1 = <\x{C3}\x{A9}> ;
>> $naughty2 = <\x{E9}> ;
>
> I think this is suffering from the same problem that all the other
> approaches have had, that you're creating two characters not one.  BTW,
> it's hex C2, followed by hex 82.
>
> After some further thought, I remembered about the \u regular expression
> syntax.  Being unsure of the correct byte order, I tried both, but
> neither of the following work either, whereas logically I would have
> thought that one of them should:
>     find . -regex ".*\uc282.*"
>     find . -regex ".*\u82c2.*"
>
> But at least now there's hope, see above.
>
> Tx again to all.
>

Re: Character Encoding (Was: while loop taking input from file via iconv )

<io18anFog00U1@mid.individual.net>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=474&group=uk.comp.os.linux#474

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: use...@andyburns.uk (Andy Burns)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Tue, 17 Aug 2021 08:55:03 +0100
Lines: 8
Message-ID: <io18anFog00U1@mid.individual.net>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<inviqqFeqbiU1@mid.individual.net> <sfebr1$rvr$1@dont-email.me>
<sfen3u$8cv$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net N4jRJDYSjFfOgZW1l8JPAQDDwDfG4/IxmOmVF+sGeHRJeX7DOS
Cancel-Lock: sha1:Jb/fAGRpGubKrnVJMpYprYV1HXQ=
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Content-Language: en-GB
In-Reply-To: <sfen3u$8cv$1@gioia.aioe.org>
 by: Andy Burns - Tue, 17 Aug 2021 07:55 UTC

Java Jive wrote:

> Thanks to you and 'jak' for suggesting this!
>
> While still none of the following work ...

You could show us the "ls -b" output for your previous Chatt Botte
filename ...

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfg07n$g8$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=475&group=uk.comp.os.linux#475

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!f3Ja+IUlF3LLNCdyvqay1w.user.46.165.242.91.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Tue, 17 Aug 2021 11:41:10 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sfg07n$g8$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<inviqqFeqbiU1@mid.individual.net> <sfebr1$rvr$1@dont-email.me>
<sfen3u$8cv$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="520"; posting-host="f3Ja+IUlF3LLNCdyvqay1w.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Content-Language: it
X-Notice: Filtered by postfilter v. 0.9.2
 by: jak - Tue, 17 Aug 2021 09:41 UTC

Il 16/08/2021 23:59, Java Jive ha scritto:
> On 16/08/2021 19:46, Paul wrote:
>> Andy Burns wrote:
>>> Java Jive wrote:
>>>
>>>> console listing shows a very odd character sequence instead of the e
>>>> acute ...
>>>>      "Chat Bott'$'\302\202'', Le"
>>>
>>> Are you sure the filename is exactly as you say/think? What does
>>>
>>> ls -b
>>>
>>> show?
>
> Thanks to you and 'jak' for suggesting this!
>
> While still none of the following work ...
>     ls -R -b | grep '\xc2\x82'
>     ls -R -b | grep -E '\xc2\x82'
>     ls -R -b | grep '\uc282'
>     ls -R -b | grep -E '\uc282'
>     ls -R -b | grep '\u82c2'
>     ls -R -b | grep -E '\u82c2'
>     ls -R -b | grep '\uc282'
>     ls -R -b | grep -E '\uc282'
>     ls -R -b | grep '\u82c2'
>     ls -R -b | grep -E '\u82c2'
> ... this at least finds all the files that I'm already aware of,
> suggesting that I may know about all of them ...
>     ls -R -b | grep -E '\\[0-7]{3}'
>
> There are 35 files or directories at fault, nearly all are e acute, but
> there a couple of e umlaut and 6 files with both an e grave and an e
> acute :-(
>
> Now I have to devise a method of renaming them, in other words of
> ensuring that the mv command will find them.  I've just tried the
> following manual command to see what happens (it'll wrap, but originally
> it was all one command-line):
>
> OLDIFS=${IFS}; IFS=$'\n'; for A in $(ls -1Rb * | grep -E
> '(:|\\302\\202)'; do if [ "S{A: -1}" == ":" ]; then export
> LASTDIR="${A/:/}; else pushd "${LASTDIR}"; mv ${A/\\302\\202/?}
> ${A/\\302\\202/é}; popd; fi; done; unset LASTDIR; IFS=${OLDIFS}
>
> Guess what now!  The files were renamed, but the slashes that were
> supposed to escape the spaces were included in the name!  FFS, HOW
> INCONSISTENT IS THAT???!!!  Why are the slashes successful in escaping
> the spaces in the source name but getting included as part of the target
> name?  Alright, so I can programme around that, but I shouldn't have to,
> the illogicality of it all is just maddening!
>
>> Using a Perl script, I created some examples.
>> File "Y" is the php-failure induced problem name the OP has.
>> File "Z" is the visually-correct one.
>
> PHP was not involved, it was WinZip that created the problem, whereas 7z
> did not, but for one thing, I didn't notice at the time, and for
> another, people would have had to install software to handle *.7z files,
> whereas the ability to handle *.zip files is native to many/most/all
> modern OSs.
>
>> https://i.postimg.cc/gksLyGFL/rename2-output.gif
>>
>> So you can create your own for a test.
>>
>> *********************** rename2.ps *************************
>> printf("this is a test\n");
>>
>> $start = "Chat Bott";
>> $finish = ", Le";
>> $naughty1 = <\x{C3}\x{A9}> ;
>> $naughty2 = <\x{E9}> ;
>
> I think this is suffering from the same problem that all the other
> approaches have had, that you're creating two characters not one.  BTW,
> it's hex C2, followed by hex 82.
>
> After some further thought, I remembered about the \u regular expression
> syntax.  Being unsure of the correct byte order, I tried both, but
> neither of the following work either, whereas logically I would have
> thought that one of them should:
>     find . -regex ".*\uc282.*"
>     find . -regex ".*\u82c2.*"
>
> But at least now there's hope, see above.
>
> Tx again to all.
>

try this way to rename your file with the strange name:

$ find . -iname `echo -e "foo\0302\0202"` -exec mv {} new_name \;

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfgbe0$15ug$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=479&group=uk.comp.os.linux#479

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!8YXKAhSo8fMBpI0CH1QWtw.user.46.165.242.75.POSTED!not-for-mail
From: jav...@evij.com.invalid (Java Jive)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Tue, 17 Aug 2021 13:52:13 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sfgbe0$15ug$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="38864"; posting-host="8YXKAhSo8fMBpI0CH1QWtw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101
Thunderbird/68.4.2
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2
 by: Java Jive - Tue, 17 Aug 2021 12:52 UTC

On 15/08/2021 12:57, Java Jive wrote:
>
> Can anyone suggest a sequence that will find the file, when put inside
> quotes as the filename in the controlling data file mentioned previously
> in the thread, so that it can just be treated like all the other lines?
> As someone here suggested the data file is now stored as UTF-8 rather
> than ANSI as it was formerly, and some example lines are given below in
> a form for easier readability in a ng  -  in reality the fields are tab
> separated but here are separated by double spacing and have been further
> abbreviated to keep them from wrapping; leading symbols such as '+' and
> '=' have special meanings for the program doing the work; and, yes, the
> commands are basically DOS commands which for Linux are translated to
> their bash equivalents:
>
> =ATTRIB -R  "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
> =RD  "./F H /_all/1o/Blessig & Heyder"
> REN  "./Chat Bott'$'\302\202'', Le"  "Chat Botté, Le"
> MOVE  "./Photo - D & M Close.png" "./Photos/D & M Close.png"
> [etc]

I've completely fixed the problem with the following code inserted
before processing the data file. Thanks for all the help here that
enabled me to do this. It'll wrap of course, sorry can't help that,
beyond reducing the tabs to two spaces:

# Search for WinZip's botched accented characters
# in the main download of v1: MacFarlane-Main.zip
# 35 pathnames affected, botched characters are:
# Intended Stored incorrectly as
# Char Octal Hex
# é (acute) \302\202 \xC2\x82
# ë (diaeresis) \302\211 \xC2\x89
# è (grave) \302\212 \xC2\x8A
# Á (acute) µ

OLDIFS=${IFS} # Normally IFS=$' \t\n'
IFS=$'\n'
LASTREN=""
for A in $(ls -1bR | grep -E '(:|µ|\\[0-7]{3}\\[0-7]{3})')
do
if [ -n "${Debug}" ]
then
echo "A = \"${A}\""
fi
if [ "${A: -1}" == ":" ]
then
THISDIR="${A/:/}"
if [ "${THISDIR}" == "${LASTREN/ -> .*/}" ]
then
THISDIR="${LASTREN/.* -> /}"
fi
if [ -n "${Debug}" ]
then
echo "THISDIR = \"${THISDIR}\""
fi
else
SC="${A}"
DS="${A}"
while [ -n "$(echo \"${SC}\" | grep -E
'(µ|\\[0-7]{3}\\[0-7]{3})')" ]
do
case $(echo "${SC}" | sed -E
's~^.*(µ|\\[0-7]{3}\\[0-7]{3}).*$~\1~') in
"µ") # A acute
SC="${SC//µ/?}"
DS="${DS//µ/Á}"
;;
"\302\202") # e acute
SC="${SC//\\302\\202/?}"
DS="${DS//\\302\\202/é}"
;;
"\302\211") # e diaeresis
SC="${SC//\\302\\211/?}"
DS="${DS//\\302\\211/ë}"
;;
"\302\212") # e grave
SC="${SC//\\302\\212/?}"
DS="${DS//\\302\\212/è}"
;;
esac
done

DS="${DS//\\/}"
pushd "${THISDIR}"
echo "mv ${SC} \"${DS}\""
if [ -z "${Dummy}" ]
then
mv ${SC} "${DS}"
fi
popd

# Remember rename in case it's a directory containing others
LASTREN="${THISDIR}/${A//\\ / } -> ${THISDIR}/${DS}"
if [ -n "${Debug}" ]
then
echo "LASTREN = \"${LASTREN}\""
fi

fi
done
IFS=${OLDIFS}

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfgf33$aa$1@gonzo.revmaps.no-ip.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=480&group=uk.comp.os.linux#480

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: use...@revmaps.no-ip.org (Jasen Betts)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Organization: JJ's own news server
Message-ID: <sfgf33$aa$1@gonzo.revmaps.no-ip.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<inviqqFeqbiU1@mid.individual.net> <sfebr1$rvr$1@dont-email.me>
<sfen3u$8cv$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 17 Aug 2021 13:54:43 -0000 (UTC)
Injection-Info: gonzo.revmaps.no-ip.org; posting-host="localhost:127.0.0.1";
logging-data="330"; mail-complaints-to="usenet@gonzo.revmaps.no-ip.org"
User-Agent: slrn/1.0.3 (Linux)
X-Face: ?)Aw4rXwN5u0~$nqKj`xPz>xHCwgi^q+^?Ri*+R(&uv2=E1Q0Zk(>h!~o2ID@6{uf8s;a
+M[5[U[QT7xFN%^gR"=tuJw%TXXR'Fp~W;(T"1(739R%m0Yyyv*gkGoPA.$b,D.w:z+<'"=-lVT?6
{T?=R^:W5g|E2#EhjKCa+nt":4b}dU7GYB*HBxn&Td$@f%.kl^:7X8rQWd[NTc"P"u6nkisze/Q;8
"9Z{peQF,w)7UjV$c|RO/mQW/NMgWfr5*$-Z%u46"/00mx-,\R'fLPe.)^
Lines: 34
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Tue, 17 Aug 2021 14:00:44 UTC
Date: Tue, 17 Aug 2021 13:54:43 -0000 (UTC)
X-Received-Bytes: 2109
 by: Jasen Betts - Tue, 17 Aug 2021 13:54 UTC

On 2021-08-16, Java Jive <java@evij.com.invalid> wrote:
> On 16/08/2021 19:46, Paul wrote:
>> Andy Burns wrote:
>>> Java Jive wrote:
>>>
>>>> console listing shows a very odd character sequence instead of the e
>>>> acute ...
>>>>      "Chat Bott'$'\302\202'', Le"

That's a control character \u0082 "break permitted here"

>>> Are you sure the filename is exactly as you say/think? What does
>>>
>>> ls -b
>>>
>>> show?
>
> Thanks to you and 'jak' for suggesting this!
>
> While still none of the following work ...
> ls -R -b | grep '\xc2\x82'
> ls -R -b | grep -E '\xc2\x82'

There's no chance of that working try fgrep instead, or double up
the backslashes.

what does "ls -b" show?

--
Jasen.

Re: Character Encoding (Was: while loop taking input from file via iconv )

<sfhdmm$tcp$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=483&group=uk.comp.os.linux#483

  copy link   Newsgroups: alt.os.linux uk.comp.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!f3Ja+IUlF3LLNCdyvqay1w.user.46.165.242.91.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: alt.os.linux,uk.comp.os.linux
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )
Date: Wed, 18 Aug 2021 00:37:10 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sfhdmm$tcp$1@gioia.aioe.org>
References: <sf6h49$15o3$1@gioia.aioe.org> <sfavf7$r6h$1@gioia.aioe.org>
<sfgbe0$15ug$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="30105"; posting-host="f3Ja+IUlF3LLNCdyvqay1w.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: it
 by: jak - Tue, 17 Aug 2021 22:37 UTC

Il 17/08/2021 14:52, Java Jive ha scritto:
> On 15/08/2021 12:57, Java Jive wrote:
>>
>> Can anyone suggest a sequence that will find the file, when put inside
>> quotes as the filename in the controlling data file mentioned
>> previously in the thread, so that it can just be treated like all the
>> other lines? As someone here suggested the data file is now stored as
>> UTF-8 rather than ANSI as it was formerly, and some example lines are
>> given below in a form for easier readability in a ng  -  in reality
>> the fields are tab separated but here are separated by double spacing
>> and have been further abbreviated to keep them from wrapping; leading
>> symbols such as '+' and '=' have special meanings for the program
>> doing the work; and, yes, the commands are basically DOS commands
>> which for Linux are translated to their bash equivalents:
>>
>> =ATTRIB -R  "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
>> =RD  "./F H /_all/1o/Blessig & Heyder"
>> REN  "./Chat Bott'$'\302\202'', Le"  "Chat Botté, Le"
>> MOVE  "./Photo - D & M Close.png" "./Photos/D & M Close.png"
>> [etc]
>
> I've completely fixed the problem with the following code inserted
> before processing the data file.  Thanks for all the help here that
> enabled me to do this.  It'll wrap of course, sorry can't help that,
> beyond reducing the tabs to two spaces:
>
> # Search for WinZip's botched accented characters
> # in the main download of v1: MacFarlane-Main.zip
> # 35 pathnames affected, botched characters are:
> #    Intended    Stored incorrectly as
> #    Char        Octal        Hex
> #    é (acute)    \302\202    \xC2\x82
> #    ë (diaeresis)    \302\211    \xC2\x89
> #    è (grave)    \302\212    \xC2\x8A
> #    Á (acute)    µ
>
> OLDIFS=${IFS} # Normally IFS=$' \t\n'
> IFS=$'\n'
> LASTREN=""
> for A in $(ls -1bR | grep -E '(:|µ|\\[0-7]{3}\\[0-7]{3})')
>   do
>     if [ -n "${Debug}" ]
>       then
>         echo "A = \"${A}\""
>     fi
>     if [ "${A: -1}" == ":" ]
>       then
>         THISDIR="${A/:/}"
>         if [ "${THISDIR}" == "${LASTREN/ -> .*/}" ]
>           then
>             THISDIR="${LASTREN/.* -> /}"
>         fi
>         if [ -n "${Debug}" ]
>           then
>             echo "THISDIR = \"${THISDIR}\""
>         fi
>       else
>         SC="${A}"
>         DS="${A}"
>         while [ -n "$(echo \"${SC}\" | grep -E
> '(µ|\\[0-7]{3}\\[0-7]{3})')" ]
>           do
>             case $(echo "${SC}" | sed -E
> 's~^.*(µ|\\[0-7]{3}\\[0-7]{3}).*$~\1~') in
>               "µ")         # A acute
>                     SC="${SC//µ/?}"
>                     DS="${DS//µ/Á}"
>                     ;;
>               "\302\202")  # e acute
>                     SC="${SC//\\302\\202/?}"
>                     DS="${DS//\\302\\202/é}"
>                     ;;
>               "\302\211")  # e diaeresis
>                     SC="${SC//\\302\\211/?}"
>                     DS="${DS//\\302\\211/ë}"
>                     ;;
>               "\302\212")  # e grave
>                     SC="${SC//\\302\\212/?}"
>                     DS="${DS//\\302\\212/è}"
>                     ;;
>             esac
>           done
>
>         DS="${DS//\\/}"
>         pushd "${THISDIR}"
>         echo "mv ${SC} \"${DS}\""
>         if [ -z "${Dummy}" ]
>           then
>             mv ${SC} "${DS}"
>         fi
>         popd
>
>         # Remember rename in case it's a directory containing others
>         LASTREN="${THISDIR}/${A//\\ / } -> ${THISDIR}/${DS}"
>         if [ -n "${Debug}" ]
>           then
>             echo "LASTREN = \"${LASTREN}\""
>         fi
>
>     fi
>   done
> IFS=${OLDIFS}
>

Just because I had also tried to write a version of the script shell:

These are the files I created for testing:

$ ls -1 jak/foo*
'jak/foo'$'\302\202'
'jak/foo'$'\302\202\302\202'
'jak/foo'$'\302\202\302\211'
'jak/foo'$'\302\212\302\202'
'jak/foo'$'\302\212\302\202''foo'

This is the result of the script:

$ ./renbadch
mv "./jak/foo\302\202" "./jak/fooé"
mv "./jak/foo\302\202\302\202" "./jak/fooéé"
mv "./jak/foo\302\202\302\211" "./jak/fooéë"
mv "./jak/foo\302\212\302\202" "./jak/fooèé"
mv "./jak/foo\302\212\302\202foo" "./jak/fooèéfoo"

This is the code:

#! /usr/bin/bash

regex='([^\\]*[^0-7]*)(\\[0-7]{3})(\\[0-7]{3})'

while read -r ll
do
orig=$ll
transl=""
while [[ $ll =~ $regex ]]
do
start=${BASH_REMATCH[1]}
goodch=$(printf %d ${BASH_REMATCH[3]:1})
newch=$(echo -e "\0${goodch}" | iconv -f 'CP863' -t
'UTF-8')
transl="${transl}${start}${newch}"
m=${BASH_REMATCH[0]}
ll=${ll##*"$m"}
done
echo "mv \"${orig}\" \"${transl}${ll}\""
done < <(find . -type f -exec ls -1b {} + | egrep '\\[0-7]{3}\\[0-7]{3}')

Pages:123
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor