Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

The universe is all a spin-off of the Big Bang.


devel / comp.lang.tcl / Re: Reading Unicode text in a non-localized application

SubjectAuthor
* Reading Unicode text in a non-localized applicationPhillip Brooks
+- Reading Unicode text in a non-localized applicationPhillip Brooks
+* Reading Unicode text in a non-localized applicationRich
|+- Reading Unicode text in a non-localized applicationLuc
|`* Reading Unicode text in a non-localized applicationPhillip Brooks
| `* Reading Unicode text in a non-localized applicationRich
|  `* Reading Unicode text in a non-localized applicationRalf Fassel
|   `* Reading Unicode text in a non-localized applicationPhillip Brooks
|    +* Reading Unicode text in a non-localized applicationHarald Oehlmann
|    |`* Reading Unicode text in a non-localized applicationPhillip Brooks
|    | `- Reading Unicode text in a non-localized applicationHarald Oehlmann
|    +* Reading Unicode text in a non-localized applicationRalf Fassel
|    |`* Reading Unicode text in a non-localized applicationPhillip Brooks
|    | +- Reading Unicode text in a non-localized applicationHarald Oehlmann
|    | +- Reading Unicode text in a non-localized applicationRalf Fassel
|    | +- Reading Unicode text in a non-localized applicationbriang
|    | `* Reading Unicode text in a non-localized applicationChristian Werner
|    |  `- Reading Unicode text in a non-localized applicationPhillip Brooks
|    `- Reading Unicode text in a non-localized applicationsaitology9
`- Reading Unicode text in a non-localized applicationsaitology9

1
Reading Unicode text in a non-localized application

<858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10156&group=comp.lang.tcl#10156

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:ad4:4432:0:b0:4b1:9054:b54a with SMTP id e18-20020ad44432000000b004b19054b54amr17995589qvt.122.1668543765920;
Tue, 15 Nov 2022 12:22:45 -0800 (PST)
X-Received: by 2002:a05:6214:2e90:b0:4c6:5bac:4456 with SMTP id
oc16-20020a0562142e9000b004c65bac4456mr3456965qvb.128.1668543765729; Tue, 15
Nov 2022 12:22:45 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 15 Nov 2022 12:22:45 -0800 (PST)
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
Subject: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Tue, 15 Nov 2022 20:22:45 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4793
 by: Phillip Brooks - Tue, 15 Nov 2022 20:22 UTC

Hi,

We have noticed a problem in our application that started occurring with our transition to Tcl 8.6 from Tcl 8.4. The problem is that we read some user provided text using Tcl that eventually gets printed by our application. Although our application is not localized, enterprising users found that they can enter Unicode text into the file and then when it prints out, it ends up the same way it came in when we print it out from C++. When they started using the Tcl 8.6 version of our product, that stopped working and now garbage is printed where the nice unicode output was printed previously.

Here is a small example program that illustrates this problem:

#include <iostream>
#include <sstream>
#include <fstream>
#include <tcl.h>
#include <string.h>

void dump_buffer( Tcl_Obj* read_obj_ptr ) {
size_t buflen = strlen( read_obj_ptr->bytes );
for( size_t i=0; i != buflen; ++i ) {
if ( i > 0 && ( i % 10 == 0 )) { std::cout << std::endl; }
unsigned char c = read_obj_ptr->bytes[i];
std::cout << (unsigned int)c << " ";
}
std::cout << std::endl;
}

int main()
{ Tcl_Interp * interp = Tcl_CreateInterp();
//Tcl_SetSystemEncoding(interp, "utf-8");

Tcl_Channel fc = Tcl_OpenFileChannel(interp, "file", "r", 0644);
if (!fc)
{
std::cout << "ERROR: Cannot open input TVF file for reading" << std::endl;
return 1;
}

Tcl_Obj *read_obj_ptr = Tcl_NewObj();
int chars_read = Tcl_ReadChars(fc, read_obj_ptr, -1, 0);
char* str = Tcl_GetStringFromObj( read_obj_ptr, nullptr );
std::cout << "TCL READ String\n";
std::cout << str << std::endl;
dump_buffer( read_obj_ptr );

Tcl_Close(interp, fc);

std::ifstream fc1("file");
if ( fc1.fail() ) {
std::cout << "ERROR: Cannot open input TVF file for reading" << std::endl;
fc1.close();
return 1;
}
std::stringstream buffer;
buffer << fc1.rdbuf();

if ( fc1.fail() || buffer.str().empty() )
{
std::cout << "ERROR: No data read from input TVF file" << std::endl;
fc1.close();
return 1;
}
fc1.close();

Tcl_Obj *read_obj_ptr1 = Tcl_NewObj();
Tcl_AppendToObj(read_obj_ptr1, buffer.str().c_str(), -1);
std::cout << "C++ READ\n";
std::cout << read_obj_ptr1->bytes << std::endl;
dump_buffer( read_obj_ptr1 );

return 0;
}

The file "file" contains unicode:
$ cat file
Korean : 서요한 가나다라 아야어여
Armenian : Թեստ
English : This line is redundant :)

A Tcl only version of the program is:

set f [ open file "r" ]
set lines [ read $f ]
puts "Tcl script READ Unicode"
puts $lines

It behaves as expected in both Tcl 8.4 and Tcl 8.6.

Note that the commented call to Tcl_SetSystemEncoding will cause the program to work the same way for Tcl 8.6 and Tcl 8.4.

The questions I have are:

What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior? It seems that with Tcl 8.4, we were able to get the original content of the strings, but that Tcl 8.6 is altering the input in some way that makes it incompatible with C++.

Is setting the Tcl_SetSystemEncoding call a reasonable fix for this, or will we run into other difficulties now or in the future (I notice that there are a lot of Unicode enhancements set up for Tcl 8.7 and Tcl 9)? What happens if someone gives us some non utf-8 encoded string? Is there a way to support that in this case?

Be patient - I am not, by any means, a Unicode expert.

Thanks!

Re: Reading Unicode text in a non-localized application

<a1503d3e-279e-4f2f-89d2-537173d16ae5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10157&group=comp.lang.tcl#10157

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:ad4:5308:0:b0:4b4:128:3cb2 with SMTP id y8-20020ad45308000000b004b401283cb2mr18484737qvr.80.1668544395262;
Tue, 15 Nov 2022 12:33:15 -0800 (PST)
X-Received: by 2002:ac8:100d:0:b0:3a5:2751:ce81 with SMTP id
z13-20020ac8100d000000b003a52751ce81mr18136626qti.322.1668544395055; Tue, 15
Nov 2022 12:33:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 15 Nov 2022 12:33:14 -0800 (PST)
In-Reply-To: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a1503d3e-279e-4f2f-89d2-537173d16ae5n@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Tue, 15 Nov 2022 20:33:15 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1321
 by: Phillip Brooks - Tue, 15 Nov 2022 20:33 UTC

Note that I am building and running this on Red Hat Enterprise Linux 6.

When I build and run this on Red Hat Enterprise Linux 8, the Tcl 8.4 case also fails to print properly.

Re: Reading Unicode text in a non-localized application

<tl1243$252go$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10158&group=comp.lang.tcl#10158

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ric...@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Tue, 15 Nov 2022 21:59:32 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <tl1243$252go$1@dont-email.me>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
Injection-Date: Tue, 15 Nov 2022 21:59:32 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="6c9c8a1cca49c9a033da279557b4f1dd";
logging-data="2263576"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18MX0dhCE0wqsLSQusvj6CX"
User-Agent: tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/3.10.17 (x86_64))
Cancel-Lock: sha1:PjAiBW9zb/7wh4bLQ+se6R7lEsk=
 by: Rich - Tue, 15 Nov 2022 21:59 UTC

Phillip Brooks <philbrks@gmail.com> wrote:
> What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?

Most likely, Tcl became more properly Unicode aware.

> It seems that with Tcl 8.4, we were able to get the original content
> of the strings, but that Tcl 8.6 is altering the input in some way
> that makes it incompatible with C++.

8.4 was likely using a setting that was transparent while 8.6 is likely
trying to convert the incoming data into Tcl's internal UTF-8 variant.

> Is setting the Tcl_SetSystemEncoding call a reasonable fix for this,
> or will we run into other difficulties now or in the future (I notice
> that there are a lot of Unicode enhancements set up for Tcl 8.7 and
> Tcl 9)?

The 'system' encoding is also used when passing strings to the OS API,
modifying it /may/ cause other strange issues.

> What happens if someone gives us some non utf-8 encoded string? Is
> there a way to support that in this case?

Unless you can:
1) be informed of what actual encoding was used; or
2) write a bunch of code to try to infer the encoding used (and this
will likely be fragile)
then there is not really a general way to 'interpret' any possible
encodinng.

However, if you just want the exact bytes present in the files to come
back out, you could set the channels to 'binary' mode and that will
disable all the translating of bytes between encodings.

You need to look at the "fconfigure" command for adjusting the encoding
used for file channels (the C API equivalent is the
Tcl_SetChannelOption function). You may simply need to set the
input and output channels to utf-8 for things to work correctly again.

Re: Reading Unicode text in a non-localized application

<20221115194424.298e9ac9@lud1.home>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10159&group=comp.lang.tcl#10159

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!aioe.org!5fNieJ10Veff9nzZBR7qpg.user.46.165.242.75.POSTED!not-for-mail
From: no...@no.no (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Tue, 15 Nov 2022 19:44:24 -0300
Organization: Aioe.org NNTP Server
Message-ID: <20221115194424.298e9ac9@lud1.home>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="10912"; posting-host="5fNieJ10Veff9nzZBR7qpg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
X-Newsreader: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu)
 by: Luc - Tue, 15 Nov 2022 22:44 UTC

On Tue, 15 Nov 2022 21:59:32 -0000 (UTC), Rich wrote:

> Phillip Brooks <philbrks@gmail.com> wrote:
> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?
>
> Most likely, Tcl became more properly Unicode aware.

At least that I can attest. I've had these two applications that I made for
myself for about 15 years, they use the clipboard and text widgets.
They never handled Unicode correctly in the 8.4 and 8.5 era, and I just
gave up on that, learning to live in resignation with some occasional
garbled content.

Only a few months ago I decided to try to fix them and it was very easy
because the old problems I used to have with Unicode just weren't there
anymore. I just removed the ugly kludges I had had in place to hide some
of the problem and everything just worked.

--
Luc
>>

Re: Reading Unicode text in a non-localized application

<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10160&group=comp.lang.tcl#10160

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:620a:2154:b0:6f9:6c61:ea58 with SMTP id m20-20020a05620a215400b006f96c61ea58mr18129196qkm.518.1668558849357;
Tue, 15 Nov 2022 16:34:09 -0800 (PST)
X-Received: by 2002:a05:6214:5592:b0:4bc:158d:faf6 with SMTP id
mi18-20020a056214559200b004bc158dfaf6mr19128705qvb.11.1668558849187; Tue, 15
Nov 2022 16:34:09 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 15 Nov 2022 16:34:08 -0800 (PST)
In-Reply-To: <tl1243$252go$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com> <tl1243$252go$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Wed, 16 Nov 2022 00:34:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3954
 by: Phillip Brooks - Wed, 16 Nov 2022 00:34 UTC

Thanks for the response, Rich. It was very helpful.

On Tuesday, November 15, 2022 at 1:59:36 PM UTC-8, Rich wrote:
> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?

> Most likely, Tcl became more properly Unicode aware.

When I look through the 8.4 and 8.5 Tcl release notes, I am not finding anything about Unicode. Similarly for the list of TIPs - there are several Unicode TIPs for 8.7/9.0, though.

> Unless you can:
> 1) be informed of what actual encoding was used; or
> 2) write a bunch of code to try to infer the encoding used (and this
> will likely be fragile)
> then there is not really a general way to 'interpret' any possible
> encoding.

That's what I was thinking.

> you could set the channels to 'binary' mode and that will
> disable all the translating of bytes between encodings.

The binary setting didn't help - rather it breaks 8.4 in the same way that 8.6 is broken. This was after calling:

Tcl_SetChannelOption(interp, fc, "-encoding", "binary");

> You need to look at the "fconfigure" command for adjusting the encoding
> used for file channels (the C API equivalent is the
> Tcl_SetChannelOption function). You may simply need to set the
> input and output channels to utf-8 for things to work correctly again.

Thanks for that pointer, fconfigure and Tcl_Get/SetChannelOption have been very illuminating.

In Tcl 8.4, the "C" Tcl_Channel seems to have "-encoding" set to "identity" by default. In Tcl 8.6, it is set to "iso8859-1" by default. In the Tcl script, however, fconfigure shows default "-encoding" set to "utf-8" for both Tcl 8.4 and Tcl 8.6.

Setting "-encoding" to "identity" in Tcl 8.6 seems to reestablish the previous behavior. Also, setting it explicitly to "utf-8" works as well. Setting Tcl_SetSystemEncoding to "utf-8" changes the default to "utf-8" in both Tcl 8.4 and Tcl 8.6.

I see this in the fconfigure doc page under -encoding:

"The default encoding for newly opened channels is the same platform- and locale-dependent system encoding used for interfacing with the operating system, as returned by encoding system."

Does that mean that the user can alter this behavior by setting an environment variable on Unix? Any idea where I can find out more about that? I am thinking that if I can provide the user with an environment variable setting, then I won't have to worry about breaking someone else's clever use of some other international strings in some other place by forcing it to utf-8.. I tried explicitly setting LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid breaking things in new ways for Tcl 8.7 and Tcl 9.

Re: Reading Unicode text in a non-localized application

<tl1df4$1ij1$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10161&group=comp.lang.tcl#10161

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!aioe.org!a5rWVvs5S5ZXUwkNcVnRMw.user.46.165.242.75.POSTED!not-for-mail
From: saitolo...@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Tue, 15 Nov 2022 20:13:06 -0500
Organization: Aioe.org NNTP Server
Message-ID: <tl1df4$1ij1$1@gioia.aioe.org>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="51809"; posting-host="a5rWVvs5S5ZXUwkNcVnRMw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.3.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
 by: saitology9 - Wed, 16 Nov 2022 01:13 UTC

On 11/15/2022 3:22 PM, Phillip Brooks wrote:
> Hi,
>
> We have noticed a problem in our application that started occurring with our transition to Tcl 8.6 from Tcl 8.4. The problem is that we read some user provided text using Tcl that eventually gets printed by our application. Although our application is not localized, enterprising users found that they can enter Unicode text into the file and then when it prints out, it ends up the same way it came in when we print it out from C++. When they started using the Tcl 8.6 version of our product, that stopped working and now garbage is printed where the nice unicode output was printed previously.
>

You seem to have access to both versions of your application.
Therefore, you could find out the exact encoding that was in place 8.4
and enforce it in 8.6, or change it to something else.

# find out current encoding
% encoding system
cp1251

# change it to something else
% encoding system unicode
unicode

# check
% encoding system
unicode

# list all
% encoding names
....

Re: Reading Unicode text in a non-localized application

<tl1h8k$267h3$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10162&group=comp.lang.tcl#10162

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ric...@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Wed, 16 Nov 2022 02:17:56 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 127
Message-ID: <tl1h8k$267h3$1@dont-email.me>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com> <tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 16 Nov 2022 02:17:56 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="fb991ff29593f044d3f6b32b30a70faf";
logging-data="2301475"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/8pU+u+wP0elY2JdlZPFGL"
User-Agent: tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/3.10.17 (x86_64))
Cancel-Lock: sha1:6OOrFMEEHwnhSkZ4mNou6hrQF2w=
 by: Rich - Wed, 16 Nov 2022 02:17 UTC

Phillip Brooks <philbrks@gmail.com> wrote:
> Thanks for the response, Rich. It was very helpful.
>
> On Tuesday, November 15, 2022 at 1:59:36 PM UTC-8, Rich wrote:
>> > What changed between Tcl 8.4 and Tcl 8.6 to alter the behavior?
>
>> Most likely, Tcl became more properly Unicode aware.
>
> When I look through the 8.4 and 8.5 Tcl release notes, I am not
> finding anything about Unicode. Similarly for the list of TIPs -
> there are several Unicode TIPs for 8.7/9.0, though.

The change might not necesarially referenced Unicode, it might have
refered to channel encodings, or other terms. Note I'm not saying you
are wrong, just that if changes did happen (and 8.4 to 8.6 is a wide
time window) they might not have used the word "Unicode" but still
might have been impactful.

>> Unless you can:
>> 1) be informed of what actual encoding was used; or
>> 2) write a bunch of code to try to infer the encoding used (and this
>> will likely be fragile)
>> then there is not really a general way to 'interpret' any possible
>> encoding.
>
> That's what I was thinking.
>
>> you could set the channels to 'binary' mode and that will disable
>> all the translating of bytes between encodings.
>
> The binary setting didn't help - rather it breaks 8.4 in the same way
> that 8.6 is broken. This was after calling:
>
> Tcl_SetChannelOption(interp, fc, "-encoding", "binary");

Interesting...

>> You need to look at the "fconfigure" command for adjusting the
>> encoding used for file channels (the C API equivalent is the
>> Tcl_SetChannelOption function). You may simply need to set the
>> input and output channels to utf-8 for things to work correctly
>> again.
>
> Thanks for that pointer, fconfigure and Tcl_Get/SetChannelOption have
> been very illuminating.
>
> In Tcl 8.4, the "C" Tcl_Channel seems to have "-encoding" set to
> "identity" by default. In Tcl 8.6, it is set to "iso8859-1" by
> default. In the Tcl script, however, fconfigure shows default
> "-encoding" set to "utf-8" for both Tcl 8.4 and Tcl 8.6.

If your users have been sneaking in UTF-8 encoded data, and the channel
is now set for iso8859-1, you'll get ugly messes out as a result.

I.e., if your users entered a Unicode right single quote (U+2019) but
the channel is set to iso8859-1, you get: â@Y out instead of a right
single quote mark.

But, if your users have been entering UTF-8 encoded text, you'd also be
safe setting the channels to UTF-8 as well.

> Setting "-encoding" to "identity" in Tcl 8.6 seems to reestablish the
> previous behavior. Also, setting it explicitly to "utf-8" works as
> well. Setting Tcl_SetSystemEncoding to "utf-8" changes the default
> to "utf-8" in both Tcl 8.4 and Tcl 8.6.

The Tcl wiki has this to say about the 'identity' encoding:

https://wiki.tcl-lang.org/page/encoding+system

Can soneone elaborate on the meaning of the 'identity' encoding?
When using freewrap I get:

% encoding system
identity

What is this and what is it used for?

schlenk 2005-06-27: The identity encoding is for testing purposes,
it should not be used without very good reasons. If you see your
encoding system set to identity, you are missing the proper encoding
files for your setup. This happens with tclkit-sh.exe on windows or
other wrapped applications which do not include the right encodings
for the local system they are running on.

Googie 2012-08-09: The 'identity' encoding is the default encoding
in my Tcl, even I use regular tclsh and not tclkit. Why is so? (I
use Linux)

PYK 2018-12-04: It is so because your Tcl configuration is borked.

Is your code running inside a 'wrapped' executable -- if the Wiki
statements here are correct, the fact that you get 'identity' on 8.4
would imply that the fact that "it worked" was more of a stroke of luck
than anything else.

If setting to UTF-8 'fixes things' then your likely best course is to
set the channels to UTF-8 and let it be. UTF-8 is all but the
'universal' encoding now for just about everything, so you'd be more
'future proof' to explictly set UTF-8 than not.

> I see this in the fconfigure doc page under -encoding:
>
> "The default encoding for newly opened channels is the same platform-
> and locale-dependent system encoding used for interfacing with the
> operating system, as returned by encoding system."
>
> Does that mean that the user can alter this behavior by setting an
> environment variable on Unix? Any idea where I can find out more
> about that?

Sadly, no. And the only real mention of LANG= in the wiki is that Tcl
uses it to guess what encoding to set as 'system' when it initializes.

> I am thinking that if I can provide the user with an environment
> variable setting, then I won't have to worry about breaking someone
> else's clever use of some other international strings in some other
> place by forcing it to utf-8. I tried explicitly setting
> LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid
> breaking things in new ways for Tcl 8.7 and Tcl 9.

Try LANG=C, which might 'trick' things. But if you do want to avoid
future breakage, if switching to 'utf-8' 'fixes' things now, then that
switch should cause less breakage in the future than not. Anything
else you to would just be a band-aid over another band-aid and itself
likely to subtly break in other ways in the future.

Re: Reading Unicode text in a non-localized application

<ygaedu2c1gr.fsf@akutech.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10173&group=comp.lang.tcl#10173

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!lilly.ping.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: ralf...@gmx.de (Ralf Fassel)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Wed, 16 Nov 2022 17:23:48 +0100
Lines: 51
Message-ID: <ygaedu2c1gr.fsf@akutech.de>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net 3Kkng0FJtZjjotC1THAz6gzmTXlcmosi+NKWtctIJiarNAYIM=
Cancel-Lock: sha1:O68B2SIUasRNNLtmI6QhODpteWg= sha1:z+IoDIWg7Pa3p/DVvkyhln2IKXg=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
 by: Ralf Fassel - Wed, 16 Nov 2022 16:23 UTC

* Rich <rich@example.invalid>
| Phillip Brooks <philbrks@gmail.com> wrote:
| > I am thinking that if I can provide the user with an environment
| > variable setting, then I won't have to worry about breaking someone
| > else's clever use of some other international strings in some other
| > place by forcing it to utf-8. I tried explicitly setting
| > LANG=en_US.UTF-8, but that didn't help. I'd also like to avoid
| > breaking things in new ways for Tcl 8.7 and Tcl 9.
>
| Try LANG=C, which might 'trick' things. But if you do want to avoid
| future breakage, if switching to 'utf-8' 'fixes' things now, then that
| switch should cause less breakage in the future than not. Anything
| else you to would just be a band-aid over another band-aid and itself
| likely to subtly break in other ways in the future.

Linux/Opensuse 15.4:

$ env LANG=de_DE.UTF-8 tclsh
% fconfigure stdout -encoding
utf-8

$ env LANG=en_US.UTF-8 tclsh
% fconfigure stdout -encoding
utf-8

$ env LANG=C tclsh
% fconfigure stdout -encoding
iso8859-1

So LANG=C is probably not the Right Thing in the context of this thread.

If the LANG=en_US.UTF-8 did not work for the OP, most likely he had set
some other env-vars (namely LC_ALL or LC_CTYPE):

unix/tclUnixInit.c, Tcl_GetEncodingNameFromEnvironment():
/*
* Determine the current encoding from the LC_* or LANG environment
* variables.
--<snip-snip>--
encoding = getenv("LC_ALL");

if (encoding == NULL || encoding[0] == '\0') {
encoding = getenv("LC_CTYPE");
}
if (encoding == NULL || encoding[0] == '\0') {
encoding = getenv("LANG");
}

R'

Re: Reading Unicode text in a non-localized application

<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10176&group=comp.lang.tcl#10176

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:620a:4306:b0:6fa:312d:96a8 with SMTP id u6-20020a05620a430600b006fa312d96a8mr20934502qko.350.1668618989118;
Wed, 16 Nov 2022 09:16:29 -0800 (PST)
X-Received: by 2002:a05:620a:10b0:b0:6fa:5778:8e with SMTP id
h16-20020a05620a10b000b006fa5778008emr20437881qkk.71.1668618988852; Wed, 16
Nov 2022 09:16:28 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Wed, 16 Nov 2022 09:16:28 -0800 (PST)
In-Reply-To: <ygaedu2c1gr.fsf@akutech.de>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Wed, 16 Nov 2022 17:16:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2580
 by: Phillip Brooks - Wed, 16 Nov 2022 17:16 UTC

The encoding system command doesn't seem to yield anything meaningful in terms of my observed behavior of what default encoding is present. I am finding various builds of tcl, both 8.4 and 8.6, that seem to set it different ways - possibly by something in the install tree?

From my 8.4 product install tree:
$MGC_HOME/bin/tclsh
% encoding system
iso8859-1

From my generic 8.4 build:
$ /usr/local/tcl8.4b/bin/tclsh8.4
% encoding system
utf-8

As mentioned previously, we don't see issues in a pure Tcl script (see main..tcl in the original post), but only when creating a Tcl interpreter from C/C++ code.

Perhaps it is something that gets handled during initialization and isn't being initialized properly for Tcl 8.6?

I do note that there are a lot of references to iso8859-1 in the Tcl source tree. One of them is in unix/README regarding the configure script:

--with-encoding=ENCODING Specifies the encoding for compile-time
configuration values. Defaults to iso8859-1,
which is also sufficient for ASCII.

Might it be that I can ask the customer to use iso8859-1 encoding instead of utf-8 for their localized comments?

Re: Reading Unicode text in a non-localized application

<tl36qf$2cnfm$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10178&group=comp.lang.tcl#10178

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: wortka...@yahoo.com (Harald Oehlmann)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Wed, 16 Nov 2022 18:32:00 +0100
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <tl36qf$2cnfm$1@dont-email.me>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 16 Nov 2022 17:31:59 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="7cc29fa0f10a00c1f78de7b16178ebe6";
logging-data="2514422"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/bVghXNZbdGXWBQzuWpBA/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:/ald5YSUjNwrNtQatyKEqszKnak=
Content-Language: en-GB
In-Reply-To: <646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
 by: Harald Oehlmann - Wed, 16 Nov 2022 17:32 UTC

Am 16.11.2022 um 18:16 schrieb Phillip Brooks:

I don't know, if it was mentioned before.
The tcl initialization code changed. To initialze static stuff, first:
Tcl_FindExecutable(argv)
should be called.

Hope this helps,
Harald

Re: Reading Unicode text in a non-localized application

<9683617d-5b58-45d8-8f8b-7c5918ec4771n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10181&group=comp.lang.tcl#10181

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:ad4:4381:0:b0:4bb:8383:c8d7 with SMTP id s1-20020ad44381000000b004bb8383c8d7mr22448594qvr.74.1668625428883;
Wed, 16 Nov 2022 11:03:48 -0800 (PST)
X-Received: by 2002:ac8:7d94:0:b0:3a5:826c:1cfb with SMTP id
c20-20020ac87d94000000b003a5826c1cfbmr22167708qtd.643.1668625428685; Wed, 16
Nov 2022 11:03:48 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Wed, 16 Nov 2022 11:03:48 -0800 (PST)
In-Reply-To: <tl36qf$2cnfm$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com> <tl36qf$2cnfm$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9683617d-5b58-45d8-8f8b-7c5918ec4771n@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Wed, 16 Nov 2022 19:03:48 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1827
 by: Phillip Brooks - Wed, 16 Nov 2022 19:03 UTC

On Wednesday, November 16, 2022 at 9:33:08 AM UTC-8, Harald Oehlmann wrote:

> I don't know, if it was mentioned before.
> The tcl initialization code changed. To initialze static stuff, first:
> Tcl_FindExecutable(argv)
> should be called.

That helps immensely - If I add the Tcl_FindExecutable(argv) call before creating the interpreter, it resolves the issue in my small testcase. We'll try that in the main application and see how it goes.

Thanks!

Re: Reading Unicode text in a non-localized application

<tl4oi3$2jvbm$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10186&group=comp.lang.tcl#10186

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: wortka...@yahoo.com (Harald Oehlmann)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Thu, 17 Nov 2022 08:40:53 +0100
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <tl4oi3$2jvbm$1@dont-email.me>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
<tl36qf$2cnfm$1@dont-email.me>
<9683617d-5b58-45d8-8f8b-7c5918ec4771n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Nov 2022 07:40:52 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8f038cf34ced1de08a44afa74fa07cfa";
logging-data="2751862"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AP4R8XNpHpfjC1FWiOZ47"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:meHZ8m3m1mmmNCBDW6t7NawpulE=
Content-Language: en-GB
In-Reply-To: <9683617d-5b58-45d8-8f8b-7c5918ec4771n@googlegroups.com>
 by: Harald Oehlmann - Thu, 17 Nov 2022 07:40 UTC

Am 16.11.2022 um 20:03 schrieb Phillip Brooks:
> On Wednesday, November 16, 2022 at 9:33:08 AM UTC-8, Harald Oehlmann wrote:
>
>> I don't know, if it was mentioned before.
>> The tcl initialization code changed. To initialze static stuff, first:
>> Tcl_FindExecutable(argv)
>> should be called.
>
> That helps immensely - If I add the Tcl_FindExecutable(argv) call before creating the interpreter, it resolves the issue in my small testcase. We'll try that in the main application and see how it goes.
>
> Thanks!

Great to hear. Cudos to the TCL designers, which worked a lot on the
embedded issue.
Harald

Re: Reading Unicode text in a non-localized application

<ygapmdlx44i.fsf@akutech.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10188&group=comp.lang.tcl#10188

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: ralf...@gmx.de (Ralf Fassel)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Thu, 17 Nov 2022 11:33:01 +0100
Lines: 9
Message-ID: <ygapmdlx44i.fsf@akutech.de>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net AODdBWgaW/EYA9OHI7U9NgoTzQweb+CAUZcEv1aBJjxaraRCI=
Cancel-Lock: sha1:2vqgIjIw6dD9KjctRuC7JGc9FdY= sha1:QTTSboNYXBX9feAlG7XZD81Z7jU=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
 by: Ralf Fassel - Thu, 17 Nov 2022 10:33 UTC

* Phillip Brooks <philbrks@gmail.com>
| Might it be that I can ask the customer to use iso8859-1 encoding
| instead of utf-8 for their localized comments?

Don't. UTF-8 is the way to go. iso8859-1 will not even transfer
properly to Windows, where the default codepage for Europe (cp1252)
is subtly different from iso8859-1 for 128ff.

R'

Re: Reading Unicode text in a non-localized application

<0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10193&group=comp.lang.tcl#10193

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a37:5482:0:b0:6fa:f91:615c with SMTP id i124-20020a375482000000b006fa0f91615cmr2469313qkb.691.1668704258609;
Thu, 17 Nov 2022 08:57:38 -0800 (PST)
X-Received: by 2002:a0c:ec04:0:b0:4c6:7622:3107 with SMTP id
y4-20020a0cec04000000b004c676223107mr3137557qvo.118.1668704258358; Thu, 17
Nov 2022 08:57:38 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Thu, 17 Nov 2022 08:57:38 -0800 (PST)
In-Reply-To: <ygapmdlx44i.fsf@akutech.de>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com> <ygapmdlx44i.fsf@akutech.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Thu, 17 Nov 2022 16:57:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1987
 by: Phillip Brooks - Thu, 17 Nov 2022 16:57 UTC

Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application. What is this call doing? Clearly it must be more than setting the executable name - also, in my small testcase, I don't see how knowing the executable name (which is nowhere near the Tcl install tree) helps with anything. Does anyone know what is going on under the covers there?

Ralf - thanks for the info. Also, in searching for info about iso8859-1, it isn't suitable for Korean anyway as it only covers Roman alphabet derivatives.

Re: Reading Unicode text in a non-localized application

<tl5q11$2mbtm$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10194&group=comp.lang.tcl#10194

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: wortka...@yahoo.com (Harald Oehlmann)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Thu, 17 Nov 2022 18:12:01 +0100
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <tl5q11$2mbtm$1@dont-email.me>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
<ygapmdlx44i.fsf@akutech.de>
<0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Nov 2022 17:12:03 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8f038cf34ced1de08a44afa74fa07cfa";
logging-data="2830262"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/bDq2yLvIT5e+DFxh+3wEa"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.4.2
Cancel-Lock: sha1:NfH8liDbxPGB87Lk4gkxYLVqg9o=
In-Reply-To: <0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
Content-Language: en-GB
 by: Harald Oehlmann - Thu, 17 Nov 2022 17:12 UTC

Am 17.11.2022 um 17:57 schrieb Phillip Brooks:
> Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application. What is this call doing? Clearly it must be more than setting the executable name - also, in my small testcase, I don't see how knowing the executable name (which is nowhere near the Tcl install tree) helps with anything. Does anyone know what is going on under the covers there?
>
> Ralf - thanks for the info. Also, in searching for info about iso8859-1, it isn't suitable for Korean anyway as it only covers Roman alphabet derivatives.

In one project
https://wiki.tcl-lang.org/page/Embedding+TCL+program+in+DLL
I debugged a lot the embedded stuff.
Tcl_FindExecutable(null) does a lot more.
I don't remember where the system encoding was set.
But it passed somewhere on the journey.
You may need to call Tcl_Init after creation of the interpreter...

Take care,
Harald

Re: Reading Unicode text in a non-localized application

<yga4juxwllp.fsf@akutech.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10195&group=comp.lang.tcl#10195

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: ralf...@gmx.de (Ralf Fassel)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Thu, 17 Nov 2022 18:13:06 +0100
Lines: 29
Message-ID: <yga4juxwllp.fsf@akutech.de>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
<ygapmdlx44i.fsf@akutech.de>
<0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net 6x2yZS5NjcvSJRoEhMQLEAyWRZUbpRlOOKhj1QTeIit2mEqIk=
Cancel-Lock: sha1:ZJsORkTzW31H5sMU++4NLCCCsWU= sha1:TBXP/5UBo+dfE8QiapGzy5gCiMk=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
 by: Ralf Fassel - Thu, 17 Nov 2022 17:13 UTC

* Phillip Brooks <philbrks@gmail.com>
| Unfortunately, using Tcl_FindExecutable(argv), which works in the
| small example program, is not working in our application. What is
| this call doing?

Read the source, Luke.

tcl8.6.13: generic/tclEncoding.c:1449
void
Tcl_FindExecutable(
const char *argv0) /* The value of the application's argv[0]
* (native). */
{
TclInitSubsystems();
TclpSetInitialEncodings();
TclpFindExecutable(argv0);
}

Could you you show the relevant code from your application (i.e. the
Tcl_Open* calls, the write calls etc) together with what happens, and
what you expect to happen?

| Ralf - thanks for the info. Also, in searching for info about
| iso8859-1, it isn't suitable for Korean anyway as it only covers Roman
| alphabet derivatives.

iso8859-1 also does not even contain €, you need iso8859-15 for that ;-)

R'

Re: Reading Unicode text in a non-localized application

<tl61nj$1ppi$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10196&group=comp.lang.tcl#10196

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!aioe.org!a5rWVvs5S5ZXUwkNcVnRMw.user.46.165.242.75.POSTED!not-for-mail
From: saitolo...@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: Re: Reading Unicode text in a non-localized application
Date: Thu, 17 Nov 2022 14:23:28 -0500
Organization: Aioe.org NNTP Server
Message-ID: <tl61nj$1ppi$1@gioia.aioe.org>
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me>
<f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="59186"; posting-host="a5rWVvs5S5ZXUwkNcVnRMw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.3.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US
 by: saitology9 - Thu, 17 Nov 2022 19:23 UTC

On 11/16/2022 12:16 PM, Phillip Brooks wrote:
>
> As mentioned previously, we don't see issues in a pure Tcl script (see main.tcl in the original post), but only when creating a Tcl interpreter from C/C++ code.
>

Well, the idea was that you'd find out which encoding works on your
client side and enforce that everywhere. However, ....

> Perhaps it is something that gets handled during initialization and isn't being initialized properly for Tcl 8.6?
>

This is interesting. You are embedding Tcl in a larger C/C++
application and as you state, Tcl takes care of things fine. So, if you
still have the issue, it would behoove you to look at the rest of the
C/C++ program. Namely, I would expect that you'd have to handle the
encoding there as well. I am not sure if the embedded Tcl interpreter's
control reaches outwards into the embedding system.

Re: Reading Unicode text in a non-localized application

<9cde488c-533e-4bf2-8428-6d0e3c750e4an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10197&group=comp.lang.tcl#10197

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:6214:5617:b0:4b1:8b53:8a1f with SMTP id mg23-20020a056214561700b004b18b538a1fmr4107155qvb.29.1668717590844;
Thu, 17 Nov 2022 12:39:50 -0800 (PST)
X-Received: by 2002:a05:620a:22c7:b0:6fa:3871:4f25 with SMTP id
o7-20020a05620a22c700b006fa38714f25mr2532467qki.40.1668717590667; Thu, 17 Nov
2022 12:39:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Thu, 17 Nov 2022 12:39:50 -0800 (PST)
In-Reply-To: <0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=192.183.219.24; posting-account=f4QznQoAAAAjupLEpV87s_G-96g1Io1w
NNTP-Posting-Host: 192.183.219.24
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com> <ygapmdlx44i.fsf@akutech.de>
<0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9cde488c-533e-4bf2-8428-6d0e3c750e4an@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: bgriffin...@gmail.com (briang)
Injection-Date: Thu, 17 Nov 2022 20:39:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2210
 by: briang - Thu, 17 Nov 2022 20:39 UTC

On Thursday, November 17, 2022 at 8:57:40 AM UTC-8, Phillip Brooks wrote:
> Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application. What is this call doing? Clearly it must be more than setting the executable name - also, in my small testcase, I don't see how knowing the executable name (which is nowhere near the Tcl install tree) helps with anything. Does anyone know what is going on under the covers there?

Are you running multi-threaded? Are you running multiple interps in multiple threads? I think you need to call Tcl_FindExecutable(NULL) in each thread, before creating any interps in the thread.

-Brian

Re: Reading Unicode text in a non-localized application

<a5695cab-40d3-43e2-a347-c80953dd63b4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10198&group=comp.lang.tcl#10198

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:ac8:53da:0:b0:3a5:c1e:d8b with SMTP id c26-20020ac853da000000b003a50c1e0d8bmr3892779qtq.537.1668720117279;
Thu, 17 Nov 2022 13:21:57 -0800 (PST)
X-Received: by 2002:a05:6214:2b48:b0:4c6:82d7:5ea1 with SMTP id
jy8-20020a0562142b4800b004c682d75ea1mr4114920qvb.79.1668720117119; Thu, 17
Nov 2022 13:21:57 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Thu, 17 Nov 2022 13:21:56 -0800 (PST)
In-Reply-To: <0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=217.232.214.101; posting-account=wi8XUAkAAADhgxlOTzYRDqsuN4By_ngW
NNTP-Posting-Host: 217.232.214.101
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com> <ygapmdlx44i.fsf@akutech.de>
<0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a5695cab-40d3-43e2-a347-c80953dd63b4n@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: undroidw...@gmail.com (Christian Werner)
Injection-Date: Thu, 17 Nov 2022 21:21:57 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 2
 by: Christian Werner - Thu, 17 Nov 2022 21:21 UTC

> Unfortunately, using Tcl_FindExecutable(argv), which works in the small example program, is not working in our application......

A larger C++ based program? Does it have global constructors? Which run before main()? Which call Tcl_SomeThing()?

Re: Reading Unicode text in a non-localized application

<b5eaa516-b04f-4a04-940f-72352186c25en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10260&group=comp.lang.tcl#10260

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a37:5846:0:b0:6fa:566f:eb1e with SMTP id m67-20020a375846000000b006fa566feb1emr46203211qkb.616.1669652082450;
Mon, 28 Nov 2022 08:14:42 -0800 (PST)
X-Received: by 2002:a05:622a:5c09:b0:3a6:2155:bac3 with SMTP id
gd9-20020a05622a5c0900b003a62155bac3mr47215740qtb.356.1669652082122; Mon, 28
Nov 2022 08:14:42 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Mon, 28 Nov 2022 08:14:41 -0800 (PST)
In-Reply-To: <a5695cab-40d3-43e2-a347-c80953dd63b4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <858fe625-aa8d-4e12-bccf-2a89730e30adn@googlegroups.com>
<tl1243$252go$1@dont-email.me> <f61aa3e7-d09b-47d7-b325-c37c441096c2n@googlegroups.com>
<tl1h8k$267h3$1@dont-email.me> <ygaedu2c1gr.fsf@akutech.de>
<646da78d-cf59-422e-b70c-e83d7aba6de3n@googlegroups.com> <ygapmdlx44i.fsf@akutech.de>
<0cf5ca54-8747-421b-a37d-d76baebf2258n@googlegroups.com> <a5695cab-40d3-43e2-a347-c80953dd63b4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b5eaa516-b04f-4a04-940f-72352186c25en@googlegroups.com>
Subject: Re: Reading Unicode text in a non-localized application
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Mon, 28 Nov 2022 16:14:42 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2167
 by: Phillip Brooks - Mon, 28 Nov 2022 16:14 UTC

It turns out that the difference between our main C++ application and the smaller test program is that there is a wrapper script that launches the main application that also unsets the LANG variable. I think this was done in response to a previous case where some particular setting of LANG was causing problems with our non-localized Tk gui code. Having LANG unset or set to blank also causes whatever initialization was happening in Tcl_FindExecutable not to happen anymore. I think we'll need to hard-wire LANG to en_US..UTF-8 or some such.

Thanks for all the help in tracking this down.

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor