Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

My computer can beat up your computer. -- Karl Lehenbauer


devel / comp.lang.tcl / Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

SubjectAuthor
* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Phillip Brooks
+* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13saitology9
|`* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Phillip Brooks
| `* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Ralf Fassel
|  `- Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Phillip Brooks
+* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Christian Gollwitzer
|`* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Phillip Brooks
| +- Invalid UTF handling behavior between 8.6.10 and 8.6.12/13briang
| `* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Christian Gollwitzer
|  `* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Phillip Brooks
|   `- Invalid UTF handling behavior between 8.6.10 and 8.6.12/13ted@loft.tnolan.com (Ted Nolan
`* Invalid UTF handling behavior between 8.6.10 and 8.6.12/13briang
 `- Invalid UTF handling behavior between 8.6.10 and 8.6.12/13Phillip Brooks

1
Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10691&group=comp.lang.tcl#10691

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a0c:b252:0:b0:56b:ee78:be8d with SMTP id k18-20020a0cb252000000b0056bee78be8dmr68669qve.40.1675730331336;
Mon, 06 Feb 2023 16:38:51 -0800 (PST)
X-Received: by 2002:a81:6a41:0:b0:52a:7563:2d30 with SMTP id
f62-20020a816a41000000b0052a75632d30mr149217ywc.294.1675730330911; Mon, 06
Feb 2023 16:38:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Mon, 6 Feb 2023 16:38:50 -0800 (PST)
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
Subject: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Tue, 07 Feb 2023 00:38:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4911
 by: Phillip Brooks - Tue, 7 Feb 2023 00:38 UTC

Hi,
We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.

In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in

In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:

"The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."

This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?

Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.

#include <stdio.h>
#include <string.h>
#include <tcl.h>

void test(const char* s1)
{ int length = strlen(s1);
Tcl_Obj *valuePtr;
Tcl_Obj *objResultPtr;

printf("Ready to determine the length of \"%s\"\n", s1);

objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr));
Tcl_SetObjLength(objResultPtr, length);

printf("The length of \"%s\" is %d\n", s1, length);
}

int main (int argc, char *argv[]) {
Tcl_FindExecutable(NULL);

Tcl_Interp *myinterp;

myinterp = Tcl_CreateInterp();

test("//j The quick brown fox jumps over the lazy dog");
test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");

printf( "Completed test.\n" );

return 0;
}

The crash in Tcl 8.6.13 shows this stack trace:

(gdb) where
#0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
#1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
#2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261"...) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:1068
#3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") at main.c:14
#4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<trs9u3$39i8b$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10692&group=comp.lang.tcl#10692

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: saitolo...@gmail.com (saitology9)
Newsgroups: comp.lang.tcl
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
Date: Mon, 6 Feb 2023 20:36:02 -0500
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <trs9u3$39i8b$1@dont-email.me>
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 7 Feb 2023 01:36:03 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="0326fafa173de0274a5b66b0ea2266ac";
logging-data="3459339"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19doGIA5rJ3iCG4/4HJ0//x"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.3.2
Cancel-Lock: sha1:iuw1ia5L/Okdba6FyUJzNf3v9uM=
Content-Language: en-US
In-Reply-To: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
 by: saitology9 - Tue, 7 Feb 2023 01:36 UTC

On 2/6/2023 7:38 PM, Phillip Brooks wrote:
> Hi,
> We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.
>
> In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in
>
> In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:
>
> "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."
>
> This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?
>
> Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.
>

I ran your test input in tclsh (version 8.6.12) and it ran fine:
Do you get a different result from the tclsh/wish shell?

% encoding system
utf-8

% proc test {s} {puts "$s : [string length $s]"}

% test "//j The quick brown fox jumps over the lazy dog"

% test "//j
\244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"

% puts "Completed test.\n"

This is the output:

//j The quick brown fox jumps over the lazy dog : 47
//j ¤ʤë²ÄǽÀ­¤¢¤ꡢ¥Á¥§¥å¯Êýˡ¤Îʣ»¨²½¤âÈò¤±¤ë°١¢º£²ó¤Ͻü³°¤Ȥ¹¤롣 : 58
Completed test.

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<trsqv8$3f425$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10693&group=comp.lang.tcl#10693

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: aurio...@gmx.de (Christian Gollwitzer)
Newsgroups: comp.lang.tcl
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
Date: Tue, 7 Feb 2023 07:26:48 +0100
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <trsqv8$3f425$1@dont-email.me>
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 7 Feb 2023 06:26:49 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="dc6a288f547c40f24494d403e4c34a8b";
logging-data="3641413"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bpABJxG8NYjU4Gj5mu/TxXg/kzEmCLYs="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.6.1
Cancel-Lock: sha1:ZB9kORLTosHzdKgozZJIjYh4itE=
In-Reply-To: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
 by: Christian Gollwitzer - Tue, 7 Feb 2023 06:26 UTC

Hi Phil,

Am 07.02.23 um 01:38 schrieb Phillip Brooks:
> We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.
>
> In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in
>
> In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:
>
> "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."
>
> This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?
>

I'm not sure if I understand what the point of your code is. Could you
please describe what the meaning of the long octal encoded byte sequence
is? Is it an UTF8 string with one invalid char? Also it is strange that
in your source code, you have non-ASCII chars inside of the C string.
the encoding of these depends on the C compiler (!), it might be encoded
as UTF-8, latin-1, or anything else.

In principal, the sentence you found is correct. The string
representation of a Tcl obj is a string in the sense of Tcl; usually
stored as UTF-8 with the exception that NULL bytes are encoded as C0 80,
in order to allow handling the string as NULL termination. If you want
to handle arbitrary binary data, then either use a ByteArray, and if it
is an UTF8 string with errors, do a script level [encoding convertfrom],
or you can do the same from the C level. The functions for this
described here:

https://www.tcl.tk/man/tcl/TclLib/Encoding.html

If you bypass the encodings and directly put chars into the string rep
of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.

Christian

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<3be77dd5-dfea-4c60-aa3f-51f914d2e0b1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10699&group=comp.lang.tcl#10699

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:622a:453:b0:3b9:b54b:d347 with SMTP id o19-20020a05622a045300b003b9b54bd347mr628429qtx.246.1675795471143;
Tue, 07 Feb 2023 10:44:31 -0800 (PST)
X-Received: by 2002:a05:6902:388:b0:884:d0a3:87d9 with SMTP id
f8-20020a056902038800b00884d0a387d9mr628378ybs.171.1675795470763; Tue, 07 Feb
2023 10:44:30 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 7 Feb 2023 10:44:30 -0800 (PST)
In-Reply-To: <trs9u3$39i8b$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com> <trs9u3$39i8b$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3be77dd5-dfea-4c60-aa3f-51f914d2e0b1n@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Tue, 07 Feb 2023 18:44:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2064
 by: Phillip Brooks - Tue, 7 Feb 2023 18:44 UTC

On Monday, February 6, 2023 at 5:36:08 PM UTC-8, saitology9 wrote:
> >
> I ran your test input in tclsh (version 8.6.12) and it ran fine:
> Do you get a different result from the tclsh/wish shell?

Yes, I similarly ran the test through tclsh and that succeeded. A more complete version is:

set str1 "//j The quick brown fox jumps over the lazy dog"
set str2 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"
puts $str1
set str1_uc [ string toupper $str1 ]
puts $str1_uc
puts $str2
set str2_uc [ string toupper $str2 ]
puts $str2_uc

I am not sure what tclsh is doing to prevent the error, though.

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<50a33ce7-925b-4f46-bd76-a0379697473cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10700&group=comp.lang.tcl#10700

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a37:aa91:0:b0:71d:bbfe:6af0 with SMTP id t139-20020a37aa91000000b0071dbbfe6af0mr296917qke.327.1675804634292;
Tue, 07 Feb 2023 13:17:14 -0800 (PST)
X-Received: by 2002:a25:abee:0:b0:7e8:ee11:9717 with SMTP id
v101-20020a25abee000000b007e8ee119717mr697441ybi.495.1675804633943; Tue, 07
Feb 2023 13:17:13 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 7 Feb 2023 13:17:13 -0800 (PST)
In-Reply-To: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=192.183.219.24; posting-account=f4QznQoAAAAjupLEpV87s_G-96g1Io1w
NNTP-Posting-Host: 192.183.219.24
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <50a33ce7-925b-4f46-bd76-a0379697473cn@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: bgriffin...@gmail.com (briang)
Injection-Date: Tue, 07 Feb 2023 21:17:14 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 6040
 by: briang - Tue, 7 Feb 2023 21:17 UTC

On Monday, February 6, 2023 at 4:38:53 PM UTC-8, Phillip Brooks wrote:
> Hi,
> We are attempting to port our application forward to Tcl 8.6.12 and have also had a look at Tcl 8.6.13 and we have found a problem with the way that invalid UTF character strings are handled between the two.
>
> In 8.6.10, there wasn't a problem at least with the testcase shown below. In 8.6.12 and 8.6.13, the test crashes in
>
> In searching for answers, I noticed that there is a statement in the background of TIP #345 that says:
>
> "The contract of string representations in Tcl states that the bytes field (the strep) of a Tcl_Obj must be a valid UTF-8 byte sequence. ..."
>
> This seems to imply that handing an invalid byte sequence to Tcl_NewStringObj and manipulating it as a valid UTF-8 byte sequence could result in this sort of crash. Is this "contract" in any of the Tcl API documentation?
>
> Here is the small test program that demonstrates the issue. (Hopefully, the invalid string input will make it through intact. Let me know if it does not, and I will try to find an alternate way of posting it.
>
> #include <stdio.h>
> #include <string.h>
> #include <tcl.h>
>
> void test(const char* s1)
> {
> int length = strlen(s1);
> Tcl_Obj *valuePtr;
> Tcl_Obj *objResultPtr;
>
> printf("Ready to determine the length of \"%s\"\n", s1);
>
> objResultPtr = Tcl_NewStringObj(s1, length);
> length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr));
> Tcl_SetObjLength(objResultPtr, length);
>
> printf("The length of \"%s\" is %d\n", s1, length);
> }
>
> int main (int argc, char *argv[]) {
> Tcl_FindExecutable(NULL);
>
> Tcl_Interp *myinterp;
>
> myinterp = Tcl_CreateInterp();
>
> test("//j The quick brown fox jumps over the lazy dog");
> test("//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣");
>
> printf( "Completed test.\n" );
>
> return 0;
> }
>
> The crash in Tcl 8.6.13 shows this stack trace:
>
> (gdb) where
> #0 0x00007ffff7d3deae in Tcl_UtfToUniChar (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, chPtr=0x7fffffffcbc6) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:409
> #1 0x00007ffff7d3fd0b in TclUtfToUCS4 (src=0x491ffe "\244\261"<error: Cannot access memory at address 0x492000>, ucs4Ptr=0x7fffffffcbf4) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:2417
> #2 0x00007ffff7d3e802 in Tcl_UtfToUpper (
> str=0x47f330 "//J \244ʤ\353\262\304Ǽ\300\255\244\242\244ꡢ\245\301\245\247\245Å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\355\251\223\355\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261\244\261"...) at /scratch2/pbrooks/tcl8.6.13/generic/tclUtf.c:1068
> #3 0x0000000000400910 in test (s1=0x400aa0 "//j \244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣") at main.c:14
> #4 0x0000000000400976 in main (argc=1, argv=0x7fffffffcd58) at main.c:28

This code is in violation according to the manual:

objResultPtr = Tcl_NewStringObj(s1, length);
length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr));
Tcl_SetObjLength(objResultPtr, length);

"Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing to this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."

-Brian

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10701&group=comp.lang.tcl#10701

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:ac8:5c45:0:b0:3b9:bfdf:cc6 with SMTP id j5-20020ac85c45000000b003b9bfdf0cc6mr858605qtj.337.1675806504519;
Tue, 07 Feb 2023 13:48:24 -0800 (PST)
X-Received: by 2002:a05:6902:4d:b0:895:2805:1fb3 with SMTP id
m13-20020a056902004d00b0089528051fb3mr485949ybh.275.1675806504009; Tue, 07
Feb 2023 13:48:24 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 7 Feb 2023 13:48:23 -0800 (PST)
In-Reply-To: <trsqv8$3f425$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com> <trsqv8$3f425$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Tue, 07 Feb 2023 21:48:24 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3212
 by: Phillip Brooks - Tue, 7 Feb 2023 21:48 UTC

Thanks for the response, it definitely got me farther.

On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
> I'm not sure if I understand what the point of your code is. Could you
> please describe what the meaning of the long octal encoded byte sequence
> is?
The string is basically undefined incorrect data - we know that. It is from a test suite
checking to make sure that we can handle undefined incorrect data, so it could really
be anything (I am not sure of the original source for the data).

> If you want
> to handle arbitrary binary data, then either use a ByteArray, and if it
> is an UTF8 string with errors, do a script level [encoding convertfrom],
> or you can do the same from the C level. The functions for this
> described here:
>
> https://www.tcl.tk/man/tcl/TclLib/Encoding.html

From the C level, something like this?

void test( Tcl_Interp *interp, const char* s1)
{ Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
int length = strlen(s1);
printf("The length of \"%s\" is %d\n", s1, length);

char* valid_s1 = (char*) malloc( length*2 );
int length_read = 0;
int length_written = 0;
int rt = Tcl_ExternalToUtf(
interp,
utf8_encoding,
s1,
length,
0,
nullptr,
valid_s1,
length*2,
&length_read,
&length_written,
nullptr );
if ( rt != TCL_OK ) {
return;
}
....

That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.

> If you bypass the encodings and directly put chars into the string rep
> of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.
OK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<190f59dd-f160-4ffb-a1ba-e3a5244cce60n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10702&group=comp.lang.tcl#10702

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a0c:e284:0:b0:56a:c48c:4e9 with SMTP id r4-20020a0ce284000000b0056ac48c04e9mr356465qvl.12.1675806902950;
Tue, 07 Feb 2023 13:55:02 -0800 (PST)
X-Received: by 2002:a0d:f843:0:b0:500:ac2c:80fb with SMTP id
i64-20020a0df843000000b00500ac2c80fbmr554521ywf.90.1675806902552; Tue, 07 Feb
2023 13:55:02 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 7 Feb 2023 13:55:02 -0800 (PST)
In-Reply-To: <884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=192.183.219.24; posting-account=f4QznQoAAAAjupLEpV87s_G-96g1Io1w
NNTP-Posting-Host: 192.183.219.24
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
<trsqv8$3f425$1@dont-email.me> <884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <190f59dd-f160-4ffb-a1ba-e3a5244cce60n@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: bgriffin...@gmail.com (briang)
Injection-Date: Tue, 07 Feb 2023 21:55:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3924
 by: briang - Tue, 7 Feb 2023 21:55 UTC

On Tuesday, February 7, 2023 at 1:48:26 PM UTC-8, Phillip Brooks wrote:
> Thanks for the response, it definitely got me farther.
> On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
> > I'm not sure if I understand what the point of your code is. Could you
> > please describe what the meaning of the long octal encoded byte sequence
> > is?
> The string is basically undefined incorrect data - we know that. It is from a test suite
> checking to make sure that we can handle undefined incorrect data, so it could really
> be anything (I am not sure of the original source for the data).
> > If you want
> > to handle arbitrary binary data, then either use a ByteArray, and if it
> > is an UTF8 string with errors, do a script level [encoding convertfrom],
> > or you can do the same from the C level. The functions for this
> > described here:
> >
> > https://www.tcl.tk/man/tcl/TclLib/Encoding.html
> From the C level, something like this?
>
> void test( Tcl_Interp *interp, const char* s1)
> {
> Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
> int length = strlen(s1);
> printf("The length of \"%s\" is %d\n", s1, length);
> char* valid_s1 = (char*) malloc( length*2 );
> int length_read = 0;
> int length_written = 0;
> int rt = Tcl_ExternalToUtf(
> interp,
> utf8_encoding,
> s1,
> length,
> 0,
> nullptr,
> valid_s1,
> length*2,
> &length_read,
> &length_written,
> nullptr );
> if ( rt != TCL_OK ) {
> return;
> }
> ...
>
> That does stop it from crashing, but, oddly enough, it also converts the invalid data to some other invalid data - but possibly a valid UTF-8 non the less? It doesn't seem to damage any of the few valid UTF-8 strings I passed through it.
> > If you bypass the encodings and directly put chars into the string rep
> > of a Tcl_Obj, you will get undefined behaviour if it is not correct UTF8.
> OK - is that documented somewhere? I don't see anything to that effect in Tcl_NewStringObj.

https://www.tcl-lang.org/man/tcl8.6/TclLib/StringObj.htm#M4

"Points to the first byte of an array of UTF-8-encoded bytes used to set or append to a string value. This byte array may contain embedded null characters unless numChars is negative. (Applications needing null bytes should represent them as the two-byte sequence \300\200, use Tcl_ExternalToUtf to convert, or Tcl_NewByteArrayObj if the string is a collection of uninterpreted bytes.)"

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<1e2d0062-5dfb-4304-abde-bd3420c63812n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10703&group=comp.lang.tcl#10703

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a0c:db8f:0:b0:56c:10cb:197d with SMTP id m15-20020a0cdb8f000000b0056c10cb197dmr170393qvk.75.1675807812544;
Tue, 07 Feb 2023 14:10:12 -0800 (PST)
X-Received: by 2002:a0d:da86:0:b0:51d:bfbf:7f25 with SMTP id
c128-20020a0dda86000000b0051dbfbf7f25mr600294ywe.36.1675807812172; Tue, 07
Feb 2023 14:10:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Tue, 7 Feb 2023 14:10:11 -0800 (PST)
In-Reply-To: <50a33ce7-925b-4f46-bd76-a0379697473cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com> <50a33ce7-925b-4f46-bd76-a0379697473cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1e2d0062-5dfb-4304-abde-bd3420c63812n@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Tue, 07 Feb 2023 22:10:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2407
 by: Phillip Brooks - Tue, 7 Feb 2023 22:10 UTC

On Tuesday, February 7, 2023 at 1:17:16 PM UTC-8, briang wrote:

> This code is in violation according to the manual:
> objResultPtr = Tcl_NewStringObj(s1, length);
> length = Tcl_UtfToUpper(Tcl_GetString(objResultPtr));
> Tcl_SetObjLength(objResultPtr, length);
> "Except for that limited purpose, the pointer returned by Tcl_GetStringFromObj or Tcl_GetString should be treated as read-only. It is recommended that this pointer be assigned to a (const char *) variable. Even in the limited situations where writing to this pointer is acceptable, one should take care to respect the copy-on-write semantics required by Tcl_Obj's, with appropriate calls to Tcl_IsShared and Tcl_DuplicateObj prior to any in-place modification of the string representation."
>
> -Brian

Right - that bit was actually copied out from inside of Tcl someplace by the person that isolated it into a standalone problem. Our application was doing something different. This (more correct) code also crashes:

char* s1_upper = (char*) malloc( length+1 );
strcpy( s1_upper, s1 );
length = Tcl_UtfToUpper(s1_upper);

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<trvh6b$2g6s$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10706&group=comp.lang.tcl#10706

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: aurio...@gmx.de (Christian Gollwitzer)
Newsgroups: comp.lang.tcl
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
Date: Wed, 8 Feb 2023 07:58:19 +0100
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <trvh6b$2g6s$1@dont-email.me>
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
<trsqv8$3f425$1@dont-email.me>
<884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 8 Feb 2023 06:58:19 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="1a0c1e9cb661266603d0bd4fd1357aca";
logging-data="82140"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/wgWfqh1HyHC3WRrJ1RiVKKdznl2xw8EQ="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.6.1
Cancel-Lock: sha1:naySU7m73JmxO7jbTIc6OAKQZ64=
In-Reply-To: <884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>
 by: Christian Gollwitzer - Wed, 8 Feb 2023 06:58 UTC

Am 07.02.23 um 22:48 schrieb Phillip Brooks:
> Thanks for the response, it definitely got me farther.
>
> On Monday, February 6, 2023 at 10:26:53 PM UTC-8, Christian Gollwitzer wrote:
>> I'm not sure if I understand what the point of your code is. Could you
>> please describe what the meaning of the long octal encoded byte sequence
>> is?
> The string is basically undefined incorrect data - we know that. It is from a test suite
> checking to make sure that we can handle undefined incorrect data, so it could really
> be anything (I am not sure of the original source for the data).

OK - so it is not arbitrary binary data (that would be

>
>> If you want
>> to handle arbitrary binary data, then either use a ByteArray, and if it
>> is an UTF8 string with errors, do a script level [encoding convertfrom],
>> or you can do the same from the C level. The functions for this
>> described here:
>>
>> https://www.tcl.tk/man/tcl/TclLib/Encoding.html
>
> From the C level, something like this?
>
> void test( Tcl_Interp *interp, const char* s1)
> {
> Tcl_Encoding utf8_encoding = Tcl_GetEncoding( interp, "utf-8" );
> int length = strlen(s1);
> printf("The length of \"%s\" is %d\n", s1, length);
>
> char* valid_s1 = (char*) malloc( length*2 );
> int length_read = 0;
> int length_written = 0;
> int rt = Tcl_ExternalToUtf(
> interp,
> utf8_encoding,
> s1,
> length,
> 0,
> nullptr,
> valid_s1,
> length*2,
> &length_read,
> &length_written,
> nullptr );
> if ( rt != TCL_OK ) {
> return;
>

In principle, yes. Instead of the malloc I would use
Tcl_ExternalToUtfDString, which allocates the correct number of bytes
itself, and you can easily convert the DString into a Tcl_Obj later on
(without copying). Also consider caching the Tcl_Encoding if you call
this function more often, and free the encoding at the end of the program.

crashing, but, oddly enough, it also converts the invalid data to some
other invalid data - but possibly a valid UTF-8 non the less? It
doesn't seem to damage any of the few valid UTF-8 strings I passed
through it.

Are you sure it is invalid data? It should have replaced all invalid
chars by the Unicode encoding for "invalid".

If you want to handle the encoding errors differently, I'm not sure how
to do it in current Tcl version. There is a discussion going on about
improving Unicode support in Tcl 9, which will bring different failure
modes also from the script level, so that the application can decide on
which errors to reject etc.

I had a similar problem in one of my projects, and I decided to check
the data for UTF8 compatbility manually. If it was incorrect data, I
passed it in as a ByteArray. That was the rigth way to do in this
specific context. The code is here:

https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

Christian

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<ygaa61o8o2v.fsf@akutech.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10707&group=comp.lang.tcl#10707

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: ralf...@gmx.de (Ralf Fassel)
Newsgroups: comp.lang.tcl
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
Date: Wed, 08 Feb 2023 11:10:32 +0100
Lines: 23
Message-ID: <ygaa61o8o2v.fsf@akutech.de>
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
<trs9u3$39i8b$1@dont-email.me>
<3be77dd5-dfea-4c60-aa3f-51f914d2e0b1n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net zgcOn4L/a6FPBpbpu0W22gZoFx4pZtQG+ku49+Xr1XWVIvAxY=
Cancel-Lock: sha1:pPEGwtxbuvpL4dBE+/5bBYeiJCk= sha1:Av5s5F4/AOrDmmL6wK1QQGG+xLQ=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
 by: Ralf Fassel - Wed, 8 Feb 2023 10:10 UTC

* Phillip Brooks <philbrks@gmail.com>
| Yes, I similarly ran the test through tclsh and that succeeded. A more complete version is:
>
| set str2 "//j\244ʤ\353\262\304ǽ\300\255\244\242\244ꡢ\245\301\245\247\245å\257\312\375ˡ\244\316ʣ\273\250\262\275\244\342\310\362\244\261\244\353\260١\242\272\243\262\363\244Ͻ\374\263\260\244Ȥ\271\244롣"
>
| I am not sure what tclsh is doing to prevent the error, though.

% string length $str2
57
% string bytelength $str2
113

% set str1 \255
­
% string length $str1
1
% string bytelength $str1
2

I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.

R'

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<9835a18c-1ed5-4024-97c9-42794ce3dbf4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10710&group=comp.lang.tcl#10710

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:ac8:5b90:0:b0:3ba:18b0:ab46 with SMTP id a16-20020ac85b90000000b003ba18b0ab46mr1898767qta.176.1675906889684;
Wed, 08 Feb 2023 17:41:29 -0800 (PST)
X-Received: by 2002:a81:6a41:0:b0:52a:7563:2d30 with SMTP id
f62-20020a816a41000000b0052a75632d30mr1243281ywc.294.1675906889361; Wed, 08
Feb 2023 17:41:29 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Wed, 8 Feb 2023 17:41:28 -0800 (PST)
In-Reply-To: <ygaa61o8o2v.fsf@akutech.de>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
<trs9u3$39i8b$1@dont-email.me> <3be77dd5-dfea-4c60-aa3f-51f914d2e0b1n@googlegroups.com>
<ygaa61o8o2v.fsf@akutech.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9835a18c-1ed5-4024-97c9-42794ce3dbf4n@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Thu, 09 Feb 2023 01:41:29 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1548
 by: Phillip Brooks - Thu, 9 Feb 2023 01:41 UTC

On Wednesday, February 8, 2023 at 2:10:37 AM UTC-8, Ralf Fassel wrote:
> * Phillip Brooks
> I think TCL simply encodes the data you pass in (even with \ooo) as valid UTF-8.

OK - simply probably something similar to the solution we reached using Tcl_ExternalToUtf above. Thanks for the observations.

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<90e79e83-ca17-4497-8998-e5403fa339b3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10711&group=comp.lang.tcl#10711

  copy link   Newsgroups: comp.lang.tcl
X-Received: by 2002:a05:622a:19a2:b0:3b8:5f47:aab6 with SMTP id u34-20020a05622a19a200b003b85f47aab6mr1495320qtc.235.1675907336909;
Wed, 08 Feb 2023 17:48:56 -0800 (PST)
X-Received: by 2002:a25:b87:0:b0:87f:dd8e:ce3f with SMTP id
129-20020a250b87000000b0087fdd8ece3fmr1164622ybl.349.1675907336659; Wed, 08
Feb 2023 17:48:56 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.tcl
Date: Wed, 8 Feb 2023 17:48:56 -0800 (PST)
In-Reply-To: <trvh6b$2g6s$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=192.94.38.34; posting-account=SiRO5QoAAAAilq1etV9Podc0rHvm5sum
NNTP-Posting-Host: 192.94.38.34
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com>
<trsqv8$3f425$1@dont-email.me> <884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com>
<trvh6b$2g6s$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90e79e83-ca17-4497-8998-e5403fa339b3n@googlegroups.com>
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
From: philb...@gmail.com (Phillip Brooks)
Injection-Date: Thu, 09 Feb 2023 01:48:56 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3347
 by: Phillip Brooks - Thu, 9 Feb 2023 01:48 UTC

On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:

> In principle, yes. Instead of the malloc I would use
> Tcl_ExternalToUtfDString, which allocates the correct number of bytes
> itself, and you can easily convert the DString into a Tcl_Obj later on
> (without copying). Also consider caching the Tcl_Encoding if you call
> this function more often, and free the encoding at the end of the program..

> Are you sure it is invalid data? It should have replaced all invalid
> chars by the Unicode encoding for "invalid".

OK - maybe not invalid but meaningless, none the less. Interestingly, I found out that this data was generated by our previous issue with the Korean UTF-8 string that was being interpreted as something else and then junk was printed into one of our result files. That junk result file has since become part of our test suite. (Devilishly clever, some of these QA folks. Show them a bug, and they know they have spotted weakness. They cleverly dive in anew on the same spot and lay groundworks to find another bug in the process.)

> If you want to handle the encoding errors differently, I'm not sure how
> to do it in current Tcl version. There is a discussion going on about
> improving Unicode support in Tcl 9, which will bring different failure
> modes also from the script level, so that the application can decide on
> which errors to reject etc.

Avoiding crashing is the main objective here. We'll leave interpreting whether the UTF-8 means anything for another day (ChatGPTcl?)

> I had a similar problem in one of my projects, and I decided to check
> the data for UTF8 compatbility manually. If it was incorrect data, I
> passed it in as a ByteArray. That was the rigth way to do in this
> specific context. The code is here:
>
> https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56

Thanks, I'll take a look!

Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13

<k4j9q2FsfmsU1@mid.individual.net>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=10714&group=comp.lang.tcl#10714

  copy link   Newsgroups: comp.lang.tcl
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!2.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: ...@ednolan (ted@loft.tnolan.com (Ted Nolan)
Newsgroups: comp.lang.tcl
Subject: Re: Invalid UTF handling behavior between 8.6.10 and 8.6.12/13
Date: 9 Feb 2023 04:06:26 GMT
Organization: loft
Lines: 54
Message-ID: <k4j9q2FsfmsU1@mid.individual.net>
References: <e9976652-46a1-4cc3-99eb-fdad4308d7e7n@googlegroups.com> <884f2807-d56b-4b2b-9b05-e6f7a788cd08n@googlegroups.com> <trvh6b$2g6s$1@dont-email.me> <90e79e83-ca17-4497-8998-e5403fa339b3n@googlegroups.com>
X-Trace: individual.net hwWavP6DGuxb7ZeZVtZUJwM12fTytmHGkZ5rLl8Q7e2H8M8DL1
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:aGc+erPv0iGZJ5I8DGBq9CYSkX4=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
 by: ted@loft.tnolan.com - Thu, 9 Feb 2023 04:06 UTC

In article <90e79e83-ca17-4497-8998-e5403fa339b3n@googlegroups.com>,
Phillip Brooks <philbrks@gmail.com> wrote:
>On Tuesday, February 7, 2023 at 10:58:23 PM UTC-8, Christian Gollwitzer wrote:
>
>> In principle, yes. Instead of the malloc I would use
>> Tcl_ExternalToUtfDString, which allocates the correct number of bytes
>> itself, and you can easily convert the DString into a Tcl_Obj later on
>> (without copying). Also consider caching the Tcl_Encoding if you call
>> this function more often, and free the encoding at the end of the program.
>
>> Are you sure it is invalid data? It should have replaced all invalid
>> chars by the Unicode encoding for "invalid".
>
>OK - maybe not invalid but meaningless, none the less. Interestingly, I
>found out that this data was generated by our previous issue with the
>Korean UTF-8 string that was being interpreted as something else and
>then junk was printed into one of our result files. That junk result
>file has since become part of our test suite. (Devilishly clever, some
>of these QA folks. Show them a bug, and they know they have spotted
>weakness. They cleverly dive in anew on the same spot and lay
>groundworks to find another bug in the process.)
>
>> If you want to handle the encoding errors differently, I'm not sure how
>> to do it in current Tcl version. There is a discussion going on about
>> improving Unicode support in Tcl 9, which will bring different failure
>> modes also from the script level, so that the application can decide on
>> which errors to reject etc.
>
>Avoiding crashing is the main objective here. We'll leave interpreting
>whether the UTF-8 means anything for another day (ChatGPTcl?)
>
>> I had a similar problem in one of my projects, and I decided to check
>> the data for UTF8 compatbility manually. If it was incorrect data, I
>> passed it in as a ByteArray. That was the rigth way to do in this
>> specific context. The code is here:
>>
>>
>https://github.com/BessyHDFViewer/HDFpp/blob/c5d384b3970f0bebc47e40224dc218cc6c441cbf/generic/SWObject.hpp#L56
>
>Thanks, I'll take a look!

I discovered once that "encoding convertfrom utf-8" would not
throw an error if you invoked it on invalid (non utf-8) data, which I
had not expected. I'm not sure what it actually does, but it's
happy to hand you garbage.

I wrote a little "is_utf8" extension based on code from:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

It turned out to be pretty easy.
--
columbiaclosings.com
What's not in Columbia anymore..

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor