novaBBS - comp.lang.c - Re: UTF-8 strings in C

Re: UTF-8 strings in C

<e3b52fff-13c9-4537-8c7a-bfd9a180e2c2n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=19976&group=comp.lang.c#19976

X-Received: by 2002:ac8:5c87:: with SMTP id r7mr8165339qta.575.1642178049819;
Fri, 14 Jan 2022 08:34:09 -0800 (PST)
X-Received: by 2002:a05:622a:15c7:: with SMTP id d7mr8409618qty.476.1642178049644;
Fri, 14 Jan 2022 08:34:09 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 14 Jan 2022 08:34:09 -0800 (PST)
In-Reply-To: <srs7nj$62u$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:8076:173:6d1c:b40e;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:8076:173:6d1c:b40e
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net> <afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me> <2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org> <srs269$4n1$2@dont-email.me> <srs7nj$62u$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e3b52fff-13c9-4537-8c7a-bfd9a180e2c2n@googlegroups.com>
Subject: Re: UTF-8 strings in C
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Fri, 14 Jan 2022 16:34:09 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 24

by: Malcolm McLean - Fri, 14 Jan 2022 16:34 UTC

On Friday, 14 January 2022 at 16:18:11 UTC, Mateusz Viste wrote:
> 2022-01-14 at 15:43 +0100, Bonita Montero wrote:
> > Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
> > > 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
> > >> Yes. It should compile cleanly. That's not the objection.
> > >
> > > clang with -Weverything (or -Wsign-conversion) does warn about this.
> > >
> > > warning: implicit conversion changes signedness: 'int' to 'size_t'
> > > (aka 'unsigned long') [-Wsign-conversion]
> >
> > Stupid warning. All compiler widen the data to the size of the
> > destination-type and then convert it to it.
> The warning points out a clear inconsistency. If that's *really* what
> the programmer wanted, then an explicit cast should clear out the
> warning. Otherwise it is likely to be a human error, and warning about
> that is never stupid.
>
If you are writing a utf8strlen() you have to do something is passed malformed
utf8. -1 is a standard "error return". Really you should return -2/-3. Keep -1
for "out of memory" errors, -2 or -3 for "parse errors", and -3 / -2 for
"IO errors". Those cover the vast majority of error conditions that can arise.

However size_t cannot represent -1, so instead the maximum value is returned.

Re: UTF-8 strings in C

<srs95u$qhf$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19977&group=comp.lang.c#19977

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!299gYy2nqWB43X4cCBV6zg.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 17:42:38 +0100
Organization: . . .
Message-ID: <srs95u$qhf$1@gioia.aioe.org>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me>
<srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me>
<srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org>
<srs269$4n1$2@dont-email.me>
<srs7nj$62u$1@gioia.aioe.org>
<e3b52fff-13c9-4537-8c7a-bfd9a180e2c2n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="27183"; posting-host="299gYy2nqWB43X4cCBV6zg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2

by: Mateusz Viste - Fri, 14 Jan 2022 16:42 UTC

2022-01-14 at 08:34 -0800, Malcolm McLean wrote:
> However size_t cannot represent -1, so instead the maximum value is
> returned.

Obviously, but it can still return "-1 casted to size_t", so the
caller may check for errors by doing the same cast. That's a convention
I have seen in actual code in the past. Whether or not it is good
practice and healthy design is another matter.

Mateusz

Re: UTF-8 strings in C

<srs9l2$th2$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19978&group=comp.lang.c#19978

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 17:50:46 +0100
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <srs9l2$th2$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org> <srs269$4n1$2@dont-email.me>
<srs649$1af$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 16:50:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dc301fd105e5ac21c197a54d7b05fd93";
logging-data="30242"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hviNp0r6TEgkDXPmCXwZ5S80rKMWo160="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:pPyVUlVhAvPtU72vP6cFkZuWSCg=
In-Reply-To: <srs649$1af$1@dont-email.me>
Content-Language: de-DE

by: Bonita Montero - Fri, 14 Jan 2022 16:50 UTC

Am 14.01.2022 um 16:50 schrieb Bart:
> On 14/01/2022 14:43, Bonita Montero wrote:
>> Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
>>> 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
>>>> Yes. It should compile cleanly. That's not the objection.
>>>
>>> clang with -Weverything (or -Wsign-conversion) does warn about this.
>>>
>>> warning: implicit conversion changes signedness: 'int' to 'size_t' (aka
>>> 'unsigned long') [-Wsign-conversion]
>>
>> Stupid warning. All compiler widen the data to the size of the
>> destination-type and then convert it to it.
>>
>
>
> If you change int -1 to size_t then you are liable to end up with
> 18446744073709551615.
> You don't want a warning about that?

No, because the upper value is intended.

Re: UTF-8 strings in C

<srs9n2$th2$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19979&group=comp.lang.c#19979

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 17:51:50 +0100
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <srs9n2$th2$2@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org> <srs269$4n1$2@dont-email.me>
<srs7nj$62u$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 16:51:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dc301fd105e5ac21c197a54d7b05fd93";
logging-data="30242"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iZ+PuIWIYvXsnXpQ3Hm+sO/LYXMDhUaI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:HAVT1k89lJC0Hd5g0oyHUtyA9iM=
In-Reply-To: <srs7nj$62u$1@gioia.aioe.org>
Content-Language: de-DE

by: Bonita Montero - Fri, 14 Jan 2022 16:51 UTC

Am 14.01.2022 um 17:17 schrieb Mateusz Viste:
> 2022-01-14 at 15:43 +0100, Bonita Montero wrote:
>> Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
>>> 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
>>>> Yes. It should compile cleanly. That's not the objection.
>>>
>>> clang with -Weverything (or -Wsign-conversion) does warn about this.
>>>
>>> warning: implicit conversion changes signedness: 'int' to 'size_t'
>>> (aka 'unsigned long') [-Wsign-conversion]
>>
>> Stupid warning. All compiler widen the data to the size of the
>> destination-type and then convert it to it.
>
> The warning points out a clear inconsistency. If that's *really* what
> the programmer wanted, then an explicit cast should clear out the
> warning. Otherwise it is likely to be a human error, and warning about
> that is never stupid.

When s.o. casts -1 to an unsigned value he always intends that.
So this warning is superfluous.

Re: UTF-8 strings in C

<srsajm$1gvq$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19980&group=comp.lang.c#19980

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!299gYy2nqWB43X4cCBV6zg.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 18:07:02 +0100
Organization: . . .
Message-ID: <srsajm$1gvq$1@gioia.aioe.org>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me>
<srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me>
<srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org>
<srs269$4n1$2@dont-email.me>
<srs7nj$62u$1@gioia.aioe.org>
<srs9n2$th2$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="50170"; posting-host="299gYy2nqWB43X4cCBV6zg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2

by: Mateusz Viste - Fri, 14 Jan 2022 17:07 UTC

2022-01-14 at 17:51 +0100, Bonita Montero wrote:
> Am 14.01.2022 um 17:17 schrieb Mateusz Viste:
> > The warning points out a clear inconsistency. If that's *really*
> > what the programmer wanted, then an explicit cast should clear out
> > the warning. Otherwise it is likely to be a human error, and
> > warning about that is never stupid.
>
> When s.o. casts -1 to an unsigned value he always intends that.

Look again at the original code. It states "return -1". No cast, other
than the implicit one due to the function being defined as returning an
unsigned value. The majority of situations where I have seen this were
human mistakes. Often because the function was initially supposed to
return a signed value and the programmer changed his mind later, but
forgot about the few negative returns he left behind.

> So this warning is superfluous.

You think that because you are used to working with reptilians. For us,
mere humans, such warnings help saving man-hours of code tracking.

Mateusz

Re: UTF-8 strings in C

<srsba9$b50$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19981&group=comp.lang.c#19981

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 18:19:09 +0100
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <srsba9$b50$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org> <srs269$4n1$2@dont-email.me>
<srs7nj$62u$1@gioia.aioe.org> <srs9n2$th2$2@dont-email.me>
<srsajm$1gvq$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 17:19:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dc301fd105e5ac21c197a54d7b05fd93";
logging-data="11424"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19HJkR5AU1E+1vM77yUbsUXAmWXiHMkkOA="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:EEH1eaAw8UlUM+KGDsPX9Ab1DyA=
In-Reply-To: <srsajm$1gvq$1@gioia.aioe.org>
Content-Language: de-DE

by: Bonita Montero - Fri, 14 Jan 2022 17:19 UTC

Am 14.01.2022 um 18:07 schrieb Mateusz Viste:

> Look again at the original code. It states "return -1". No cast, other
> than the implicit one due to the function being defined as returning an
> unsigned value. The majority of situations where I have seen this were
> human mistakes. ...

No, this is never been done mistakenly, but intended.

Re: UTF-8 strings in C

<srtvd7$143n0$2@solani.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19983&group=comp.lang.c#19983

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: pkk...@spth.de (Philipp Klaus Krause)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Sat, 15 Jan 2022 09:08:07 +0100
Message-ID: <srtvd7$143n0$2@solani.org>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 Jan 2022 08:08:07 -0000 (UTC)
Injection-Info: solani.org;
logging-data="1183456"; mail-complaints-to="abuse@news.solani.org"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:AnK42JmXo/4GVOnxn27bWw5RRwE=
X-User-ID: eJwFwQkRADAIAzBL4wc5K7f6l7AkLCW3PCM9GEQM5qxdXQXvGz0BLOFlaZSqltvpOtg37A8pthFp
In-Reply-To: <87h7a6h0lp.fsf@tigger.extechop.net>
Content-Language: en-US

by: Philipp Klaus Krause - Sat, 15 Jan 2022 08:08 UTC

On 14.01.22 15:07, Otto J. Makela wrote:
>
> How about this? It of course does not count the slightly odder
> compositions (like flags) correctly.
>

Why not just call mblen() in the loop instead?

Re: UTF-8 strings in C

<87k0es9kag.fsf@tigger.extechop.net>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20099&group=comp.lang.c#20099

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: om...@iki.fi (Otto J. Makela)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Sat, 22 Jan 2022 11:44:39 +0200
Organization: Games and Theory
Lines: 38
Message-ID: <87k0es9kag.fsf@tigger.extechop.net>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="5f272e59d6df41108b59cdf90fe95f12";
logging-data="27485"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19NbVJl03eyEMQ1yItYavwR"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:DQOEYNOxYDvQvjo6qq5yaTbSPM8=
sha1:V1u4RXiS155vCtevNEQv4F48o/Q=
X-Face: 'g'S,X"!c;\pfvl4ljdcm?cDdk<-Z;`x5;YJPI-cs~D%;_<\V3!3GCims?a*;~u$<FYl@"E
c?3?_J+Zwn~{$8<iEy}EqIn_08"`oWuqO$#(5y3hGq8}BG#sag{BL)u8(c^Lu;*{8+'Z-k\?k09ILS
X-URL: http://www.iki.fi/om/
Mail-Copies-To: never

by: Otto J. Makela - Sat, 22 Jan 2022 09:44 UTC

Philipp Klaus Krause <pkk@spth.de> wrote:

> On 14.01.22 15:07, Otto J. Makela wrote:
>> How about this? It of course does not count the slightly odder
>> compositions (like flags) correctly.
>
> Why not just call mblen() in the loop instead?

Firstly, I will fully admit wasn't aware of its existance.

Secondly, on quick tests I can't seem to find the settings it needs to
be happy even with 2-byte characters. Apparently I don't understand it,
or my environment isn't correctly set up?

----

#include <stdio.h>
#include <stdlib.h>

int main() {
unsigned char str[10]="Ä";

printf("'%s' %02x %02x %02x %d\n",
str,str[0],str[1],str[3],mblen(str,10));
exit(0);
}

----

Produces output:

'Ä' c3 84 00 -1

--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Re: UTF-8 strings in C

<a47139af-611d-4556-bae4-7399ea12942fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20101&group=comp.lang.c#20101

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:506:: with SMTP id l6mr6389073qtx.559.1642855735278;
Sat, 22 Jan 2022 04:48:55 -0800 (PST)
X-Received: by 2002:a05:622a:1104:: with SMTP id e4mr6490328qty.233.1642855735140;
Sat, 22 Jan 2022 04:48:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 22 Jan 2022 04:48:54 -0800 (PST)
In-Reply-To: <87k0es9kag.fsf@tigger.extechop.net>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:d5af:911c:4449:64d2;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:d5af:911c:4449:64d2
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me>
<87lezih5b5.fsf@tigger.extechop.net> <87h7a6h0lp.fsf@tigger.extechop.net>
<srtvd7$143n0$2@solani.org> <87k0es9kag.fsf@tigger.extechop.net>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a47139af-611d-4556-bae4-7399ea12942fn@googlegroups.com>
Subject: Re: UTF-8 strings in C
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 22 Jan 2022 12:48:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2572

by: Malcolm McLean - Sat, 22 Jan 2022 12:48 UTC

On Saturday, 22 January 2022 at 09:44:52 UTC, Otto J. Makela wrote:
> Philipp Klaus Krause <p...@spth.de> wrote:
>
> > On 14.01.22 15:07, Otto J. Makela wrote:
> >> How about this? It of course does not count the slightly odder
> >> compositions (like flags) correctly.
> >
> > Why not just call mblen() in the loop instead?
> Firstly, I will fully admit wasn't aware of its existance.
>
> Secondly, on quick tests I can't seem to find the settings it needs to
> be happy even with 2-byte characters. Apparently I don't understand it,
> or my environment isn't correctly set up?
>
> ----
>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main() {
> unsigned char str[10]="Ä";
>
> printf("'%s' %02x %02x %02x %d\n",
> str,str[0],str[1],str[3],mblen(str,10));
> exit(0);
> }
>
> ----
>
> Produces output:
>
> 'Ä' c3 84 00 -1
>
Something is wrong.
Try resetting the "shift state" with mblen(NULL, 0);

I doubt that's the problem, but it's worth a try.

Re: UTF-8 strings in C

<ICTGJ.53031$Y01.30902@fx45.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20103&group=comp.lang.c#20103

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.5.0
Subject: Re: UTF-8 strings in C
Content-Language: en-US
Newsgroups: comp.lang.c
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org>
<87k0es9kag.fsf@tigger.extechop.net>
<a47139af-611d-4556-bae4-7399ea12942fn@googlegroups.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <a47139af-611d-4556-bae4-7399ea12942fn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 49
Message-ID: <ICTGJ.53031$Y01.30902@fx45.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sat, 22 Jan 2022 08:27:04 -0500
X-Received-Bytes: 2848

by: Richard Damon - Sat, 22 Jan 2022 13:27 UTC

On 1/22/22 7:48 AM, Malcolm McLean wrote:
> On Saturday, 22 January 2022 at 09:44:52 UTC, Otto J. Makela wrote:
>> Philipp Klaus Krause <p...@spth.de> wrote:
>>
>>> On 14.01.22 15:07, Otto J. Makela wrote:
>>>> How about this? It of course does not count the slightly odder
>>>> compositions (like flags) correctly.
>>>
>>> Why not just call mblen() in the loop instead?
>> Firstly, I will fully admit wasn't aware of its existance.
>>
>> Secondly, on quick tests I can't seem to find the settings it needs to
>> be happy even with 2-byte characters. Apparently I don't understand it,
>> or my environment isn't correctly set up?
>>
>> ----
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> int main() {
>> unsigned char str[10]="Ä";
>>
>> printf("'%s' %02x %02x %02x %d\n",
>> str,str[0],str[1],str[3],mblen(str,10));
>> exit(0);
>> }
>>
>> ----
>>
>> Produces output:
>>
>> 'Ä' c3 84 00 -1
>>
> Something is wrong.
> Try resetting the "shift state" with mblen(NULL, 0);
>
> I doubt that's the problem, but it's worth a try.

Could it be that the library isn't in a 'Unicode' mode but an 'ASCII' mode.

I thought that mblen was locale dependent, and if the default 'C' local
defines the character set at 'ASCII', then the first byte of 0xC3 is in
fact, an invalid character.

One of the limits of the C language is that it does not default
(necessarily) to assuming Unicode, so the code may need an
implementation specific call to set the locale to one that uses Unicode,
if one exists (which isn't required).

Re: UTF-8 strings in C

<mblen-20220122163758@ram.dialup.fu-berlin.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20106&group=comp.lang.c#20106

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: 22 Jan 2022 15:39:23 GMT
Organization: Stefan Ram
Lines: 35
Expires: 1 Apr 2022 11:59:58 GMT
Message-ID: <mblen-20220122163758@ram.dialup.fu-berlin.de>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de> <srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org> <srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net> <87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org> <87k0es9kag.fsf@tigger.extechop.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de Py34EXOwYWduFlzoXbt1lgDUrNbOSWSutK8HEuM0TE34tp
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Sat, 22 Jan 2022 15:39 UTC

om@iki.fi (Otto J. Makela) writes (non-ASCII characters guessed):
>#include <stdio.h>
>#include <stdlib.h>
>int main() {
> unsigned char str[10]="Ä";
> printf("'%s' %02x %02x %02x %d\n",
> str,str[0],str[1],str[3],mblen(str,10));
....
>'Ä' c3 84 00 -1

Here:

main.c

#include <stdint.h> /* SIZE_MAX */
#include <stdio.h> /* printf */
#include <stdlib.h> /* mblen */
#include <string.h> /* strlen */

int main( void )
{ char const s[ 10 ]= "Ä";
printf( "s = \"%s\"\n", s );
for( size_t i = 0; i < strlen( s )+ 1; ++i )
printf( "s[ %zu ]= %02X\n", i, 0xFF & s[ i ]);
printf( "mblen( s, SIZE_MAX )= %d\n", mblen( s, SIZE_MAX )); }

transcript

s = "Ä"
s[ 0 ]= C3
s[ 1 ]= 84
s[ 2 ]= 00
mblen( s, SIZE_MAX )= 1

Re: UTF-8 strings in C

<sshlmu$vqi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20111&group=comp.lang.c#20111

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jameskuy...@alumni.caltech.edu (James Kuyper)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Sat, 22 Jan 2022 14:25:16 -0500
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <sshlmu$vqi$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org>
<87k0es9kag.fsf@tigger.extechop.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 22 Jan 2022 19:25:18 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dbae1492b85e78a8d748e7d245fceb20";
logging-data="32594"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX194+hndIadg1jbIADO1iUnSxXSTXX6Y19A="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:dj1ggeJ0AlZZCbLppTy8BTUoiJI=
In-Reply-To: <87k0es9kag.fsf@tigger.extechop.net>
Content-Language: en-US

by: James Kuyper - Sat, 22 Jan 2022 19:25 UTC

On 1/22/22 04:44, Otto J. Makela wrote:
> #include <stdio.h>
> #include <stdlib.h>
>
> int main() {
> unsigned char str[10]="Ä";
>
> printf("'%s' %02x %02x %02x %d\n",
> str,str[0],str[1],str[3],mblen(str,10));
> exit(0);
> }

The set of valid extended characters is locale-dependent (5.2.1p1). In
the "C" locale on my system, there are none - only the basic character
set is supported. It's not clear to me that the C standard requires
this, but it certainly allows it.

You need to use setlocale() to select a locale that supports UTF-8.
Which locales are supported depend upon your implementation. On my
platform, the default locale (the one chosen if the second argument of
setlocale() is "") is determined by the LC_LANG environment, which I
have set to "en_US.UTF-8". When I insert setlocale(LC_CTYPE, "") in your
program before the mblen() call, I get the following results:

'Ä' c3 84 00 2

Re: UTF-8 strings in C

<87lez1pg6c.fsf@tigger.extechop.net>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20235&group=comp.lang.c#20235

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: om...@iki.fi (Otto J. Makela)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Thu, 27 Jan 2022 13:32:43 +0200
Organization: Games and Theory
Lines: 25
Message-ID: <87lez1pg6c.fsf@tigger.extechop.net>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org>
<87k0es9kag.fsf@tigger.extechop.net> <sshlmu$vqi$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="2210ec353eac979e98d0323d717205c9";
logging-data="16054"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19X4/ng4ddi9G4GVeZVaBB/"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:8zq0SU2qXL/V6kdLuz4Rqg8NzPE=
sha1:MFB1369dZGq25bAV/D/Sbj88blw=
X-Face: 'g'S,X"!c;\pfvl4ljdcm?cDdk<-Z;`x5;YJPI-cs~D%;_<\V3!3GCims?a*;~u$<FYl@"E
c?3?_J+Zwn~{$8<iEy}EqIn_08"`oWuqO$#(5y3hGq8}BG#sag{BL)u8(c^Lu;*{8+'Z-k\?k09ILS
X-URL: http://www.iki.fi/om/
Mail-Copies-To: never

by: Otto J. Makela - Thu, 27 Jan 2022 11:32 UTC

James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

> The set of valid extended characters is locale-dependent (5.2.1p1). In
> the "C" locale on my system, there are none - only the basic character
> set is supported. It's not clear to me that the C standard requires
> this, but it certainly allows it.
>
> You need to use setlocale() to select a locale that supports UTF-8.
> Which locales are supported depend upon your implementation. On my
> platform, the default locale (the one chosen if the second argument of
> setlocale() is "") is determined by the LC_LANG environment, which I
> have set to "en_US.UTF-8". When I insert setlocale(LC_CTYPE, "") in
> your program before the mblen() call, I get the following results:
>
> 'Ä' c3 84 00 2

Yep, that setlocale() call did it. I actually have the semi-joke value
of LANG=en_DK.UTF-8 in my environment, produces good results for me.

https://unix.stackexchange.com/questions/62316/why-is-there-no-euro-english-locale
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Re: UTF-8 strings in C

<87k0elpg4l.fsf@tigger.extechop.net>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20236&group=comp.lang.c#20236

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: om...@iki.fi (Otto J. Makela)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Thu, 27 Jan 2022 13:33:46 +0200
Organization: Games and Theory
Lines: 15
Message-ID: <87k0elpg4l.fsf@tigger.extechop.net>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org>
<87k0es9kag.fsf@tigger.extechop.net>
<mblen-20220122163758@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="2210ec353eac979e98d0323d717205c9";
logging-data="16054"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dflenM1NNFm/OBiChSVNn"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:kNYU4P64qnUl4a5LCP3jTkiPa78=
sha1:OKkxBMVSBjUzXC1Zb3c1Fw/wds4=
X-Face: 'g'S,X"!c;\pfvl4ljdcm?cDdk<-Z;`x5;YJPI-cs~D%;_<\V3!3GCims?a*;~u$<FYl@"E
c?3?_J+Zwn~{$8<iEy}EqIn_08"`oWuqO$#(5y3hGq8}BG#sag{BL)u8(c^Lu;*{8+'Z-k\?k09ILS
X-URL: http://www.iki.fi/om/
Mail-Copies-To: never

by: Otto J. Makela - Thu, 27 Jan 2022 11:33 UTC

ram@zedat.fu-berlin.de (Stefan Ram) wrote:

> Here:

The end result is exactly the same as with my program,
mblen() returns -1.

James Kuyper <jameskuyper@alumni.caltech.edu> understood it, I need to
ask for the default locale. I'm a bit stumped as to why so.

Re: UTF-8 strings in C

<ssug22$pd6$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20238&group=comp.lang.c#20238

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jameskuy...@alumni.caltech.edu (James Kuyper)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Thu, 27 Jan 2022 11:08:33 -0500
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <ssug22$pd6$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org>
<87k0es9kag.fsf@tigger.extechop.net>
<mblen-20220122163758@ram.dialup.fu-berlin.de>
<87k0elpg4l.fsf@tigger.extechop.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 27 Jan 2022 16:08:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="28fe711c904b0c3d9c90180e5f4cb6f9";
logging-data="26022"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ILBg9gfj/U6WVgQo/mvVk99kY1QfhuMQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:vIgyquhKlV8oRh5u/wSCUjwwU8Q=
In-Reply-To: <87k0elpg4l.fsf@tigger.extechop.net>
Content-Language: en-US

by: James Kuyper - Thu, 27 Jan 2022 16:08 UTC

On 1/27/22 06:33, Otto J. Makela wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) wrote:
>
>> Here:
>
> The end result is exactly the same as with my program,
> mblen() returns -1.
>
> James Kuyper <jameskuyper@alumni.caltech.edu> understood it, I need to
> ask for the default locale. I'm a bit stumped as to why so.

Note: while I referred to it as the "default" locale, that's a bad term
for it, since the "C" locale is the true default:

"At program startup, the equivalent of
setlocale(LC_ALL, "C");
is executed." (7.11.1.1p4)

"... a value of "" for locale specifies the locale-specific native
environment." (7.11.1.1p3), so it would be better to call it the
"native" locale.

"In the "C" locale, isalpha returns true only for the characters for
which isupper or islower is true." (7.4.1.2p2)

"In the "C" locale, islower returns true only for the lowercase letters
(as defined in 5.2.1)." (7.4.1.7p2)

"In the "C" locale, isupper returns true only for the uppercase letters
(as defined in 5.2.1)." (7.4.1.11p2)

"... the 26 _uppercase letters_ of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

the 26 _lowercase letters_ of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z" (5.2.1p2)

Note that the phrases "uppercase letters" and "lowercase letters" are in
italics, an ISO convention indicating that the clause in which they
appear constitutes the official definition for those terms within this
document.

Thus, isalpha() can only be true in the "C" locale for the 26 letters of
the latin alphabet. That's not quite the same as requiring that those be
the only valid character encodings. Note, in particular, that it's not
possible to pass a member of the extended character set to isalpha() if
that member's encoding takes more than 1 byte, so nothing the standard
says about isalpha() can constrain how such characters are treated. Such
a character could be converted to wchar_t and passed to iswalpha(), but
that function has no special specifications for the "C" locale. However,
those facts certainly don't encourage support for additional valid
character encodings in the "C" locale.

Re: UTF-8 strings in C

<86o83nkv5b.fsf@linuxsc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20332&group=comp.lang.c#20332

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: tr.17...@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Thu, 03 Feb 2022 22:23:12 -0800
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <86o83nkv5b.fsf@linuxsc.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de> <srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org> <srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net> <87h7a6h0lp.fsf@tigger.extechop.net> <srtvd7$143n0$2@solani.org> <87k0es9kag.fsf@tigger.extechop.net> <mblen-20220122163758@ram.dialup.fu-berlin.de> <87k0elpg4l.fsf@tigger.extechop.net> <ssug22$pd6$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="e5e1159670f97e706efea2e630831efe";
logging-data="12448"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GicSjSztcParaqDg0PjC1qCaJnZjCws0="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:3Xt8QHNCQmACAU2f1dFM3Qn1pq4=
sha1:2iO0VQM2tSgw9n4u9SQHzbfMEiY=

by: Tim Rentsch - Fri, 4 Feb 2022 06:23 UTC

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

> On 1/27/22 06:33, Otto J. Makela wrote:
>
>> ram@zedat.fu-berlin.de (Stefan Ram) wrote:
>>
>>> Here:
>>
>> The end result is exactly the same as with my program,
>> mblen() returns -1.
>>
>> James Kuyper <jameskuyper@alumni.caltech.edu> understood it, I need to
>> ask for the default locale. I'm a bit stumped as to why so.
>
> Note: while I referred to it as the "default" locale, that's a bad
> term for it, since the "C" locale is the true default:
>
> "At program startup, the equivalent of
> setlocale(LC_ALL, "C");
> is executed." (7.11.1.1p4)
>
> "... a value of "" for locale specifies the locale-specific native
> environment." (7.11.1.1p3), so it would be better to call it the
> "native" locale.
>
> "In the "C" locale, isalpha returns true only for the characters for
> which isupper or islower is true." (7.4.1.2p2)
>
> "In the "C" locale, islower returns true only for the lowercase
> letters (as defined in 5.2.1)." (7.4.1.7p2)
>
> "In the "C" locale, isupper returns true only for the uppercase
> letters (as defined in 5.2.1)." (7.4.1.11p2)
>
> "... the 26 _uppercase letters_ of the Latin alphabet
> A B C D E F G H I J K L M
> N O P Q R S T U V W X Y Z
>
> the 26 _lowercase letters_ of the Latin alphabet
> a b c d e f g h i j k l m
> n o p q r s t u v w x y z" (5.2.1p2)
>
> Note that the phrases "uppercase letters" and "lowercase letters"
> are in italics, an ISO convention indicating that the clause in
> which they appear constitutes the official definition for those
> terms within this document.
>
> Thus, isalpha() can only be true in the "C" locale for the 26
> letters of the latin alphabet. [...]

What I think you mean is that in the "C" locale isalpha() can be
true only for the 26 letters of the latin alphabet.

Re: UTF-8 strings in C

<86tudaip8s.fsf@linuxsc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=20418&group=comp.lang.c#20418

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: tr.17...@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 07 Feb 2022 09:02:43 -0800
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <86tudaip8s.fsf@linuxsc.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de> <srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org> <srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net> <afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com> <srrul5$c2h$1@dont-email.me> <2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com> <srs07u$cr$1@gioia.aioe.org> <srs269$4n1$2@dont-email.me> <srs7nj$62u$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="792c2a82c28252afd7e8a94f82466189";
logging-data="18554"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/2d/rDlMOhibAstQqe5anWkMa4jaMe5hI="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:cS1hOlp3K/4iaHSgkq9idhZYB74=
sha1:JgufvIE/5ujn0GNNzO0zfJVAY8I=

by: Tim Rentsch - Mon, 7 Feb 2022 17:02 UTC

Mateusz Viste <mateusz@xyz.invalid> writes:

> 2022-01-14 at 15:43 +0100, Bonita Montero wrote:
>
>> Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
>>
>>> 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
>>>
>>>> Yes. It should compile cleanly. That's not the objection.
>>>
>>> clang with -Weverything (or -Wsign-conversion) does warn about
>>> this.
>>>
>>> warning: implicit conversion changes signedness: 'int' to 'size_t'
>>> (aka 'unsigned long') [-Wsign-conversion]
>>
>> Stupid warning. All compiler widen the data to the size of the
>> destination-type and then convert it to it.
>
> The warning points out a clear inconsistency. If that's *really*
> what the programmer wanted, then an explicit cast should clear out
> the warning. Otherwise it is likely to be a human error, and
> warning about that is never stupid.

Using a cast to avoid a warning for mixing signed and unsigned
values is a fairly common practice. I believe that practice
leads to worse results than simply leaving the code alone (unless
of course the warning does reveal a true problem, in which case
the problem should be fixed).

First, warnings for signed/unsigned mismatch yield more false
positives than true positives IME.

Second, in most cases a cast is an accident waiting to happen.
If subsequently a type is changed there is a chance that the cast
will convert to a type that is not appropriate for the context.

Third, if a type change does cause a cast to go wrong, then
having used a cast virtually guarantees that no warning will
be generated. Using a cast is like putting black electrical
tape over a blinking red warning light - you don't see the light
any more, but that doesn't mean there is no danger.

Fourth, in some cases mixing (or potentially mixing) signed and
unsigned types is almost inevitable. To give an example,
recently I had occasion to write code something like this:

Count r = -1;

while( r++, ..something.. ){
..whatever needed doing..
}

return r;

As it happens 'Count' was a typedef for an unsigned type, but it
may be a different type in different environments, including the
possibility of being a signed type at some point in the future.
But this code doesn't need to know that: the same code works
whether Count is a signed type or an unsigned type. Furthermore
that result is not just happenstance.

It's important to distinguish between unacceptable warnings and
advisory warnings. Some warning conditions give so few false
positives that it is advisable to treat them as absolute, just
the same as though they were errors. Other warning conditions
are more an indication that there might be a problem, but also
there is a good chance that there isn't. Mixing signed values and
unsigned values definitely falls into the latter category, and
should be treated accordingly.

Corollary: any case where a cast is used to avoid a warning should
be viewed with suspicion and a skeptical eye.

"Pull the wool over your own eyes!" -- J. R. "Bob" Dobbs

devel / comp.lang.c / Re: UTF-8 strings in C

Subject	Author
UTF-8 strings in C	Stefan Ram
Re: UTF-8 strings in C	Po Lu
Re: UTF-8 strings in C	Richard Damon
Re: UTF-8 strings in C	Po Lu
Re: UTF-8 strings in C	Malcolm McLean
Re: UTF-8 strings in C	Po Lu
Re: UTF-8 strings in C	Keith Thompson
Re: UTF-8 strings in C	David Brown
Re: UTF-8 strings in C	Anton Shepelev
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Mateusz Viste
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Otto J. Makela
Re: UTF-8 strings in C	Malcolm McLean
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Malcolm McLean
Re: UTF-8 strings in C	Mateusz Viste
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Bart
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Mateusz Viste
Re: UTF-8 strings in C	Malcolm McLean
Re: UTF-8 strings in C	Mateusz Viste
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Mateusz Viste
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Tim Rentsch
Re: UTF-8 strings in C	Otto J. Makela
Re: UTF-8 strings in C	Otto J. Makela
Re: UTF-8 strings in C	Bonita Montero
Re: UTF-8 strings in C	Philipp Klaus Krause
Re: UTF-8 strings in C	Otto J. Makela
Re: UTF-8 strings in C	Malcolm McLean
Re: UTF-8 strings in C	Richard Damon
Re: UTF-8 strings in C	Stefan Ram
Re: UTF-8 strings in C	Otto J. Makela
Re: UTF-8 strings in C	James Kuyper
Re: UTF-8 strings in C	Tim Rentsch
Re: UTF-8 strings in C	James Kuyper
Re: UTF-8 strings in C	Otto J. Makela