Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Oh, I've seen copies [of Linux Journal] around the terminal room at The Labs. -- Dennis Ritchie


devel / comp.lang.c / Re: UTF-8 strings in C

SubjectAuthor
* UTF-8 strings in CStefan Ram
+* Re: UTF-8 strings in CPo Lu
|`* Re: UTF-8 strings in CRichard Damon
| `* Re: UTF-8 strings in CPo Lu
|  `* Re: UTF-8 strings in CMalcolm McLean
|   +- Re: UTF-8 strings in CPo Lu
|   `* Re: UTF-8 strings in CKeith Thompson
|    `* Re: UTF-8 strings in CDavid Brown
|     `- Re: UTF-8 strings in CAnton Shepelev
`* Re: UTF-8 strings in CBonita Montero
 `* Re: UTF-8 strings in CMateusz Viste
  `* Re: UTF-8 strings in CBonita Montero
   `* Re: UTF-8 strings in CBonita Montero
    `* Re: UTF-8 strings in CBonita Montero
     +* Re: UTF-8 strings in COtto J. Makela
     |`* Re: UTF-8 strings in CMalcolm McLean
     | `* Re: UTF-8 strings in CBonita Montero
     |  `* Re: UTF-8 strings in CMalcolm McLean
     |   `* Re: UTF-8 strings in CMateusz Viste
     |    `* Re: UTF-8 strings in CBonita Montero
     |     +* Re: UTF-8 strings in CBart
     |     |`- Re: UTF-8 strings in CBonita Montero
     |     `* Re: UTF-8 strings in CMateusz Viste
     |      +* Re: UTF-8 strings in CMalcolm McLean
     |      |`- Re: UTF-8 strings in CMateusz Viste
     |      +* Re: UTF-8 strings in CBonita Montero
     |      |`* Re: UTF-8 strings in CMateusz Viste
     |      | `- Re: UTF-8 strings in CBonita Montero
     |      `- Re: UTF-8 strings in CTim Rentsch
     `* Re: UTF-8 strings in COtto J. Makela
      `* Re: UTF-8 strings in COtto J. Makela
       +- Re: UTF-8 strings in CBonita Montero
       `* Re: UTF-8 strings in CPhilipp Klaus Krause
        `* Re: UTF-8 strings in COtto J. Makela
         +* Re: UTF-8 strings in CMalcolm McLean
         |`- Re: UTF-8 strings in CRichard Damon
         +* Re: UTF-8 strings in CStefan Ram
         |`* Re: UTF-8 strings in COtto J. Makela
         | `* Re: UTF-8 strings in CJames Kuyper
         |  `- Re: UTF-8 strings in CTim Rentsch
         `* Re: UTF-8 strings in CJames Kuyper
          `- Re: UTF-8 strings in COtto J. Makela

Pages:12
UTF-8 strings in C

<strlen-20220108201256@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19897&group=comp.lang.c#19897

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.c
Subject: UTF-8 strings in C
Date: 8 Jan 2022 19:15:21 GMT
Organization: Stefan Ram
Lines: 22
Expires: 1 Apr 2022 11:59:58 GMT
Message-ID: <strlen-20220108201256@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 632Tsg/gfUjs6dEpDENlHwbzmMC+U4H0eBhRFjyVEOFW27
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Sat, 8 Jan 2022 19:15 UTC

Mateusz Viste <mateusz@xyz.invalid> writes:
>While UTF-8 is neat, it is also complex to decode. Even a simple
>strlen() can be challenging.

In the case of US-ASCII strings that do not contain a NUL
character, strlen gives both the number of characters and
the size of the memory required (minus one).

Even the handling of US-ASCII strings can already become
difficult in C when they contain a NUL character.

In the case of UTF-8, the two aspects "memory size" and
"character count" might differ. It even becomes difficult
two define "character" (when several code points can be
combined to form one, er, glyph).

In C, "strlen" sometimes is used to merely learn about the
memory requirements, not the number of characters. In this
case, strlen still can be used as is because except for the
encoding of NUL, UTF-8 does not use a zero byte anywhere.

Re: UTF-8 strings in C

<y23p78lt.fsf@yahoo.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19904&group=comp.lang.c#19904

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!K/GcOpVz8Y6R73fFRNUdsw.user.46.165.242.91.POSTED!not-for-mail
From: luang...@yahoo.com (Po Lu)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Sun, 09 Jan 2022 13:59:10 +0800
Organization: Aioe.org NNTP Server
Message-ID: <y23p78lt.fsf@yahoo.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gioia.aioe.org; logging-data="39791"; posting-host="K/GcOpVz8Y6R73fFRNUdsw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (haiku)
Cancel-Lock: sha1:y853yqQx25Vg+Qw++Cv9MwfFNKs=
X-Notice: Filtered by postfilter v. 0.9.2
 by: Po Lu - Sun, 9 Jan 2022 05:59 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:

> It even becomes difficult two define "character" (when several code
> points can be combined to form one, er, glyph).

The distinction between character and glyph (text shaping, compositions,
and maybe even bidirectional reordering) is not really important for
most programmers.

IMO, it's best to use the usual definition of character in a multibyte
encoding (a single code point) unless you're specifically into what I
mentioned above.

Re: UTF-8 strings in C

<gIACJ.154456$SR4.134163@fx43.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19907&group=comp.lang.c#19907

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: UTF-8 strings in C
Content-Language: en-US
Newsgroups: comp.lang.c
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <y23p78lt.fsf@yahoo.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 24
Message-ID: <gIACJ.154456$SR4.134163@fx43.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 07:39:40 -0500
X-Received-Bytes: 2083
 by: Richard Damon - Sun, 9 Jan 2022 12:39 UTC

On 1/9/22 12:59 AM, Po Lu wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>
>> It even becomes difficult two define "character" (when several code
>> points can be combined to form one, er, glyph).
>
> The distinction between character and glyph (text shaping, compositions,
> and maybe even bidirectional reordering) is not really important for
> most programmers.
>
> IMO, it's best to use the usual definition of character in a multibyte
> encoding (a single code point) unless you're specifically into what I
> mentioned above.

The issue is that with Unicode 'Combining Characters', where a glyph can
be represented as two code points, one being the 'Basic Character', and
the second being an accent added to it, means that almost all of the
advantages of using 'wide' characters is gone. You can no longer just
break the string at any character boundary and not need to worry about
affecting the presentation of the string.

If you aren't wanting to do that sort of operation, then multi-byte
encodings tend to be better than what the wide (and supposedly single
unit) encodings give you as it tends to be more compact.

Re: UTF-8 strings in C

<ilutt44s.fsf@yahoo.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19909&group=comp.lang.c#19909

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!K/GcOpVz8Y6R73fFRNUdsw.user.46.165.242.91.POSTED!not-for-mail
From: luang...@yahoo.com (Po Lu)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Sun, 09 Jan 2022 21:44:51 +0800
Organization: Aioe.org NNTP Server
Message-ID: <ilutt44s.fsf@yahoo.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com> <gIACJ.154456$SR4.134163@fx43.iad>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gioia.aioe.org; logging-data="44449"; posting-host="K/GcOpVz8Y6R73fFRNUdsw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (haiku)
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:ose0KDqUnxYNrIVv2+UWB9NgkXo=
 by: Po Lu - Sun, 9 Jan 2022 13:44 UTC

Richard Damon <Richard@Damon-Family.org> writes:

> The issue is that with Unicode 'Combining Characters', where a glyph
> can be represented as two code points, one being the 'Basic
> Character', and the second being an accent added to it, means that
> almost all of the advantages of using 'wide' characters is gone. You
> can no longer just break the string at any character boundary and not
> need to worry about affecting the presentation of the string.

I would only worry about that if the string was going to be displayed to
the user at some point, and it was significantly likely to contain such
sequences. For most use cases, they aren't important enough to worry
about, IME.

> If you aren't wanting to do that sort of operation, then multi-byte
> encodings tend to be better than what the wide (and supposedly single
> unit) encodings give you as it tends to be more compact.

Agreed.

Re: UTF-8 strings in C

<4f88d005-6bf4-4edf-8d53-ee33325abf20n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19910&group=comp.lang.c#19910

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:27e5:: with SMTP id jt5mr64899981qvb.113.1641743181451;
Sun, 09 Jan 2022 07:46:21 -0800 (PST)
X-Received: by 2002:a37:9d2:: with SMTP id 201mr42172100qkj.9.1641743181299;
Sun, 09 Jan 2022 07:46:21 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 07:46:21 -0800 (PST)
In-Reply-To: <ilutt44s.fsf@yahoo.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:f0d3:a0d6:2956:d5fb;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:f0d3:a0d6:2956:d5fb
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com> <gIACJ.154456$SR4.134163@fx43.iad> <ilutt44s.fsf@yahoo.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4f88d005-6bf4-4edf-8d53-ee33325abf20n@googlegroups.com>
Subject: Re: UTF-8 strings in C
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sun, 09 Jan 2022 15:46:21 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 19
 by: Malcolm McLean - Sun, 9 Jan 2022 15:46 UTC

On Sunday, 9 January 2022 at 13:44:38 UTC, Po Lu wrote:
> Richard Damon <Ric...@Damon-Family.org> writes:
>
> > The issue is that with Unicode 'Combining Characters', where a glyph
> > can be represented as two code points, one being the 'Basic
> > Character', and the second being an accent added to it, means that
> > almost all of the advantages of using 'wide' characters is gone. You
> > can no longer just break the string at any character boundary and not
> > need to worry about affecting the presentation of the string.
> I would only worry about that if the string was going to be displayed to
> the user at some point, and it was significantly likely to contain such
> sequences. For most use cases, they aren't important enough to worry
> about, IME.
>
The problem with that approach is that things seem to work perfectly,
then break unexepectedly when someone passes a rare character like
ash (a and e combined, you see it mainly in archaic texts in English).

Computer text processing is full of these little glitches. Usually they
are just an irriation, but occasionally the costs can be high.

Re: UTF-8 strings in C

<pmp0ml9u.fsf@yahoo.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19921&group=comp.lang.c#19921

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!K/GcOpVz8Y6R73fFRNUdsw.user.46.165.242.91.POSTED!not-for-mail
From: luang...@yahoo.com (Po Lu)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 09:28:45 +0800
Organization: Aioe.org NNTP Server
Message-ID: <pmp0ml9u.fsf@yahoo.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com> <gIACJ.154456$SR4.134163@fx43.iad>
<ilutt44s.fsf@yahoo.com>
<4f88d005-6bf4-4edf-8d53-ee33325abf20n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="1063"; posting-host="K/GcOpVz8Y6R73fFRNUdsw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (haiku)
Cancel-Lock: sha1:40fBjonMkeGB+8mdA410FCLfB08=
X-Notice: Filtered by postfilter v. 0.9.2
 by: Po Lu - Mon, 10 Jan 2022 01:28 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> The problem with that approach is that things seem to work perfectly,
> then break unexepectedly when someone passes a rare character like
> ash (a and e combined, you see it mainly in archaic texts in English).

[...]

> Computer text processing is full of these little glitches. Usually they
> are just an irriation, but occasionally the costs can be high.

I agree with what you said above, but usually the irritation is rare
enough to be bearable.

I think your specific example is wrong, however: æ here shows up here as
U+00E6, a single code point.

Or did I miss something? Thanks.

Re: UTF-8 strings in C

<87h7acz6yz.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19923&group=comp.lang.c#19923

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Sun, 09 Jan 2022 17:59:16 -0800
Organization: None to speak of
Lines: 29
Message-ID: <87h7acz6yz.fsf@nosuchdomain.example.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com> <gIACJ.154456$SR4.134163@fx43.iad>
<ilutt44s.fsf@yahoo.com>
<4f88d005-6bf4-4edf-8d53-ee33325abf20n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="6a0abf6ae7cc7bd72dda36b89628ce4b";
logging-data="16447"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18+5+FzOuJPaYyB5uVDlXmD"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:wdSdjYXQTvLm4i4GoVCsgefTLhw=
sha1:jE43y07COzkKCt9RvJQrpzm9nSM=
 by: Keith Thompson - Mon, 10 Jan 2022 01:59 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> On Sunday, 9 January 2022 at 13:44:38 UTC, Po Lu wrote:
>> Richard Damon <Ric...@Damon-Family.org> writes:
>> > The issue is that with Unicode 'Combining Characters', where a glyph
>> > can be represented as two code points, one being the 'Basic
>> > Character', and the second being an accent added to it, means that
>> > almost all of the advantages of using 'wide' characters is gone. You
>> > can no longer just break the string at any character boundary and not
>> > need to worry about affecting the presentation of the string.
>> I would only worry about that if the string was going to be displayed to
>> the user at some point, and it was significantly likely to contain such
>> sequences. For most use cases, they aren't important enough to worry
>> about, IME.
>>
> The problem with that approach is that things seem to work perfectly,
> then break unexepectedly when someone passes a rare character like
> ash (a and e combined, you see it mainly in archaic texts in English).

I don't think that's a particularly good example. Unicode encodes 'Æ'
as 0xc6 and 'æ' as 0xe6. As far as I know they can't even be specified
using combining characters.

> Computer text processing is full of these little glitches. Usually they
> are just an irriation, but occasionally the costs can be high.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Re: UTF-8 strings in C

<srgj1m$p9s$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19935&group=comp.lang.c#19935

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 07:17:26 +0100
Organization: A noiseless patient Spider
Lines: 1
Message-ID: <srgj1m$p9s$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 10 Jan 2022 06:17:26 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5e85ede43e7c7dd2a1ac376160a0e63a";
logging-data="25916"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18YhdFb5tK+0JoRai1KGsiUeBGYYinFFKs="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:8xtttnXisjuVLRbqXHaez5VM39g=
In-Reply-To: <strlen-20220108201256@ram.dialup.fu-berlin.de>
Content-Language: de-DE
 by: Bonita Montero - Mon, 10 Jan 2022 06:17 UTC

Then write your own UTF-8 strlen().

Re: UTF-8 strings in C

<srgpuo$u9o$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19937&group=comp.lang.c#19937

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 09:15:20 +0100
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <srgpuo$u9o$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com> <gIACJ.154456$SR4.134163@fx43.iad>
<ilutt44s.fsf@yahoo.com>
<4f88d005-6bf4-4edf-8d53-ee33325abf20n@googlegroups.com>
<87h7acz6yz.fsf@nosuchdomain.example.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 10 Jan 2022 08:15:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d234e6344e91b940fc93671bf53053e0";
logging-data="31032"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Z1u4M1kkoZbXezKusp6mFYSCL2HfgST0="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:gYEcbVyOTAbcV2I1lRDKEMqLhYo=
In-Reply-To: <87h7acz6yz.fsf@nosuchdomain.example.com>
Content-Language: en-GB
 by: David Brown - Mon, 10 Jan 2022 08:15 UTC

On 10/01/2022 02:59, Keith Thompson wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>> On Sunday, 9 January 2022 at 13:44:38 UTC, Po Lu wrote:
>>> Richard Damon <Ric...@Damon-Family.org> writes:
>>>> The issue is that with Unicode 'Combining Characters', where a glyph
>>>> can be represented as two code points, one being the 'Basic
>>>> Character', and the second being an accent added to it, means that
>>>> almost all of the advantages of using 'wide' characters is gone. You
>>>> can no longer just break the string at any character boundary and not
>>>> need to worry about affecting the presentation of the string.
>>> I would only worry about that if the string was going to be displayed to
>>> the user at some point, and it was significantly likely to contain such
>>> sequences. For most use cases, they aren't important enough to worry
>>> about, IME.
>>>
>> The problem with that approach is that things seem to work perfectly,
>> then break unexepectedly when someone passes a rare character like
>> ash (a and e combined, you see it mainly in archaic texts in English).
>
> I don't think that's a particularly good example. Unicode encodes 'Æ'
> as 0xc6 and 'æ' as 0xe6. As far as I know they can't even be specified
> using combining characters.
>

It is also not that rare in some languages - it's a normal letter (the
27th in the alphabet) in Norwegian.

In English, you typically only see it some imported Latin terms, such as
"curriculum vitæ" or "dæmon" - and then usually only in more archaic
texts, as Malcolm said.

A more interesting example is the "i" with diaeresis, "ï". The common
example in English is the word "naïve", which is typically spelt with a
normal "i" these days since most US and UK keyboards make it extremely
inconvenient to type non-ASCII letters. (Both spellings are, I think,
valid forms in British English.)

What makes this particularly fun is that when made with combining
characters of "i" plus double-dot accent, the "i" glyph must be changed
significantly.

Re: UTF-8 strings in C

<20220110112451.6008a58f71b1b174a6f29100@g{oogle}mail.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19939&group=comp.lang.c#19939

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: anton....@g{oogle}mail.com (Anton Shepelev)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 11:24:51 +0300
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <20220110112451.6008a58f71b1b174a6f29100@g{oogle}mail.com>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<y23p78lt.fsf@yahoo.com>
<gIACJ.154456$SR4.134163@fx43.iad>
<ilutt44s.fsf@yahoo.com>
<4f88d005-6bf4-4edf-8d53-ee33325abf20n@googlegroups.com>
<87h7acz6yz.fsf@nosuchdomain.example.com>
<srgpuo$u9o$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="d5f8617c6c75cf53b36bf153b90122a7";
logging-data="24817"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+CZL9r3IhrgPqKK9Yr9sykoQuDpBhBTNE="
Cancel-Lock: sha1:RaTIoPNbxyujxpl8Q9LwtS8s+qc=
X-Newsreader: Sylpheed 3.5.0 (GTK+ 2.24.23; i686-pc-mingw32)
 by: Anton Shepelev - Mon, 10 Jan 2022 08:24 UTC

David Brown:

> A more interesting example is the "i" with diaeresis, ï".
> The common example in English is the word "naïve", which
> is typically spelt with a normal "i" these days since most
> US and UK keyboards make it extremely inconvenient to type
> non-ASCII letters. (Both spellings are, I think, valid
> forms in British English.)

I think these non-ASCII letters look out of place in
English. See paragraph I The Naturalization of Foreign
Words in:

Tract No. III of the Society for Pure English:
https://www.gutenberg.org/files/12390/12390-h/12390-h.htm

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Re: UTF-8 strings in C

<srgqvn$5gn$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19940&group=comp.lang.c#19940

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!8hiQobHKlOvsb2aWVVOzwA.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 09:32:55 +0100
Organization: . . .
Message-ID: <srgqvn$5gn$1@gioia.aioe.org>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="5655"; posting-host="8hiQobHKlOvsb2aWVVOzwA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
 by: Mateusz Viste - Mon, 10 Jan 2022 08:32 UTC

2022-01-10 at 07:17 +0100, Bonita Montero wrote:
> Then write your own UTF-8 strlen().

That wouldn't be CPU efficient. At this point it might be saner to
implement a special utf-8 string type that keeps the string's
render-width in a separate field.

Mateusz

Re: UTF-8 strings in C

<srh1kj$epo$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19943&group=comp.lang.c#19943

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 11:26:27 +0100
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <srh1kj$epo$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 10 Jan 2022 10:26:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5e85ede43e7c7dd2a1ac376160a0e63a";
logging-data="15160"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lzDlwX8RiTwfekgU7s0QEdB1rlbuIiMA="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:AK7zQYi8smtTcvaRWvp4l71KUZs=
In-Reply-To: <srgqvn$5gn$1@gioia.aioe.org>
Content-Language: de-DE
 by: Bonita Montero - Mon, 10 Jan 2022 10:26 UTC

Am 10.01.2022 um 09:32 schrieb Mateusz Viste:
> 2022-01-10 at 07:17 +0100, Bonita Montero wrote:
>> Then write your own UTF-8 strlen().
>
> That wouldn't be CPU efficient. ...

I'll bet it woudln't be relevant slower than strlen()
because the memory-latency here is the limit.

Re: UTF-8 strings in C

<srh3b9$q0m$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19944&group=comp.lang.c#19944

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 11:55:38 +0100
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <srh3b9$q0m$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 10 Jan 2022 10:55:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5e85ede43e7c7dd2a1ac376160a0e63a";
logging-data="26646"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18E/9s2dLH9vDcNZZweF27gae70L8r5VmI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:4Wi0YwhHQFhg+RWer5jgCHJgwDc=
In-Reply-To: <srh1kj$epo$1@dont-email.me>
Content-Language: de-DE
 by: Bonita Montero - Mon, 10 Jan 2022 10:55 UTC

Am 10.01.2022 um 11:26 schrieb Bonita Montero:
> Am 10.01.2022 um 09:32 schrieb Mateusz Viste:
>> 2022-01-10 at 07:17 +0100, Bonita Montero wrote:
>>> Then write your own UTF-8 strlen().
>>
>> That wouldn't be CPU efficient. ...

Ok, it would be slower, but I think this is the fastest possible
implementation (C++20):

size_t utf8Strlen( char const *str )
{ struct encode_t
{
size_t lenIncr, strIncr;
};
static encode_t const encodes[] =
{
{ 1, 1 },
{ 0, 0 },
{ 1, 2 },
{ 1, 3 },
{ 1, 4 },
{ 0, 0 },
{ 0, 0 },
{ 0, 0 },
{ 0, 0 }
};
size_t len = 0;
for( unsigned char c; (c = *str); )
{
encode_t const &enc = encodes[(size_t)countl_zero<unsigned char>( ~c )];
if( !enc.lenIncr ) [[unlikely]]
return -1;
len += enc.lenIncr;
str += enc.strIncr;
}
return len;
}

Re: UTF-8 strings in C

<srhcti$tvd$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19952&group=comp.lang.c#19952

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Mon, 10 Jan 2022 14:38:58 +0100
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <srhcti$tvd$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 10 Jan 2022 13:38:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5e85ede43e7c7dd2a1ac376160a0e63a";
logging-data="30701"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18u1gQ3D2k1+0KJM7Pu2gChy5BXc3SruhU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:t+txZBnNcmtJsEYDcezZLQPje7c=
In-Reply-To: <srh3b9$q0m$1@dont-email.me>
Content-Language: de-DE
 by: Bonita Montero - Mon, 10 Jan 2022 13:38 UTC

Am 10.01.2022 um 11:55 schrieb Bonita Montero:
> Am 10.01.2022 um 11:26 schrieb Bonita Montero:
>> Am 10.01.2022 um 09:32 schrieb Mateusz Viste:
>>> 2022-01-10 at 07:17 +0100, Bonita Montero wrote:
>>>> Then write your own UTF-8 strlen().
>>>
>>> That wouldn't be CPU efficient. ...
>
> Ok, it would be slower, but I think this is the fastest possible
> implementation (C++20):

This is more correct since it completely detects encoding-errors:

size_t utf8Strlen( char const *str )
{ struct encode_t { size_t lenIncr, strIncr; };
static encode_t const encodes[] =
{
{ 1, 1 },
{ 0, 0 },
{ 1, 2 },
{ 1, 3 },
{ 1, 4 },
{ 0, 0 },
{ 0, 0 },
{ 0, 0 },
{ 0, 0 }
};
size_t len = 0;
for( unsigned char c; (c = *str); )
{
encode_t const &enc = encodes[(size_t)countl_zero<unsigned
char>( ~c )];
if( !enc.lenIncr ) [[unlikely]]
return -1;
len += enc.lenIncr;
for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
return -1;
}
return len;
}

Re: UTF-8 strings in C

<87o84eh5nv.fsf@tigger.extechop.net>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19965&group=comp.lang.c#19965

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: om...@iki.fi (Otto J. Makela)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 14:18:28 +0200
Organization: Games and Theory
Lines: 45
Message-ID: <87o84eh5nv.fsf@tigger.extechop.net>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="c4f1cd7bc2e8ee2e415275dd7feee475";
logging-data="28418"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ACgpCZ4pG+4TjY/ZWuf8D"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:WqXORPFuvXyzlJAaNeA1Ec42oQs=
sha1:AuxvLcjlgzY/JUMgbC+6lUS0VkY=
X-Face: 'g'S,X"!c;\pfvl4ljdcm?cDdk<-Z;`x5;YJPI-cs~D%;_<\V3!3GCims?a*;~u$<FYl@"E
c?3?_J+Zwn~{$8<iEy}EqIn_08"`oWuqO$#(5y3hGq8}BG#sag{BL)u8(c^Lu;*{8+'Z-k\?k09ILS
X-URL: http://www.iki.fi/om/
Mail-Copies-To: never
 by: Otto J. Makela - Fri, 14 Jan 2022 12:18 UTC

Bonita Montero <Bonita.Montero@gmail.com> wrote:

> This is more correct since it completely detects encoding-errors:
>
> size_t utf8Strlen( char const *str )
> {
> struct encode_t { size_t lenIncr, strIncr; };
> static encode_t const encodes[] =
> {
> { 1, 1 },
> { 0, 0 },
> { 1, 2 },
> { 1, 3 },
> { 1, 4 },
> { 0, 0 },
> { 0, 0 },
> { 0, 0 },
> { 0, 0 }
> };
> size_t len = 0;
> for( unsigned char c; (c = *str); )
> {
> encode_t const &enc = encodes[(size_t)countl_zero<unsigned
> char>( ~c )];
> if( !enc.lenIncr ) [[unlikely]]
> return -1;
> len += enc.lenIncr;
> for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
> if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
> return -1;
> }
> return len;
> }

I must be getting a bit old here, I don't seem to understand
what that countl_zero stuff (and the syntax thereabouts) is?

PS. Also, shouldn't the first struct line be a typedef,
at least you seem to use it at bit later like it was?
PPS. And shouldn't const be before type, not after?
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Re: UTF-8 strings in C

<87lezih5b5.fsf@tigger.extechop.net>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19966&group=comp.lang.c#19966

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: om...@iki.fi (Otto J. Makela)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Supersedes: <87o84eh5nv.fsf@tigger.extechop.net>
Date: Fri, 14 Jan 2022 14:26:06 +0200
Organization: Games and Theory
Lines: 43
Message-ID: <87lezih5b5.fsf@tigger.extechop.net>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="c4f1cd7bc2e8ee2e415275dd7feee475";
logging-data="28418"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18gRkJqXpulL1Hrx98lwhEO"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Key: sha1:6tRqghUhs1c8+XCwmfotf8dIfYk=
Cancel-Lock: sha1:K7NiA9ZQ6otBIVrTqGM+7Tv8Hhc=
sha1:+arfGeICL2XHb7UTMU/PBj62qAc=
X-Face: 'g'S,X"!c;\pfvl4ljdcm?cDdk<-Z;`x5;YJPI-cs~D%;_<\V3!3GCims?a*;~u$<FYl@"E
c?3?_J+Zwn~{$8<iEy}EqIn_08"`oWuqO$#(5y3hGq8}BG#sag{BL)u8(c^Lu;*{8+'Z-k\?k09ILS
X-URL: http://www.iki.fi/om/
Mail-Copies-To: never
 by: Otto J. Makela - Fri, 14 Jan 2022 12:26 UTC

Bonita Montero <Bonita.Montero@gmail.com> wrote:

> This is more correct since it completely detects encoding-errors:
>
> size_t utf8Strlen( char const *str )
> {
> struct encode_t { size_t lenIncr, strIncr; };
> static encode_t const encodes[] =
> {
> { 1, 1 },
> { 0, 0 },
> { 1, 2 },
> { 1, 3 },
> { 1, 4 },
> { 0, 0 },
> { 0, 0 },
> { 0, 0 },
> { 0, 0 }
> };
> size_t len = 0;
> for( unsigned char c; (c = *str); )
> {
> encode_t const &enc = encodes[(size_t)countl_zero<unsigned
> char>( ~c )];
> if( !enc.lenIncr ) [[unlikely]]
> return -1;
> len += enc.lenIncr;
> for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
> if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
> return -1;
> }
> return len;
> }

Took me a while to realize the countl_zero stuff was a C++ library call,
with the accompanying syntax.

However, this is comp.lang.c, so C solutions are appreciated.
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Re: UTF-8 strings in C

<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19967&group=comp.lang.c#19967

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:ac8:5846:: with SMTP id h6mr2422563qth.424.1642164103431;
Fri, 14 Jan 2022 04:41:43 -0800 (PST)
X-Received: by 2002:ac8:5713:: with SMTP id 19mr7431077qtw.642.1642164103262;
Fri, 14 Jan 2022 04:41:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 14 Jan 2022 04:41:43 -0800 (PST)
In-Reply-To: <87o84eh5nv.fsf@tigger.extechop.net>
Injection-Info: google-groups.googlegroups.com; posting-host=81.143.231.9; posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 81.143.231.9
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
Subject: Re: UTF-8 strings in C
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Fri, 14 Jan 2022 12:41:43 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 52
 by: Malcolm McLean - Fri, 14 Jan 2022 12:41 UTC

On Friday, 14 January 2022 at 12:18:42 UTC, Otto J. Makela wrote:
> Bonita Montero <Bonita....@gmail.com> wrote:
>
> > This is more correct since it completely detects encoding-errors:
> >
> > size_t utf8Strlen( char const *str )
> > {
> > struct encode_t { size_t lenIncr, strIncr; };
> > static encode_t const encodes[] =
> > {
> > { 1, 1 },
> > { 0, 0 },
> > { 1, 2 },
> > { 1, 3 },
> > { 1, 4 },
> > { 0, 0 },
> > { 0, 0 },
> > { 0, 0 },
> > { 0, 0 }
> > };
> > size_t len = 0;
> > for( unsigned char c; (c = *str); )
> > {
> > encode_t const &enc = encodes[(size_t)countl_zero<unsigned
> > char>( ~c )];
> > if( !enc.lenIncr ) [[unlikely]]
> > return -1;
> > len += enc.lenIncr;
> > for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
> > if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
> > return -1;
> > }
> > return len;
> > }
> I must be getting a bit old here, I don't seem to understand
> what that countl_zero stuff (and the syntax thereabouts) is?
>
> PS. Also, shouldn't the first struct line be a typedef,
> at least you seem to use it at bit later like it was?
> PPS. And shouldn't const be before type, not after?
>
Also, it would be safer to make the function return an int.
size_t is unsigned, and the return is -1 for error. So code like

size_t Nchars = utf8Strlen(utf8str);
size_t i;

for (i =0; i < Nchars; i++)
{ }

will run through quintillions of iterations, instead of being
a no-op, if the utf8 is malformed.

Re: UTF-8 strings in C

<srrul5$c2h$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19968&group=comp.lang.c#19968

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 14:43:04 +0100
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <srrul5$c2h$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 13:43:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dc301fd105e5ac21c197a54d7b05fd93";
logging-data="12369"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/8e1lqWc0S6fyxEQ+DqurSLuRpZWeI4aE="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:xIccsfNtD58JHnaNDjn898D5tus=
In-Reply-To: <afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
Content-Language: de-DE
 by: Bonita Montero - Fri, 14 Jan 2022 13:43 UTC

Am 14.01.2022 um 13:41 schrieb Malcolm McLean:
> On Friday, 14 January 2022 at 12:18:42 UTC, Otto J. Makela wrote:
>> Bonita Montero <Bonita....@gmail.com> wrote:
>>
>>> This is more correct since it completely detects encoding-errors:
>>>
>>> size_t utf8Strlen( char const *str )
>>> {
>>> struct encode_t { size_t lenIncr, strIncr; };
>>> static encode_t const encodes[] =
>>> {
>>> { 1, 1 },
>>> { 0, 0 },
>>> { 1, 2 },
>>> { 1, 3 },
>>> { 1, 4 },
>>> { 0, 0 },
>>> { 0, 0 },
>>> { 0, 0 },
>>> { 0, 0 }
>>> };
>>> size_t len = 0;
>>> for( unsigned char c; (c = *str); )
>>> {
>>> encode_t const &enc = encodes[(size_t)countl_zero<unsigned
>>> char>( ~c )];
>>> if( !enc.lenIncr ) [[unlikely]]
>>> return -1;
>>> len += enc.lenIncr;
>>> for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
>>> if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
>>> return -1;
>>> }
>>> return len;
>>> }
>> I must be getting a bit old here, I don't seem to understand
>> what that countl_zero stuff (and the syntax thereabouts) is?
>>
>> PS. Also, shouldn't the first struct line be a typedef,
>> at least you seem to use it at bit later like it was?
>> PPS. And shouldn't const be before type, not after?
>>
> Also, it would be safer to make the function return an int.
> size_t is unsigned, and the return is -1 for error. ...

All compilers support this because it is very common.

>
> size_t Nchars = utf8Strlen(utf8str);
> size_t i;
>
> for (i =0; i < Nchars; i++)
> {
> }
>
> will run through quintillions of iterations, instead of being
> a no-op, if the utf8 is malformed.

Re: UTF-8 strings in C

<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19969&group=comp.lang.c#19969

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:21ea:: with SMTP id p10mr8036013qvj.65.1642167920850;
Fri, 14 Jan 2022 05:45:20 -0800 (PST)
X-Received: by 2002:a05:620a:1418:: with SMTP id d24mr6549171qkj.100.1642167920709;
Fri, 14 Jan 2022 05:45:20 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 14 Jan 2022 05:45:20 -0800 (PST)
In-Reply-To: <srrul5$c2h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:19af:6504:554:6b4c;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:19af:6504:554:6b4c
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me> <srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net> <afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
Subject: Re: UTF-8 strings in C
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Fri, 14 Jan 2022 13:45:20 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 48
 by: Malcolm McLean - Fri, 14 Jan 2022 13:45 UTC

On Friday, 14 January 2022 at 13:43:16 UTC, Bonita Montero wrote:
> Am 14.01.2022 um 13:41 schrieb Malcolm McLean:
> > On Friday, 14 January 2022 at 12:18:42 UTC, Otto J. Makela wrote:
> >> Bonita Montero <Bonita....@gmail.com> wrote:
> >>
> >>> This is more correct since it completely detects encoding-errors:
> >>>
> >>> size_t utf8Strlen( char const *str )
> >>> {
> >>> struct encode_t { size_t lenIncr, strIncr; };
> >>> static encode_t const encodes[] =
> >>> {
> >>> { 1, 1 },
> >>> { 0, 0 },
> >>> { 1, 2 },
> >>> { 1, 3 },
> >>> { 1, 4 },
> >>> { 0, 0 },
> >>> { 0, 0 },
> >>> { 0, 0 },
> >>> { 0, 0 }
> >>> };
> >>> size_t len = 0;
> >>> for( unsigned char c; (c = *str); )
> >>> {
> >>> encode_t const &enc = encodes[(size_t)countl_zero<unsigned
> >>> char>( ~c )];
> >>> if( !enc.lenIncr ) [[unlikely]]
> >>> return -1;
> >>> len += enc.lenIncr;
> >>> for( char const *cpEnd = str + enc.strIncr; ++str != cpEnd; )
> >>> if( ((unsigned char)*str & 0x0C0) != 0x080 ) [[unlikely]]
> >>> return -1;
> >>> }
> >>> return len;
> >>> }
> >> I must be getting a bit old here, I don't seem to understand
> >> what that countl_zero stuff (and the syntax thereabouts) is?
> >>
> >> PS. Also, shouldn't the first struct line be a typedef,
> >> at least you seem to use it at bit later like it was?
> >> PPS. And shouldn't const be before type, not after?
> >>
> > Also, it would be safer to make the function return an int.
> > size_t is unsigned, and the return is -1 for error. ...
>
> All compilers support this because it is very common.
>
Yes. It should compile cleanly. That's not the objection.

Re: UTF-8 strings in C

<87h7a6h0lp.fsf@tigger.extechop.net>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19970&group=comp.lang.c#19970

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!rocksolid2!news.neodome.net!news.mixmin.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: om...@iki.fi (Otto J. Makela)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 16:07:46 +0200
Organization: Games and Theory
Lines: 47
Message-ID: <87h7a6h0lp.fsf@tigger.extechop.net>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="c4f1cd7bc2e8ee2e415275dd7feee475";
logging-data="20903"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+RWC1/cwRHRPRPdGL45bgN"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:6X0cSxplswnWOD3fIwnQzNiOquU=
sha1:q+0yuCrV1Lw2UqCgmWOPf/CYpok=
X-Face: 'g'S,X"!c;\pfvl4ljdcm?cDdk<-Z;`x5;YJPI-cs~D%;_<\V3!3GCims?a*;~u$<FYl@"E
c?3?_J+Zwn~{$8<iEy}EqIn_08"`oWuqO$#(5y3hGq8}BG#sag{BL)u8(c^Lu;*{8+'Z-k\?k09ILS
X-URL: http://www.iki.fi/om/
Mail-Copies-To: never
 by: Otto J. Makela - Fri, 14 Jan 2022 14:07 UTC

om@iki.fi (Otto J. Makela) wrote:

> Took me a while to realize the countl_zero stuff was a C++ library
> call, with the accompanying syntax.
>
> However, this is comp.lang.c, so C solutions are appreciated.

How about this? It of course does not count the slightly odder
compositions (like flags) correctly.

----

int utf8Strlen( char const *str ) {
int length = 0;
int multibyte = 0;
unsigned char c;

while(c = *str++) {
if(multibyte) {
multibyte--;
if((c & 0xC0) != 0x80 ) // UTF8 encoding error
return -1;
} else {
length++;
if (c < 0x80) // Normal 7-bit ASCII
multibyte = 0;
else if ((c & 0xE0) == 0xC0) // Single following byte
multibyte = 1;
else if ((c & 0xF0) == 0xE0) // Two following bytes
multibyte = 2;
else if ((c & 0xF8) == 0xF0) // Three following bytes
multibyte = 3;
else // UTF8 encoding error
return -1;
}
}

if(multibyte) // Ran out of string in the middle of multibyte character
return -1;
return length;
}

--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Re: UTF-8 strings in C

<srs07u$cr$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19971&group=comp.lang.c#19971

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!299gYy2nqWB43X4cCBV6zg.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 15:10:06 +0100
Organization: . . .
Message-ID: <srs07u$cr$1@gioia.aioe.org>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me>
<srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me>
<srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="411"; posting-host="299gYy2nqWB43X4cCBV6zg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
 by: Mateusz Viste - Fri, 14 Jan 2022 14:10 UTC

2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
> Yes. It should compile cleanly. That's not the objection.

clang with -Weverything (or -Wsign-conversion) does warn about this.

warning: implicit conversion changes signedness: 'int' to 'size_t' (aka
'unsigned long') [-Wsign-conversion]

Mateusz

Re: UTF-8 strings in C

<srs24f$4n1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19972&group=comp.lang.c#19972

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 15:42:25 +0100
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <srs24f$4n1$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87lezih5b5.fsf@tigger.extechop.net>
<87h7a6h0lp.fsf@tigger.extechop.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 14:42:23 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dc301fd105e5ac21c197a54d7b05fd93";
logging-data="4833"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7HQO7QYNv8i5e+ECAndXwMbOFSR46R3s="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:TnOK3OAPhkmIAT5HQ5ooVdxxxWE=
In-Reply-To: <87h7a6h0lp.fsf@tigger.extechop.net>
Content-Language: de-DE
 by: Bonita Montero - Fri, 14 Jan 2022 14:42 UTC

Am 14.01.2022 um 15:07 schrieb Otto J. Makela:
> om@iki.fi (Otto J. Makela) wrote:
>
>> Took me a while to realize the countl_zero stuff was a C++ library
>> call, with the accompanying syntax.
>>
>> However, this is comp.lang.c, so C solutions are appreciated.
>
> How about this? It of course does not count the slightly odder
> compositions (like flags) correctly.
>
> ----
>
> int utf8Strlen( char const *str ) {
> int length = 0;
> int multibyte = 0;
> unsigned char c;
>
> while(c = *str++) {
> if(multibyte) {
> multibyte--;
> if((c & 0xC0) != 0x80 ) // UTF8 encoding error
> return -1;
> } else {
> length++;
> if (c < 0x80) // Normal 7-bit ASCII
> multibyte = 0;
> else if ((c & 0xE0) == 0xC0) // Single following byte
> multibyte = 1;
> else if ((c & 0xF0) == 0xE0) // Two following bytes
> multibyte = 2;
> else if ((c & 0xF8) == 0xF0) // Three following bytes
> multibyte = 3;
> else // UTF8 encoding error
> return -1;
> }
> }
>
> if(multibyte) // Ran out of string in the middle of multibyte character
> return -1;
> return length;
> }
>

A lot of unpredictible branches.
That's what I want to prevent with the table.

Re: UTF-8 strings in C

<srs269$4n1$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19973&group=comp.lang.c#19973

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 15:43:23 +0100
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <srs269$4n1$2@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 14:43:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dc301fd105e5ac21c197a54d7b05fd93";
logging-data="4833"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX194cAHJil7lYNXcAkcetlACW7h4Wsyh5wc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:QY966z1HEdcB2o6WwGI2jRB4BS4=
In-Reply-To: <srs07u$cr$1@gioia.aioe.org>
Content-Language: de-DE
 by: Bonita Montero - Fri, 14 Jan 2022 14:43 UTC

Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
> 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
>> Yes. It should compile cleanly. That's not the objection.
>
> clang with -Weverything (or -Wsign-conversion) does warn about this.
>
> warning: implicit conversion changes signedness: 'int' to 'size_t' (aka
> 'unsigned long') [-Wsign-conversion]

Stupid warning. All compiler widen the data to the size of the
destination-type and then convert it to it.

Re: UTF-8 strings in C

<srs649$1af$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19974&group=comp.lang.c#19974

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 15:50:32 +0000
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <srs649$1af$1@dont-email.me>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me> <srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me> <srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me> <87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org> <srs269$4n1$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jan 2022 15:50:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b7be546e11aaaeff462959137ed46049";
logging-data="1359"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19+GeR3DQfK1v9AtOlOC0N+EkZzUd8yHAI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:/+ikx+RkMQgvSCo++Emhw7z+ac8=
In-Reply-To: <srs269$4n1$2@dont-email.me>
 by: Bart - Fri, 14 Jan 2022 15:50 UTC

On 14/01/2022 14:43, Bonita Montero wrote:
> Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
>> 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
>>> Yes. It should compile cleanly. That's not the objection.
>>
>> clang with -Weverything (or -Wsign-conversion) does warn about this.
>>
>> warning: implicit conversion changes signedness: 'int' to 'size_t' (aka
>> 'unsigned long') [-Wsign-conversion]
>
> Stupid warning. All compiler widen the data to the size of the
> destination-type and then convert it to it.
>

If you change int -1 to size_t then you are liable to end up with
18446744073709551615.

You don't want a warning about that?

Re: UTF-8 strings in C

<srs7nj$62u$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19975&group=comp.lang.c#19975

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!299gYy2nqWB43X4cCBV6zg.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: UTF-8 strings in C
Date: Fri, 14 Jan 2022 17:17:55 +0100
Organization: . . .
Message-ID: <srs7nj$62u$1@gioia.aioe.org>
References: <strlen-20220108201256@ram.dialup.fu-berlin.de>
<srgj1m$p9s$1@dont-email.me>
<srgqvn$5gn$1@gioia.aioe.org>
<srh1kj$epo$1@dont-email.me>
<srh3b9$q0m$1@dont-email.me>
<srhcti$tvd$1@dont-email.me>
<87o84eh5nv.fsf@tigger.extechop.net>
<afd871a5-960a-436e-abde-382a667a55f5n@googlegroups.com>
<srrul5$c2h$1@dont-email.me>
<2d6fffa5-9748-4d64-8ba6-a164caef70c6n@googlegroups.com>
<srs07u$cr$1@gioia.aioe.org>
<srs269$4n1$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="6238"; posting-host="299gYy2nqWB43X4cCBV6zg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
 by: Mateusz Viste - Fri, 14 Jan 2022 16:17 UTC

2022-01-14 at 15:43 +0100, Bonita Montero wrote:
> Am 14.01.2022 um 15:10 schrieb Mateusz Viste:
> > 2022-01-14 at 05:45 -0800, Malcolm McLean wrote:
> >> Yes. It should compile cleanly. That's not the objection.
> >
> > clang with -Weverything (or -Wsign-conversion) does warn about this.
> >
> > warning: implicit conversion changes signedness: 'int' to 'size_t'
> > (aka 'unsigned long') [-Wsign-conversion]
>
> Stupid warning. All compiler widen the data to the size of the
> destination-type and then convert it to it.

The warning points out a clear inconsistency. If that's *really* what
the programmer wanted, then an explicit cast should clear out the
warning. Otherwise it is likely to be a human error, and warning about
that is never stupid.

Mateusz

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor