Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Reserve your abuse for your true friends. -- Larry Wall in <199712041852.KAA19364@wall.org>


devel / comp.lang.c / Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

SubjectAuthor
* "Some sanity for C and C++ development on Windows" by Chris WellonsLynn McGuire
`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
 `* Re: "Some sanity for C and C++ development on Windows" by ChrisVir Campestris
  `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsScott Lurndal
   +- Re: "Some sanity for C and C++ development on Windows" by ChrisKaz Kylheku
   `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
    +- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsScott Lurndal
    `* Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
     +- Re: "Some sanity for C and C++ development on Windows" by ChrisGuillaume
     `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      +- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      +* Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
      |+* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||+* Re: "Some sanity for C and C++ development on Windows" by ChrisBart
      |||`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||| `* Re: "Some sanity for C and C++ development on Windows" by ChrisBart
      |||  `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      |||   `* Re: "Some sanity for C and C++ development on Windows" by ChrisBart
      |||    `- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||`- Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
      |+* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||`* Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
      || `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||  +* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||  |`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||  | `* Re: "Some sanity for C and C++ development on Windows" by ChrisMateusz Viste
      ||  |  +* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||  |  |+* Re: "Some sanity for C and C++ development on Windows" by ChrisMateusz Viste
      ||  |  ||+- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||  |  ||`* Re: "Some sanity for C and C++ development on Windows" by ChrisDavid Brown
      ||  |  || `* Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||  |  ||  +* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||  |  ||  |+* Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||  |  ||  ||`- Re: "Some sanity for C and C++ development on Windows" by ChrisDavid Brown
      ||  |  ||  |`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsBen Bacarisse
      ||  |  ||  | `- Re: "Some sanity for C and C++ development on Windows" by ChrisDavid Brown
      ||  |  ||  `- Re: "Some sanity for C and C++ development on Windows" by ChrisDavid Brown
      ||  |  |`- Re: "Some sanity for C and C++ development on Windows" by ChrisBart
      ||  |  +- Re: "Some sanity for C and C++ development on Windows" by ChrisManfred
      ||  |  +- Re: "Some sanity for C and C++ development on Windows" by ChrisGuillaume
      ||  |  `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsBen Bacarisse
      ||  |   +* Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||  |   |`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsBen Bacarisse
      ||  |   | `- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||  |   `* Re: "Some sanity for C and C++ development on Windows" by ChrisMateusz Viste
      ||  |    +- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsBen Bacarisse
      ||  |    `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||  |     +* Re: "Some sanity for C and C++ development on Windows" by ChrisMateusz Viste
      ||  |     |`- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||  |     `* Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||  |      `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||  |       `- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsMalcolm McLean
      ||  `* Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
      ||   `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||    `* Re: "Some sanity for C and C++ development on Windows" by ChrisManfred
      ||     `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||      `* Re: "Some sanity for C and C++ development on Windows" by ChrisManfred
      ||       `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        +* Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||        |`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        | +* Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||        | |`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        | | +- Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||        | | `- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        | `* Re: "Some sanity for C and C++ development on Windows" by ChrisVir Campestris
      ||        |  `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsScott Lurndal
      ||        |   `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        |    `* Re: "Some sanity for C and C++ development on Windows" by ChrisVir Campestris
      ||        |     `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        |      `- Re: "Some sanity for C and C++ development on Windows" by ChrisKaz Kylheku
      ||        +* Re: "Some sanity for C and C++ development on Windows" by ChrisManfred
      ||        |+- Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon
      ||        |`- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||        `* Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
      ||         `* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      ||          +- Re: "Some sanity for C and C++ development on Windows" by Chris Wellonsjames...@alumni.caltech.edu
      ||          `- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      |`* Re: "Some sanity for C and C++ development on Windows" by Chris WellonsPo Lu
      | `* Re: "Some sanity for C and C++ development on Windows" by ChrisJames Kuyper
      |  `- Re: "Some sanity for C and C++ development on Windows" by Chris WellonsÖö Tiib
      `- Re: "Some sanity for C and C++ development on Windows" by ChrisRichard Damon

Pages:1234
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<krMCJ.117486$_Y5.84485@fx29.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19924&group=comp.lang.c#19924

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx29.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
<srg1jr$1rpr$1@gioia.aioe.org>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <srg1jr$1rpr$1@gioia.aioe.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 113
Message-ID: <krMCJ.117486$_Y5.84485@fx29.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 21:00:48 -0500
X-Received-Bytes: 7251
 by: Richard Damon - Mon, 10 Jan 2022 02:00 UTC

On 1/9/22 8:19 PM, Manfred wrote:
> On 1/9/2022 11:35 PM, Öö Tiib wrote:
>> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
>>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
>>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
>>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
>>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2,
>>>>>> james...@alumni.caltech.edu wrote:
>>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2,
>>>>>>>> james...@alumni.caltech.edu wrote:
>> ...
>>>>
>>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>>>>> That should work unless underlying file system does not support
>>>>>>>> files
>>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work
>>>>>>>> then it indicates
>>>>>>>> bad standard that allows implementations to weasel away. No
>>>>>>>> garbage like
>>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or
>>>>>>>> so is
>>>>>>>> needed as it already works like in my example on vast majority
>>>>>>>> of things.
>>>>>>>
>>>>>>> Nothing in the standard prevents an implementation from doing
>>>>>>> that. If one doesn't
>>>>>>> already do so, that's a choice made by the implementors, and you
>>>>>>> should ask them
>>>>>>> about it. Your real beef is with the implementors, not the standard.
>>>>>>
>>>>>> My beef is with standards. Adding garbage that does not work to
>>>>>> standard is wrong
>>>>>> and not adding what everybody at least half sane does use to
>>>>>> standard is also wrong.
>>>>>>
>>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
>>>>> should have been added?
>>>>
>>>> Something that makes it clear that it is defect when
>>>> "Foo😀Bar.txt" is silently opened
>>>> on file-system that fully supports files named "Foo😀Bar.txt" I
>>>> suppose.
>>>>
>>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
>>> representation, what's the difference? One form or the other shows up
>>> only when it is displayed in some UI - the filesystem isn't one, which
>>> leads to the implementation's runtime behavior.
>>
>> How you mean same binary representation? Both "Foo😀Bar.txt" and
>> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
>> names in underlying file system precisely as posted.
>
> I mean the same byte sequence in their name, but different UI
> representation, e.g. when decoded as utf-8 or w-1252 or whatever.

But that seems to imply that the file system keeps track of file name
encoding at the entry level, which I don't know any that do that.

>
> What you are saying assumes a Unicode-aware filesystem, that's not free
> from the point of view of the standard.
> But, in order to support utf-8, it would be enough to have a char based
> filesystem that treats names as plain 0-terminated char[]. That's
> easier, probably free on most platforms, but it's different from
> Unicode-aware (which could be UTF-16 like Windows, and there you have
> your problems).
>
>>
>>> If they are actually different in their binary sequence, and this is the
>>> result of the utf-8 string being wrongly converted multiple times, this
>>> looks like a bad implementation, rather than a problem with the
>>> standard.
>>> IIUC you are advocating for some statement in the standard that prevents
>>> implementations from messing up with "character sets" in null terminated
>>> char strings?
>>
>> I mean that standard should require that all char* texts are treated as
>> UTF-8 by standard library unless said otherwise. If implementation needs
>> some other encoding of such byte sequence then it provides
>> platform-specific functions or compiler switches and/or extends language
>> with implementation-defined char_iso8859_1_t character types and
>> prefixes. If it is noteworthy handy type then add it to standards too, I
>> don't care.
>
> I see this hard to win, and probably not ideal - suppose in 10 years
> some better encoding than utf-8 shows up, then you are screwed again.
>
> I'd rather stick to the fact that utf-8 is compatible with 0-terminated
> char[], and so a plausible wish would be that such strings are not
> screwed by the implementation; for example when you store a file name in
> a filesystem with fopen() and the name is given as char[], then the
> standard could mandate that reading back that same name as char[] gives
> back the same byte sequence.
>
> Currently I guess one could use a utf-8 string as a name to fopen() on
> Windows, then the OS assumes it is W-1252 and converts it into UTF-16,
> at which point it is screwed, and when you read it back into char[] it
> is garbage.
>
>>
>> If standard can define that overflow in signed atomics is  well defined
>> and two's complement is mandated there then it also can define that all
>> char* texts are UTF-8. The only question is if what I suggest is
>> reasonable
>> or not. From viewpoint of implementer of standard library or users it
>> is likely blessing ... so I think it is question of
>> business/politics/religions.
>
> I agree with Richard here. Two's complement is not like utf-8.
> I still think it's technical rather than business/politics/religions in
> this case - as I said above I'm not sure it would even be ideal.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<d10748b6-4244-454d-b2f6-ed3c72618e2dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19925&group=comp.lang.c#19925

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:ae9:eb54:: with SMTP id b81mr14333962qkg.747.1641780496279;
Sun, 09 Jan 2022 18:08:16 -0800 (PST)
X-Received: by 2002:ad4:5d65:: with SMTP id fn5mr64801005qvb.10.1641780496159;
Sun, 09 Jan 2022 18:08:16 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 18:08:15 -0800 (PST)
In-Reply-To: <srg1jr$1rpr$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <srg1jr$1rpr$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d10748b6-4244-454d-b2f6-ed3c72618e2dn@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 02:08:16 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 152
 by: Öö Tiib - Mon, 10 Jan 2022 02:08 UTC

On Monday, 10 January 2022 at 03:20:06 UTC+2, Manfred wrote:
> On 1/9/2022 11:35 PM, Öö Tiib wrote:
> > On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> >> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> >>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> > ...
> >>>
> >>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>>>> That should work unless underlying file system does not support files
> >>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>>>> needed as it already works like in my example on vast majority of things.
> >>>>>>
> >>>>>> Nothing in the standard prevents an implementation from doing that.. If one doesn't
> >>>>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>>>> about it. Your real beef is with the implementors, not the standard.
> >>>>>
> >>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>>>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>>>
> >>>> Also agreed, but since utf-8 is transparent to ascii functions, what
> >>>> should have been added?
> >>>
> >>> Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> >>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >>>
> >> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> >> representation, what's the difference? One form or the other shows up
> >> only when it is displayed in some UI - the filesystem isn't one, which
> >> leads to the implementation's runtime behavior.
> >
> > How you mean same binary representation? Both "Foo😀Bar.txt" and
> > "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> > names in underlying file system precisely as posted.
> I mean the same byte sequence in their name, but different UI
> representation, e.g. when decoded as utf-8 or w-1252 or whatever.

Nope. NTFS for example has file names as UTF-16 plus Windows uses hard links
to also give legacy Radix-50 style short (8.3) filenames to all files. Trivia
question: Why it is named "Radix-50" when there are only 40 characters in it?

> What you are saying assumes a Unicode-aware filesystem, that's not free
> from the point of view of the standard.
> But, in order to support utf-8, it would be enough to have a char based
> filesystem that treats names as plain 0-terminated char[]. That's
> easier, probably free on most platforms, but it's different from
> Unicode-aware (which could be UTF-16 like Windows, and there you have
> your problems).

Problems were with Japanese Shift JIS or EUC encodings in file
systems... it was expensive to guess what it is and so they switched
to Unicode. With UTF-16 one needs to know or to detect endianess
of it ... otherwise turning to UTF-8 and back is absurdly trivial.
Certainly less code than between Windows-1252 and UTF-16.

> >
> >> If they are actually different in their binary sequence, and this is the
> >> result of the utf-8 string being wrongly converted multiple times, this
> >> looks like a bad implementation, rather than a problem with the standard.
> >> IIUC you are advocating for some statement in the standard that prevents
> >> implementations from messing up with "character sets" in null terminated
> >> char strings?
> >
> > I mean that standard should require that all char* texts are treated as
> > UTF-8 by standard library unless said otherwise. If implementation needs
> > some other encoding of such byte sequence then it provides
> > platform-specific functions or compiler switches and/or extends language
> > with implementation-defined char_iso8859_1_t character types and
> > prefixes. If it is noteworthy handy type then add it to standards too, I
> > don't care.
>
> I see this hard to win, and probably not ideal - suppose in 10 years
> some better encoding than utf-8 shows up, then you are screwed again.
>
> I'd rather stick to the fact that utf-8 is compatible with 0-terminated
> char[], and so a plausible wish would be that such strings are not
> screwed by the implementation; for example when you store a file name in
> a filesystem with fopen() and the name is given as char[], then the
> standard could mandate that reading back that same name as char[] gives
> back the same byte sequence.
>
> Currently I guess one could use a utf-8 string as a name to fopen() on
> Windows, then the OS assumes it is W-1252 and converts it into UTF-16,
> at which point it is screwed, and when you read it back into char[] it
> is garbage.

Yes, something like that happens. Microsoft was amazingly innovative
and wanted to push all kinds of good things up to 1995. But then some
kind of browser and compiler and other incompatibility wars within its
own operating system started ... and its positions started to shrink and
get damage. But it is their own business so they have full right to burn
it however they please.

> >
> > If standard can define that overflow in signed atomics is well defined
> > and two's complement is mandated there then it also can define that all
> > char* texts are UTF-8. The only question is if what I suggest is reasonable
> > or not. From viewpoint of implementer of standard library or users it
> > is likely blessing ... so I think it is question of business/politics/religions.
>
> I agree with Richard here. Two's complement is not like utf-8.
> I still think it's technical rather than business/politics/religions in
> this case - as I said above I'm not sure it would even be ideal.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<87tuec49lz.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19926&group=comp.lang.c#19926

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
Date: Mon, 10 Jan 2022 02:18:00 +0000
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <87tuec49lz.fsf@bsb.me.uk>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d66b7acf84c3753c59039a01afb9fa20";
logging-data="13847"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19/EnZMs6+FB1Vi4bKicQyUryUBckQoqRY="
Cancel-Lock: sha1:ihY+rdBkFrnUfkgeHE8AhvPDwH0=
sha1:25jdeUP/OldHurBlnWk82rvAM7w=
X-BSB-Auth: 1.38d7b32e68281842ce18.20220110021800GMT.87tuec49lz.fsf@bsb.me.uk
 by: Ben Bacarisse - Mon, 10 Jan 2022 02:18 UTC

Mateusz Viste <mateusz@xyz.invalid> writes:

> While UTF-8 is neat, it is also complex to decode. Even a simple
> strlen() can be challenging.

Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points, I
would interpret a "UTF-8 strlen" as being a function the counts the
number of encoded code points, and that's simple enough. Every byte,
before the null, that does not have 10xxxx it it's top two bits is the
start of a code point:

size_t ustrlen(char *s)
{
size_t len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
return len;
}

Obviously, for some uses, this is too simple as it does not detect
incorrect encodings.

--
Ben.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<87o84k48w0.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19927&group=comp.lang.c#19927

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
Date: Mon, 10 Jan 2022 02:33:35 +0000
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <87o84k48w0.fsf@bsb.me.uk>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org> <srekmi$h0p$1@dont-email.me>
<GhFCJ.204983$831.61812@fx40.iad>
<b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="d66b7acf84c3753c59039a01afb9fa20";
logging-data="13847"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188X586BmP3S4DzqJGGFauKeEG8mUWET+A="
Cancel-Lock: sha1:w5Zn/vASkJuY8H7NTYClOv8MpkM=
sha1:i+KOv8Y826TNfOJKxTm86USuf2A=
X-BSB-Auth: 1.cbf562319e99963a811a.20220110023335GMT.87o84k48w0.fsf@bsb.me.uk
 by: Ben Bacarisse - Mon, 10 Jan 2022 02:33 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Sunday, 9 January 2022 at 17:52:50 UTC, Richard Damon wrote:
>>
>> On the other hand, some languages add things like 'vowel points' to
>> characters, and those are seperate graphemes even though they are added
>> by a similar manner. This comes down to what the original language
>> though of as a 'character', which just makes things even more complicated.
>>
> In Hebrew the "vowel points" are optional. They are used in beginners' and
> religious texts, but not in general use. So if we take a text from scripture,
> and represent it with and without vowels, is that the same text or a different
> text? Almost all Hebrew speakers would say "It's the same text". So
> strcmp() doesn't necessarily work in a Hebrew context.

strcmp fails must closer to home (at least closer to my geographic home)
because, in Spanish, ch and ll are, transitionally, considered separate
letters. All c* words collate before any ch* words, and all l* words
before and ll* ones.

This has proved so inconvenient that I believe that the Real Academia
Española has ruled that, now, only ñ must be considered separately.

--
Ben.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19928&group=comp.lang.c#19928

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:620a:25ca:: with SMTP id y10mr50107739qko.526.1641782432402;
Sun, 09 Jan 2022 18:40:32 -0800 (PST)
X-Received: by 2002:a05:622a:491:: with SMTP id p17mr5871659qtx.300.1641782432257;
Sun, 09 Jan 2022 18:40:32 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 18:40:32 -0800 (PST)
In-Reply-To: <qpMCJ.117485$_Y5.68107@fx29.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <GvKCJ.77541$KV.71777@fx14.iad>
<471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com> <qpMCJ.117485$_Y5.68107@fx29.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 02:40:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 203
 by: Öö Tiib - Mon, 10 Jan 2022 02:40 UTC

On Monday, 10 January 2022 at 03:59:00 UTC+2, Richard Damon wrote:
> On 1/9/22 7:51 PM, Öö Tiib wrote:
> > On Monday, 10 January 2022 at 01:49:07 UTC+2, Richard Damon wrote:
> >> On 1/9/22 5:35 PM, Öö Tiib wrote:
> >>> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> >>>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> >>>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >>>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> >>> ...
> >>>>>
> >>>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>>>>>> That should work unless underlying file system does not support files
> >>>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>>>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>>>>>> needed as it already works like in my example on vast majority of things.
> >>>>>>>>
> >>>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >>>>>>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>>>>>> about it. Your real beef is with the implementors, not the standard.
> >>>>>>>
> >>>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>>>>>
> >>>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
> >>>>>> should have been added?
> >>>>>
> >>>>> Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> >>>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >>>>>
> >>>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> >>>> representation, what's the difference? One form or the other shows up
> >>>> only when it is displayed in some UI - the filesystem isn't one, which
> >>>> leads to the implementation's runtime behavior.
> >>>
> >>> How you mean same binary representation? Both "Foo😀Bar.txt" and
> >>> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> >>> names in underlying file system precisely as posted.
> >>>
> >>>> If they are actually different in their binary sequence, and this is the
> >>>> result of the utf-8 string being wrongly converted multiple times, this
> >>>> looks like a bad implementation, rather than a problem with the standard.
> >>>> IIUC you are advocating for some statement in the standard that prevents
> >>>> implementations from messing up with "character sets" in null terminated
> >>>> char strings?
> >>>
> >>> I mean that standard should require that all char* texts are treated as
> >>> UTF-8 by standard library unless said otherwise. If implementation needs
> >>> some other encoding of such byte sequence then it provides
> >>> platform-specific functions or compiler switches and/or extends language
> >>> with implementation-defined char_iso8859_1_t character types and
> >>> prefixes. If it is noteworthy handy type then add it to standards too, I
> >>> don't care.
> >>>
> >>> If standard can define that overflow in signed atomics is well defined
> >>> and two's complement is mandated there then it also can define that all
> >>> char* texts are UTF-8. The only question is if what I suggest is reasonable
> >>> or not. From viewpoint of implementer of standard library or users it
> >>> is likely blessing ... so I think it is question of business/politics/religions.
> >>
> >> The difference is that in these days, the existence of computers that
> >> aren't going to be able to support two's complement that will still want
> >> to support modern 'C' is effectively non-existent.
> >>
> >> The existance of machines that might still want to be able to support
> >> non-UTF-8 strings is not.
> >
> > But do there exist machines that do want to support texts as char* but
> > do not want to support UTF-8? Describe those machines, give examples.
> Small embedded micros with no need for large character sets.

You diligently avoid giving examples?
You mean if it displays only Arabic numbers then it needs only 10
characters, if minuses and dots too then 12. ASCII and UTF-8 are identical
in that processing. So UTF-8 adds no extra bytes to such system.

> >
> >> Perhaps the biggest is the embedded market where needing to support
> >
> > You mean tiny things like SD cards or flash sticks? I can store there
> > "Foo😀Bar.txt" just fine. Either embedded system does not need to
> > communicate in char* text at all, can fully ignore its encoding or has
> > to deal with UTF-8 anyway. I know of no other examples, despite I've
> > participated in programming whole pile of embedded systems over
> > the decades.
>
> Many such system communicate in command strings, maybe even with a
> minimal TCP/IP but have no need for processing data beyond pure ASCII.

Same as with numbers, if no need to show the degree in 74.3°F so no
need for to process anything beyond pure ASCII. Otherwise the software
needs to detect that there are bytes C2 B0 for to show ° also no biggie.

> >> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> >> UTF-8 rules adds a LOT of complications for some operations that just
> >> aren't needed on many of the systems.
> >
> > WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> > bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> > use-case where UTF-8 hurts? Human languages and typography
> > are horribly complicated but UTF-8 is genially trivial. Either embedded
> > system does not do linguistic analyses of poems or if it does then
> > it needs to use Unicode anyway. But commonly if it can't display
> > something then it shows � and done.
>
> Once you have your char as being defined as a Multi-Byte Character Set,
> then wchar_t must be big enough to hold any of them. If you just support
> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
> the wchar_t stuff can just be alias for the char stuff.
>
> Thus forcing char to be UTF-8 adds a lot of complexity to the system.

Most embedded systems that I programmed used wchar_t for nothing.
So the compiler generated precisely 0 bytes of wchar_t processing
into image that was flashed into those.

> >
> >> The Standard does ALLOW a system to define char to be UTF-8 (at least
> >> until you get into issues of what it requires for wide characters).
> >
> > Allowing is apparently not enough as the support rots in standards.
> > Wide characters are wchar_t, char16_t and char32_t. These are
> > in horrible state too but I ignore it for now. Not related to issues
> > with char* and far less important in industry.
> >
> But that is part of the problem with supporting UTF-8, as that by
> definiition brings in all the wide character issues into play.
>
> If you define that your character set is ASCII, then wchar_t becomes
> trivial.
>
> A big part of the issue with char16_t is that it is fundamentally broken
> with Unicode, but lives on due to trying to maintain the backwards
> bandaids that basically can't be removed without admitting that a large
> segment of code just will live as being openly non-complient.
>
> Too much legacy code assumes that 16 bit characters are 'big enough' for
> most people, and pretty much do work if you aren't being a stickler for
> full conformance to the rules, which no one is because you can't be.


Click here to read the complete article
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<SvNCJ.12005$jW.9864@fx05.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19929&group=comp.lang.c#19929

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx05.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org> <87tuec49lz.fsf@bsb.me.uk>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <87tuec49lz.fsf@bsb.me.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 29
Message-ID: <SvNCJ.12005$jW.9864@fx05.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 22:13:53 -0500
X-Received-Bytes: 2664
 by: Richard Damon - Mon, 10 Jan 2022 03:13 UTC

On 1/9/22 9:18 PM, Ben Bacarisse wrote:
> Mateusz Viste <mateusz@xyz.invalid> writes:
>
>> While UTF-8 is neat, it is also complex to decode. Even a simple
>> strlen() can be challenging.
>
> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points, I
> would interpret a "UTF-8 strlen" as being a function the counts the
> number of encoded code points, and that's simple enough. Every byte,
> before the null, that does not have 10xxxx it it's top two bits is the
> start of a code point:
>
> size_t ustrlen(char *s)
> {
> size_t len = 0;
> while (*s) len += (*s++ & 0xc0) != 0x80;
> return len;
> }
>
> Obviously, for some uses, this is too simple as it does not detect
> incorrect encodings.
>

My understanding is that for MBCS the function strlen returns the number
of BYTES in the string, not the number of Multi-Byte Characters in the
string.

This means that strlen can be used to determine how much space is needs
to store the string.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srg90u$cbf$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19930&group=comp.lang.c#19930

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jameskuy...@alumni.caltech.edu (James Kuyper)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sun, 9 Jan 2022 22:26:21 -0500
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <srg90u$cbf$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<lezpdwdw.fsf@yahoo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 10 Jan 2022 03:26:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="70101d842d4c3689c09b40dcee90df43";
logging-data="12655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/+tobEZrRVR14BCgooJTLPh4zKfu9wyos="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:1VkfI2xhccM18PswRNUjss69138=
In-Reply-To: <lezpdwdw.fsf@yahoo.com>
Content-Language: en-US
 by: James Kuyper - Mon, 10 Jan 2022 03:26 UTC

On 1/9/22 5:41 AM, Po Lu wrote:
> "james...@alumni.caltech.edu" <jameskuyper@alumni.caltech.edu> writes:
>
>> No, I am quite accurately and honestly expressing my confusion. You
>> object to something being prohibited by the standards that is, to the
>> best of my understanding, allowed. It would make more sense if you
>> were objecting the fact that it isn't mandatory, and if you were
>> making such claims, I would disagree with you about whether it would
>> be a good idea to make it mandatory - but as far as I can tell, you're
>> claiming it isn't allowed.
>
> AFAICT, he's complaining about Microsoft's specific implementations of
> some standards.

He's repeatedly asserted that it's the standards themselves that he's
complaining about. His actual complaints, however, seem to be about
Microsoft-specific behavior. That's not a contradiction - he's
complaining about the fact that the standards allow that behavior.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<APNCJ.282688$3q9.253297@fx47.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19931&group=comp.lang.c#19931

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me> <sr53qo$vbl$1@dont-email.me>
<_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
<GvKCJ.77541$KV.71777@fx14.iad>
<471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
<qpMCJ.117485$_Y5.68107@fx29.iad>
<b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 200
Message-ID: <APNCJ.282688$3q9.253297@fx47.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 22:34:56 -0500
X-Received-Bytes: 12530
 by: Richard Damon - Mon, 10 Jan 2022 03:34 UTC

On 1/9/22 9:40 PM, Öö Tiib wrote:
> On Monday, 10 January 2022 at 03:59:00 UTC+2, Richard Damon wrote:
>> On 1/9/22 7:51 PM, Öö Tiib wrote:
>>> On Monday, 10 January 2022 at 01:49:07 UTC+2, Richard Damon wrote:
>>>> On 1/9/22 5:35 PM, Öö Tiib wrote:
>>>>> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
>>>>>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
>>>>>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
>>>>>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
>>>>>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>>>>>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>>>>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>>>>> ...
>>>>>>>
>>>>>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>>>>>>>> That should work unless underlying file system does not support files
>>>>>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>>>>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>>>>>>>> needed as it already works like in my example on vast majority of things.
>>>>>>>>>>
>>>>>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>>>>>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>>>>>>>> about it. Your real beef is with the implementors, not the standard.
>>>>>>>>>
>>>>>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>>>>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>>>>>>>
>>>>>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
>>>>>>>> should have been added?
>>>>>>>
>>>>>>> Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
>>>>>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>>>>>>>
>>>>>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
>>>>>> representation, what's the difference? One form or the other shows up
>>>>>> only when it is displayed in some UI - the filesystem isn't one, which
>>>>>> leads to the implementation's runtime behavior.
>>>>>
>>>>> How you mean same binary representation? Both "Foo😀Bar.txt" and
>>>>> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
>>>>> names in underlying file system precisely as posted.
>>>>>
>>>>>> If they are actually different in their binary sequence, and this is the
>>>>>> result of the utf-8 string being wrongly converted multiple times, this
>>>>>> looks like a bad implementation, rather than a problem with the standard.
>>>>>> IIUC you are advocating for some statement in the standard that prevents
>>>>>> implementations from messing up with "character sets" in null terminated
>>>>>> char strings?
>>>>>
>>>>> I mean that standard should require that all char* texts are treated as
>>>>> UTF-8 by standard library unless said otherwise. If implementation needs
>>>>> some other encoding of such byte sequence then it provides
>>>>> platform-specific functions or compiler switches and/or extends language
>>>>> with implementation-defined char_iso8859_1_t character types and
>>>>> prefixes. If it is noteworthy handy type then add it to standards too, I
>>>>> don't care.
>>>>>
>>>>> If standard can define that overflow in signed atomics is well defined
>>>>> and two's complement is mandated there then it also can define that all
>>>>> char* texts are UTF-8. The only question is if what I suggest is reasonable
>>>>> or not. From viewpoint of implementer of standard library or users it
>>>>> is likely blessing ... so I think it is question of business/politics/religions.
>>>>
>>>> The difference is that in these days, the existence of computers that
>>>> aren't going to be able to support two's complement that will still want
>>>> to support modern 'C' is effectively non-existent.
>>>>
>>>> The existance of machines that might still want to be able to support
>>>> non-UTF-8 strings is not.
>>>
>>> But do there exist machines that do want to support texts as char* but
>>> do not want to support UTF-8? Describe those machines, give examples.
>> Small embedded micros with no need for large character sets.
>
> You diligently avoid giving examples?
> You mean if it displays only Arabic numbers then it needs only 10
> characters, if minuses and dots too then 12. ASCII and UTF-8 are identical
> in that processing. So UTF-8 adds no extra bytes to such system.

The problem is that once you define that your 'Character Set' is UTF-8,
and thus wide characters are wider than 8 bits, then a number of
mechanizism need to be provided by the library, and it can be hard to
keep that down to zero cost if not used.

A big issue is locales. An ASCII only system can easily just define very
crude locale support that is very cheap. Once you introduce UTF-8, it
becomes a very slippery slope that can make it hard to keep the size of
the library code brought in under control.

Remember, even simple things like printf pulls in some locale code, even
if you don't actaully ever set a locale.

>
>>>
>>>> Perhaps the biggest is the embedded market where needing to support
>>>
>>> You mean tiny things like SD cards or flash sticks? I can store there
>>> "Foo😀Bar.txt" just fine. Either embedded system does not need to
>>> communicate in char* text at all, can fully ignore its encoding or has
>>> to deal with UTF-8 anyway. I know of no other examples, despite I've
>>> participated in programming whole pile of embedded systems over
>>> the decades.
>>
>> Many such system communicate in command strings, maybe even with a
>> minimal TCP/IP but have no need for processing data beyond pure ASCII.
>
> Same as with numbers, if no need to show the degree in 74.3°F so no
> need for to process anything beyond pure ASCII. Otherwise the software
> needs to detect that there are bytes C2 B0 for to show ° also no biggie.

Again, the problem is that once you have defined that Multi-byte
characters exist, things like printf will use locale support that might
pull in classifaction routines that might needs to classify what
characters are 'digits' or 'letters' in the full Unicode range.

For a PC, with a large OS, that support is fairly cheap, and might even
be just built in, but in a small embedded system that can be costly.

I HAVE had systems that defined that characters were UTF-8 and the
result was I couldn't use a lot of the library because it pulled in too
much locale code to fit into my machine.

>
>>>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
>>>> UTF-8 rules adds a LOT of complications for some operations that just
>>>> aren't needed on many of the systems.
>>>
>>> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
>>> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
>>> use-case where UTF-8 hurts? Human languages and typography
>>> are horribly complicated but UTF-8 is genially trivial. Either embedded
>>> system does not do linguistic analyses of poems or if it does then
>>> it needs to use Unicode anyway. But commonly if it can't display
>>> something then it shows � and done.
>>
>> Once you have your char as being defined as a Multi-Byte Character Set,
>> then wchar_t must be big enough to hold any of them. If you just support
>> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
>> the wchar_t stuff can just be alias for the char stuff.
>>
>> Thus forcing char to be UTF-8 adds a lot of complexity to the system.
>
> Most embedded systems that I programmed used wchar_t for nothing.
> So the compiler generated precisely 0 bytes of wchar_t processing
> into image that was flashed into those.
>


Click here to read the complete article
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<9d0d96a6-4e83-4250-a5f9-16f483295277n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19932&group=comp.lang.c#19932

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:508b:: with SMTP id kk11mr9676825qvb.61.1641788652706;
Sun, 09 Jan 2022 20:24:12 -0800 (PST)
X-Received: by 2002:a05:620a:290d:: with SMTP id m13mr1180509qkp.151.1641788652544;
Sun, 09 Jan 2022 20:24:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 20:24:12 -0800 (PST)
In-Reply-To: <APNCJ.282688$3q9.253297@fx47.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <sr53qo$vbl$1@dont-email.me>
<_mpBJ.219710$qz4.56726@fx97.iad> <36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com> <27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com> <c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com> <4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com> <ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org> <c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org> <2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
<GvKCJ.77541$KV.71777@fx14.iad> <471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
<qpMCJ.117485$_Y5.68107@fx29.iad> <b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>
<APNCJ.282688$3q9.253297@fx47.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9d0d96a6-4e83-4250-a5f9-16f483295277n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 04:24:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 311
 by: Öö Tiib - Mon, 10 Jan 2022 04:24 UTC

On Monday, 10 January 2022 at 05:35:08 UTC+2, Richard Damon wrote:
> On 1/9/22 9:40 PM, Öö Tiib wrote:
> > On Monday, 10 January 2022 at 03:59:00 UTC+2, Richard Damon wrote:
> >> On 1/9/22 7:51 PM, Öö Tiib wrote:
> >>> On Monday, 10 January 2022 at 01:49:07 UTC+2, Richard Damon wrote:
> >>>> On 1/9/22 5:35 PM, Öö Tiib wrote:
> >>>>> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> >>>>>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> >>>>>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >>>>>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>>>>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>>>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> >>>>> ...
> >>>>>>>
> >>>>>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>>>>>>>> That should work unless underlying file system does not support files
> >>>>>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>>>>>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>>>>>>>> needed as it already works like in my example on vast majority of things.
> >>>>>>>>>>
> >>>>>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >>>>>>>>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>>>>>>>> about it. Your real beef is with the implementors, not the standard.
> >>>>>>>>>
> >>>>>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>>>>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>>>>>>>
> >>>>>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
> >>>>>>>> should have been added?
> >>>>>>>
> >>>>>>> Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> >>>>>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >>>>>>>
> >>>>>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> >>>>>> representation, what's the difference? One form or the other shows up
> >>>>>> only when it is displayed in some UI - the filesystem isn't one, which
> >>>>>> leads to the implementation's runtime behavior.
> >>>>>
> >>>>> How you mean same binary representation? Both "Foo😀Bar.txt" and
> >>>>> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> >>>>> names in underlying file system precisely as posted.
> >>>>>
> >>>>>> If they are actually different in their binary sequence, and this is the
> >>>>>> result of the utf-8 string being wrongly converted multiple times, this
> >>>>>> looks like a bad implementation, rather than a problem with the standard.
> >>>>>> IIUC you are advocating for some statement in the standard that prevents
> >>>>>> implementations from messing up with "character sets" in null terminated
> >>>>>> char strings?
> >>>>>
> >>>>> I mean that standard should require that all char* texts are treated as
> >>>>> UTF-8 by standard library unless said otherwise. If implementation needs
> >>>>> some other encoding of such byte sequence then it provides
> >>>>> platform-specific functions or compiler switches and/or extends language
> >>>>> with implementation-defined char_iso8859_1_t character types and
> >>>>> prefixes. If it is noteworthy handy type then add it to standards too, I
> >>>>> don't care.
> >>>>>
> >>>>> If standard can define that overflow in signed atomics is well defined
> >>>>> and two's complement is mandated there then it also can define that all
> >>>>> char* texts are UTF-8. The only question is if what I suggest is reasonable
> >>>>> or not. From viewpoint of implementer of standard library or users it
> >>>>> is likely blessing ... so I think it is question of business/politics/religions.
> >>>>
> >>>> The difference is that in these days, the existence of computers that
> >>>> aren't going to be able to support two's complement that will still want
> >>>> to support modern 'C' is effectively non-existent.
> >>>>
> >>>> The existance of machines that might still want to be able to support
> >>>> non-UTF-8 strings is not.
> >>>
> >>> But do there exist machines that do want to support texts as char* but
> >>> do not want to support UTF-8? Describe those machines, give examples.
> >> Small embedded micros with no need for large character sets.
> >
> > You diligently avoid giving examples?
> > You mean if it displays only Arabic numbers then it needs only 10
> > characters, if minuses and dots too then 12. ASCII and UTF-8 are identical
> > in that processing. So UTF-8 adds no extra bytes to such system.
>
> The problem is that once you define that your 'Character Set' is UTF-8,
> and thus wide characters are wider than 8 bits, then a number of
> mechanizism need to be provided by the library, and it can be hard to
> keep that down to zero cost if not used.

You must be specific.
> A big issue is locales. An ASCII only system can easily just define very
> crude locale support that is very cheap. Once you introduce UTF-8, it
> becomes a very slippery slope that can make it hard to keep the size of
> the library code brought in under control.
>
> Remember, even simple things like printf pulls in some locale code, even
> if you don't actaully ever set a locale.

No, its different topic. Fully implementation defined. Conformant
implementation may implement only locale named "C" and be done with it.
The setlocale() localeconv() and lconv() can be trivial stubs behaving by
letter of standard and not worth calling ever. Would be nice from
implementer to support localization but not really required and not
something I complain about.

> >
> >>>
> >>>> Perhaps the biggest is the embedded market where needing to support
> >>>
> >>> You mean tiny things like SD cards or flash sticks? I can store there
> >>> "Foo😀Bar.txt" just fine. Either embedded system does not need to
> >>> communicate in char* text at all, can fully ignore its encoding or has
> >>> to deal with UTF-8 anyway. I know of no other examples, despite I've
> >>> participated in programming whole pile of embedded systems over
> >>> the decades.
> >>
> >> Many such system communicate in command strings, maybe even with a
> >> minimal TCP/IP but have no need for processing data beyond pure ASCII.
> >
> > Same as with numbers, if no need to show the degree in 74.3°F so no
> > need for to process anything beyond pure ASCII. Otherwise the software
> > needs to detect that there are bytes C2 B0 for to show ° also no biggie.
>
> Again, the problem is that once you have defined that Multi-byte
> characters exist, things like printf will use locale support that might
> pull in classifaction routines that might needs to classify what
> characters are 'digits' or 'letters' in the full Unicode range.

Stay with UTF-8? It can keep locale "en_US". It has to show □ for each
missing symbol in font and � for illegal UTF-8 byte sequence (that is
trivial to detect). There are likely 0 fonts in existence with all Unicode
symbols so a font with 20 symbols is fully conformant for embedded
device that does not need to analyze Hebrew manuscripts but to
show temperature for desperate housewife.


Click here to read the complete article
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<35c5adf1-a311-4832-a1a9-1090074b2621n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19933&group=comp.lang.c#19933

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:508b:: with SMTP id kk11mr9770240qvb.61.1641792235553;
Sun, 09 Jan 2022 21:23:55 -0800 (PST)
X-Received: by 2002:a05:622a:11ce:: with SMTP id n14mr64880571qtk.432.1641792235346;
Sun, 09 Jan 2022 21:23:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 21:23:55 -0800 (PST)
In-Reply-To: <2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=108.48.119.9; posting-account=Ix1u_AoAAAAILVQeRkP2ENwli-Uv6vO8
NNTP-Posting-Host: 108.48.119.9
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <35c5adf1-a311-4832-a1a9-1090074b2621n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: jameskuy...@alumni.caltech.edu (james...@alumni.caltech.edu)
Injection-Date: Mon, 10 Jan 2022 05:23:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 36
 by: james...@alumni.calt - Mon, 10 Jan 2022 05:23 UTC

On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > On 1/9/2022 1:44 PM, Öö Tiib wrote:
....
> > > Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > >
> > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > representation, what's the difference? One form or the other shows up
> > only when it is displayed in some UI - the filesystem isn't one, which
> > leads to the implementation's runtime behavior.
> How you mean same binary representation? Both "Foo😀Bar.txt" and
> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> names in underlying file system precisely as posted.

Have you checked to make sure? Any system where passing the UTF-8 string
"Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
is likely to be a system where the file names are displayed using some single-
byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74

After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
names used that encoding to interpret the file name, that would explain your results.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<91da42d7-2503-4525-9ea4-8764a475b2b3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19934&group=comp.lang.c#19934

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:620a:17a8:: with SMTP id ay40mr4494863qkb.485.1641795061569;
Sun, 09 Jan 2022 22:11:01 -0800 (PST)
X-Received: by 2002:a05:620a:199d:: with SMTP id bm29mr12263522qkb.450.1641795061433;
Sun, 09 Jan 2022 22:11:01 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 22:11:01 -0800 (PST)
In-Reply-To: <35c5adf1-a311-4832-a1a9-1090074b2621n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <35c5adf1-a311-4832-a1a9-1090074b2621n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <91da42d7-2503-4525-9ea4-8764a475b2b3n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 06:11:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 66
 by: Öö Tiib - Mon, 10 Jan 2022 06:11 UTC

On Monday, 10 January 2022 at 07:24:01 UTC+2, james...@alumni.caltech.edu wrote:
> On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> > On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > > On 1/9/2022 1:44 PM, Öö Tiib wrote:
> ...
> > > > Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> > > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > > >
> > > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > > representation, what's the difference? One form or the other shows up
> > > only when it is displayed in some UI - the filesystem isn't one, which
> > > leads to the implementation's runtime behavior.
> > How you mean same binary representation? Both "Foo😀Bar.txt" and
> > "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> > names in underlying file system precisely as posted.
> Have you checked to make sure? Any system where passing the UTF-8 string
> "Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
> is likely to be a system where the file names are displayed using some single-
> byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is
>
> 0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74
>
> After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
> has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
> names used that encoding to interpret the file name, that would explain your results.

Need a screenshot with files named "Foo😀Bar.txt" and "Foo≡ƒÿÇBar.txt"
side-by-side? Just that for "Foo😀Bar.txt" one needs to use non-standard
_wfopen( L"Foo😀Bar.txt", L"w"). That is default both with MSVC and MinGW
gcc.

People have apparently lamented about it so the behavior of fopen can be
repaired with some butt-ugly xml file linked in the program in specific way
or by providing one next to your program with yourprogramname.xml as
name. That trick works from Windows 10 May 2019 Update.

But reading UTF-8 (for example password) from console is still impossible.
One has to write platform-specific code about like that:

SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0; // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);
That way we have UTF-8 read into buf.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srgp6r$n6f$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19936&group=comp.lang.c#19936

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Mon, 10 Jan 2022 09:02:35 +0100
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <srgp6r$n6f$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org> <srekmi$h0p$1@dont-email.me>
<GhFCJ.204983$831.61812@fx40.iad>
<b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>
<87o84k48w0.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 10 Jan 2022 08:02:35 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d234e6344e91b940fc93671bf53053e0";
logging-data="23759"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18uDpuJ8f0pVeLRAkPB30UTYifbDx+oVyU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:AwBe1LuYosmUbQ04hqQUCjB5gnA=
In-Reply-To: <87o84k48w0.fsf@bsb.me.uk>
Content-Language: en-GB
 by: David Brown - Mon, 10 Jan 2022 08:02 UTC

On 10/01/2022 03:33, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>
>> On Sunday, 9 January 2022 at 17:52:50 UTC, Richard Damon wrote:
>>>
>>> On the other hand, some languages add things like 'vowel points' to
>>> characters, and those are seperate graphemes even though they are added
>>> by a similar manner. This comes down to what the original language
>>> though of as a 'character', which just makes things even more complicated.
>>>
>> In Hebrew the "vowel points" are optional. They are used in beginners' and
>> religious texts, but not in general use. So if we take a text from scripture,
>> and represent it with and without vowels, is that the same text or a different
>> text? Almost all Hebrew speakers would say "It's the same text". So
>> strcmp() doesn't necessarily work in a Hebrew context.
>
> strcmp fails must closer to home (at least closer to my geographic home)
> because, in Spanish, ch and ll are, transitionally, considered separate
> letters. All c* words collate before any ch* words, and all l* words
> before and ll* ones.
>
> This has proved so inconvenient that I believe that the Real Academia
> Española has ruled that, now, only ñ must be considered separately.
>

In a more extreme case, in Norwegian "aa" is sometimes sorted very early
alphabetically, and sometimes very late as it is a transliteration of
the Norwegian letter "å", which is the last letter in our alphabet.

Sorting for human use (as distinct from, say, making a binary tree for
lookups, in which case pure data-based sorting is fine) is complicated
business!

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srgsnj$5gn$3@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19941&group=comp.lang.c#19941

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!8hiQobHKlOvsb2aWVVOzwA.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Mon, 10 Jan 2022 10:02:43 +0100
Organization: . . .
Message-ID: <srgsnj$5gn$3@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me>
<_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<87tuec49lz.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="5655"; posting-host="8hiQobHKlOvsb2aWVVOzwA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
 by: Mateusz Viste - Mon, 10 Jan 2022 09:02 UTC

2022-01-10 at 02:18 +0000, Ben Bacarisse wrote:
> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points,
> I would interpret a "UTF-8 strlen" as being a function the counts the
> number of encoded code points, and that's simple enough.
> (...)
> Obviously, for some uses, this is too simple as it does not detect
> incorrect encodings.

I'm not sure what your point was supposed to be. "It's simple to write
non-practical, prototype-grade code"? Yes, it is.

Now, I am not saying that writing a utf-8 strlen() is incredibly
difficult of course. I am only saying it is an extra layer of
complexity compared to UCS-2 or UTF-32. And that is why I understand
why people often choose to internally store strings in one of these
encodings instead of utf-8 (esp. if dealing with fixed-width character
outputs). It's simply easier to deal with an array of values that maps
directly to codepoints rather than parse a utf-8 string taking care not
to explode on encoding errors or edge cases.

Mateusz

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<3191612e-f7bb-4a81-b59d-2b4ddbac18d7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19942&group=comp.lang.c#19942

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:518f:: with SMTP id kl15mr68024571qvb.4.1641807408831;
Mon, 10 Jan 2022 01:36:48 -0800 (PST)
X-Received: by 2002:ae9:eb03:: with SMTP id b3mr14420503qkg.100.1641807408697;
Mon, 10 Jan 2022 01:36:48 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 10 Jan 2022 01:36:48 -0800 (PST)
In-Reply-To: <srg90u$cbf$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=84.50.190.130; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 84.50.190.130
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<lezpdwdw.fsf@yahoo.com> <srg90u$cbf$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3191612e-f7bb-4a81-b59d-2b4ddbac18d7n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 09:36:48 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 44
 by: Öö Tiib - Mon, 10 Jan 2022 09:36 UTC

On Monday, 10 January 2022 at 05:26:33 UTC+2, james...@alumni.caltech.edu wrote:
> On 1/9/22 5:41 AM, Po Lu wrote:
> > "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> writes:
> >
> >> No, I am quite accurately and honestly expressing my confusion. You
> >> object to something being prohibited by the standards that is, to the
> >> best of my understanding, allowed. It would make more sense if you
> >> were objecting the fact that it isn't mandatory, and if you were
> >> making such claims, I would disagree with you about whether it would
> >> be a good idea to make it mandatory - but as far as I can tell, you're
> >> claiming it isn't allowed.
> >
> > AFAICT, he's complaining about Microsoft's specific implementations of
> > some standards.
> He's repeatedly asserted that it's the standards themselves that he's
> complaining about. His actual complaints, however, seem to be about
> Microsoft-specific behavior. That's not a contradiction - he's
> complaining about the fact that the standards allow that behavior.

My complaints are because of the events somehow falling
together.

On one hand the issues with electronic components caused
shortage of prototype devices to run tests on. So some
cooperation partners decided to run unit tests on Windows
boxes. Since on Windows the C standard library is crappy these
unit tests now mostly test ad-hoc hacks of simulating proper
standard library and pointless man-months wasted into those.

On the other hand C++20 broke that u8 prefix indicating
dedication to push UTF-8 into that char8_t* garbage that no
standard library function is using. The char8_t would also
break constexpr processing of it as casts to char are illegal in
constexpr context.

It smells like some next family of non-standard functions soon
in style of _u8fopen(u8"Foo😀Bar.txt", u8"w") . Story of annex
K repeated. Oh I hope being paranoid and UCRT with Visual
Studio 2022 coming plain great, but what is the chance of
that?

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<87ilur4xx8.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19945&group=comp.lang.c#19945

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
Date: Mon, 10 Jan 2022 11:45:07 +0000
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <87ilur4xx8.fsf@bsb.me.uk>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org> <87tuec49lz.fsf@bsb.me.uk>
<srgsnj$5gn$3@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d66b7acf84c3753c59039a01afb9fa20";
logging-data="6981"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fPTou/wvNPUgE8w9+w8pJ2Cu2/KB3KAA="
Cancel-Lock: sha1:61m+GTik2h9QSXfTy7b+XesFsQ8=
sha1:Vn3jPqFyAeTAm+6ctDA/vJGoygM=
X-BSB-Auth: 1.f9676387f43fef44e222.20220110114507GMT.87ilur4xx8.fsf@bsb.me.uk
 by: Ben Bacarisse - Mon, 10 Jan 2022 11:45 UTC

Mateusz Viste <mateusz@xyz.invalid> writes:

> 2022-01-10 at 02:18 +0000, Ben Bacarisse wrote:
>> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points,
>> I would interpret a "UTF-8 strlen" as being a function the counts the
>> number of encoded code points, and that's simple enough.
>> (...)
>> Obviously, for some uses, this is too simple as it does not detect
>> incorrect encodings.
>
> I'm not sure what your point was supposed to be. "It's simple to write
> non-practical, prototype-grade code"? Yes, it is.

Yes, that was my point. A lot of people think UTF-8 is more complex
than it is so I think it helps to demystify it a bit.

--
Ben.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<87czkz4xtn.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19946&group=comp.lang.c#19946

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
Date: Mon, 10 Jan 2022 11:47:16 +0000
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <87czkz4xtn.fsf@bsb.me.uk>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org> <87tuec49lz.fsf@bsb.me.uk>
<SvNCJ.12005$jW.9864@fx05.iad>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d66b7acf84c3753c59039a01afb9fa20";
logging-data="6981"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18g/xQpOd9i+AxewACK1dCCHYej8gxXSjo="
Cancel-Lock: sha1:qQ6WTPh/WWPosDbZSfrQbIhwblw=
sha1:4Ws31uyIuz/YB4lBiuSiN1boWnU=
X-BSB-Auth: 1.8d84d37ff11ed900b9b9.20220110114716GMT.87czkz4xtn.fsf@bsb.me.uk
 by: Ben Bacarisse - Mon, 10 Jan 2022 11:47 UTC

Richard Damon <Richard@Damon-Family.org> writes:

> On 1/9/22 9:18 PM, Ben Bacarisse wrote:
>> Mateusz Viste <mateusz@xyz.invalid> writes:
>>
>>> While UTF-8 is neat, it is also complex to decode. Even a simple
>>> strlen() can be challenging.
>> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points, I
>> would interpret a "UTF-8 strlen" as being a function the counts the
>> number of encoded code points, and that's simple enough. Every byte,
>> before the null, that does not have 10xxxx it it's top two bits is the
>> start of a code point:
>> size_t ustrlen(char *s)
>> {
>> size_t len = 0;
>> while (*s) len += (*s++ & 0xc0) != 0x80;
>> return len;
>> }
>> Obviously, for some uses, this is too simple as it does not detect
>> incorrect encodings.
>
> My understanding is that for MBCS the function strlen returns the
> number of BYTES in the string, not the number of Multi-Byte Characters
> in the string.
>
> This means that strlen can be used to determine how much space is
> needs to store the string.

Yes. But that's now what "a simple strlen()" for UTF-8 appeared to be
referring to. After all, as you say, strlen /is/ strlen for UTF-8
strings.

--
Ben.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<8f6d1594-3913-4f8d-9938-83a62b6790c0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19947&group=comp.lang.c#19947

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:622a:14c6:: with SMTP id u6mr65315447qtx.195.1641818072374;
Mon, 10 Jan 2022 04:34:32 -0800 (PST)
X-Received: by 2002:a05:620a:199d:: with SMTP id bm29mr12883032qkb.450.1641818072236;
Mon, 10 Jan 2022 04:34:32 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 10 Jan 2022 04:34:32 -0800 (PST)
In-Reply-To: <87czkz4xtn.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:d04a:4768:e599:25e2;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:d04a:4768:e599:25e2
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com> <srcll6$14f3$1@gioia.aioe.org>
<87tuec49lz.fsf@bsb.me.uk> <SvNCJ.12005$jW.9864@fx05.iad> <87czkz4xtn.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8f6d1594-3913-4f8d-9938-83a62b6790c0n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Mon, 10 Jan 2022 12:34:32 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 33
 by: Malcolm McLean - Mon, 10 Jan 2022 12:34 UTC

On Monday, 10 January 2022 at 11:47:27 UTC, Ben Bacarisse wrote:
> Richard Damon <Ric...@Damon-Family.org> writes:
>
> > On 1/9/22 9:18 PM, Ben Bacarisse wrote:
> >> Mateusz Viste <mat...@xyz.invalid> writes:
> >>
> >>> While UTF-8 is neat, it is also complex to decode. Even a simple
> >>> strlen() can be challenging.
> >> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points, I
> >> would interpret a "UTF-8 strlen" as being a function the counts the
> >> number of encoded code points, and that's simple enough. Every byte,
> >> before the null, that does not have 10xxxx it it's top two bits is the
> >> start of a code point:
> >> size_t ustrlen(char *s)
> >> {
> >> size_t len = 0;
> >> while (*s) len += (*s++ & 0xc0) != 0x80;
> >> return len;
> >> }
> >> Obviously, for some uses, this is too simple as it does not detect
> >> incorrect encodings.
> >
> > My understanding is that for MBCS the function strlen returns the
> > number of BYTES in the string, not the number of Multi-Byte Characters
> > in the string.
> >
> > This means that strlen can be used to determine how much space is
> > needs to store the string.
> Yes. But that's now what "a simple strlen()" for UTF-8 appeared to be
> referring to. After all, as you say, strlen /is/ strlen for UTF-8
> strings.
>
Yes. It's not obvious from the name "strlen" what it should do when fed
UTF-8.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<1e59a2a7-baaf-4648-afa5-9bc636f7883en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19948&group=comp.lang.c#19948

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:620a:16b4:: with SMTP id s20mr11871304qkj.229.1641820558656;
Mon, 10 Jan 2022 05:15:58 -0800 (PST)
X-Received: by 2002:a05:622a:93:: with SMTP id o19mr11611992qtw.379.1641820558499;
Mon, 10 Jan 2022 05:15:58 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 10 Jan 2022 05:15:58 -0800 (PST)
In-Reply-To: <srgsnj$5gn$3@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=84.50.190.130; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 84.50.190.130
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com> <srcll6$14f3$1@gioia.aioe.org>
<87tuec49lz.fsf@bsb.me.uk> <srgsnj$5gn$3@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1e59a2a7-baaf-4648-afa5-9bc636f7883en@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 13:15:58 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 19
 by: Öö Tiib - Mon, 10 Jan 2022 13:15 UTC

On Monday, 10 January 2022 at 11:02:54 UTC+2, Mateusz Viste wrote:
>
> Now, I am not saying that writing a utf-8 strlen() is incredibly
> difficult of course. I am only saying it is an extra layer of
> complexity compared to UCS-2 or UTF-32. And that is why I understand
> why people often choose to internally store strings in one of these
> encodings instead of utf-8 (esp. if dealing with fixed-width character
> outputs). It's simply easier to deal with an array of values that maps
> directly to codepoints rather than parse a utf-8 string taking care not
> to explode on encoding errors or edge cases.

That argument feels like result of misinterpretation of formation of
UCS-2 and UTF-32 glyphs. Both encodings contain variety of combining
characters, modifiers, accents and tabulators. So even with monospaced
font (in world where proportional fonts are more frequently used) one
can't decide the width of result on screen without examining all
characters in sequence. But if to examine all characters of sequence
anyway then UTF-8 is often.just bit less memory to examine. Somehow
it does not look like the other options are "simply easier".

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srhc41$7a8$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19949&group=comp.lang.c#19949

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!8hiQobHKlOvsb2aWVVOzwA.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Mon, 10 Jan 2022 14:25:21 +0100
Organization: . . .
Message-ID: <srhc41$7a8$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me>
<_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<87tuec49lz.fsf@bsb.me.uk>
<srgsnj$5gn$3@gioia.aioe.org>
<1e59a2a7-baaf-4648-afa5-9bc636f7883en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: gioia.aioe.org; logging-data="7496"; posting-host="8hiQobHKlOvsb2aWVVOzwA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2
 by: Mateusz Viste - Mon, 10 Jan 2022 13:25 UTC

2022-01-10 at 05:15 -0800, Öö Tiib wrote:
> That argument feels like result of misinterpretation of formation of
> UCS-2 and UTF-32 glyphs. Both encodings contain variety of combining
> characters, modifiers, accents and tabulators. So even with
> monospaced font (in world where proportional fonts are more
> frequently used) one can't decide the width of result on screen
> without examining all characters in sequence.

That is true in a perfect world, yes. In practice, terminal-based
implementations often naively assume that 1 codepoint = 1 character on
screen. And this works fairly well, as far as I can tell, even if it's
not a 100% correct implementation.

I agree that a full-blown unicode implementation can be quite complex
(handling text-direction, combining characters, separators, control
characters, etc). In such context the extra complexity of parsing utf-8
strings may seem irrelevant. All depends what is the implementation's
goal I guess.

Mateusz

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<cf169d9f-f3dc-4140-89a1-a9b4b8c85f62n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19950&group=comp.lang.c#19950

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:620a:2596:: with SMTP id x22mr51116554qko.408.1641821620761;
Mon, 10 Jan 2022 05:33:40 -0800 (PST)
X-Received: by 2002:a05:622a:104e:: with SMTP id f14mr13904210qte.376.1641821620602;
Mon, 10 Jan 2022 05:33:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 10 Jan 2022 05:33:40 -0800 (PST)
In-Reply-To: <srhc41$7a8$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:d04a:4768:e599:25e2;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:d04a:4768:e599:25e2
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com> <srcll6$14f3$1@gioia.aioe.org>
<87tuec49lz.fsf@bsb.me.uk> <srgsnj$5gn$3@gioia.aioe.org> <1e59a2a7-baaf-4648-afa5-9bc636f7883en@googlegroups.com>
<srhc41$7a8$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cf169d9f-f3dc-4140-89a1-a9b4b8c85f62n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Mon, 10 Jan 2022 13:33:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 22
 by: Malcolm McLean - Mon, 10 Jan 2022 13:33 UTC

On Monday, 10 January 2022 at 13:25:33 UTC, Mateusz Viste wrote:
> 2022-01-10 at 05:15 -0800, Öö Tiib wrote:
> > That argument feels like result of misinterpretation of formation of
> > UCS-2 and UTF-32 glyphs. Both encodings contain variety of combining
> > characters, modifiers, accents and tabulators. So even with
> > monospaced font (in world where proportional fonts are more
> > frequently used) one can't decide the width of result on screen
> > without examining all characters in sequence.
> That is true in a perfect world, yes. In practice, terminal-based
> implementations often naively assume that 1 codepoint = 1 character on
> screen. And this works fairly well, as far as I can tell, even if it's
> not a 100% correct implementation.
>
> I agree that a full-blown unicode implementation can be quite complex
> (handling text-direction, combining characters, separators, control
> characters, etc). In such context the extra complexity of parsing utf-8
> strings may seem irrelevant. All depends what is the implementation's
> goal I guess.
>
You can't really separate Unicode handling from font handling. And that
is notoriously difficult, even if you restrict yourself to English.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<3FWCJ.205729$np6.54600@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19951&group=comp.lang.c#19951

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!paganini.bofh.team!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org> <87tuec49lz.fsf@bsb.me.uk>
<srgsnj$5gn$3@gioia.aioe.org>
<1e59a2a7-baaf-4648-afa5-9bc636f7883en@googlegroups.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <1e59a2a7-baaf-4648-afa5-9bc636f7883en@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 30
Message-ID: <3FWCJ.205729$np6.54600@fx46.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Mon, 10 Jan 2022 08:38:07 -0500
X-Received-Bytes: 3419
 by: Richard Damon - Mon, 10 Jan 2022 13:38 UTC

On 1/10/22 8:15 AM, Öö Tiib wrote:
> On Monday, 10 January 2022 at 11:02:54 UTC+2, Mateusz Viste wrote:
>>
>> Now, I am not saying that writing a utf-8 strlen() is incredibly
>> difficult of course. I am only saying it is an extra layer of
>> complexity compared to UCS-2 or UTF-32. And that is why I understand
>> why people often choose to internally store strings in one of these
>> encodings instead of utf-8 (esp. if dealing with fixed-width character
>> outputs). It's simply easier to deal with an array of values that maps
>> directly to codepoints rather than parse a utf-8 string taking care not
>> to explode on encoding errors or edge cases.
>
> That argument feels like result of misinterpretation of formation of
> UCS-2 and UTF-32 glyphs. Both encodings contain variety of combining
> characters, modifiers, accents and tabulators. So even with monospaced
> font (in world where proportional fonts are more frequently used) one
> can't decide the width of result on screen without examining all
> characters in sequence. But if to examine all characters of sequence
> anyway then UTF-8 is often.just bit less memory to examine. Somehow
> it does not look like the other options are "simply easier".
>

And this is the reason that Unicode doesn't really meet the requirements
of a C 'Wide Character Type'. Wide characters are supposed to be 1
character = 1 storage unit. Because of combining characters Unicode
doesn't meet this requirement.

Ultimately, we have to live with it and accept that programming in the
face of full compliance with the rules of the character set are going to
add complexity.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<1b26a8ea-98c3-4424-95e8-237cb9c31e04n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19953&group=comp.lang.c#19953

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:622a:14c6:: with SMTP id u6mr513413qtx.195.1641834083159;
Mon, 10 Jan 2022 09:01:23 -0800 (PST)
X-Received: by 2002:a05:622a:296:: with SMTP id z22mr493479qtw.275.1641834082218;
Mon, 10 Jan 2022 09:01:22 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 10 Jan 2022 09:01:22 -0800 (PST)
In-Reply-To: <91da42d7-2503-4525-9ea4-8764a475b2b3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=108.48.119.9; posting-account=Ix1u_AoAAAAILVQeRkP2ENwli-Uv6vO8
NNTP-Posting-Host: 108.48.119.9
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <35c5adf1-a311-4832-a1a9-1090074b2621n@googlegroups.com>
<91da42d7-2503-4525-9ea4-8764a475b2b3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1b26a8ea-98c3-4424-95e8-237cb9c31e04n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: jameskuy...@alumni.caltech.edu (james...@alumni.caltech.edu)
Injection-Date: Mon, 10 Jan 2022 17:01:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 133
 by: james...@alumni.calt - Mon, 10 Jan 2022 17:01 UTC

On Monday, January 10, 2022 at 1:11:08 AM UTC-5, Öö Tiib wrote:
> On Monday, 10 January 2022 at 07:24:01 UTC+2, james...@alumni.caltech.edu wrote:
> > On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> > > On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > > > On 1/9/2022 1:44 PM, Öö Tiib wrote:
> > ...
> > > > > Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> > > > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > > > >
> > > > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > > > representation, what's the difference? One form or the other shows up
> > > > only when it is displayed in some UI - the filesystem isn't one, which
> > > > leads to the implementation's runtime behavior.
> > > How you mean same binary representation? Both "Foo😀Bar.txt" and
> > > "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> > > names in underlying file system precisely as posted.
> > Have you checked to make sure? Any system where passing the UTF-8 string
> > "Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
> > is likely to be a system where the file names are displayed using some single-
> > byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is
> >
> > 0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74
> >
> > After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
> > has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
> > names used that encoding to interpret the file name, that would explain your results.
> Need a screenshot with files named "Foo😀Bar.txt" and "Foo≡ƒÿÇBar.txt"
> side-by-side? ...

No, that wouldn't be relevant. I realized shortly after I posted my message that you
might not understand what I meant by "Have you checked to make sure?" I started
composing a message in my head explaining in more detail. However, it was very late,
I had to get to bed, and when I checked this morning you'd already confirmed that you
didn't realize what I meant.

As I understand it, you've opened a file using "Foo😀Bar.txt" as the file name, and
somehow determined that the "actual" name of the file that got opened was
"Foo😀Bar.txt". You didn't specify, but I presume you reached that conclusion by
doing something like get a directory listing at the command line or using a GUI file
browser to look at the directory.

The UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XE2 0X89 0XA1 0XC6 0X92 0XC3 0XBF 0XC3 0X87 0X42 0X61 0X72 0X2E 0X74 0X78 0X7

Those same bytes, if interpreted using Code page 865, represent the string
"Foo😀Bar.txt". I know very little about Windows internals, so I'm not sure why that
might be relevant. It could be just a coincidence, but that seems excessively
unlikely. What I was suggestinging, and what I think Manfred was hinting at, is that the
string you provide as the file name is stored by the file system using UTF-8 encoding.
Whatever method you used to determine the "actual" file name interpreted those
bytes using a single-byte encoding, which could be Code Page 865, or possibly some
other encoding that encodes those particular characters the same way as Code Pag
865. There's a lot of different code pages out there, so I couldn't check them all, but of
the dozen or so I checked, that is the only one where 0xF0 represents '≡'. The "MS
DOS Nordic" code page was one of the first ones I checked, based upon your e-mail
address ootiib@hot.ee, where I presume "ee" refers to Estonia.

If that is indeed the case, consider what should happen if you try to open a file using
the name "Foo😀Bar.txt". If I understand you correctly, I believe that you would
expect it to open the same file that got opened when you specified "Foo😀Bar.txt".
However, the UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XE2 0X89 0XA1 0XC6 0X92 0XC3 0XBF 0XC3 0X87 0X42 0X61 0X72 0X2E 0X74 0X78 0X74

I expect that you will end up opening a different file. If the file name is being displayed
using a single-byte encoding, it should have 19 characters. If that encoding is in fact
Code Page 865, then that name should be "Foo😀Bar.txt". So, what result do
you get?

If the problem is in fact that the file name is being interpreted using a single byte
encoding by whatever utility you're using determine what the actual name is, then
there's absolutely nothing the standards can do about that - the behavior of any such
utility is completely outside the scope of either standard.

> ... Just that for "Foo😀Bar.txt" one needs to use non-standard
> _wfopen( L"Foo😀Bar.txt", L"w"). That is default both with MSVC and MinGW
> gcc.

As you say, it's non-standard. Therefore, nothing the C standard says could do
anything to constrain it's behavior. If your complaint is indeed about the behavior of
_wfopen(), it's not relevant to either the C or C++ standards, and should be posted to a
Windows-specific forum.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<sri8rh$cg1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19954&group=comp.lang.c#19954

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: vir.camp...@invalid.invalid (Vir Campestris)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Mon, 10 Jan 2022 21:35:45 +0000
Organization: A noiseless patient Spider
Lines: 7
Message-ID: <sri8rh$cg1$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
<GvKCJ.77541$KV.71777@fx14.iad>
<471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 10 Jan 2022 21:35:45 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bcc470134a7db94d7a488bb1621bcd92";
logging-data="12801"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ATDHQmSejfOgy+4Ht5AdDJyzhIAMOR6k="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:+v/jefAR2AktaIfHN1lbLuyAm7E=
In-Reply-To: <471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
Content-Language: en-GB
 by: Vir Campestris - Mon, 10 Jan 2022 21:35 UTC

On 10/01/2022 00:51, Öö Tiib wrote:
> But do there exist machines that do want to support texts as char* but
> do not want to support UTF-8? Describe those machines, give examples.

All the mainframes that run EBCDIC. There are a lot of them still.

Andy

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<A82DJ.76383$cW6.58145@fx08.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19955&group=comp.lang.c#19955

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx08.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me> <4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com> <ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org> <c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org> <2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <GvKCJ.77541$KV.71777@fx14.iad> <471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com> <sri8rh$cg1$1@dont-email.me>
Lines: 12
Message-ID: <A82DJ.76383$cW6.58145@fx08.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 10 Jan 2022 22:09:36 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 10 Jan 2022 22:09:36 GMT
X-Received-Bytes: 1622
 by: Scott Lurndal - Mon, 10 Jan 2022 22:09 UTC

Vir Campestris <vir.campestris@invalid.invalid> writes:
>On 10/01/2022 00:51, Öö Tiib wrote:
>> But do there exist machines that do want to support texts as char* but
>> do not want to support UTF-8? Describe those machines, give examples.
>
>All the mainframes that run EBCDIC. There are a lot of them still.

Here's the implementation guide for one:

https://public.support.unisys.com/aseries/docs/ClearPath-MCP-18.0/86002268-207.pdf

See appendix E for I18N.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<4286cf61-5e4c-43d7-94d5-97c4e4f07808n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19956&group=comp.lang.c#19956

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:ac8:4e91:: with SMTP id 17mr508574qtp.689.1641866630356;
Mon, 10 Jan 2022 18:03:50 -0800 (PST)
X-Received: by 2002:ac8:758b:: with SMTP id s11mr2099072qtq.51.1641866630140;
Mon, 10 Jan 2022 18:03:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 10 Jan 2022 18:03:49 -0800 (PST)
In-Reply-To: <A82DJ.76383$cW6.58145@fx08.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com> <ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org> <c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org> <2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
<GvKCJ.77541$KV.71777@fx14.iad> <471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
<sri8rh$cg1$1@dont-email.me> <A82DJ.76383$cW6.58145@fx08.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4286cf61-5e4c-43d7-94d5-97c4e4f07808n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Tue, 11 Jan 2022 02:03:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 21
 by: Öö Tiib - Tue, 11 Jan 2022 02:03 UTC

On Tuesday, 11 January 2022 at 00:09:48 UTC+2, Scott Lurndal wrote:
> Vir Campestris <vir.cam...@invalid.invalid> writes:
> >On 10/01/2022 00:51, Öö Tiib wrote:
> >> But do there exist machines that do want to support texts as char* but
> >> do not want to support UTF-8? Describe those machines, give examples.
> >
> >All the mainframes that run EBCDIC. There are a lot of them still.
> Here's the implementation guide for one:
>
> https://public.support.unisys.com/aseries/docs/ClearPath-MCP-18.0/86002268-207.pdf
>
> See appendix E for I18N.

Seems out of context as these guys do not look like wanting to upgrade to C99 or
something. But maybe at 2060 or so they start to think about usefulness of
UTF-8 too. The EBCDIC is usual, but irrelevant red herring in discussions like this.

Pages:1234
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor