novaBBS - comp.lang.c - Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=19891&group=comp.lang.c#19891

X-Received: by 2002:a05:6214:2583:: with SMTP id fq3mr62715329qvb.94.1641666046364;
Sat, 08 Jan 2022 10:20:46 -0800 (PST)
X-Received: by 2002:a05:6214:21ec:: with SMTP id p12mr62162485qvj.82.1641666046216;
Sat, 08 Jan 2022 10:20:46 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 8 Jan 2022 10:20:46 -0800 (PST)
In-Reply-To: <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Sat, 08 Jan 2022 18:20:46 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 35

by: Öö Tiib - Sat, 8 Jan 2022 18:20 UTC

On Saturday, 8 January 2022 at 19:17:19 UTC+2, Malcolm McLean wrote:
> On Saturday, 8 January 2022 at 16:17:53 UTC, Öö Tiib wrote:
> > On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> >
> > > So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> > > unprefixed string literals, even for implementations targeting platforms where that's
> > > contrary to the conventions for that platform?
> > Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
> > Why? On what? For what? Must be odd corner case.
> >
> Where you've got an 8 bit character-mapped display that supports ascii plus some
> extended characters. That used to be almost every microcomputer, and it still lives
> on to a bit in modern PCs.

When? How long ago? In eighties?
I have had the fun to participate in programming such panel more than decade ago
to support showing whatever text including 8,105 "simplified" Chinese characters if
needed. Wasn't that big a project and the panel was dirt cheap. I disliked that it
was pointlessly required to use UTF-16 as UTF-8 could make it even simpler. What
is the reason to use for character-mapping in our current world anything but UTF-8?
In current PCs it lives on as deliberate sabotage by the platform vendor as there are
no reasons others than desire to make their proprietary programming language to
look better than C.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srcll6$14f3$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19892&group=comp.lang.c#19892

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!8hiQobHKlOvsb2aWVVOzwA.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sat, 8 Jan 2022 19:37:26 +0100
Organization: . . .
Message-ID: <srcll6$14f3$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me>
<_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: gioia.aioe.org; logging-data="37347"; posting-host="8hiQobHKlOvsb2aWVVOzwA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2

by: Mateusz Viste - Sat, 8 Jan 2022 18:37 UTC

2022-01-08 at 10:20 -0800, Öö Tiib wrote:
> I disliked that it was pointlessly required to use UTF-16 as UTF-8
> could make it even simpler. What is the reason to use for
> character-mapping in our current world anything but UTF-8?

While UTF-8 is neat, it is also complex to decode. Even a simple
strlen() can be challenging. That's where UTF-16 or UTF-32 are handy
since, there is no decoding required and every glyph has a fixed
byte length.

Mateusz

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19893&group=comp.lang.c#19893

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:2aab:: with SMTP id js11mr61799986qvb.54.1641667735407;
Sat, 08 Jan 2022 10:48:55 -0800 (PST)
X-Received: by 2002:a05:622a:104e:: with SMTP id f14mr8974844qte.376.1641667735206;
Sat, 08 Jan 2022 10:48:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 8 Jan 2022 10:48:55 -0800 (PST)
In-Reply-To: <4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=108.48.119.9; posting-account=Ix1u_AoAAAAILVQeRkP2ENwli-Uv6vO8
NNTP-Posting-Host: 108.48.119.9
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: jameskuy...@alumni.caltech.edu (james...@alumni.caltech.edu)
Injection-Date: Sat, 08 Jan 2022 18:48:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 104

by: james...@alumni.calt - Sat, 8 Jan 2022 18:48 UTC

On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
....
> > So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> > unprefixed string literals, even for implementations targeting platforms where that's
> > contrary to the conventions for that platform?
> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?

Well, let me ask you - does the implementation you use most often use UTF-8
encoding for unprefixed string literals? Since you're complaining about the difficulty of
using UTF-8, I presume that it doesn't. If not, why not? The standard doesn't say
anything to prevent that implementation from doing so. If they don't, it can only be
because they don't want to. So why don't you ask the implementors why they made
that decision? They've got a reason that seemed sufficiently good for them, find out
what it is.

....
> FILE *f = fopen( "Foo😀Bar.txt", "w");
> That should work unless underlying file system does not support files
> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> bad standard that allows implementations to weasel away. No garbage like
> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> needed as it already works like in my example on vast majority of things.

Nothing in the standard prevents an implementation from doing that. If one doesn't
already do so, that's a choice made by the implementors, and you should ask them
about it. Your real beef is with the implementors, not the standard.

....
> You never answered why should they use obscure extensions for what they need on
> majority of cases. Why UTF-8 must be obscure extension?

They shouldn't. It isn't. It's an implementation-defined choice, and if an
implementation you want to use forces you to use an obscure extension, in order to
work with UTF-8, you should ask them why - it's nothing the standard forced them to
do. I don't need to use any extensions to work with UTF-8 on my desktop. I also don't
need to use UTF-8, but that's a separate matter.

....
> It is used on close to 100% of cases anyway. I am objecting that it is deliberately
> standardized (or more like pseudo-standardized/non-standardized) to be
> inconvenient to use.

How would you like it to be more convenient? printf("Foo😀Bar.txt") works fine on my
system. If it doesn't work on yours, talk with your implementor.

....
> Agreed. So do you have numbers how many C programmers *want* to use UTF-16? I
> think that it is little, but I do not have any sources. They may *need* to for legacy
> reasons I already mentioned but even there it is most likely small number.. Their
> pain with support to their u"string", L"string" and \x \u \U character references
> might need relieving too but is bit different topic.

If they don't want it, and in particular, don't need it, then it should be pretty easy to
convince implementors to use UTF-8 instead. Have you tried? If they refuse, their
response is likely to give you far more relevant information than I could give you, since
I don't work on that platform any more, and didn't work on it very long.

The fundamental problem is - why should people who want to use some other
encoding for unprefixed string literals be forced to use UTF-8 instead, just because
you disagree? How does the existence of implementation catering to their needs hurt
you? Those implementations aren't the reason why using UTF-8 is complicated on the
implementations you use - that's entirely due to decisions made by your implementor
- so talk to the implementor and try to convince them to change.

If it seems unreasonable to you that you should have to convince one implementor to
adopt UTF-8, keep this in mind: however hard it is to convince a single implementor to
change, it would be much harder to convince the C and C++ committees to make such
a change. If you do want to convince those committees, a good way to start is by
convincing a single implmentation to change.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19894&group=comp.lang.c#19894

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:1998:: with SMTP id u24mr5533931qtc.505.1641668000566;
Sat, 08 Jan 2022 10:53:20 -0800 (PST)
X-Received: by 2002:a05:620a:2911:: with SMTP id m17mr10479795qkp.151.1641668000427;
Sat, 08 Jan 2022 10:53:20 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 8 Jan 2022 10:53:20 -0800 (PST)
In-Reply-To: <srcll6$14f3$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com> <srcll6$14f3$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Sat, 08 Jan 2022 18:53:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 13

by: Öö Tiib - Sat, 8 Jan 2022 18:53 UTC

On Saturday, 8 January 2022 at 20:37:38 UTC+2, Mateusz Viste wrote:
> 2022-01-08 at 10:20 -0800, Öö Tiib wrote:
> > I disliked that it was pointlessly required to use UTF-16 as UTF-8
> > could make it even simpler. What is the reason to use for
> > character-mapping in our current world anything but UTF-8?
>
> While UTF-8 is neat, it is also complex to decode. Even a simple
> strlen() can be challenging. That's where UTF-16 or UTF-32 are handy
> since, there is no decoding required and every glyph has a fixed
> byte length.

That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
varying length. That makes UTF-8 the sole sane one.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srcnm2$5h9$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19895&group=comp.lang.c#19895

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!8hiQobHKlOvsb2aWVVOzwA.user.46.165.242.75.POSTED!not-for-mail
From: mate...@xyz.invalid (Mateusz Viste)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sat, 8 Jan 2022 20:12:02 +0100
Organization: . . .
Message-ID: <srcnm2$5h9$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me>
<_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: gioia.aioe.org; logging-data="5673"; posting-host="8hiQobHKlOvsb2aWVVOzwA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
X-Notice: Filtered by postfilter v. 0.9.2

by: Mateusz Viste - Sat, 8 Jan 2022 19:12 UTC

2022-01-08 at 10:53 -0800, Öö Tiib wrote:
> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
> varying length. That makes UTF-8 the sole sane one.

You are right, the implementations of UTF-16 I worked on were limited
to the BMP (ie. always 2 bytes), hence my simplified view.

Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?

Mateusz

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srcnpl$b93$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19896&group=comp.lang.c#19896

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!Puiiztk9lHEEQC0y3uUjRA.user.46.165.242.75.POSTED!not-for-mail
From: inva...@add.invalid (Manfred)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sat, 8 Jan 2022 20:13:57 +0100
Organization: Aioe.org NNTP Server
Message-ID: <srcnpl$b93$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="11555"; posting-host="Puiiztk9lHEEQC0y3uUjRA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:68.0) Gecko/20100101
Thunderbird/68.5.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Manfred - Sat, 8 Jan 2022 19:13 UTC

On 1/8/22 7:37 PM, Mateusz Viste wrote:
> 2022-01-08 at 10:20 -0800, Öö Tiib wrote:
>> I disliked that it was pointlessly required to use UTF-16 as UTF-8
>> could make it even simpler. What is the reason to use for
>> character-mapping in our current world anything but UTF-8?
>
> While UTF-8 is neat, it is also complex to decode. Even a simple
> strlen() can be challenging. That's where UTF-16 or UTF-32 are handy
> since, there is no decoding required and every glyph has a fixed
> byte length.

As Öö Tiib wrote, UTF-16 is variable length - its predecessor (under
Windows) UCS-2 is fixed length, but it failed to keep the promise to
accommodate for the entire universe of language glyphs.
Furthermore, the recent fashion of adding emojis to Unicode has made
UTF-32 no longer fixed length as well.

However, the problem with strlen() is most often a false problem: most
often you need to know the size of the string in memory, and that's
bytes, rather than the count of glyphs in the string. Which you still
can do, by the way, but it doesn't have to have the performance
requirements of strlen, for example.

>
> Mateusz
>

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<fbdb61bd-ad36-4e94-8fbc-f7f815bd1ee2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19899&group=comp.lang.c#19899

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:d66:: with SMTP id 6mr63721096qvs.85.1641669753455;
Sat, 08 Jan 2022 11:22:33 -0800 (PST)
X-Received: by 2002:ac8:7f8f:: with SMTP id z15mr60783303qtj.613.1641669742517;
Sat, 08 Jan 2022 11:22:22 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 8 Jan 2022 11:22:22 -0800 (PST)
In-Reply-To: <srcnm2$5h9$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:2454:7a15:1508:64c4;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:2454:7a15:1508:64c4
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com> <srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com> <srcnm2$5h9$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fbdb61bd-ad36-4e94-8fbc-f7f815bd1ee2n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 08 Jan 2022 19:22:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 15

by: Malcolm McLean - Sat, 8 Jan 2022 19:22 UTC

On Saturday, 8 January 2022 at 19:12:14 UTC, Mateusz Viste wrote:
> 2022-01-08 at 10:53 -0800, Öö Tiib wrote:
> > That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
> > varying length. That makes UTF-8 the sole sane one.
> You are right, the implementations of UTF-16 I worked on were limited
> to the BMP (ie. always 2 bytes), hence my simplified view.
>
> Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?
>
The problem is that not all languages fit into the Latin mould, where
you have one letter taking up one physical rectangle of space in the
writing area.
In some languages, there are combining forms. We see this a bit
in European scripts, where you have accents. But in other languages
it can go much deeper, and you can't really provide one code per glyph.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srconh$6u7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19900&group=comp.lang.c#19900

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sat, 8 Jan 2022 19:29:54 +0000
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <srconh$6u7$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 8 Jan 2022 19:29:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ff053acf9bf64e933491ba326a7cea4b";
logging-data="7111"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19bZEwZNVWrZa6U/gCKeuQLfgsyGnU5Szc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:vc3WP1Pu/nsQu/5jR2BjqAQ6QsA=
In-Reply-To: <c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>

by: Bart - Sat, 8 Jan 2022 19:29 UTC

On 08/01/2022 18:53, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 20:37:38 UTC+2, Mateusz Viste wrote:
>> 2022-01-08 at 10:20 -0800, Öö Tiib wrote:
>>> I disliked that it was pointlessly required to use UTF-16 as UTF-8
>>> could make it even simpler. What is the reason to use for
>>> character-mapping in our current world anything but UTF-8?
>>
>> While UTF-8 is neat, it is also complex to decode. Even a simple
>> strlen() can be challenging. That's where UTF-16 or UTF-32 are handy
>> since, there is no decoding required and every glyph has a fixed
>> byte length.
>
> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
> varying length. That makes UTF-8 the sole sane one.

This is Unicode crossing the line into typography, markup and clip-art.

The first is an actual character, or rather, a symbol (especially by
'glyph' is meant only the shape or design). The second is a modifier, I
believe of the colour, which IMO don't belong (along with font, height,
aspect and weight, among other attibutes).

I'm sure there were plenty of such schemes in ASCII too, although more
recently they take form of explicit tag strings. But it's still the case
that a linebreak in ASCII can be CR,LF or just LF; so is that one
character or two?

At the heart, though, everyone knows that a plain ASCII string of
printable characters, one - one containing no control codes, attributes
or other meta-data - can be represented by an array of bytes, one per
character.

Similarly, most such strings of full 21-bit (ie. 32 bits in practice)
Unicode codes can be represented by an indexable array of 32-bit values.

If you really, really need those multi-Unicode sequences, then you can
choose to represent a string as an array of variable-length short
strings, most of which will be one 32-bit character long.

Although there will doubtless be other special requirements that would
make that impractical too. But then, the very definition of what is a
character or word will be blurred as well.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srcons$n00$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19901&group=comp.lang.c#19901

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!UgLt14+w9tVHe1BtIa3HDQ.user.46.165.242.75.POSTED!not-for-mail
From: mess...@bottle.org (Guillaume)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sat, 8 Jan 2022 20:29:54 +0100
Organization: Aioe.org NNTP Server
Message-ID: <srcons$n00$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="23552"; posting-host="UgLt14+w9tVHe1BtIa3HDQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: fr

by: Guillaume - Sat, 8 Jan 2022 19:29 UTC

Le 08/01/2022 à 19:37, Mateusz Viste a écrit :
> 2022-01-08 at 10:20 -0800, Öö Tiib wrote:
>> I disliked that it was pointlessly required to use UTF-16 as UTF-8
>> could make it even simpler. What is the reason to use for
>> character-mapping in our current world anything but UTF-8?
>
> While UTF-8 is neat, it is also complex to decode. Even a simple
> strlen() can be challenging. That's where UTF-16 or UTF-32 are handy
> since, there is no decoding required and every glyph has a fixed
> byte length.

Yeah. While fixed-width characters are certainly easier (and, most of
all, faster) to handle, UTF-8 is not rocket science. The encoding is
pretty simple.

The downside is more speed than complexity.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19902&group=comp.lang.c#19902

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:2583:: with SMTP id fq3mr63331698qvb.94.1641682201789;
Sat, 08 Jan 2022 14:50:01 -0800 (PST)
X-Received: by 2002:a05:6214:f07:: with SMTP id gw7mr10044582qvb.6.1641682201658;
Sat, 08 Jan 2022 14:50:01 -0800 (PST)
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 8 Jan 2022 14:50:01 -0800 (PST)
In-Reply-To: <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Sat, 08 Jan 2022 22:50:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 95

by: Öö Tiib - Sat, 8 Jan 2022 22:50 UTC

On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> > On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> ...
> > > So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> > > unprefixed string literals, even for implementations targeting platforms where that's
> > > contrary to the conventions for that platform?
> > Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
>
> Well, let me ask you - does the implementation you use most often use UTF-8
> encoding for unprefixed string literals? Since you're complaining about the difficulty of
> using UTF-8, I presume that it doesn't. If not, why not?

All compilers that I have used did it for some time. Or at least could be configured to.
Sometimes the configuration had to be done in inconvenient manner but these
are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
prefixes. When some confused novice added it then it did not matter. After all
these two were working same in C++17:

const char crap[] = "Öö Tiib 😀";
const char crap8[] = u8"Öö Tiib 😀";

But C++20 gives error about second line. Also cast does not compile there.
So that makes me angry opponent of the whole u8 prefix. It should be gone from
language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
want to but should raise the privileges of the UTF-8 to be always supported
as char array and not fucked with.

> The standard doesn't say
> anything to prevent that implementation from doing so. If they don't, it can only be
> because they don't want to. So why don't you ask the implementors why they made
> that decision? They've got a reason that seemed sufficiently good for them, find out
> what it is.

No, that fish rots from the head, IOW from standards. MS abuses it more than others
but only because they are bit bigger assholes.
> ...
> > FILE *f = fopen( "Foo😀Bar.txt", "w");
> > That should work unless underlying file system does not support files
> > named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> > bad standard that allows implementations to weasel away. No garbage like
> > u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> > needed as it already works like in my example on vast majority of things.
>
> Nothing in the standard prevents an implementation from doing that. If one doesn't
> already do so, that's a choice made by the implementors, and you should ask them
> about it. Your real beef is with the implementors, not the standard.

My beef is with standards. Adding garbage that does not work to standard is wrong
and not adding what everybody at least half sane does use to standard is also wrong.

>
> They shouldn't. It isn't. It's an implementation-defined choice, and if an
> implementation you want to use forces you to use an obscure extension, in order to
> work with UTF-8, you should ask them why - it's nothing the standard forced them to
> do. I don't need to use any extensions to work with UTF-8 on my desktop. I also don't
> need to use UTF-8, but that's a separate matter.

Oh, if I can't convince even experienced person like you that the obfuscation
around UTF-8 in standards is evil then there are no point to discuss that
position with any implementer.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srdd2b$om7$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19903&group=comp.lang.c#19903

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!TvNXSfOF8nxwhe91w101ag.user.46.165.242.75.POSTED!not-for-mail
From: non...@add.invalid (Manfred)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sun, 9 Jan 2022 02:16:59 +0100
Organization: Aioe.org NNTP Server
Message-ID: <srdd2b$om7$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="25287"; posting-host="TvNXSfOF8nxwhe91w101ag.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Manfred - Sun, 9 Jan 2022 01:16 UTC

On 1/8/2022 11:50 PM, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>> ...
>>>> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
>>>> unprefixed string literals, even for implementations targeting platforms where that's
>>>> contrary to the conventions for that platform?
>>> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
>>
>> Well, let me ask you - does the implementation you use most often use UTF-8
>> encoding for unprefixed string literals? Since you're complaining about the difficulty of
>> using UTF-8, I presume that it doesn't. If not, why not?
>
> All compilers that I have used did it for some time. Or at least could be configured to.
> Sometimes the configuration had to be done in inconvenient manner but these
> are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
> prefixes. When some confused novice added it then it did not matter. After all
> these two were working same in C++17:
>
> const char crap[] = "Öö Tiib 😀";
> const char crap8[] = u8"Öö Tiib 😀";
>
> But C++20 gives error about second line. Also cast does not compile there.
> So that makes me angry opponent of the whole u8 prefix. It should be gone from
> language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
> want to but should raise the privileges of the UTF-8 to be always supported
> as char array and not fucked with.

The argument you are making here is more than convincing to me, but let
me try the devil's advocate role here.

Granted, the Venerable Luminaries of the Holy Committee screwed up big
time, but they did it in C++17 (how surprising) rather than in C++20.

In principle, I could imagine a use for u8"" strings that are compatible
with some family of printf8() functions only, a sort of tight type
constraint for character types. This would have probably ended up like
annex K, but still it could have made sense to some Unicode purists, and
more importantly it would have made no harm to the sane world.

BUT the fact that C++17 allowed your second string, and thus people
started naively using it, and THEN C++ prohibited it, thus breaking said
naïve but so far legal code, denotes some serious dickheadedness, yes.

This is to say that the solution might be to consider C++17 a sad
parenthesis (once again), and only use char, char16_t (wchar_t?) and
char32_t where needed.
Applications where a distinct separation between utf_8 and generic char
is important can pay the price of using u8, but the majority of
applications would most probably ignore it.

>
>> The standard doesn't say
>> anything to prevent that implementation from doing so. If they don't, it can only be
>> because they don't want to. So why don't you ask the implementors why they made
>> that decision? They've got a reason that seemed sufficiently good for them, find out
>> what it is.
>
> No, that fish rots from the head, IOW from standards. MS abuses it more than others
> but only because they are bit bigger assholes.
>

Agreed.

>> ...
>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>> That should work unless underlying file system does not support files
>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>> bad standard that allows implementations to weasel away. No garbage like
>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>> needed as it already works like in my example on vast majority of things.
>>
>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>> already do so, that's a choice made by the implementors, and you should ask them
>> about it. Your real beef is with the implementors, not the standard.
>
> My beef is with standards. Adding garbage that does not work to standard is wrong
> and not adding what everybody at least half sane does use to standard is also wrong.
>

Also agreed, but since utf-8 is transparent to ascii functions, what
should have been added?
I mean, if printf can't print utf-8, it is a problem of the console
rather than printf itself, right? So some way to set the console in
utf-8 mode? But that is outside the scope of the standard, isn't it?

>>
>> They shouldn't. It isn't. It's an implementation-defined choice, and if an
>> implementation you want to use forces you to use an obscure extension, in order to
>> work with UTF-8, you should ask them why - it's nothing the standard forced them to
>> do. I don't need to use any extensions to work with UTF-8 on my desktop. I also don't
>> need to use UTF-8, but that's a separate matter.
>
> Oh, if I can't convince even experienced person like you that the obfuscation
> around UTF-8 in standards is evil then there are no point to discuss that
> position with any implementer.
>

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<lezpdwdw.fsf@yahoo.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19905&group=comp.lang.c#19905

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!K/GcOpVz8Y6R73fFRNUdsw.user.46.165.242.91.POSTED!not-for-mail
From: luang...@yahoo.com (Po Lu)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
Date: Sun, 09 Jan 2022 18:41:15 +0800
Organization: Aioe.org NNTP Server
Message-ID: <lezpdwdw.fsf@yahoo.com>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: gioia.aioe.org; logging-data="20902"; posting-host="K/GcOpVz8Y6R73fFRNUdsw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (haiku)
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:HNTNESJiiLah0mJG8Dz6QbmzyTE=

by: Po Lu - Sun, 9 Jan 2022 10:41 UTC

"james...@alumni.caltech.edu" <jameskuyper@alumni.caltech.edu> writes:

> No, I am quite accurately and honestly expressing my confusion. You
> object to something being prohibited by the standards that is, to the
> best of my understanding, allowed. It would make more sense if you
> were objecting the fact that it isn't mandatory, and if you were
> making such claims, I would disagree with you about whether it would
> be a good idea to make it mandatory - but as far as I can tell, you're
> claiming it isn't allowed.

AFAICT, he's complaining about Microsoft's specific implementations of
some standards.

I'm not an MS-Windows programmer, but from a Unix point-of-view their
way of doing things is indeed confusing -- at least when I looked into
porting some programs.

It's probably a matter of habit: I'm sure the distinction between wide
and ASCII system calls, code pages, and text and binary streams comes
naturally to MS-Windows programmers, who in turn find the lack of
explicit text streams in Unix confusing.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srekmi$h0p$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19906&group=comp.lang.c#19906

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sun, 9 Jan 2022 13:33:21 +0100
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <srekmi$h0p$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 9 Jan 2022 12:33:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ecdcffdc11e458aefbe6267c003aafae";
logging-data="17433"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+M8A9ML401MsQkEV3ii3H3ijkNANpGQig="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:wp2b7mp6gyAw7W+86IpruUsJ2Gs=
In-Reply-To: <srcnm2$5h9$1@gioia.aioe.org>
Content-Language: en-GB

by: David Brown - Sun, 9 Jan 2022 12:33 UTC

On 08/01/2022 20:12, Mateusz Viste wrote:
> 2022-01-08 at 10:53 -0800, Öö Tiib wrote:
>> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
>> varying length. That makes UTF-8 the sole sane one.
>
> You are right, the implementations of UTF-16 I worked on were limited
> to the BMP (ie. always 2 bytes), hence my simplified view.
>

When Unicode was young, the intention was that every glyph was one
character, and it would all fit in 16-bits - that was UCS2, as used
originally by Windows NT, Java, Python, QT, and other systems, languages
and libraries. But it was quickly discovered that this was far from
sufficient.

> Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?
>

The terminology of Unicode can be a little confusing. (And I'm sure
someone will correct me if I get it wrong.)

A "code point" is an entry in the Unicode tables. Each code point is
uniquely identified by a 32-bit number. The code points are organised
in "planes" for convenience, and designed so that the first 128 code
points match ASCII and that a wide range of languages can be covered by
the code units in the range 0x0000 .. 0xffff (excluding 0xd800 ..
0xdfff) so that 16 bits would often be enough.

A "code unit" is the container for the bits of the encoding. In UTF-8,
a code unit is an 8-bit unit. In UTF-16, it is 16-bit, in UTF-32 it is
32-bit.

UTF-8 takes up to four code units (32 bits total) per code point, UTF-16
takes up to two code units, and UTF-32 takes exactly one code unit per
code point. UTF-8 is always at least as compact as UTF-32, and will be
more or less compact than UTF-16 depending on the content. These are
just different encodings - different ways to write the code points.
There are others, such as GB18030 which is a 16-bit encoding popular in
China because it matches their traditional GB encodings in the same way
UTF-8 matches ASCII.

A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
that conveys meaning. Sometimes it is useful to break them down,
sometimes it is useful to treat them separately. For example, "é" can
considered as a single grapheme, or as a grapheme "e" followed by a
combining graphene "'" acute accent. The same grapheme can match
multiple code points - a Latin alphabet capital A is the same as a Greek
alphabet capital Alpha.

A "glyph" is a rendering of a grapheme - the letter "A" in different
fonts are different glyphs of the same grapheme.

What the reader perceives as a "character" is often a single grapheme,
but might be several graphemes together.

So, with that in mind, all three UTF formats require multiple code units
to cover all graphemes. But UTF-32 always gets one code point per code
unit, making it simpler and more consistent for processing Unicode text.
As a file or transfer encoding, it has the big inconvenience of being
endian-specific as well as being bulkier than UTF-8. UTF-16 combines
the worst features of UTF-8 with the worst features of UTF-32, with none
of the benefits - it exists solely because early Unicode adopters
committed too strongly to UCS2.

People are often concerned that UTF-8 is difficult or complex to decode
or split up. It is not, in practice. It is actually quite rare that
you need to divide up a string based on characters or even find its
length in code points - for most uses of strings, you just pass them
around without bothering about the details of the contents. You need to
know how much memory the string takes, not how many code points it has.
And simply treating it as an abstract stream of data terminated by a
zero character can be enough to give you a useable sorting and
uniqueness comparison for many uses. The point where you need to decode
the code units and know what they mean is when you are doing rendering,
sorting, or other human interaction - and then you have such a vastly
bigger task that turning UTF-8 coding into UTF-32 code points is
negligible effort in comparison.

(And UTF-8 is not much harder to encode or decode than UTF-16.)

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19908&group=comp.lang.c#19908

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:9c3:: with SMTP id y3mr48678668qky.367.1641732242575;
Sun, 09 Jan 2022 04:44:02 -0800 (PST)
X-Received: by 2002:a05:622a:93:: with SMTP id o19mr8432746qtw.379.1641732242436;
Sun, 09 Jan 2022 04:44:02 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 04:44:02 -0800 (PST)
In-Reply-To: <srdd2b$om7$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Sun, 09 Jan 2022 12:44:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 122

by: Öö Tiib - Sun, 9 Jan 2022 12:44 UTC

On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> > On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> >> ...
> >>>> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> >>>> unprefixed string literals, even for implementations targeting platforms where that's
> >>>> contrary to the conventions for that platform?
> >>> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
> >>
> >> Well, let me ask you - does the implementation you use most often use UTF-8
> >> encoding for unprefixed string literals? Since you're complaining about the difficulty of
> >> using UTF-8, I presume that it doesn't. If not, why not?
> >
> > All compilers that I have used did it for some time. Or at least could be configured to.
> > Sometimes the configuration had to be done in inconvenient manner but these
> > are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
> > prefixes. When some confused novice added it then it did not matter. After all
> > these two were working same in C++17:
> >
> > const char crap[] = "Öö Tiib 😀";
> > const char crap8[] = u8"Öö Tiib 😀";
> >
> > But C++20 gives error about second line. Also cast does not compile there.
> > So that makes me angry opponent of the whole u8 prefix. It should be gone from
> > language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
> > want to but should raise the privileges of the UTF-8 to be always supported
> > as char array and not fucked with.
> The argument you are making here is more than convincing to me, but let
> me try the devil's advocate role here.
>
> Granted, the Venerable Luminaries of the Holy Committee screwed up big
> time, but they did it in C++17 (how surprising) rather than in C++20.
>
> In principle, I could imagine a use for u8"" strings that are compatible
> with some family of printf8() functions only, a sort of tight type
> constraint for character types. This would have probably ended up like
> annex K, but still it could have made sense to some Unicode purists, and
> more importantly it would have made no harm to the sane world.
>
> BUT the fact that C++17 allowed your second string, and thus people
> started naively using it, and THEN C++ prohibited it, thus breaking said
> naïve but so far legal code, denotes some serious dickheadedness, yes.
>
> This is to say that the solution might be to consider C++17 a sad
> parenthesis (once again), and only use char, char16_t (wchar_t?) and
> char32_t where needed.
> Applications where a distinct separation between utf_8 and generic char
> is important can pay the price of using u8, but the majority of
> applications would most probably ignore it.

Hmm. Great is that at least you are convinced. For me C++17 and
C++20 both added too big number of silently changing or silently
turning into undefined behaviors so the noisy one is even better
than the rest of it. Just that the importance of UTF-8 in software
development industry is hard to overestimate.

....

> >>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>> That should work unless underlying file system does not support files
> >>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>> bad standard that allows implementations to weasel away. No garbage like
> >>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>> needed as it already works like in my example on vast majority of things.
> >>
> >> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >> already do so, that's a choice made by the implementors, and you should ask them
> >> about it. Your real beef is with the implementors, not the standard.
> >
> > My beef is with standards. Adding garbage that does not work to standard is wrong
> > and not adding what everybody at least half sane does use to standard is also wrong.
> >
> Also agreed, but since utf-8 is transparent to ascii functions, what
> should have been added?

Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
on file-system that fully supports files named "Foo😀Bar.txt" I suppose.

> I mean, if printf can't print utf-8, it is a problem of the console
> rather than printf itself, right? So some way to set the console in
> utf-8 mode? But that is outside the scope of the standard, isn't it?

The console output can be set to UTF-8 mode with few lines of platform specific
code ... its keyboard input can't but that is all about vendor ... I agree with James there.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srf2q6$c63$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19911&group=comp.lang.c#19911

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!FW82V+DVgp9hOrBwDPQtKg.user.46.165.242.75.POSTED!not-for-mail
From: non...@add.invalid (Manfred)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sun, 9 Jan 2022 17:34:14 +0100
Organization: Aioe.org NNTP Server
Message-ID: <srf2q6$c63$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="12483"; posting-host="FW82V+DVgp9hOrBwDPQtKg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Manfred - Sun, 9 Jan 2022 16:34 UTC

On 1/9/2022 1:44 PM, Öö Tiib wrote:
> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>>>> ...
>>>>>> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
>>>>>> unprefixed string literals, even for implementations targeting platforms where that's
>>>>>> contrary to the conventions for that platform?
>>>>> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
>>>>
>>>> Well, let me ask you - does the implementation you use most often use UTF-8
>>>> encoding for unprefixed string literals? Since you're complaining about the difficulty of
>>>> using UTF-8, I presume that it doesn't. If not, why not?
>>>
>>> All compilers that I have used did it for some time. Or at least could be configured to.
>>> Sometimes the configuration had to be done in inconvenient manner but these
>>> are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
>>> prefixes. When some confused novice added it then it did not matter. After all
>>> these two were working same in C++17:
>>>
>>> const char crap[] = "Öö Tiib 😀";
>>> const char crap8[] = u8"Öö Tiib 😀";
>>>
>>> But C++20 gives error about second line. Also cast does not compile there.
>>> So that makes me angry opponent of the whole u8 prefix. It should be gone from
>>> language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
>>> want to but should raise the privileges of the UTF-8 to be always supported
>>> as char array and not fucked with.
>> The argument you are making here is more than convincing to me, but let
>> me try the devil's advocate role here.
>>
>> Granted, the Venerable Luminaries of the Holy Committee screwed up big
>> time, but they did it in C++17 (how surprising) rather than in C++20.
>>
>> In principle, I could imagine a use for u8"" strings that are compatible
>> with some family of printf8() functions only, a sort of tight type
>> constraint for character types. This would have probably ended up like
>> annex K, but still it could have made sense to some Unicode purists, and
>> more importantly it would have made no harm to the sane world.
>>
>> BUT the fact that C++17 allowed your second string, and thus people
>> started naively using it, and THEN C++ prohibited it, thus breaking said
>> naïve but so far legal code, denotes some serious dickheadedness, yes.
>>
>> This is to say that the solution might be to consider C++17 a sad
>> parenthesis (once again), and only use char, char16_t (wchar_t?) and
>> char32_t where needed.
>> Applications where a distinct separation between utf_8 and generic char
>> is important can pay the price of using u8, but the majority of
>> applications would most probably ignore it.
>
> Hmm. Great is that at least you are convinced. For me C++17 and
> C++20 both added too big number of silently changing or silently
> turning into undefined behaviors so the noisy one is even better
> than the rest of it.

Yes, and the silent one is in C++17. From your example, in C++20 the
compiler doesn't allow you to pass a u8"string" to printf, does it?
If u8 was started this way from the beginning, then the problem you
mention above wouldn't exist.

Just that the importance of UTF-8 in software
> development industry is hard to overestimate.
>
> ...
>
>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>> That should work unless underlying file system does not support files
>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>> needed as it already works like in my example on vast majority of things.
>>>>
>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>> about it. Your real beef is with the implementors, not the standard.
>>>
>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>
>> Also agreed, but since utf-8 is transparent to ascii functions, what
>> should have been added?
>
> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>

Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
representation, what's the difference? One form or the other shows up
only when it is displayed in some UI - the filesystem isn't one, which
leads to the implementation's runtime behavior.

If they are actually different in their binary sequence, and this is the
result of the utf-8 string being wrongly converted multiple times, this
looks like a bad implementation, rather than a problem with the standard.
IIUC you are advocating for some statement in the standard that prevents
implementations from messing up with "character sets" in null terminated
char strings?

>> I mean, if printf can't print utf-8, it is a problem of the console
>> rather than printf itself, right? So some way to set the console in
>> utf-8 mode? But that is outside the scope of the standard, isn't it?
>
> The console output can be set to UTF-8 mode with few lines of platform specific
> code ... its keyboard input can't but that is all about vendor ... I agree with James there.
>

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<GhFCJ.204983$831.61812@fx40.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19912&group=comp.lang.c#19912

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx40.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org> <srekmi$h0p$1@dont-email.me>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <srekmi$h0p$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 35
Message-ID: <GhFCJ.204983$831.61812@fx40.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 12:52:38 -0500
X-Received-Bytes: 3387

by: Richard Damon - Sun, 9 Jan 2022 17:52 UTC

On 1/9/22 7:33 AM, David Brown wrote:

> A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
> that conveys meaning. Sometimes it is useful to break them down,
> sometimes it is useful to treat them separately. For example, "é" can
> considered as a single grapheme, or as a grapheme "e" followed by a
> combining graphene "'" acute accent. The same grapheme can match
> multiple code points - a Latin alphabet capital A is the same as a Greek
> alphabet capital Alpha.
>
> A "glyph" is a rendering of a grapheme - the letter "A" in different
> fonts are different glyphs of the same grapheme.
>
> What the reader perceives as a "character" is often a single grapheme,
> but might be several graphemes together.
>

No, a grapheme, from my understanding, is the character as perceived by
the readed. Thus the adding of accents to a base character builds a
single grapheme from several codepoints.

The grapheme dosn't include 'font' information like which font to use,
the size, additions like bold or italics and such, which add on to make
the final glyph, but does include all the jots and tildes that are part
of the character.

On the other hand, some languages add things like 'vowel points' to
characters, and those are seperate graphemes even though they are added
by a similar manner. This comes down to what the original language
though of as a 'character', which just makes things even more complicated.

Then as you said, there are the 'look-alike' characters which are
considered (generally) to be separate, but some canonicalizations will
convert to a common character.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19913&group=comp.lang.c#19913

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a37:c50:: with SMTP id 77mr51700464qkm.717.1641751986512;
Sun, 09 Jan 2022 10:13:06 -0800 (PST)
X-Received: by 2002:a37:315:: with SMTP id 21mr49533354qkd.52.1641751986378;
Sun, 09 Jan 2022 10:13:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 10:13:06 -0800 (PST)
In-Reply-To: <GhFCJ.204983$831.61812@fx40.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:302a:f794:3f85:b836;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:302a:f794:3f85:b836
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com> <srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com> <srcnm2$5h9$1@gioia.aioe.org>
<srekmi$h0p$1@dont-email.me> <GhFCJ.204983$831.61812@fx40.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sun, 09 Jan 2022 18:13:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 11

by: Malcolm McLean - Sun, 9 Jan 2022 18:13 UTC

On Sunday, 9 January 2022 at 17:52:50 UTC, Richard Damon wrote:
>
> On the other hand, some languages add things like 'vowel points' to
> characters, and those are seperate graphemes even though they are added
> by a similar manner. This comes down to what the original language
> though of as a 'character', which just makes things even more complicated.
>
In Hebrew the "vowel points" are optional. They are used in beginners' and
religious texts, but not in general use. So if we take a text from scripture,
and represent it with and without vowels, is that the same text or a different
text? Almost all Hebrew speakers would say "It's the same text". So
strcmp() doesn't necessarily work in a Hebrew context.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<DLFCJ.154462$SR4.45518@fx43.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19914&group=comp.lang.c#19914

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org> <srekmi$h0p$1@dont-email.me>
<GhFCJ.204983$831.61812@fx40.iad>
<b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 20
Message-ID: <DLFCJ.154462$SR4.45518@fx43.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 13:24:35 -0500
X-Received-Bytes: 2966

by: Richard Damon - Sun, 9 Jan 2022 18:24 UTC

On 1/9/22 1:13 PM, Malcolm McLean wrote:
> On Sunday, 9 January 2022 at 17:52:50 UTC, Richard Damon wrote:
>>
>> On the other hand, some languages add things like 'vowel points' to
>> characters, and those are seperate graphemes even though they are added
>> by a similar manner. This comes down to what the original language
>> though of as a 'character', which just makes things even more complicated.
>>
> In Hebrew the "vowel points" are optional. They are used in beginners' and
> religious texts, but not in general use. So if we take a text from scripture,
> and represent it with and without vowels, is that the same text or a different
> text? Almost all Hebrew speakers would say "It's the same text". So
> strcmp() doesn't necessarily work in a Hebrew context.

Hebrew isn't the only language to use 'vowel points', but they also
occur in a number of other languages.

The key point that I was pointing out is that in many of these
languages, to points are NOT considered a part of the letter they are
'attached' to, but a separate letter, even if typographically connected.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srfmqt$9id$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19915&group=comp.lang.c#19915

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sun, 9 Jan 2022 23:15:56 +0100
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <srfmqt$9id$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org> <srekmi$h0p$1@dont-email.me>
<GhFCJ.204983$831.61812@fx40.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 9 Jan 2022 22:15:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ecdcffdc11e458aefbe6267c003aafae";
logging-data="9805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+zBvKWTzpiBRUhPRg5zUEJxRaRY3DavbY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:iJa2KK4u4PuFkbRvqqZWQhghoOU=
In-Reply-To: <GhFCJ.204983$831.61812@fx40.iad>
Content-Language: en-GB

by: David Brown - Sun, 9 Jan 2022 22:15 UTC

On 09/01/2022 18:52, Richard Damon wrote:
>
> On 1/9/22 7:33 AM, David Brown wrote:
>
>> A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
>> that conveys meaning. Sometimes it is useful to break them down,
>> sometimes it is useful to treat them separately. For example, "é" can
>> considered as a single grapheme, or as a grapheme "e" followed by a
>> combining graphene "'" acute accent. The same grapheme can match
>> multiple code points - a Latin alphabet capital A is the same as a Greek
>> alphabet capital Alpha.
>>
>> A "glyph" is a rendering of a grapheme - the letter "A" in different
>> fonts are different glyphs of the same grapheme.
>>
>> What the reader perceives as a "character" is often a single grapheme,
>> but might be several graphemes together.
>>
>
> No, a grapheme, from my understanding, is the character as perceived by
> the readed. Thus the adding of accents to a base character builds a
> single grapheme from several codepoints.

The letter "o" is a grapheme, and an umlaut accent " is a grapheme. The
combination ö may be considered a single grapheme, or a combination of
graphemes. A German reader might consider it two graphemes - an
accented letter "o". A Swedish reader would consider it to be one
grapheme, as "ö" is a distinct letter in Swedish.

>
> The grapheme dosn't include 'font' information like which font to use,
> the size, additions like bold or italics and such, which add on to make
> the final glyph, but does include all the jots and tildes that are part
> of the character.

Correct. These are included in the glyph - the actual ink pattern on
the page.

>
> On the other hand, some languages add things like 'vowel points' to
> characters, and those are seperate graphemes even though they are added
> by a similar manner. This comes down to what the original language
> though of as a 'character', which just makes things even more complicated.
>

Yes, I think that is correct. (And "it's complicated" is /certainly/
correct!)

> Then as you said, there are the 'look-alike' characters which are
> considered (generally) to be separate, but some canonicalizations will
> convert to a common character.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srfn8d$c4k$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19916&group=comp.lang.c#19916

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Sun, 9 Jan 2022 23:23:09 +0100
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <srfn8d$c4k$1@dont-email.me>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<314f4088-9ea3-4117-b034-356d77a705cen@googlegroups.com>
<73f9b4a9-fa69-4a99-a9cb-15daa9725048n@googlegroups.com>
<srcll6$14f3$1@gioia.aioe.org>
<c0a5717c-6202-4f3a-93a9-f9d8a5b5293cn@googlegroups.com>
<srcnm2$5h9$1@gioia.aioe.org> <srekmi$h0p$1@dont-email.me>
<GhFCJ.204983$831.61812@fx40.iad>
<b1027848-b4b3-4602-8623-6c2c4fc6dc97n@googlegroups.com>
<DLFCJ.154462$SR4.45518@fx43.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 9 Jan 2022 22:23:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ecdcffdc11e458aefbe6267c003aafae";
logging-data="12436"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Zv+wlDSf25i2v+B4xnd7t6GrOcrNTcqg="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:3+X6X3YkPAnsvX1CEUPmcqDHTzg=
In-Reply-To: <DLFCJ.154462$SR4.45518@fx43.iad>
Content-Language: en-GB

by: David Brown - Sun, 9 Jan 2022 22:23 UTC

On 09/01/2022 19:24, Richard Damon wrote:
> On 1/9/22 1:13 PM, Malcolm McLean wrote:
>> On Sunday, 9 January 2022 at 17:52:50 UTC, Richard Damon wrote:
>>>
>>> On the other hand, some languages add things like 'vowel points' to
>>> characters, and those are seperate graphemes even though they are added
>>> by a similar manner. This comes down to what the original language
>>> though of as a 'character', which just makes things even more
>>> complicated.
>>>
>> In Hebrew the "vowel points" are optional. They are used in beginners'
>> and
>> religious texts, but not in general use. So if we take a text from
>> scripture,
>> and represent it with and without vowels, is that the same text or a
>> different
>> text? Almost all Hebrew speakers would say "It's the same text". So
>> strcmp() doesn't necessarily work in a Hebrew context.
>
> Hebrew isn't the only language to use 'vowel points', but they also
> occur in a number of other languages.
>
> The key point that I was pointing out is that in many of these
> languages, to points are NOT considered a part of the letter they are
> 'attached' to, but a separate letter, even if typographically connected.

That's true. The opposite is true also - a ligature like ﬁ might be
typographically connected (depending on the font and typesetting),
despite being two independent letters and two characters. Unicode has a
number of code points for such ligatures, but there are many more that
are sometimes used in typography, especially historical documents.
(Unicode is missing an "fj" ligature, for example.)

And while "æ" is a ligature of two letters forming a diphthong (used in
Latin and related languages), it is an independent letter in Norwegian.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19917&group=comp.lang.c#19917

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:111c:: with SMTP id o28mr49930628qkk.328.1641767734714;
Sun, 09 Jan 2022 14:35:34 -0800 (PST)
X-Received: by 2002:a05:6214:29ee:: with SMTP id jv14mr63924212qvb.73.1641767734548;
Sun, 09 Jan 2022 14:35:34 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 14:35:34 -0800 (PST)
In-Reply-To: <srf2q6$c63$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Sun, 09 Jan 2022 22:35:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 82

by: Öö Tiib - Sun, 9 Jan 2022 22:35 UTC

On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> > On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
....
> >
> >>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>> That should work unless underlying file system does not support files
> >>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>> needed as it already works like in my example on vast majority of things.
> >>>>
> >>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>> about it. Your real beef is with the implementors, not the standard.
> >>>
> >>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>
> >> Also agreed, but since utf-8 is transparent to ascii functions, what
> >> should have been added?
> >
> > Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >
> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar..txt" have the same binary
> representation, what's the difference? One form or the other shows up
> only when it is displayed in some UI - the filesystem isn't one, which
> leads to the implementation's runtime behavior.

How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
"Foo😀Bar.txt" files can be in same directory. Both have Unicode
names in underlying file system precisely as posted.

> If they are actually different in their binary sequence, and this is the
> result of the utf-8 string being wrongly converted multiple times, this
> looks like a bad implementation, rather than a problem with the standard.
> IIUC you are advocating for some statement in the standard that prevents
> implementations from messing up with "character sets" in null terminated
> char strings?

I mean that standard should require that all char* texts are treated as
UTF-8 by standard library unless said otherwise. If implementation needs
some other encoding of such byte sequence then it provides
platform-specific functions or compiler switches and/or extends language
with implementation-defined char_iso8859_1_t character types and
prefixes. If it is noteworthy handy type then add it to standards too, I
don't care.

If standard can define that overflow in signed atomics is well defined
and two's complement is mandated there then it also can define that all
char* texts are UTF-8. The only question is if what I suggest is reasonable
or not. From viewpoint of implementer of standard library or users it
is likely blessing ... so I think it is question of business/politics/religions.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<GvKCJ.77541$KV.71777@fx14.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19918&group=comp.lang.c#19918

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!feeder5.feed.usenet.farm!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 75
Message-ID: <GvKCJ.77541$KV.71777@fx14.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 18:48:54 -0500
X-Received-Bytes: 5968

by: Richard Damon - Sun, 9 Jan 2022 23:48 UTC

On 1/9/22 5:35 PM, Öö Tiib wrote:
> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> ...
>>>
>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>>>> That should work unless underlying file system does not support files
>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>>>> needed as it already works like in my example on vast majority of things.
>>>>>>
>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>>>> about it. Your real beef is with the implementors, not the standard.
>>>>>
>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>>>
>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
>>>> should have been added?
>>>
>>> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>>>
>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
>> representation, what's the difference? One form or the other shows up
>> only when it is displayed in some UI - the filesystem isn't one, which
>> leads to the implementation's runtime behavior.
>
> How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> names in underlying file system precisely as posted.
>
>> If they are actually different in their binary sequence, and this is the
>> result of the utf-8 string being wrongly converted multiple times, this
>> looks like a bad implementation, rather than a problem with the standard.
>> IIUC you are advocating for some statement in the standard that prevents
>> implementations from messing up with "character sets" in null terminated
>> char strings?
>
> I mean that standard should require that all char* texts are treated as
> UTF-8 by standard library unless said otherwise. If implementation needs
> some other encoding of such byte sequence then it provides
> platform-specific functions or compiler switches and/or extends language
> with implementation-defined char_iso8859_1_t character types and
> prefixes. If it is noteworthy handy type then add it to standards too, I
> don't care.
>
> If standard can define that overflow in signed atomics is well defined
> and two's complement is mandated there then it also can define that all
> char* texts are UTF-8. The only question is if what I suggest is reasonable
> or not. From viewpoint of implementer of standard library or users it
> is likely blessing ... so I think it is question of business/politics/religions.

The difference is that in these days, the existence of computers that
aren't going to be able to support two's complement that will still want
to support modern 'C' is effectively non-existent.

The existance of machines that might still want to be able to support
non-UTF-8 strings is not.

Perhaps the biggest is the embedded market where needing to support
beyond plain ASCII isn't needed, and DEFINING that strings will follow
UTF-8 rules adds a LOT of complications for some operations that just
aren't needed on many of the systems.

The Standard does ALLOW a system to define char to be UTF-8 (at least
until you get into issues of what it requires for wide characters).

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19919&group=comp.lang.c#19919

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:118d:: with SMTP id m13mr9017464qtk.507.1641775877567;
Sun, 09 Jan 2022 16:51:17 -0800 (PST)
X-Received: by 2002:a05:6214:21ec:: with SMTP id p12mr66040975qvj.82.1641775877431;
Sun, 09 Jan 2022 16:51:17 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 16:51:17 -0800 (PST)
In-Reply-To: <GvKCJ.77541$KV.71777@fx14.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <GvKCJ.77541$KV.71777@fx14.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 00:51:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 136

by: Öö Tiib - Mon, 10 Jan 2022 00:51 UTC

On Monday, 10 January 2022 at 01:49:07 UTC+2, Richard Damon wrote:
> On 1/9/22 5:35 PM, Öö Tiib wrote:
> > On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> >> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> >>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> > ...
> >>>
> >>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>>>> That should work unless underlying file system does not support files
> >>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>>>> needed as it already works like in my example on vast majority of things.
> >>>>>>
> >>>>>> Nothing in the standard prevents an implementation from doing that.. If one doesn't
> >>>>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>>>> about it. Your real beef is with the implementors, not the standard.
> >>>>>
> >>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>>>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>>>
> >>>> Also agreed, but since utf-8 is transparent to ascii functions, what
> >>>> should have been added?
> >>>
> >>> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> >>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >>>
> >> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> >> representation, what's the difference? One form or the other shows up
> >> only when it is displayed in some UI - the filesystem isn't one, which
> >> leads to the implementation's runtime behavior.
> >
> > How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
> > "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> > names in underlying file system precisely as posted.
> >
> >> If they are actually different in their binary sequence, and this is the
> >> result of the utf-8 string being wrongly converted multiple times, this
> >> looks like a bad implementation, rather than a problem with the standard.
> >> IIUC you are advocating for some statement in the standard that prevents
> >> implementations from messing up with "character sets" in null terminated
> >> char strings?
> >
> > I mean that standard should require that all char* texts are treated as
> > UTF-8 by standard library unless said otherwise. If implementation needs
> > some other encoding of such byte sequence then it provides
> > platform-specific functions or compiler switches and/or extends language
> > with implementation-defined char_iso8859_1_t character types and
> > prefixes. If it is noteworthy handy type then add it to standards too, I
> > don't care.
> >
> > If standard can define that overflow in signed atomics is well defined
> > and two's complement is mandated there then it also can define that all
> > char* texts are UTF-8. The only question is if what I suggest is reasonable
> > or not. From viewpoint of implementer of standard library or users it
> > is likely blessing ... so I think it is question of business/politics/religions.
>
> The difference is that in these days, the existence of computers that
> aren't going to be able to support two's complement that will still want
> to support modern 'C' is effectively non-existent.
>
> The existance of machines that might still want to be able to support
> non-UTF-8 strings is not.

But do there exist machines that do want to support texts as char* but
do not want to support UTF-8? Describe those machines, give examples.

> Perhaps the biggest is the embedded market where needing to support

You mean tiny things like SD cards or flash sticks? I can store there
"Foo😀Bar.txt" just fine. Either embedded system does not need to
communicate in char* text at all, can fully ignore its encoding or has
to deal with UTF-8 anyway. I know of no other examples, despite I've
participated in programming whole pile of embedded systems over
the decades.

> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> UTF-8 rules adds a LOT of complications for some operations that just
> aren't needed on many of the systems.

WHAT complications? Give examples? Both ASCII and UTF-8 are row of
bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
use-case where UTF-8 hurts? Human languages and typography
are horribly complicated but UTF-8 is genially trivial. Either embedded
system does not do linguistic analyses of poems or if it does then
it needs to use Unicode anyway. But commonly if it can't display
something then it shows � and done.
> The Standard does ALLOW a system to define char to be UTF-8 (at least
> until you get into issues of what it requires for wide characters).

Allowing is apparently not enough as the support rots in standards.
Wide characters are wchar_t, char16_t and char32_t. These are
in horrible state too but I ignore it for now. Not related to issues
with char* and far less important in industry.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<srg1jr$1rpr$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19920&group=comp.lang.c#19920

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!aioe.org!FW82V+DVgp9hOrBwDPQtKg.user.46.165.242.75.POSTED!not-for-mail
From: non...@add.invalid (Manfred)
Newsgroups: comp.lang.c
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Date: Mon, 10 Jan 2022 02:19:55 +0100
Organization: Aioe.org NNTP Server
Message-ID: <srg1jr$1rpr$1@gioia.aioe.org>
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="61243"; posting-host="FW82V+DVgp9hOrBwDPQtKg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Manfred - Mon, 10 Jan 2022 01:19 UTC

On 1/9/2022 11:35 PM, Öö Tiib wrote:
> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> ...
>>>
>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>>>> That should work unless underlying file system does not support files
>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>>>> needed as it already works like in my example on vast majority of things.
>>>>>>
>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>>>> about it. Your real beef is with the implementors, not the standard.
>>>>>
>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>>>
>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
>>>> should have been added?
>>>
>>> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>>>
>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
>> representation, what's the difference? One form or the other shows up
>> only when it is displayed in some UI - the filesystem isn't one, which
>> leads to the implementation's runtime behavior.
>
> How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> names in underlying file system precisely as posted.

I mean the same byte sequence in their name, but different UI
representation, e.g. when decoded as utf-8 or w-1252 or whatever.

What you are saying assumes a Unicode-aware filesystem, that's not free
from the point of view of the standard.
But, in order to support utf-8, it would be enough to have a char based
filesystem that treats names as plain 0-terminated char[]. That's
easier, probably free on most platforms, but it's different from
Unicode-aware (which could be UTF-16 like Windows, and there you have
your problems).

>
>> If they are actually different in their binary sequence, and this is the
>> result of the utf-8 string being wrongly converted multiple times, this
>> looks like a bad implementation, rather than a problem with the standard.
>> IIUC you are advocating for some statement in the standard that prevents
>> implementations from messing up with "character sets" in null terminated
>> char strings?
>
> I mean that standard should require that all char* texts are treated as
> UTF-8 by standard library unless said otherwise. If implementation needs
> some other encoding of such byte sequence then it provides
> platform-specific functions or compiler switches and/or extends language
> with implementation-defined char_iso8859_1_t character types and
> prefixes. If it is noteworthy handy type then add it to standards too, I
> don't care.

I see this hard to win, and probably not ideal - suppose in 10 years
some better encoding than utf-8 shows up, then you are screwed again.

I'd rather stick to the fact that utf-8 is compatible with 0-terminated
char[], and so a plausible wish would be that such strings are not
screwed by the implementation; for example when you store a file name in
a filesystem with fopen() and the name is given as char[], then the
standard could mandate that reading back that same name as char[] gives
back the same byte sequence.

Currently I guess one could use a utf-8 string as a name to fopen() on
Windows, then the OS assumes it is W-1252 and converts it into UTF-16,
at which point it is screwed, and when you read it back into char[] it
is garbage.

>
> If standard can define that overflow in signed atomics is well defined
> and two's complement is mandated there then it also can define that all
> char* texts are UTF-8. The only question is if what I suggest is reasonable
> or not. From viewpoint of implementer of standard library or users it
> is likely blessing ... so I think it is question of business/politics/religions.

I agree with Richard here. Two's complement is not like utf-8.
I still think it's technical rather than business/politics/religions in
this case - as I said above I'm not sure it would even be ideal.

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<qpMCJ.117485$_Y5.68107@fx29.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=19922&group=comp.lang.c#19922

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx29.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.4.1
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris
Wellons
Content-Language: en-US
Newsgroups: comp.lang.c
References: <sr0psj$g2d$1@dont-email.me>
<761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com>
<b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com>
<884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com>
<74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com>
<000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com>
<srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com>
<srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com>
<GvKCJ.77541$KV.71777@fx14.iad>
<471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 132
Message-ID: <qpMCJ.117485$_Y5.68107@fx29.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sun, 9 Jan 2022 20:58:46 -0500
X-Received-Bytes: 8782
X-Original-Bytes: 8649

by: Richard Damon - Mon, 10 Jan 2022 01:58 UTC

On 1/9/22 7:51 PM, Öö Tiib wrote:
> On Monday, 10 January 2022 at 01:49:07 UTC+2, Richard Damon wrote:
>> On 1/9/22 5:35 PM, Öö Tiib wrote:
>>> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
>>>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
>>>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
>>>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
>>>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>>>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>>> ...
>>>>>
>>>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>>>>>> That should work unless underlying file system does not support files
>>>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>>>>>> needed as it already works like in my example on vast majority of things.
>>>>>>>>
>>>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>>>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>>>>>> about it. Your real beef is with the implementors, not the standard.
>>>>>>>
>>>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>>>>>
>>>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
>>>>>> should have been added?
>>>>>
>>>>> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
>>>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>>>>>
>>>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
>>>> representation, what's the difference? One form or the other shows up
>>>> only when it is displayed in some UI - the filesystem isn't one, which
>>>> leads to the implementation's runtime behavior.
>>>
>>> How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
>>> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
>>> names in underlying file system precisely as posted.
>>>
>>>> If they are actually different in their binary sequence, and this is the
>>>> result of the utf-8 string being wrongly converted multiple times, this
>>>> looks like a bad implementation, rather than a problem with the standard.
>>>> IIUC you are advocating for some statement in the standard that prevents
>>>> implementations from messing up with "character sets" in null terminated
>>>> char strings?
>>>
>>> I mean that standard should require that all char* texts are treated as
>>> UTF-8 by standard library unless said otherwise. If implementation needs
>>> some other encoding of such byte sequence then it provides
>>> platform-specific functions or compiler switches and/or extends language
>>> with implementation-defined char_iso8859_1_t character types and
>>> prefixes. If it is noteworthy handy type then add it to standards too, I
>>> don't care.
>>>
>>> If standard can define that overflow in signed atomics is well defined
>>> and two's complement is mandated there then it also can define that all
>>> char* texts are UTF-8. The only question is if what I suggest is reasonable
>>> or not. From viewpoint of implementer of standard library or users it
>>> is likely blessing ... so I think it is question of business/politics/religions.
>>
>> The difference is that in these days, the existence of computers that
>> aren't going to be able to support two's complement that will still want
>> to support modern 'C' is effectively non-existent.
>>
>> The existance of machines that might still want to be able to support
>> non-UTF-8 strings is not.
>
> But do there exist machines that do want to support texts as char* but
> do not want to support UTF-8? Describe those machines, give examples.

Small embedded micros with no need for large character sets.

>
>> Perhaps the biggest is the embedded market where needing to support
>
> You mean tiny things like SD cards or flash sticks? I can store there
> "Foo😀Bar.txt" just fine. Either embedded system does not need to
> communicate in char* text at all, can fully ignore its encoding or has
> to deal with UTF-8 anyway. I know of no other examples, despite I've
> participated in programming whole pile of embedded systems over
> the decades.
>

Many such system communicate in command strings, maybe even with a
minimal TCP/IP but have no need for processing data beyond pure ASCII.

>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
>> UTF-8 rules adds a LOT of complications for some operations that just
>> aren't needed on many of the systems.
>
> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> use-case where UTF-8 hurts? Human languages and typography
> are horribly complicated but UTF-8 is genially trivial. Either embedded
> system does not do linguistic analyses of poems or if it does then
> it needs to use Unicode anyway. But commonly if it can't display
> something then it shows � and done.

Once you have your char as being defined as a Multi-Byte Character Set,
then wchar_t must be big enough to hold any of them. If you just support
ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
the wchar_t stuff can just be alias for the char stuff.

Thus forcing char to be UTF-8 adds a lot of complexity to the system.

>
>> The Standard does ALLOW a system to define char to be UTF-8 (at least
>> until you get into issues of what it requires for wide characters).
>
> Allowing is apparently not enough as the support rots in standards.
> Wide characters are wchar_t, char16_t and char32_t. These are
> in horrible state too but I ignore it for now. Not related to issues
> with char* and far less important in industry.
>

But that is part of the problem with supporting UTF-8, as that by
definiition brings in all the wide character issues into play.

If you define that your character set is ASCII, then wchar_t becomes
trivial.

A big part of the issue with char16_t is that it is fundamentally broken
with Unicode, but lives on due to trying to maintain the backwards
bandaids that basically can't be removed without admitting that a large
segment of code just will live as being openly non-complient.

Too much legacy code assumes that 16 bit characters are 'big enough' for
most people, and pretty much do work if you aren't being a stickler for
full conformance to the rules, which no one is because you can't be.

If it has syntax, it isn't user friendly.

devel / comp.lang.c / Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

Subject	Author
"Some sanity for C and C++ development on Windows" by Chris Wellons	Lynn McGuire
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Vir Campestris
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Scott Lurndal
Re: "Some sanity for C and C++ development on Windows" by Chris	Kaz Kylheku
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Scott Lurndal
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris	Guillaume
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	Bart
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	Bart
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	Bart
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Mateusz Viste
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Mateusz Viste
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	David Brown
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris	David Brown
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Ben Bacarisse
Re: "Some sanity for C and C++ development on Windows" by Chris	David Brown
Re: "Some sanity for C and C++ development on Windows" by Chris	David Brown
Re: "Some sanity for C and C++ development on Windows" by Chris	Bart
Re: "Some sanity for C and C++ development on Windows" by Chris	Manfred
Re: "Some sanity for C and C++ development on Windows" by Chris	Guillaume
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Ben Bacarisse
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Ben Bacarisse
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	Mateusz Viste
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Ben Bacarisse
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Mateusz Viste
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Malcolm McLean
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Manfred
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Manfred
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Vir Campestris
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Scott Lurndal
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Vir Campestris
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Kaz Kylheku
Re: "Some sanity for C and C++ development on Windows" by Chris	Manfred
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	james...@alumni.caltech.edu
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Po Lu
Re: "Some sanity for C and C++ development on Windows" by Chris	James Kuyper
Re: "Some sanity for C and C++ development on Windows" by Chris Wellons	Öö Tiib
Re: "Some sanity for C and C++ development on Windows" by Chris	Richard Damon