novaBBS - comp.lang.c - Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

Re: "Some sanity for C and C++ development on Windows" by Chris Wellons

<b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=19928&group=comp.lang.c#19928

X-Received: by 2002:a05:620a:25ca:: with SMTP id y10mr50107739qko.526.1641782432402;
Sun, 09 Jan 2022 18:40:32 -0800 (PST)
X-Received: by 2002:a05:622a:491:: with SMTP id p17mr5871659qtx.300.1641782432257;
Sun, 09 Jan 2022 18:40:32 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 9 Jan 2022 18:40:32 -0800 (PST)
In-Reply-To: <qpMCJ.117485$_Y5.68107@fx29.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=94.246.251.164; posting-account=pysjKgkAAACLegAdYDFznkqjgx_7vlUK
NNTP-Posting-Host: 94.246.251.164
References: <sr0psj$g2d$1@dont-email.me> <761b391e-f071-484e-8507-f58eeb44a8e9n@googlegroups.com>
<sr53qo$vbl$1@dont-email.me> <_mpBJ.219710$qz4.56726@fx97.iad>
<36c23681-a90b-4de4-8451-e31e74f6c838n@googlegroups.com> <b13c9427-f475-4bcc-98c8-5de476b4e75bn@googlegroups.com>
<27fc916b-9aee-4a76-85e8-6d4a2281b74bn@googlegroups.com> <884c9725-5b12-4727-98a1-6b7c46efb4aen@googlegroups.com>
<c52c7902-0ce0-4db2-af97-1f9fc5c2a9fan@googlegroups.com> <74dd4f1f-c5ff-4c9e-9a04-3616a978fb04n@googlegroups.com>
<4a405512-8c50-479a-9928-857fc7d5fac4n@googlegroups.com> <000e93e1-4d5e-4dda-91da-67ded6d70f83n@googlegroups.com>
<ccc7a06a-6454-4ae4-b29c-075cd76494f9n@googlegroups.com> <srdd2b$om7$1@gioia.aioe.org>
<c8b3d237-403a-44a4-a74c-91a3ae26605an@googlegroups.com> <srf2q6$c63$1@gioia.aioe.org>
<2f16854b-ab61-4aa1-af0a-d976535eaa00n@googlegroups.com> <GvKCJ.77541$KV.71777@fx14.iad>
<471b4523-4568-4f46-9bdd-5fb5bcc7cee3n@googlegroups.com> <qpMCJ.117485$_Y5.68107@fx29.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b3f283d2-c7d2-4935-91f9-addc7e7322d1n@googlegroups.com>
Subject: Re: "Some sanity for C and C++ development on Windows" by Chris Wellons
From: oot...@hot.ee (Öö Tiib)
Injection-Date: Mon, 10 Jan 2022 02:40:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 203

by: Öö Tiib - Mon, 10 Jan 2022 02:40 UTC

On Monday, 10 January 2022 at 03:59:00 UTC+2, Richard Damon wrote:
> On 1/9/22 7:51 PM, Öö Tiib wrote:
> > On Monday, 10 January 2022 at 01:49:07 UTC+2, Richard Damon wrote:
> >> On 1/9/22 5:35 PM, Öö Tiib wrote:
> >>> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> >>>> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> >>>>> On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >>>>>> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>>>>>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>>>>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>>>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> >>> ...
> >>>>>
> >>>>>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>>>>>> That should work unless underlying file system does not support files
> >>>>>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>>>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>>>>>> needed as it already works like in my example on vast majority of things.
> >>>>>>>>
> >>>>>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >>>>>>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>>>>>> about it. Your real beef is with the implementors, not the standard.
> >>>>>>>
> >>>>>>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>>>>>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>>>>>
> >>>>>> Also agreed, but since utf-8 is transparent to ascii functions, what
> >>>>>> should have been added?
> >>>>>
> >>>>> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> >>>>> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >>>>>
> >>>> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> >>>> representation, what's the difference? One form or the other shows up
> >>>> only when it is displayed in some UI - the filesystem isn't one, which
> >>>> leads to the implementation's runtime behavior.
> >>>
> >>> How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
> >>> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> >>> names in underlying file system precisely as posted.
> >>>
> >>>> If they are actually different in their binary sequence, and this is the
> >>>> result of the utf-8 string being wrongly converted multiple times, this
> >>>> looks like a bad implementation, rather than a problem with the standard.
> >>>> IIUC you are advocating for some statement in the standard that prevents
> >>>> implementations from messing up with "character sets" in null terminated
> >>>> char strings?
> >>>
> >>> I mean that standard should require that all char* texts are treated as
> >>> UTF-8 by standard library unless said otherwise. If implementation needs
> >>> some other encoding of such byte sequence then it provides
> >>> platform-specific functions or compiler switches and/or extends language
> >>> with implementation-defined char_iso8859_1_t character types and
> >>> prefixes. If it is noteworthy handy type then add it to standards too, I
> >>> don't care.
> >>>
> >>> If standard can define that overflow in signed atomics is well defined
> >>> and two's complement is mandated there then it also can define that all
> >>> char* texts are UTF-8. The only question is if what I suggest is reasonable
> >>> or not. From viewpoint of implementer of standard library or users it
> >>> is likely blessing ... so I think it is question of business/politics/religions.
> >>
> >> The difference is that in these days, the existence of computers that
> >> aren't going to be able to support two's complement that will still want
> >> to support modern 'C' is effectively non-existent.
> >>
> >> The existance of machines that might still want to be able to support
> >> non-UTF-8 strings is not.
> >
> > But do there exist machines that do want to support texts as char* but
> > do not want to support UTF-8? Describe those machines, give examples.
> Small embedded micros with no need for large character sets.

You diligently avoid giving examples?
You mean if it displays only Arabic numbers then it needs only 10
characters, if minuses and dots too then 12. ASCII and UTF-8 are identical
in that processing. So UTF-8 adds no extra bytes to such system.

> >
> >> Perhaps the biggest is the embedded market where needing to support
> >
> > You mean tiny things like SD cards or flash sticks? I can store there
> > "Foo😀Bar.txt" just fine. Either embedded system does not need to
> > communicate in char* text at all, can fully ignore its encoding or has
> > to deal with UTF-8 anyway. I know of no other examples, despite I've
> > participated in programming whole pile of embedded systems over
> > the decades.
>
> Many such system communicate in command strings, maybe even with a
> minimal TCP/IP but have no need for processing data beyond pure ASCII.

Same as with numbers, if no need to show the degree in 74.3°F so no
need for to process anything beyond pure ASCII. Otherwise the software
needs to detect that there are bytes C2 B0 for to show ° also no biggie.

> >> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> >> UTF-8 rules adds a LOT of complications for some operations that just
> >> aren't needed on many of the systems.
> >
> > WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> > bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> > use-case where UTF-8 hurts? Human languages and typography
> > are horribly complicated but UTF-8 is genially trivial. Either embedded
> > system does not do linguistic analyses of poems or if it does then
> > it needs to use Unicode anyway. But commonly if it can't display
> > something then it shows � and done.
>
> Once you have your char as being defined as a Multi-Byte Character Set,
> then wchar_t must be big enough to hold any of them. If you just support
> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
> the wchar_t stuff can just be alias for the char stuff.
>
> Thus forcing char to be UTF-8 adds a lot of complexity to the system.

Most embedded systems that I programmed used wchar_t for nothing.
So the compiler generated precisely 0 bytes of wchar_t processing
into image that was flashed into those.

> >
> >> The Standard does ALLOW a system to define char to be UTF-8 (at least
> >> until you get into issues of what it requires for wide characters).
> >
> > Allowing is apparently not enough as the support rots in standards.
> > Wide characters are wchar_t, char16_t and char32_t. These are
> > in horrible state too but I ignore it for now. Not related to issues
> > with char* and far less important in industry.
> >
> But that is part of the problem with supporting UTF-8, as that by
> definiition brings in all the wide character issues into play.
>
> If you define that your character set is ASCII, then wchar_t becomes
> trivial.
>
> A big part of the issue with char16_t is that it is fundamentally broken
> with Unicode, but lives on due to trying to maintain the backwards
> bandaids that basically can't be removed without admitting that a large
> segment of code just will live as being openly non-complient.
>
> Too much legacy code assumes that 16 bit characters are 'big enough' for
> most people, and pretty much do work if you aren't being a stickler for
> full conformance to the rules, which no one is because you can't be.

But that all is far from true. The wchar_t on Windows is 16 bits but Windows
supports UTF-16 fully so lot of characters take multiple wchar_t's
to represent. Microsoft just violates standard with straight face. And
I do not care about it. I care about UTF-8.

Subject	Replies	Author
"Some sanity for C and C++ development on Windows" by Chris Wellons By: Lynn McGuire on Tue, 4 Jan 2022	80	Lynn McGuire

Remember: use logout to logout.

devel / comp.lang.c / Re: "Some sanity for C and C++ development on Windows" by Chris Wellons