novaBBS - comp.lang.c - Re: Unicode test suite

Re: Unicode test suite

<811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=26804&group=comp.lang.c#26804

X-Received: by 2002:a05:6214:104d:b0:63c:f3e3:8220 with SMTP id l13-20020a056214104d00b0063cf3e38220mr40564qvr.0.1690656933750;
Sat, 29 Jul 2023 11:55:33 -0700 (PDT)
X-Received: by 2002:a9d:7f0a:0:b0:6b7:45a8:a80c with SMTP id
j10-20020a9d7f0a000000b006b745a8a80cmr6991944otq.3.1690656933559; Sat, 29 Jul
2023 11:55:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 11:55:33 -0700 (PDT)
In-Reply-To: <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=81.143.231.9; posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 81.143.231.9
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<20230727105813.33@kylheku.com> <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 29 Jul 2023 18:55:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2962

by: Malcolm McLean - Sat, 29 Jul 2023 18:55 UTC

On Saturday, 29 July 2023 at 19:20:10 UTC+1, fir wrote:
> czwartek, 27 lipca 2023 o 20:50:09 UTC+2 Kaz Kylheku napisał(a):
> > On 2023-07-27, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > > Lynn's comment inspired me to add Unicode support to the Baby X
> > > resource compiler. But despite searching for quite a long time, I
> > > can't find a test suite of Unicode files in various formats. In fact
> > > it's hard to find any Unicode files at all which are not UTF-8.
> > It's a fool's errand to support Unicode formats other than UTF-8.
> >
> dont know what fools errand mean but probbaly seems so, i dont remember waht
> i was sayin back then on this but probably if utf8 become the standard in use it is
> probably firtunate to stay with it, wchich is kinda fortunate (most comapatible in
> ascii
>
> btw if utf8 become de facto standard there is also defacto standard of csolo haracter?
>
> is this just 32 bit integer which is ascii below 127 and ofical number from unicode tables
> in above? i mean does things liek arrow left has one oficiall integer value?
>
There's an almost standard, which is the Unicode "code point". However whilst saying that
a code point is a character is good enough for most alphabets and texts, you do have
problems, like "pointing" in Hebrew texts (vowels and hard / soft signs are optional,
and indicated by small annotations around the letters). Unicode has a concept of
"combining charcters" to handle this sort of situation. Unfortunately it means that
Unicode-aware routines are quite hard to write is they are to handle everything correctly.

Re: Unicode test suite

<87y1iyy0q1.fsf@bsb.me.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26805&group=comp.lang.c#26805

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 20:30:30 +0100
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <87y1iyy0q1.fsf@bsb.me.uk>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
<87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="7dc7094286c70ffc9164d202debea076";
logging-data="2837057"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+2OkcZrXzce3R+p4g/BTrdFifmLCz1wWU="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:mmwtRs00VyhC+hQDbwvrk/DnU6c=
sha1:hhdehvPJsJ4aA1EX77qkb8wWAEQ=
X-BSB-Auth: 1.96395176556510e1efad.20230729203030BST.87y1iyy0q1.fsf@bsb.me.uk

by: Ben Bacarisse - Sat, 29 Jul 2023 19:30 UTC

Bart <bc@freeuk.com> writes:

> In C11, "u8" is a valid user identifier, so that you can do this:
>
> typedef unsigned byte u8;
>
> But it can also be a prefix to a string literal. I believe it's called a
> 'contextual keyword'.

No, it's not called that. For the sake of simplicity, C does not give
any special name to something that is simply part of some other lexical
entity. Likewise, the L in L"abc" and 0x1L have no special name.

--
Ben.

Re: Unicode test suite

<e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26808&group=comp.lang.c#26808

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ad4:5885:0:b0:63c:f31d:7471 with SMTP id dz5-20020ad45885000000b0063cf31d7471mr21034qvb.13.1690663101547;
Sat, 29 Jul 2023 13:38:21 -0700 (PDT)
X-Received: by 2002:a05:6808:13c2:b0:3a4:4b42:5ce1 with SMTP id
d2-20020a05680813c200b003a44b425ce1mr11433800oiw.3.1690663101216; Sat, 29 Jul
2023 13:38:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 13:38:20 -0700 (PDT)
In-Reply-To: <811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.80; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.80
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<20230727105813.33@kylheku.com> <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
<811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sat, 29 Jul 2023 20:38:21 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Sat, 29 Jul 2023 20:38 UTC

sobota, 29 lipca 2023 o 20:55:42 UTC+2 Malcolm McLean napisał(a):
> On Saturday, 29 July 2023 at 19:20:10 UTC+1, fir wrote:
> > czwartek, 27 lipca 2023 o 20:50:09 UTC+2 Kaz Kylheku napisał(a):
> > > On 2023-07-27, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > > > Lynn's comment inspired me to add Unicode support to the Baby X
> > > > resource compiler. But despite searching for quite a long time, I
> > > > can't find a test suite of Unicode files in various formats. In fact
> > > > it's hard to find any Unicode files at all which are not UTF-8.
> > > It's a fool's errand to support Unicode formats other than UTF-8.
> > >
> > dont know what fools errand mean but probbaly seems so, i dont remember waht
> > i was sayin back then on this but probably if utf8 become the standard in use it is
> > probably firtunate to stay with it, wchich is kinda fortunate (most comapatible in
> > ascii
> >
> > btw if utf8 become de facto standard there is also defacto standard of csolo haracter?
> >
> > is this just 32 bit integer which is ascii below 127 and ofical number from unicode tables
> > in above? i mean does things liek arrow left has one oficiall integer value?
> >
> There's an almost standard, which is the Unicode "code point". However whilst saying that
> a code point is a character is good enough for most alphabets and texts, you do have
> problems, like "pointing" in Hebrew texts (vowels and hard / soft signs are optional,
> and indicated by small annotations around the letters). Unicode has a concept of
> "combining charcters" to handle this sort of situation. Unfortunately it means that
> Unicode-aware routines are quite hard to write is they are to handle everything correctly.

so you mean that in most cases there is "odpowiedniosc" (how it is ine english? strict corespondence:
character <---> some 32 bit integer value ?

Re: Unicode test suite

<6584ec68-2650-4d77-b2aa-9198fd33f97dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26809&group=comp.lang.c#26809

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:214:b0:402:6230:7cfc with SMTP id b20-20020a05622a021400b0040262307cfcmr19984qtx.8.1690663719535;
Sat, 29 Jul 2023 13:48:39 -0700 (PDT)
X-Received: by 2002:a05:6830:1daa:b0:6b8:6d21:d2fd with SMTP id
z10-20020a0568301daa00b006b86d21d2fdmr6907543oti.7.1690663719232; Sat, 29 Jul
2023 13:48:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 13:48:38 -0700 (PDT)
In-Reply-To: <e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<20230727105813.33@kylheku.com> <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
<811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com> <e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6584ec68-2650-4d77-b2aa-9198fd33f97dn@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 29 Jul 2023 20:48:39 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2155

by: Malcolm McLean - Sat, 29 Jul 2023 20:48 UTC

On Saturday, 29 July 2023 at 21:38:30 UTC+1, fir wrote:
> so you mean that in most cases there is "odpowiedniosc" (how it is ine english? strict corespondence:
> character <---> some 32 bit integer value ?
>
I think it depends which language you speak. For many people in the world,
Hebrew is something they would never use. For many Israelis, the pointing
system is something they rarely use and can do without. But for people interested
in Hebrew religious texts, it's essential, and they will need it every day.

But for a program written in a Latin alphabet based language, and with the
assumption that it might be translated as long as the burden isn't too high,
then saying that codepoints are C "characters" and UI "glyphs" is reasonable
enough.

Re: Unicode test suite

<dfbfbac2-ce37-498c-8769-ec99ca30fe04n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26810&group=comp.lang.c#26810

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a37:9347:0:b0:765:ada5:fdd4 with SMTP id v68-20020a379347000000b00765ada5fdd4mr18490qkd.12.1690665159698;
Sat, 29 Jul 2023 14:12:39 -0700 (PDT)
X-Received: by 2002:a05:6870:76a9:b0:1bb:9288:44a0 with SMTP id
dx41-20020a05687076a900b001bb928844a0mr7277358oab.1.1690665159385; Sat, 29
Jul 2023 14:12:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 14:12:38 -0700 (PDT)
In-Reply-To: <6584ec68-2650-4d77-b2aa-9198fd33f97dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.48; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.48
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<20230727105813.33@kylheku.com> <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
<811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com> <e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>
<6584ec68-2650-4d77-b2aa-9198fd33f97dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dfbfbac2-ce37-498c-8769-ec99ca30fe04n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sat, 29 Jul 2023 21:12:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3111

by: fir - Sat, 29 Jul 2023 21:12 UTC

sobota, 29 lipca 2023 o 22:48:46 UTC+2 Malcolm McLean napisał(a):
> On Saturday, 29 July 2023 at 21:38:30 UTC+1, fir wrote:
> > so you mean that in most cases there is "odpowiedniosc" (how it is ine english? strict corespondence:
> > character <---> some 32 bit integer value ?
> >
> I think it depends which language you speak. For many people in the world,
> Hebrew is something they would never use. For many Israelis, the pointing
> system is something they rarely use and can do without. But for people interested
> in Hebrew religious texts, it's essential, and they will need it every day.
>
> But for a program written in a Latin alphabet based language, and with the
> assumption that it might be translated as long as the burden isn't too high,
> then saying that codepoints are C "characters" and UI "glyphs" is reasonable
> enough.

this yeilds probably to a concept of encoded aray, if i is index on character and
if this haracters are 32 but int but the array is partialy 8bit and sometimes encoded

the question is if language should offer this type of array for unicode

unicode string textExample;
wide char = textExample[i];

then some iperators for this..

possibly this should be built in..once ago i think i opted for using 32 bit unicode
for simplicity - then no need to introduce such encoded arrays liek above,
but in fact maybe utf8 is better then some my cover it by introducing such kind of encoded
arrays

this btw shows that such acces as txt[i] is not optimal in case of such encoded arrays
and better to use something like txt.next txt.prev or something

Re: Unicode test suite

<878rayxvke.fsf@bsb.me.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26811&group=comp.lang.c#26811

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 22:21:53 +0100
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <878rayxvke.fsf@bsb.me.uk>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<20230727105813.33@kylheku.com>
<6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
<811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com>
<e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>
<6584ec68-2650-4d77-b2aa-9198fd33f97dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: dont-email.me; posting-host="7dc7094286c70ffc9164d202debea076";
logging-data="2860898"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18T8Ni060XVLfpS65dOQt0MRpTbD1KN8co="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:m9SBzKnTjJ9sb3EsxkciEsFNCHw=
sha1:b35aNLUPzqCjdvfIpUvymoqLswk=
X-BSB-Auth: 1.d6cbe18efd7bc435c9fa.20230729222153BST.878rayxvke.fsf@bsb.me.uk

by: Ben Bacarisse - Sat, 29 Jul 2023 21:21 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

> On Saturday, 29 July 2023 at 21:38:30 UTC+1, fir wrote:
>> so you mean that in most cases there is "odpowiedniosc" (how it is ine english? strict corespondence:
>> character <---> some 32 bit integer value ?
>>
> I think it depends which language you speak. For many people in the
> world, Hebrew is something they would never use. For many Israelis,
> the pointing system is something they rarely use and can do
> without. But for people interested in Hebrew religious texts, it's
> essential, and they will need it every day.
>
> But for a program written in a Latin alphabet based language, and with
> the assumption that it might be translated as long as the burden isn't
> too high, then saying that codepoints are C "characters" and UI
> "glyphs" is reasonable enough.

I suppose it depends on what too high a burden is, but the routines
might still have to cope with things like combining accents. A simple
test. How many newsreaders get these two words right:

café café

Some software might need these two to compare equal, and some might want
them both the match "cafe" when a user does a search.

--
Ben.

Re: Unicode test suite

<5bf22246-162e-4d89-9e80-4d5c8a69953en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26814&group=comp.lang.c#26814

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:4cf:b0:403:27b2:85b5 with SMTP id q15-20020a05622a04cf00b0040327b285b5mr22716qtx.12.1690667009594;
Sat, 29 Jul 2023 14:43:29 -0700 (PDT)
X-Received: by 2002:a05:6808:bcc:b0:3a3:8c81:a887 with SMTP id
o12-20020a0568080bcc00b003a38c81a887mr11349269oik.6.1690667009294; Sat, 29
Jul 2023 14:43:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 14:43:28 -0700 (PDT)
In-Reply-To: <878rayxvke.fsf@bsb.me.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<20230727105813.33@kylheku.com> <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
<811fdcee-65f8-4ef8-ac84-a80086cc35c1n@googlegroups.com> <e4f0284f-1239-4656-9969-3c75d8637454n@googlegroups.com>
<6584ec68-2650-4d77-b2aa-9198fd33f97dn@googlegroups.com> <878rayxvke.fsf@bsb.me.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5bf22246-162e-4d89-9e80-4d5c8a69953en@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 29 Jul 2023 21:43:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3530

by: Malcolm McLean - Sat, 29 Jul 2023 21:43 UTC

On Saturday, 29 July 2023 at 22:22:08 UTC+1, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
> > On Saturday, 29 July 2023 at 21:38:30 UTC+1, fir wrote:
> >> so you mean that in most cases there is "odpowiedniosc" (how it is ine english? strict corespondence:
> >> character <---> some 32 bit integer value ?
> >>
> > I think it depends which language you speak. For many people in the
> > world, Hebrew is something they would never use. For many Israelis,
> > the pointing system is something they rarely use and can do
> > without. But for people interested in Hebrew religious texts, it's
> > essential, and they will need it every day.
> >
> > But for a program written in a Latin alphabet based language, and with
> > the assumption that it might be translated as long as the burden isn't
> > too high, then saying that codepoints are C "characters" and UI
> > "glyphs" is reasonable enough.
> I suppose it depends on what too high a burden is, but the routines
> might still have to cope with things like combining accents. A simple
> test. How many newsreaders get these two words right:
>
> café café
>
> Some software might need these two to compare equal, and some might want
> them both the match "cafe" when a user does a search.
>
A naive routine will match "café" and "cafe". But not "élan" and "elan". If you use
combining accents. And it will match neither if you use codepoint U-00E9. Whilst
if you speak French then a routine which will work for French is relatively simple
to write, the problem is that other languages have subtly different rules, and
it's a huge undertaking to support them all, or even a subset of likely customer
languages.
(We considered transalting our software, but rejected the idea as far too expensive
and problematic. Plus even Japanese customers don't seem to mind UIs with
small amounts of English text).

Re: Unicode test suite

<871qgqo06r.fsf@nosuchdomain.example.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26817&group=comp.lang.c#26817

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 14:52:12 -0700
Organization: None to speak of
Lines: 24
Message-ID: <871qgqo06r.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
<87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="d3b018c39647936c2599e5bfe2349293";
logging-data="2863055"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+U64ma1doz9jSjDeJdOTTR"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:Iid452tEVaaGZKrfvlwAr/S+qPE=
sha1:4v+T/qJDAE3Enh9Aqr78JU6GOfU=

by: Keith Thompson - Sat, 29 Jul 2023 21:52 UTC

Bart <bc@freeuk.com> writes:
[...]
> In C11, "u8" is a valid user identifier, so that you can do this:
>
> typedef unsigned byte u8;
>
> But it can also be a prefix to a string literal. I believe it's called
> a 'contextual keyword'.

No, as Ben pointed out, it's just part of the syntax of a string
literal. The L in L"abc" doesn't conflict with L as an identifier.

[...]
> Actually, I can't see the point of such a prefix in C. All you have to
> do is to decree that string literals are UTF8 anyway.

Bad idea. It would restrict C's portability to systems that support
UTF-8, excluding both small embedded systems with limited character set
support and systems that still use EBCDIC.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<ua42v3$2nihd$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26818&group=comp.lang.c#26818

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 23:13:57 +0100
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <ua42v3$2nihd$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 29 Jul 2023 22:13:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ceb74387feb450401c92448496a7cc87";
logging-data="2869805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MCvEbEp/G3iXMGk/3uloURxJ+L766cuQ="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:Xb19B7r9jvYet35WNp/XLHq0ytU=
In-Reply-To: <871qgqo06r.fsf@nosuchdomain.example.com>

by: Bart - Sat, 29 Jul 2023 22:13 UTC

On 29/07/2023 22:52, Keith Thompson wrote:
> Bart <bc@freeuk.com> writes:
> [...]
>> In C11, "u8" is a valid user identifier, so that you can do this:
>>
>> typedef unsigned byte u8;
>>
>> But it can also be a prefix to a string literal. I believe it's called
>> a 'contextual keyword'.
>
> No, as Ben pointed out, it's just part of the syntax of a string
> literal. The L in L"abc" doesn't conflict with L as an identifier.
>
> [...]
>> Actually, I can't see the point of such a prefix in C. All you have to
>> do is to decree that string literals are UTF8 anyway.
>
> Bad idea. It would restrict C's portability to systems that support
> UTF-8, excluding both small embedded systems with limited character set
> support

It what way would be restrict those small systems?

Support for UTF8 means merely that characters in a string can have
values outside of 0 to 127, which they could have anyway whether UTF8 is
involved or not.

How exactly is a u8"€" string different from "€" anyway?

If your text editor saves as UTF8, then you will have a 3-byte sequence
plus nul in either case.

and systems that still use EBCDIC.
>

Re: Unicode test suite

<58ba36ae-ef12-4636-a643-14fc2bfb26d0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26819&group=comp.lang.c#26819

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:1721:b0:769:89bf:c7b7 with SMTP id az33-20020a05620a172100b0076989bfc7b7mr18126qkb.9.1690669177860;
Sat, 29 Jul 2023 15:19:37 -0700 (PDT)
X-Received: by 2002:a05:6808:13c2:b0:3a1:eb8a:203d with SMTP id
d2-20020a05680813c200b003a1eb8a203dmr11536589oiw.11.1690669177656; Sat, 29
Jul 2023 15:19:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 15:19:37 -0700 (PDT)
In-Reply-To: <871qgqo06r.fsf@nosuchdomain.example.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <58ba36ae-ef12-4636-a643-14fc2bfb26d0n@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 29 Jul 2023 22:19:37 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3764

by: Malcolm McLean - Sat, 29 Jul 2023 22:19 UTC

On Saturday, 29 July 2023 at 22:52:26 UTC+1, Keith Thompson wrote:
> Bart <b...@freeuk.com> writes:
> [...]
> > In C11, "u8" is a valid user identifier, so that you can do this:
> >
> > typedef unsigned byte u8;
> >
> > But it can also be a prefix to a string literal. I believe it's called
> > a 'contextual keyword'.
> No, as Ben pointed out, it's just part of the syntax of a string
> literal. The L in L"abc" doesn't conflict with L as an identifier.
>
> [...]
> > Actually, I can't see the point of such a prefix in C. All you have to
> > do is to decree that string literals are UTF8 anyway.
> Bad idea. It would restrict C's portability to systems that support
> UTF-8, excluding both small embedded systems with limited character set
> support and systems that still use EBCDIC.
>
By coincidence, fortunate or unfortunate, computers were first developed
in the United States, and English lends itself to representation by a sequence
of octets. But English is the exception. Most languages have some features
that make things difficult, whether that is accents or special forms for letters
in combination or at initial or end positions in words. (English does have
capitals, but that's the only real complication in English orthography).

But you can't win. Allow UTF-8 and the system breaks on small machines
or where the programmers haven't handled the special rules of every single
language out there. Ban it and the system breaks when you try to sell it to
the French.

The other thing is that most programmers can read English text, but not
most other languages. They might have a native language or studied one
foreign language at school. But the vast majority of foreign text might as
well be Linear A for all the use it is to most programmers. By restriciting
things to ASCII, you are discouraging the use of non-English in comments
and identifiers.

Re: Unicode test suite

<20230729145800.354@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26820&group=comp.lang.c#26820

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 22:21:36 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <20230729145800.354@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
Injection-Date: Sat, 29 Jul 2023 22:21:36 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e03ffdfa499b0f433ff65455cb09cf27";
logging-data="2871343"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QMy6tVkHtkqKqq0wo7K/pUCPPUm0JkNo="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:RC0Ps5RbWGqIpPivUZRkYmsIgo8=

by: Kaz Kylheku - Sat, 29 Jul 2023 22:21 UTC

On 2023-07-29, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
> Bart <bc@freeuk.com> writes:
> [...]
>> In C11, "u8" is a valid user identifier, so that you can do this:
>>
>> typedef unsigned byte u8;
>>
>> But it can also be a prefix to a string literal. I believe it's called
>> a 'contextual keyword'.
>
> No, as Ben pointed out, it's just part of the syntax of a string
> literal. The L in L"abc" doesn't conflict with L as an identifier.
>
> [...]
>> Actually, I can't see the point of such a prefix in C. All you have to
>> do is to decree that string literals are UTF8 anyway.
>
> Bad idea. It would restrict C's portability to systems that support
> UTF-8, excluding both small embedded systems with limited character set
> support and systems that still use EBCDIC.

The downside if we define a string literal as UTF-8 is this: there are
invalid UTF-8 sequences, so we have lost the ability to dump arbitrary
bytes between the quotes. (Which we don't exactly have, since we can't
dump a double quote, backslash or newline, but I mean those bytes that
don't require escaping.)

If a string literals is defined as UTF-8, it means the compiler can
terminate with a diagnostic if there are invalid bytes, like an overlong
form, or a continuation byte without a start byte or whatever.

This is a problem for programs that have, say, ISO-Latin-1 string
literals; they have to be converted. Converting the code to UTF-8
may not be an option if the literal data is deliberately ISO-Latin-1,
because that's what the target system understands (e.g. can display
on its display).

If a string literal is restricted to UTF-8, then even numeric hex escape
sequences have to constituted valid UTF-8. (Otherwise what is the point
of the definition?)

That's going to cause a problem in any situation (small embedded
system or not) in which string literals are used for defining
special byte sequences for serial communication protocols and whatnot.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<AZgxM.198095$U3w1.28759@fx09.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26821&group=comp.lang.c#26821

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Unicode test suite
Content-Language: en-US
Newsgroups: comp.lang.c
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <871qgqo06r.fsf@nosuchdomain.example.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 36
Message-ID: <AZgxM.198095$U3w1.28759@fx09.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sat, 29 Jul 2023 19:08:15 -0400
X-Received-Bytes: 2957

by: Richard Damon - Sat, 29 Jul 2023 23:08 UTC

On 7/29/23 5:52 PM, Keith Thompson wrote:
> Bart <bc@freeuk.com> writes:
> [...]
>> In C11, "u8" is a valid user identifier, so that you can do this:
>>
>> typedef unsigned byte u8;
>>
>> But it can also be a prefix to a string literal. I believe it's called
>> a 'contextual keyword'.
>
> No, as Ben pointed out, it's just part of the syntax of a string
> literal. The L in L"abc" doesn't conflict with L as an identifier.
>
> [...]
>> Actually, I can't see the point of such a prefix in C. All you have to
>> do is to decree that string literals are UTF8 anyway.
>
> Bad idea. It would restrict C's portability to systems that support
> UTF-8, excluding both small embedded systems with limited character set
> support and systems that still use EBCDIC.
>

Actually, it would give problems on todays computers, with files that
were written awhile ago using some "code-page" as its encoding (like
some version of Latin-1)

As I understand it, a u8"string" will convert the string from its source
character set (which might be set to Latin-1) into UTF-8, while a plain
"string" will convert it to what is defined as the "execution" character
set, which might be defined as the same as the input set of Latin-1.

Thus u8 strings MUST use UTF-8 encoding, while regular strings use
whatever encoding the implementation defines as its narrow character set.

There are reasons to allow then narrow character set to be something
other than a UTF-8 encoding.

Re: Unicode test suite

<20230729172526.387@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26822&group=comp.lang.c#26822

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 00:28:04 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <20230729172526.387@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 30 Jul 2023 00:28:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e03ffdfa499b0f433ff65455cb09cf27";
logging-data="2885696"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/lVuQZ7NNvqdWZ6htVPxgcjV35gY9Wi/g="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:4EmntdWTL2UsdfyJptVMuhPRCgY=

by: Kaz Kylheku - Sun, 30 Jul 2023 00:28 UTC

On 2023-07-29, Bart <bc@freeuk.com> wrote:
> On 29/07/2023 22:52, Keith Thompson wrote:
>> Bad idea. It would restrict C's portability to systems that support
>> UTF-8, excluding both small embedded systems with limited character set
>> support
>
> It what way would be restrict those small systems?
>
> Support for UTF8 means merely that characters in a string can have
> values outside of 0 to 127, which they could have anyway whether UTF8 is
> involved or not.

If a spec says that the object specified by a string literal has to be
UTF-8, it means there cannot be non-UTF-8 sequences.

(If any byte string were allowed, what would be the point of mentioning
UTF-8.)

> How exactly is a u8"€" string different from "€" anyway?

That one probably isn't, other than the silly char8_t thing,
but I imagine u8"\x8F\xff" can be diagnosed.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<87wmyimcwr.fsf@nosuchdomain.example.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26823&group=comp.lang.c#26823

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 18:00:20 -0700
Organization: None to speak of
Lines: 61
Message-ID: <87wmyimcwr.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
<87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: dont-email.me; posting-host="0f6fc3f0df3802232e8b4af3a2646543";
logging-data="2890165"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/UAMENQf6rg2pHzBd3xhsu"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:xZh4udyreyKeqVNtydBQW8kmifQ=
sha1:/7iQ1sF20KYU8oJrgCg1TuRVcq8=

by: Keith Thompson - Sun, 30 Jul 2023 01:00 UTC

Bart <bc@freeuk.com> writes:
> On 29/07/2023 22:52, Keith Thompson wrote:
>> Bart <bc@freeuk.com> writes:
>> [...]
>>> In C11, "u8" is a valid user identifier, so that you can do this:
>>>
>>> typedef unsigned byte u8;
>>>
>>> But it can also be a prefix to a string literal. I believe it's called
>>> a 'contextual keyword'.
>> No, as Ben pointed out, it's just part of the syntax of a string
>> literal. The L in L"abc" doesn't conflict with L as an identifier.
>> [...]
>>> Actually, I can't see the point of such a prefix in C. All you have to
>>> do is to decree that string literals are UTF8 anyway.
>> Bad idea. It would restrict C's portability to systems that support
>> UTF-8, excluding both small embedded systems with limited character set
>> support
>
> It what way would be restrict those small systems?

I'm not sure. See below.

> Support for UTF8 means merely that characters in a string can have
> values outside of 0 to 127, which they could have anyway whether UTF8
> is involved or not.
>
> How exactly is a u8"€" string different from "€" anyway?

I haven't actually used UTF-8 string literals. As of C11/N1570, the
standard is a bit vague about the difference.

N1570 6.4.5p3:

A *character string literal* is a sequence of zero or more multibyte
characters enclosed in double-quotes, as in "xyz". A *UTF−8 string
literal* is the same, except prefixed by u8.

p6:

For character string literals, the array elements have type char,
and are initialized with the individual bytes of the multibyte
character sequence. For UTF−8 string literals, the array elements
have type char, and are initialized with the characters of the
multibyte character sequence, as encoded in UTF−8.

I would guess that something like u8"\xff", which specifies an invalid
UTF-8 sequence, would be invalid, but it doesn't seem to violate any
constraint or syntax rule, and neither gcc nor clang complains about it.

I guess that for a compiler that uses EBCDIC for source code, "x" would
be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
equivalent to "\x78".

Either the standard is insufficiently clear, or I'm missing something.
I definitely wouldn't bet against the latter.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<20230729194002.561@kylheku.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26824&group=comp.lang.c#26824

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 02:46:27 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <20230729194002.561@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
Injection-Date: Sun, 30 Jul 2023 02:46:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="e03ffdfa499b0f433ff65455cb09cf27";
logging-data="3024340"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+yIZyZA0QeKY/2axMYmXquN+ZOJr2w7ww="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:ZREfVA/zeijv++hq8W3tpV1FFLg=

by: Kaz Kylheku - Sun, 30 Jul 2023 02:46 UTC

On 2023-07-30, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
> I would guess that something like u8"\xff", which specifies an invalid
> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
> constraint or syntax rule, and neither gcc nor clang complains about it.

The requirement "encoded in UTF-8" could be regarded as a syntax rule.

> I guess that for a compiler that uses EBCDIC for source code, "x" would
> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
> equivalent to "\x78".

If the u8 literal maps translation characters like 'x' to UTF-8,
that is actually very useful to the EBCDC people; they can directly
encode strings needed for communicating with ASCII systems.

E.g.

char *dial_prefix = u8"ATDT";

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<W7lxM.161741$qnnb.72702@fx11.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26825&group=comp.lang.c#26825

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Unicode test suite
Content-Language: en-US
Newsgroups: comp.lang.c
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
From: Rich...@Damon-Family.org (Richard Damon)
In-Reply-To: <87wmyimcwr.fsf@nosuchdomain.example.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 65
Message-ID: <W7lxM.161741$qnnb.72702@fx11.iad>
X-Complaints-To: abuse@easynews.com
Organization: Forte - www.forteinc.com
X-Complaints-Info: Please be sure to forward a copy of ALL headers otherwise we will be unable to process your complaint properly.
Date: Sat, 29 Jul 2023 23:52:21 -0400
X-Received-Bytes: 4269

by: Richard Damon - Sun, 30 Jul 2023 03:52 UTC

On 7/29/23 9:00 PM, Keith Thompson wrote:
> Bart <bc@freeuk.com> writes:
>> On 29/07/2023 22:52, Keith Thompson wrote:
>>> Bart <bc@freeuk.com> writes:
>>> [...]
>>>> In C11, "u8" is a valid user identifier, so that you can do this:
>>>>
>>>> typedef unsigned byte u8;
>>>>
>>>> But it can also be a prefix to a string literal. I believe it's called
>>>> a 'contextual keyword'.
>>> No, as Ben pointed out, it's just part of the syntax of a string
>>> literal. The L in L"abc" doesn't conflict with L as an identifier.
>>> [...]
>>>> Actually, I can't see the point of such a prefix in C. All you have to
>>>> do is to decree that string literals are UTF8 anyway.
>>> Bad idea. It would restrict C's portability to systems that support
>>> UTF-8, excluding both small embedded systems with limited character set
>>> support
>>
>> It what way would be restrict those small systems?
>
> I'm not sure. See below.
>
>> Support for UTF8 means merely that characters in a string can have
>> values outside of 0 to 127, which they could have anyway whether UTF8
>> is involved or not.
>>
>> How exactly is a u8"€" string different from "€" anyway?
>
> I haven't actually used UTF-8 string literals. As of C11/N1570, the
> standard is a bit vague about the difference.
>
> N1570 6.4.5p3:
>
> A *character string literal* is a sequence of zero or more multibyte
> characters enclosed in double-quotes, as in "xyz". A *UTF−8 string
> literal* is the same, except prefixed by u8.
>
> p6:
>
> For character string literals, the array elements have type char,
> and are initialized with the individual bytes of the multibyte
> character sequence. For UTF−8 string literals, the array elements
> have type char, and are initialized with the characters of the
> multibyte character sequence, as encoded in UTF−8.
>
> I would guess that something like u8"\xff", which specifies an invalid
> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
> constraint or syntax rule, and neither gcc nor clang complains about it.
>
> I guess that for a compiler that uses EBCDIC for source code, "x" would
> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
> equivalent to "\x78".
>
> Either the standard is insufficiently clear, or I'm missing something.
> I definitely wouldn't bet against the latter.
>

I think the key is that numerical constant specified characters are
specified in the "execution character encoding", which is implementation
defined, and doesn't need to be unicode.

Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
represent the Unicode character U+00FF

Re: Unicode test suite

<ua5gf0$2u9ha$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26826&group=comp.lang.c#26826

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sun, 30 Jul 2023 12:10:26 +0100
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <ua5gf0$2u9ha$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
<ua34qf$2kl06$1@dont-email.me> <871qgqo06r.fsf@nosuchdomain.example.com>
<ua42v3$2nihd$1@dont-email.me> <87wmyimcwr.fsf@nosuchdomain.example.com>
<W7lxM.161741$qnnb.72702@fx11.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 30 Jul 2023 11:10:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ceb74387feb450401c92448496a7cc87";
logging-data="3089962"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19o4omDa1xs3GhxhkAWuNFWq8zlpJ8kKhM="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:8XLheFrmyv2OBBEqgkXH9QDMLI0=
In-Reply-To: <W7lxM.161741$qnnb.72702@fx11.iad>

by: Bart - Sun, 30 Jul 2023 11:10 UTC

On 30/07/2023 04:52, Richard Damon wrote:
> On 7/29/23 9:00 PM, Keith Thompson wrote:

>> I haven't actually used UTF-8 string literals. As of C11/N1570, the
>> standard is a bit vague about the difference.
>>
>> N1570 6.4.5p3:
>>
>> A *character string literal* is a sequence of zero or more
multibyte
>> characters enclosed in double-quotes, as in "xyz". A *UTF−8 string
>> literal* is the same, except prefixed by u8.
>>
>> p6:
>>
>> For character string literals, the array elements have type char,
>> and are initialized with the individual bytes of the multibyte
>> character sequence. For UTF−8 string literals, the array elements
>> have type char, and are initialized with the characters of the
>> multibyte character sequence, as encoded in UTF−8.
>>
>> I would guess that something like u8"\xff", which specifies an invalid
>> UTF-8 sequence, would be invalid, but it doesn't seem to violate any
>> constraint or syntax rule, and neither gcc nor clang complains about it.
>>
>> I guess that for a compiler that uses EBCDIC for source code, "x" would
>> be equivalent to "\xa7" (the EBCDIC code for 'x') and u8"x" would be
>> equivalent to "\x78".
>>
>> Either the standard is insufficiently clear, or I'm missing something.
>> I definitely wouldn't bet against the latter.
>>
>
> I think the key is that numerical constant specified characters are
> specified in the "execution character encoding", which is implementation
> defined, and doesn't need to be unicode.
>
> Thus u8"\0xFF" might generate the byte sequence 0xC3 0xDF 0x00 to
> represent the Unicode character U+00FF

(Do you mean \0xFF or \xFF ?)

I thought these were limited to 2 hex digits, but apparently you can
have as many as you like. However that looks to be ambiguous:

"\x20AC"

Is that a 4-digit hex code, or is it a 2-digit one following by normal
characters A and C?

Re: Unicode test suite

<ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26827&group=comp.lang.c#26827

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:148d:b0:40d:b839:b5bb with SMTP id t13-20020a05622a148d00b0040db839b5bbmr11239qtx.2.1690716658272;
Sun, 30 Jul 2023 04:30:58 -0700 (PDT)
X-Received: by 2002:a05:6808:13c2:b0:3a4:24bc:125f with SMTP id
d2-20020a05680813c200b003a424bc125fmr13853123oiw.1.1690716657977; Sun, 30 Jul
2023 04:30:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 04:30:57 -0700 (PDT)
In-Reply-To: <20230729145800.354@kylheku.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.49; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.49
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 11:30:58 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4304

by: fir - Sun, 30 Jul 2023 11:30 UTC

i looked how this utf8 work, and some thoughts

I) in fact the decision to code 0-127 classic way seem to be ok
II) the decision to use abstract 32 bit numbers as a numbers of signs seem to be ok

iii) there is additional decision to encode that 128+ characters in a way each byte is then
128+ (for kinda compatibility i guess) as to this im not sure.. more saving space would
be use classical way something like
128 XX - additional 256 (or 255 if exclude 0)
129 XX - another additional 256 (or 255 if exclude 0)
200 XX XX - additional 256*256 (-256-1)
201 XX XX XX - additional

hovver maybe this rule 128+ is good also

iv) hovver if so there is a question how to do it

they chose the way all the 'tail' bytes encode witx "10" and 6 bytes (this is 128-191)
and the 'head' byte as 192+ .. this is probably right.. they could a) encode more values than 64
in tail bytes though or b) number the tail bytes to be seen which part of tail it is but
im not sure if the last is better (the tail-numbering would further eat the capacity)

as to first im not sure if the incrase tail capacityto more than 64 wouldnt be better, the maximum possible is to to have up do 127 capacily.. for 4 bytes it gave 28 bits against 24 bits, (25 bit seems standable) the problem is it couldnnt be all 128 values as some need to be marking head and head type.. there is yet a quetion if head should have capacity at all...some oculd assume that 128-191 is for tail and 192-255 for a heads but head shouldnt heve normal capacity

overally it all seem ok except mayeb the choice that heads teke this capacity.. there are 5 heads and each one has capacity from 5 to 1 bit... it may be usefull when using 2 byte encoding
gives 5+6=11 bits for 2 bytes 11 bits is 2k values wonder if they put good things there

for 4 bytes its 3+18 = 21.. maximum capacity for 6 bytes 31 bits

well overally it seems standable.. guess its in fact probably surprisingly easy to write c code which goes fo0rward by this unicode string and gives the wide char values - using masks shifts and ifs
and being proof to jumping inside wide char

overally the biggest problem was probably putting this confusion of providing all utf8 utf16 utf32
but if the practice is using mainly one on given system its ok.. buts imo probably the best to stick to that utf-8 where the asci is also used

Re: Unicode test suite

<1c512476-49d8-4613-ab89-d74edad922d0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26828&group=comp.lang.c#26828

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:620a:2b4a:b0:767:f1fc:5297 with SMTP id dp10-20020a05620a2b4a00b00767f1fc5297mr22141qkb.15.1690718397718;
Sun, 30 Jul 2023 04:59:57 -0700 (PDT)
X-Received: by 2002:a05:6870:3a03:b0:1bb:6acc:7f31 with SMTP id
du3-20020a0568703a0300b001bb6acc7f31mr8765132oab.10.1690718397382; Sun, 30
Jul 2023 04:59:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 04:59:56 -0700 (PDT)
In-Reply-To: <ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.49; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.49
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com> <ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1c512476-49d8-4613-ab89-d74edad922d0n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 11:59:57 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3307

by: fir - Sun, 30 Jul 2023 11:59 UTC

niedziela, 30 lipca 2023 o 13:31:06 UTC+2 fir napisał(a):
> i looked how this utf8 work, and some thoughts
>
> I) in fact the decision to code 0-127 classic way seem to be ok
> II) the decision to use abstract 32 bit numbers as a numbers of signs seem to be ok
>
> iii) there is additional decision to encode that 128+ characters in a way each byte is then
> 128+ (for kinda compatibility i guess) as to this im not sure.. more saving space would
> be use classical way something like
> 128 XX - additional 256 (or 255 if exclude 0)
> 129 XX - another additional 256 (or 255 if exclude 0)
> 200 XX XX - additional 256*256 (-256-1)
> 201 XX XX XX - additional
>

this above in fact cant stand if one needs to identify tail and head but imo the alternative
for unicode coild be
0-127 ascii
0-247 tail (where this 247 is to be chosen)
248-255 heads (say eight, though also may be 4)

it then would be like
250 XX XX
251 XX XX XX
252 XX XX XX XX where capacity of XX is alomost full except that heads cant be used)

i think it is most concured pattaern to unicode, the advantage is it has more capacity
and no need to use shifts only soem need to be aware XXXXXX dont contain heads
so its not strictlu 0-0xffffff but not contains some 0xfxfxfx values but its not much a deal imo

Re: Unicode test suite

<4c3b7ce2-71e5-4543-bae4-14e1a389cadan@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26829&group=comp.lang.c#26829

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ad4:590c:0:b0:635:e6f8:b28d with SMTP id ez12-20020ad4590c000000b00635e6f8b28dmr26786qvb.12.1690718965308;
Sun, 30 Jul 2023 05:09:25 -0700 (PDT)
X-Received: by 2002:a4a:4f04:0:b0:56c:a8c1:bff8 with SMTP id
c4-20020a4a4f04000000b0056ca8c1bff8mr1271441oob.1.1690718965011; Sun, 30 Jul
2023 05:09:25 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 05:09:24 -0700 (PDT)
In-Reply-To: <1c512476-49d8-4613-ab89-d74edad922d0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.49; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.49
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com>
<ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com> <1c512476-49d8-4613-ab89-d74edad922d0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4c3b7ce2-71e5-4543-bae4-14e1a389cadan@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 12:09:25 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4003

by: fir - Sun, 30 Jul 2023 12:09 UTC

niedziela, 30 lipca 2023 o 14:00:06 UTC+2 fir napisał(a):
> niedziela, 30 lipca 2023 o 13:31:06 UTC+2 fir napisał(a):
> > i looked how this utf8 work, and some thoughts
> >
> > I) in fact the decision to code 0-127 classic way seem to be ok
> > II) the decision to use abstract 32 bit numbers as a numbers of signs seem to be ok
> >
> > iii) there is additional decision to encode that 128+ characters in a way each byte is then
> > 128+ (for kinda compatibility i guess) as to this im not sure.. more saving space would
> > be use classical way something like
> > 128 XX - additional 256 (or 255 if exclude 0)
> > 129 XX - another additional 256 (or 255 if exclude 0)
> > 200 XX XX - additional 256*256 (-256-1)
> > 201 XX XX XX - additional
> >
> this above in fact cant stand if one needs to identify tail and head but imo the alternative
> for unicode coild be
> 0-127 ascii
> 0-247 tail (where this 247 is to be chosen)
> 248-255 heads (say eight, though also may be 4)
>
> it then would be like
> 250 XX XX
> 251 XX XX XX
> 252 XX XX XX XX where capacity of XX is alomost full except that heads cant be used)
>
> i think it is most concured pattaern to unicode, the advantage is it has more capacity
> and no need to use shifts only soem need to be aware XXXXXX dont contain heads
> so its not strictlu 0-0xffffff but not contains some 0xfxfxfx values but its not much a deal imo

for simplicity i even think 16 heads can be used to write
0xf2xxxx (where this f means head and 2 means 2 bytes of tail, and xx is from 0x00 to 0xef))
so this alternatixe unicode would look like f2 xx xx f3 xx xx xx xx xx xx f2 xx xx xx xx f3 xx xx xx xx xx
f1 rather should not be used (f0 and f1 reserved) and if someone jumps into string he need to revert back n byttes searching for fx code...
name this style maybe f-code im not sure if thsi is not better than unicode

Re: Unicode test suite

<69b9835b-d547-4909-839a-d4c20a1a4028n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26830&group=comp.lang.c#26830

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:20c:b0:404:c707:88e8 with SMTP id b12-20020a05622a020c00b00404c70788e8mr23099qtx.8.1690719168198;
Sun, 30 Jul 2023 05:12:48 -0700 (PDT)
X-Received: by 2002:a05:6870:a8aa:b0:1bb:8ad0:1fa7 with SMTP id
eb42-20020a056870a8aa00b001bb8ad01fa7mr8409491oab.7.1690719167848; Sun, 30
Jul 2023 05:12:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 05:12:47 -0700 (PDT)
In-Reply-To: <4c3b7ce2-71e5-4543-bae4-14e1a389cadan@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.49; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.49
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com>
<ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com> <1c512476-49d8-4613-ab89-d74edad922d0n@googlegroups.com>
<4c3b7ce2-71e5-4543-bae4-14e1a389cadan@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <69b9835b-d547-4909-839a-d4c20a1a4028n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 12:12:48 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4503

by: fir - Sun, 30 Jul 2023 12:12 UTC

niedziela, 30 lipca 2023 o 14:09:33 UTC+2 fir napisał(a):
> niedziela, 30 lipca 2023 o 14:00:06 UTC+2 fir napisał(a):
> > niedziela, 30 lipca 2023 o 13:31:06 UTC+2 fir napisał(a):
> > > i looked how this utf8 work, and some thoughts
> > >
> > > I) in fact the decision to code 0-127 classic way seem to be ok
> > > II) the decision to use abstract 32 bit numbers as a numbers of signs seem to be ok
> > >
> > > iii) there is additional decision to encode that 128+ characters in a way each byte is then
> > > 128+ (for kinda compatibility i guess) as to this im not sure.. more saving space would
> > > be use classical way something like
> > > 128 XX - additional 256 (or 255 if exclude 0)
> > > 129 XX - another additional 256 (or 255 if exclude 0)
> > > 200 XX XX - additional 256*256 (-256-1)
> > > 201 XX XX XX - additional
> > >
> > this above in fact cant stand if one needs to identify tail and head but imo the alternative
> > for unicode coild be
> > 0-127 ascii
> > 0-247 tail (where this 247 is to be chosen)
> > 248-255 heads (say eight, though also may be 4)
> >
> > it then would be like
> > 250 XX XX
> > 251 XX XX XX
> > 252 XX XX XX XX where capacity of XX is alomost full except that heads cant be used)
> >
> > i think it is most concured pattaern to unicode, the advantage is it has more capacity
> > and no need to use shifts only soem need to be aware XXXXXX dont contain heads
> > so its not strictlu 0-0xffffff but not contains some 0xfxfxfx values but its not much a deal imo
> for simplicity i even think 16 heads can be used to write
> 0xf2xxxx (where this f means head and 2 means 2 bytes of tail, and xx is from 0x00 to 0xef))
> so this alternatixe unicode would look like f2 xx xx f3 xx xx xx xx xx xx f2 xx xx xx xx f3 xx xx xx xx xx
> f1 rather should not be used (f0 and f1 reserved) and if someone jumps into string he need to revert back n byttes searching for fx code...
> name this style maybe f-code im not sure if thsi is not better than unicode

but utf8 is quite standable i guess ..maybe its even compatible with other bytes than 8-bit ones
which this f-code is not
but besides i think in todays time probably tehre is a need just to make architecture which makes bytes 32 bit not 8 bit, and then widechars should be stright

Re: Unicode test suite

<c631b699-db9f-49c8-8e2c-10f90114f735n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26831&group=comp.lang.c#26831

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:5652:b0:63c:f38d:e0d5 with SMTP id mh18-20020a056214565200b0063cf38de0d5mr23069qvb.0.1690719414648;
Sun, 30 Jul 2023 05:16:54 -0700 (PDT)
X-Received: by 2002:a05:6870:b79c:b0:1bb:4da2:9edc with SMTP id
ed28-20020a056870b79c00b001bb4da29edcmr9599961oab.1.1690719414309; Sun, 30
Jul 2023 05:16:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 05:16:53 -0700 (PDT)
In-Reply-To: <4c3b7ce2-71e5-4543-bae4-14e1a389cadan@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.49; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.49
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com>
<ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com> <1c512476-49d8-4613-ab89-d74edad922d0n@googlegroups.com>
<4c3b7ce2-71e5-4543-bae4-14e1a389cadan@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c631b699-db9f-49c8-8e2c-10f90114f735n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 12:16:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3152

by: fir - Sun, 30 Jul 2023 12:16 UTC

niedziela, 30 lipca 2023 o 14:09:33 UTC+2 fir napisał(a):
> > 252 XX XX XX XX where capacity of XX is alomost full except that heads cant be used)
> >
> > i think it is most concured pattaern to unicode, the advantage is it has more capacity
> > and no need to use shifts only soem need to be aware XXXXXX dont contain heads
> > so its not strictlu 0-0xffffff but not contains some 0xfxfxfx values but its not much a deal imo
> for simplicity i even think 16 heads can be used to write
> 0xf2xxxx (where this f means head and 2 means 2 bytes of tail, and xx is from 0x00 to 0xef))
> so this alternatixe unicode would look like f2 xx xx f3 xx xx xx xx xx xx f2 xx xx xx xx f3 xx xx xx xx xx

here above ofc the ones inside are ascii
(f2 xx xx) (f3 xx xx xx) (xx) (xx) (xx) (f2 xx xx) (xx) (xx) (f3 xx xx xx) (xx) (xx)

> f1 rather should not be used (f0 and f1 reserved) and if someone jumps into string he need to revert back n byttes searching for fx code...
> name this style maybe f-code im not sure if thsi is not better than unicode

Re: Unicode test suite

<6c86373c-45bc-4505-98bf-d120f7d65728n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26832&group=comp.lang.c#26832

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:1a84:b0:403:f7c6:578c with SMTP id s4-20020a05622a1a8400b00403f7c6578cmr24812qtc.10.1690724522210;
Sun, 30 Jul 2023 06:42:02 -0700 (PDT)
X-Received: by 2002:a05:6870:a89b:b0:1bb:6519:d254 with SMTP id
eb27-20020a056870a89b00b001bb6519d254mr9043048oab.3.1690724521430; Sun, 30
Jul 2023 06:42:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 06:42:00 -0700 (PDT)
In-Reply-To: <ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.167; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.167
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com> <ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c86373c-45bc-4505-98bf-d120f7d65728n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 13:42:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3533

by: fir - Sun, 30 Jul 2023 13:42 UTC

niedziela, 30 lipca 2023 o 13:31:06 UTC+2 fir napisał(a):
> overally it all seem ok except mayeb the choice that heads teke this capacity.. there are 5 heads and each one has capacity from 5 to 1 bit... it may be usefull when using 2 byte encoding
> gives 5+6=11 bits for 2 bytes 11 bits is 2k values wonder if they put good things there
>
btw if so this means thets its maybe somewhat important to look up what is in this
chars (they probably name it code points but imo it could be named
wide chars or just chars / characters) up to 255 and up to 11 bits (2047)

as probably there is no reason to denay this wide character table, and if so this
become new ascii - so better to learn it probably

https://en.wikipedia.org/wiki/Latin-1_Supplement

sadly i dont see polisch letters ąę ćś źź ółń and big ones (18 in sum) in 256 set yet i see some real weird and not much of usefull imo, but soem usefll too

hell is the use of 0x8X or 0x9X controll codes does anytyhing really use it? (liek most of 0x0X
and 0x1X ) 60 characters liek wasted (or some would need to discover soem good use for this)

some of 0xAX 0xBX i could use

is then then real use of this 7 A ,4 E, 4 I, 6 O, 4 U ? (x2) is this real use of this if no i would more liek tosee moer additional punctuations there (but as i say no way to question it i guess)

only two blocks (0xAX , 0xBX) of punctuations, and mul and div signs ? then all letters

fortunatelly polish are among 256 - 383

but where is the rusiian?

Re: Unicode test suite

<ca6d33b6-600a-4f9f-8c48-b62a6f619033n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26833&group=comp.lang.c#26833

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a37:8782:0:b0:76c:891f:1be8 with SMTP id j124-20020a378782000000b0076c891f1be8mr20397qkd.12.1690725036018;
Sun, 30 Jul 2023 06:50:36 -0700 (PDT)
X-Received: by 2002:a05:6808:1891:b0:3a6:e045:410f with SMTP id
bi17-20020a056808189100b003a6e045410fmr13533576oib.11.1690725035644; Sun, 30
Jul 2023 06:50:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 06:50:35 -0700 (PDT)
In-Reply-To: <6c86373c-45bc-4505-98bf-d120f7d65728n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.167; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.167
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com>
<ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com> <6c86373c-45bc-4505-98bf-d120f7d65728n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ca6d33b6-600a-4f9f-8c48-b62a6f619033n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 13:50:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: fir - Sun, 30 Jul 2023 13:50 UTC

niedziela, 30 lipca 2023 o 15:42:09 UTC+2 fir napisał(a):
> niedziela, 30 lipca 2023 o 13:31:06 UTC+2 fir napisał(a):
> > overally it all seem ok except mayeb the choice that heads teke this capacity.. there are 5 heads and each one has capacity from 5 to 1 bit... it may be usefull when using 2 byte encoding
> > gives 5+6=11 bits for 2 bytes 11 bits is 2k values wonder if they put good things there
> >
> btw if so this means thets its maybe somewhat important to look up what is in this
> chars (they probably name it code points but imo it could be named
> wide chars or just chars / characters) up to 255 and up to 11 bits (2047)
>
> as probably there is no reason to denay this wide character table, and if so this
> become new ascii - so better to learn it probably
>
> https://en.wikipedia.org/wiki/Latin-1_Supplement
>
> sadly i dont see polisch letters ąę ćś źź ółń and big ones (18 in sum) in 256 set yet i see some real weird and not much of usefull imo, but soem usefll too
>
> hell is the use of 0x8X or 0x9X controll codes does anytyhing really use it? (liek most of 0x0X
> and 0x1X ) 60 characters liek wasted (or some would need to discover soem good use for this)
>
> some of 0xAX 0xBX i could use
>
> is then then real use of this 7 A ,4 E, 4 I, 6 O, 4 U ? (x2) is this real use of this if no i would more liek tosee moer additional punctuations there (but as i say no way to question it i guess)
>
> only two blocks (0xAX , 0xBX) of punctuations, and mul and div signs ? then all letters
>
> fortunatelly polish are among 256 - 383
>
> but where is the rusiian?

overally a lot of letters and a lack of punctuationusefull things (though i seen mostly this 0-500
not seen exactly 500-2000 yet)

Re: Unicode test suite

<c643e4f2-e398-4c4a-a1e5-1d5e03a6c6f0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26834&group=comp.lang.c#26834

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:622a:189c:b0:403:fb10:28f8 with SMTP id v28-20020a05622a189c00b00403fb1028f8mr23604qtc.4.1690725812433;
Sun, 30 Jul 2023 07:03:32 -0700 (PDT)
X-Received: by 2002:a05:6870:76a5:b0:1bb:7126:4ddc with SMTP id
dx37-20020a05687076a500b001bb71264ddcmr9317097oab.2.1690725812049; Sun, 30
Jul 2023 07:03:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sun, 30 Jul 2023 07:03:31 -0700 (PDT)
In-Reply-To: <ca6d33b6-600a-4f9f-8c48-b62a6f619033n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.167; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.167
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <ua34qf$2kl06$1@dont-email.me>
<871qgqo06r.fsf@nosuchdomain.example.com> <20230729145800.354@kylheku.com>
<ddda5fa4-2430-45c5-906b-67a2007011f3n@googlegroups.com> <6c86373c-45bc-4505-98bf-d120f7d65728n@googlegroups.com>
<ca6d33b6-600a-4f9f-8c48-b62a6f619033n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c643e4f2-e398-4c4a-a1e5-1d5e03a6c6f0n@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sun, 30 Jul 2023 14:03:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4171

by: fir - Sun, 30 Jul 2023 14:03 UTC

niedziela, 30 lipca 2023 o 15:50:44 UTC+2 fir napisał(a):
> niedziela, 30 lipca 2023 o 15:42:09 UTC+2 fir napisał(a):
> > niedziela, 30 lipca 2023 o 13:31:06 UTC+2 fir napisał(a):
> > > overally it all seem ok except mayeb the choice that heads teke this capacity.. there are 5 heads and each one has capacity from 5 to 1 bit... it may be usefull when using 2 byte encoding
> > > gives 5+6=11 bits for 2 bytes 11 bits is 2k values wonder if they put good things there
> > >
> > btw if so this means thets its maybe somewhat important to look up what is in this
> > chars (they probably name it code points but imo it could be named
> > wide chars or just chars / characters) up to 255 and up to 11 bits (2047)
> >
> > as probably there is no reason to denay this wide character table, and if so this
> > become new ascii - so better to learn it probably
> >
> > https://en.wikipedia.org/wiki/Latin-1_Supplement
> >
> > sadly i dont see polisch letters ąę ćś źź ółń and big ones (18 in sum) in 256 set yet i see some real weird and not much of usefull imo, but soem usefll too
> >
> > hell is the use of 0x8X or 0x9X controll codes does anytyhing really use it? (liek most of 0x0X
> > and 0x1X ) 60 characters liek wasted (or some would need to discover soem good use for this)
> >
> > some of 0xAX 0xBX i could use
> >
> > is then then real use of this 7 A ,4 E, 4 I, 6 O, 4 U ? (x2) is this real use of this if no i would more liek tosee moer additional punctuations there (but as i say no way to question it i guess)
> >
> > only two blocks (0xAX , 0xBX) of punctuations, and mul and div signs ? then all letters
> >
> > fortunatelly polish are among 256 - 383
> >
> > but where is the rusiian?
> overally a lot of letters and a lack of punctuationusefull things (though i seen mostly this 0-500
> not seen exactly 500-2000 yet)

ok there seem to be a bit more symbols about 700,800+ and rusiian is 1024+

Neutrinos are into physicists.

devel / comp.lang.c / Re: Unicode test suite

devel / comp.lang.c / Re: Unicode test suite

Subject	Author
Unicode test suite	Malcolm McLean
Re: Unicode test suite	Spiros Bousbouras
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Bart
Re: Unicode test suite	Scott Lurndal
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	jak
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Bart
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Bart
Re: Unicode test suite	Ben Bacarisse
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	Bart
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	Richard Damon
Re: Unicode test suite	Bart
Re: Unicode test suite	Richard Damon
Re: Unicode test suite	Bart
Re: Unicode test suite	Richard Damon
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	Tim Rentsch
Re: Unicode test suite	Tim Rentsch
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Keith Thompson
Re: Unicode test suite	James Kuyper
Re: Unicode test suite	Tim Rentsch
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	fir
Re: Unicode test suite	Richard Damon
Re: Unicode test suite	Tim Rentsch
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	Kaz Kylheku
Re: Unicode test suite	fir
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	fir
Re: Unicode test suite	Malcolm McLean
Re: Unicode test suite	fir
Re: Unicode test suite	Ben Bacarisse
Re: Unicode test suite	Malcolm McLean