Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Sex, Drugs & Linux Rules -- MaDsen Wikholm, mwikholm@at8.abo.fi


devel / comp.lang.c / Unicode test suite

SubjectAuthor
* Unicode test suiteMalcolm McLean
+* Re: Unicode test suiteSpiros Bousbouras
|`* Re: Unicode test suiteMalcolm McLean
| +* Re: Unicode test suiteBart
| |+- Re: Unicode test suiteScott Lurndal
| |`* Re: Unicode test suiteMalcolm McLean
| | `* Re: Unicode test suiteKeith Thompson
| |  `* Re: Unicode test suiteMalcolm McLean
| |   +* Re: Unicode test suiteKaz Kylheku
| |   |`- Re: Unicode test suiteKeith Thompson
| |   `* Re: Unicode test suitejak
| |    `* Re: Unicode test suiteKeith Thompson
| |     `* Re: Unicode test suiteMalcolm McLean
| |      `* Re: Unicode test suiteBart
| |       `* Re: Unicode test suiteMalcolm McLean
| |        +- Re: Unicode test suiteKaz Kylheku
| |        `* Re: Unicode test suiteKeith Thompson
| |         `* Re: Unicode test suiteMalcolm McLean
| |          +* Re: Unicode test suiteBart
| |          |+- Re: Unicode test suiteBen Bacarisse
| |          |`* Re: Unicode test suiteKeith Thompson
| |          | +* Re: Unicode test suiteBart
| |          | |+- Re: Unicode test suiteKaz Kylheku
| |          | |`* Re: Unicode test suiteKeith Thompson
| |          | | +- Re: Unicode test suiteKaz Kylheku
| |          | | +* Re: Unicode test suiteRichard Damon
| |          | | |+* Re: Unicode test suiteBart
| |          | | ||+* Re: Unicode test suiteRichard Damon
| |          | | |||+* Re: Unicode test suiteBart
| |          | | ||||`* Re: Unicode test suiteRichard Damon
| |          | | |||| +* Re: Unicode test suiteKeith Thompson
| |          | | |||| |`- Re: Unicode test suiteTim Rentsch
| |          | | |||| `- Re: Unicode test suiteTim Rentsch
| |          | | |||`- Re: Unicode test suiteMalcolm McLean
| |          | | ||`- Re: Unicode test suiteKeith Thompson
| |          | | |`- Re: Unicode test suiteJames Kuyper
| |          | | `- Re: Unicode test suiteTim Rentsch
| |          | +- Re: Unicode test suiteMalcolm McLean
| |          | +* Re: Unicode test suiteKaz Kylheku
| |          | |`* Re: Unicode test suitefir
| |          | | +* Re: Unicode test suitefir
| |          | | |`* Re: Unicode test suitefir
| |          | | | +- Re: Unicode test suitefir
| |          | | | `- Re: Unicode test suitefir
| |          | | `* Re: Unicode test suitefir
| |          | |  `* Re: Unicode test suitefir
| |          | |   `- Re: Unicode test suitefir
| |          | `* Re: Unicode test suiteRichard Damon
| |          |  `- Re: Unicode test suiteTim Rentsch
| |          `* Re: Unicode test suiteKaz Kylheku
| |           `- Re: Unicode test suiteMalcolm McLean
| `* Re: Unicode test suiteKaz Kylheku
|  `- Re: Unicode test suiteMalcolm McLean
`* Re: Unicode test suiteKaz Kylheku
 `* Re: Unicode test suitefir
  `* Re: Unicode test suiteMalcolm McLean
   `* Re: Unicode test suitefir
    `* Re: Unicode test suiteMalcolm McLean
     +- Re: Unicode test suitefir
     `* Re: Unicode test suiteBen Bacarisse
      `- Re: Unicode test suiteMalcolm McLean

Pages:123
Unicode test suite

<636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26752&group=comp.lang.c#26752

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:622a:1a0b:b0:403:ba0f:5779 with SMTP id f11-20020a05622a1a0b00b00403ba0f5779mr306qtb.11.1690478533373;
Thu, 27 Jul 2023 10:22:13 -0700 (PDT)
X-Received: by 2002:a05:6830:619:b0:6b9:91bb:e49e with SMTP id
w25-20020a056830061900b006b991bbe49emr7981866oti.7.1690478533183; Thu, 27 Jul
2023 10:22:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 27 Jul 2023 10:22:12 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
Subject: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 27 Jul 2023 17:22:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 4
 by: Malcolm McLean - Thu, 27 Jul 2023 17:22 UTC

Lynn's comment inspired me to add Unicode support to the Baby X resource compiler. But despite searching for quite a long time, I can't find a test suite of Unicode files in various formats. In fact it's hard to find any Unicode files at all which are not UTF-8.

Re: Unicode test suite

<sk1=zVc6nIadUyDxN@bongo-ra.co>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26755&group=comp.lang.c#26755

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!paganini.bofh.team!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 17:47:13 -0000 (UTC)
Organization: To protect and to server
Message-ID: <sk1=zVc6nIadUyDxN@bongo-ra.co>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 27 Jul 2023 17:47:13 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="3392436"; posting-host="9H7U5kayiTdk7VIdYU44Rw.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
Cancel-Lock: sha256:Fao9VReVxrgFcZs4l85GI6eIqejHI+y7vApOnQSPC2M=
X-Organisation: Weyland-Yutani
X-Notice: Filtered by postfilter v. 0.9.3
X-Server-Commands: nowebcancel
 by: Spiros Bousbouras - Thu, 27 Jul 2023 17:47 UTC

On Thu, 27 Jul 2023 10:22:12 -0700 (PDT)
Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> Lynn's comment inspired me to add Unicode support to the Baby X resource
> compiler. But despite searching for quite a long time, I can't find a test
> suite of Unicode files in various formats. In fact it's hard to find any
> Unicode files at all which are not UTF-8.

What do you want to test and what do you mean by "formats" ? You can use
iconv to transform UTF-8 to any other encoding.

What is the C connection ?

Re: Unicode test suite

<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26756&group=comp.lang.c#26756

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:186f:b0:63d:38f1:fc82 with SMTP id eh15-20020a056214186f00b0063d38f1fc82mr335qvb.8.1690480536135;
Thu, 27 Jul 2023 10:55:36 -0700 (PDT)
X-Received: by 2002:a05:6808:18a1:b0:3a4:1265:67e7 with SMTP id
bi33-20020a05680818a100b003a4126567e7mr7085653oib.8.1690480535737; Thu, 27
Jul 2023 10:55:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 27 Jul 2023 10:55:35 -0700 (PDT)
In-Reply-To: <sk1=zVc6nIadUyDxN@bongo-ra.co>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <sk1=zVc6nIadUyDxN@bongo-ra.co>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 27 Jul 2023 17:55:36 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2344
 by: Malcolm McLean - Thu, 27 Jul 2023 17:55 UTC

On Thursday, 27 July 2023 at 18:47:28 UTC+1, Spiros Bousbouras wrote:
> On Thu, 27 Jul 2023 10:22:12 -0700 (PDT)
> Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > Lynn's comment inspired me to add Unicode support to the Baby X resource
> > compiler. But despite searching for quite a long time, I can't find a test
> > suite of Unicode files in various formats. In fact it's hard to find any
> > Unicode files at all which are not UTF-8.
> What do you want to test and what do you mean by "formats" ? You can use
> iconv to transform UTF-8 to any other encoding.
>
> What is the C connection ?
>
The Baby X resource compiler is a tool for C programmers.
It takes data in various formats and transforms it to compileable C code.
Currently it doesn't have any support for Unicode data. That's because my
background is in games programming and it's not something I'm very
familiar with.

I'm trying to write a function char *loadasutf8(const char *filename, int *error)
which will accept any text file, smart detect the format, and return the
contents in UTF-8. So I'm looking for Unicode files in various formats to
test it on.

Re: Unicode test suite

<u9ucka$1ulb7$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26758&group=comp.lang.c#26758

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 19:22:02 +0100
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <u9ucka$1ulb7$2@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 27 Jul 2023 18:22:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f6c6f94519e1a2521af762b44f8b9876";
logging-data="2053479"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18dWxanL7DHlH99RDAODExB/phUhuG+p3Q="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:knoI8qXaniiL/+m0KaIEyIvmjgI=
In-Reply-To: <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
 by: Bart - Thu, 27 Jul 2023 18:22 UTC

On 27/07/2023 18:55, Malcolm McLean wrote:
> On Thursday, 27 July 2023 at 18:47:28 UTC+1, Spiros Bousbouras wrote:
>> On Thu, 27 Jul 2023 10:22:12 -0700 (PDT)
>> Malcolm McLean <malcolm.ar...@gmail.com> wrote:
>>> Lynn's comment inspired me to add Unicode support to the Baby X resource
>>> compiler. But despite searching for quite a long time, I can't find a test
>>> suite of Unicode files in various formats. In fact it's hard to find any
>>> Unicode files at all which are not UTF-8.
>> What do you want to test and what do you mean by "formats" ? You can use
>> iconv to transform UTF-8 to any other encoding.
>>
>> What is the C connection ?
>>
> The Baby X resource compiler is a tool for C programmers.
> It takes data in various formats and transforms it to compileable C code.
> Currently it doesn't have any support for Unicode data. That's because my
> background is in games programming and it's not something I'm very
> familiar with.
>
> I'm trying to write a function char *loadasutf8(const char *filename, int *error)
> which will accept any text file, smart detect the format, and return the
> contents in UTF-8. So I'm looking for Unicode files in various formats to
> test it on.

Most text files will already be either UTF8, or ASCII, which is a subset
of UTF8. So what format is there to detect? What formats might be expected?

Re: Unicode test suite

<4FywM.33886$VPEa.784@fx33.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26759&group=comp.lang.c#26759

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx33.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Unicode test suite
Newsgroups: comp.lang.c
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com> <u9ucka$1ulb7$2@dont-email.me>
Lines: 29
Message-ID: <4FywM.33886$VPEa.784@fx33.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Thu, 27 Jul 2023 18:26:08 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Thu, 27 Jul 2023 18:26:08 GMT
X-Received-Bytes: 2207
 by: Scott Lurndal - Thu, 27 Jul 2023 18:26 UTC

Bart <bc@freeuk.com> writes:
>On 27/07/2023 18:55, Malcolm McLean wrote:
>> On Thursday, 27 July 2023 at 18:47:28 UTC+1, Spiros Bousbouras wrote:
>>> On Thu, 27 Jul 2023 10:22:12 -0700 (PDT)
>>> Malcolm McLean <malcolm.ar...@gmail.com> wrote:
>>>> Lynn's comment inspired me to add Unicode support to the Baby X resource
>>>> compiler. But despite searching for quite a long time, I can't find a test
>>>> suite of Unicode files in various formats. In fact it's hard to find any
>>>> Unicode files at all which are not UTF-8.
>>> What do you want to test and what do you mean by "formats" ? You can use
>>> iconv to transform UTF-8 to any other encoding.
>>>
>>> What is the C connection ?
>>>
>> The Baby X resource compiler is a tool for C programmers.
>> It takes data in various formats and transforms it to compileable C code.
>> Currently it doesn't have any support for Unicode data. That's because my
>> background is in games programming and it's not something I'm very
>> familiar with.
>>
>> I'm trying to write a function char *loadasutf8(const char *filename, int *error)
>> which will accept any text file, smart detect the format, and return the
>> contents in UTF-8. So I'm looking for Unicode files in various formats to
>> test it on.
>
>Most text files will already be either UTF8, or ASCII, which is a subset
>of UTF8. So what format is there to detect? What formats might be expected?

Most likely would be UCS-2 or UTF-16, both standard windows character sets.

Re: Unicode test suite

<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26760&group=comp.lang.c#26760

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:5a05:0:b0:762:42ca:919e with SMTP id o5-20020a375a05000000b0076242ca919emr393qkb.9.1690482836315;
Thu, 27 Jul 2023 11:33:56 -0700 (PDT)
X-Received: by 2002:a05:6870:7695:b0:1bb:4eaa:e67a with SMTP id
dx21-20020a056870769500b001bb4eaae67amr279250oab.0.1690482835648; Thu, 27 Jul
2023 11:33:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 27 Jul 2023 11:33:55 -0700 (PDT)
In-Reply-To: <u9ucka$1ulb7$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 27 Jul 2023 18:33:56 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3132
 by: Malcolm McLean - Thu, 27 Jul 2023 18:33 UTC

On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
> On 27/07/2023 18:55, Malcolm McLean wrote:
> > On Thursday, 27 July 2023 at 18:47:28 UTC+1, Spiros Bousbouras wrote:
> >> On Thu, 27 Jul 2023 10:22:12 -0700 (PDT)
> >> Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> >>> Lynn's comment inspired me to add Unicode support to the Baby X resource
> >>> compiler. But despite searching for quite a long time, I can't find a test
> >>> suite of Unicode files in various formats. In fact it's hard to find any
> >>> Unicode files at all which are not UTF-8.
> >> What do you want to test and what do you mean by "formats" ? You can use
> >> iconv to transform UTF-8 to any other encoding.
> >>
> >> What is the C connection ?
> >>
> > The Baby X resource compiler is a tool for C programmers.
> > It takes data in various formats and transforms it to compileable C code.
> > Currently it doesn't have any support for Unicode data. That's because my
> > background is in games programming and it's not something I'm very
> > familiar with.
> >
> > I'm trying to write a function char *loadasutf8(const char *filename, int *error)
> > which will accept any text file, smart detect the format, and return the
> > contents in UTF-8. So I'm looking for Unicode files in various formats to
> > test it on.
> Most text files will already be either UTF8, or ASCII, which is a subset
> of UTF8. So what format is there to detect? What formats might be expected?
>
Of course. But the spirit of the resource compiler is that it's quite liberal in
what it will accept. The vast majority of Unicode files are UTF-8, so it's
trival to write a function to read them and return the contents as UTF-8.
But you can also have UTF-16 and UTF-32, with or without a byte order
marker, and in little endian or big endian format.

Re: Unicode test suite

<20230727105813.33@kylheku.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26761&group=comp.lang.c#26761

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 18:49:55 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <20230727105813.33@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
Injection-Date: Thu, 27 Jul 2023 18:49:55 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b46a856f685959d89dbb9806175f5b3";
logging-data="2059798"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+drizUxlIu4C0fWWXW7vzjZyz7sWfYPS8="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:6BK3Tv9O35pKRx2xa7FRFU8XYqQ=
 by: Kaz Kylheku - Thu, 27 Jul 2023 18:49 UTC

On 2023-07-27, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> Lynn's comment inspired me to add Unicode support to the Baby X
> resource compiler. But despite searching for quite a long time, I
> can't find a test suite of Unicode files in various formats. In fact
> it's hard to find any Unicode files at all which are not UTF-8.

It's a fool's errand to support Unicode formats other than UTF-8.

Also, I recommend ignoring the existence of the BOM (byte order
marker) that can be present at the start of a UTF-8 stream.

It makes no sense: UTF-8 has a single defined byte order: the high
order part of a code point is encoded first, when the code point
requires multiple bytes.

I've only seen BOM in UTF-8 on Windows. I suspect it's some Microsoft
bug whereby, in conversion from UTF-16, the BOM is accidentally
retained, and they may have lobbied to get that idiocy into the
Unicode spec instead of fixing their shit.

If enough programs ignore the existence of BOM and just decode and
retain it as a character, the practice of generating it will have to go
away.

The BOM could appear in a file that contains nothing but characters in
the USASCII range, thereby wantonly breaking the ASCII compatibility
that Pike and Thompson designed into UTF-8.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<20230727115021.336@kylheku.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26762&group=comp.lang.c#26762

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 18:54:15 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <20230727115021.336@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
Injection-Date: Thu, 27 Jul 2023 18:54:15 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b46a856f685959d89dbb9806175f5b3";
logging-data="2059798"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uko9BMAzXOl6puxUk4jF510OvrBrDoaU="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:5RatYCBAUmHve9K0NfIyqKacYVs=
 by: Kaz Kylheku - Thu, 27 Jul 2023 18:54 UTC

On 2023-07-27, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> I'm trying to write a function char *loadasutf8(const char *filename, int *error)
> which will accept any text file, smart detect the format, and return the
> contents in UTF-8.

That function is indistinguishable from:

char *loadfile(const char *filename, int *err);

if all you do is return the UTF-8 that is already in the file.

Now, UTF-8 can contain bad bytes that don't conform to UTF-8, and it
makes sense to validate that. But the "int *err" argument can be
extended to reporting UTF-8 decoding errors.

If you just make it so that Baby X resource files are UTF-8, and nothing
but UTF-8, that handles ASCII also.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<dc7290a9-27cd-4fee-bae8-4cb9ea46ed1dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26765&group=comp.lang.c#26765

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:2c45:0:b0:767:420d:cec2 with SMTP id s66-20020a372c45000000b00767420dcec2mr1013qkh.5.1690488767722;
Thu, 27 Jul 2023 13:12:47 -0700 (PDT)
X-Received: by 2002:a05:6808:124e:b0:3a4:18d1:1686 with SMTP id
o14-20020a056808124e00b003a418d11686mr511687oiv.10.1690488767294; Thu, 27 Jul
2023 13:12:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 27 Jul 2023 13:12:46 -0700 (PDT)
In-Reply-To: <20230727115021.336@kylheku.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<20230727115021.336@kylheku.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dc7290a9-27cd-4fee-bae8-4cb9ea46ed1dn@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 27 Jul 2023 20:12:47 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2510
 by: Malcolm McLean - Thu, 27 Jul 2023 20:12 UTC

On Thursday, 27 July 2023 at 19:54:29 UTC+1, Kaz Kylheku wrote:
> On 2023-07-27, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > I'm trying to write a function char *loadasutf8(const char *filename, int *error)
> > which will accept any text file, smart detect the format, and return the
> > contents in UTF-8.
> That function is indistinguishable from:
>
> char *loadfile(const char *filename, int *err);
>
> if all you do is return the UTF-8 that is already in the file.
>
> Now, UTF-8 can contain bad bytes that don't conform to UTF-8, and it
> makes sense to validate that. But the "int *err" argument can be
> extended to reporting UTF-8 decoding errors.
>
> If you just make it so that Baby X resource files are UTF-8, and nothing
> but UTF-8, that handles ASCII also.
>
Whilst I work with images every day, I can't remember when I last had to
work with textual data for a serious purpose. It might be that all textual
data is either UTF-8 or ASCII, or so close that it makes no difference.

The idea of the Baby X resource compiler is that you don't need to modify
your raw source files. You give it the data and it puts it into a standard
format, and emits it as compileable C source.

Re: Unicode test suite

<87v8e5nkmn.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26767&group=comp.lang.c#26767

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 13:51:28 -0700
Organization: None to speak of
Lines: 31
Message-ID: <87v8e5nkmn.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="f8957b03f6dc45951cee385c25f0e511";
logging-data="2083536"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KPXKU50vw1nYwRuU7erzo"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:dBBGC9+R8e/SJmg95LljnGRBOqY=
sha1:k7yp9qONeIQDZf+f/rUn7CfP+68=
 by: Keith Thompson - Thu, 27 Jul 2023 20:51 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
>> On 27/07/2023 18:55, Malcolm McLean wrote:
[...]
>> > I'm trying to write a function char *loadasutf8(const char *filename, int *error)
>> > which will accept any text file, smart detect the format, and return the
>> > contents in UTF-8. So I'm looking for Unicode files in various formats to
>> > test it on.
>> Most text files will already be either UTF8, or ASCII, which is a subset
>> of UTF8. So what format is there to detect? What formats might be expected?
>>
> Of course. But the spirit of the resource compiler is that it's quite liberal in
> what it will accept. The vast majority of Unicode files are UTF-8, so it's
> trival to write a function to read them and return the contents as UTF-8.
> But you can also have UTF-16 and UTF-32, with or without a byte order
> marker, and in little endian or big endian format.

UTF-8 files are easy to find. (Some of them are also ASCII.)

UTF-16 files are common on Windows, usually little-endian, usually with
a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
Multilingual Plane, characters 0-65535).

UTF-32 files are rare.

You can generate all these formats from UTF-8 using the "iconv" command.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26768&group=comp.lang.c#26768

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:8685:0:b0:765:6a0f:8279 with SMTP id i127-20020a378685000000b007656a0f8279mr1411qkd.0.1690491327217;
Thu, 27 Jul 2023 13:55:27 -0700 (PDT)
X-Received: by 2002:a05:6808:158c:b0:3a4:1f25:7508 with SMTP id
t12-20020a056808158c00b003a41f257508mr837705oiw.0.1690491326905; Thu, 27 Jul
2023 13:55:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 27 Jul 2023 13:55:26 -0700 (PDT)
In-Reply-To: <87v8e5nkmn.fsf@nosuchdomain.example.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Thu, 27 Jul 2023 20:55:27 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2927
 by: Malcolm McLean - Thu, 27 Jul 2023 20:55 UTC

On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
> > On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
> >> On 27/07/2023 18:55, Malcolm McLean wrote:
> [...]
> >> > I'm trying to write a function char *loadasutf8(const char *filename, int *error)
> >> > which will accept any text file, smart detect the format, and return the
> >> > contents in UTF-8. So I'm looking for Unicode files in various formats to
> >> > test it on.
> >> Most text files will already be either UTF8, or ASCII, which is a subset
> >> of UTF8. So what format is there to detect? What formats might be expected?
> >>
> > Of course. But the spirit of the resource compiler is that it's quite liberal in
> > what it will accept. The vast majority of Unicode files are UTF-8, so it's
> > trival to write a function to read them and return the contents as UTF-8.
> > But you can also have UTF-16 and UTF-32, with or without a byte order
> > marker, and in little endian or big endian format.
> UTF-8 files are easy to find. (Some of them are also ASCII.)
>
> UTF-16 files are common on Windows, usually little-endian, usually with
> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
> Multilingual Plane, characters 0-65535).
>
> UTF-32 files are rare.
>
> You can generate all these formats from UTF-8 using the "iconv" command.
>
Wow, iconv --list gives far more encodings than I could possibly support.
But that solves the problem.

Re: Unicode test suite

<20230727142802.187@kylheku.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26769&group=comp.lang.c#26769

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 21:31:00 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <20230727142802.187@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
Injection-Date: Thu, 27 Jul 2023 21:31:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8b46a856f685959d89dbb9806175f5b3";
logging-data="2090845"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NxpejDxYAOSLlrG5GJsD+skwhy52Q9aY="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:atF3KDGFw5vzrZ2cKQUH5luzzrs=
 by: Kaz Kylheku - Thu, 27 Jul 2023 21:31 UTC

On 2023-07-27, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
>> You can generate all these formats from UTF-8 using the "iconv" command.
>>
> Wow, iconv --list gives far more encodings than I could possibly support.
> But that solves the problem.

It punts the problem to the user because you can say that Baby X
resource files are UTF-8, and that's it. If the user writes
the resource file in any other way, or is converting some other kind
of resource file that is in another format, they can use the iconv
tool, or something else; you never have to touch iconv.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<87r0otnhlb.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26770&group=comp.lang.c#26770

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Thu, 27 Jul 2023 14:57:04 -0700
Organization: None to speak of
Lines: 30
Message-ID: <87r0otnhlb.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<20230727142802.187@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="f8957b03f6dc45951cee385c25f0e511";
logging-data="2095128"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++hds8CoMbT5kS97haXiMi"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:c7e/M33fXOp6wu6ZxF7gRNN+RsY=
sha1:TGwZlEb1Y5fN4TmNV+NiF+jEBhI=
 by: Keith Thompson - Thu, 27 Jul 2023 21:57 UTC

Kaz Kylheku <864-117-4973@kylheku.com> writes:
> On 2023-07-27, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
>> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
>>> You can generate all these formats from UTF-8 using the "iconv" command.
>>>
>> Wow, iconv --list gives far more encodings than I could possibly support.
>> But that solves the problem.
>
> It punts the problem to the user because you can say that Baby X
> resource files are UTF-8, and that's it. If the user writes
> the resource file in any other way, or is converting some other kind
> of resource file that is in another format, they can use the iconv
> tool, or something else; you never have to touch iconv.

Sure -- assuming that the iconv application is installed, and that
the user doesn't mind an extra manual step.

UTF-16 text files are (unfortunately) common on Windows. Having a
tool quietly handle such files when they're recognized is a perfectly
valid choice.

In a similar context, I've had to work with UTF-8 and UTF-16 files
in a large software project. The fact that vim quietly handles both
formats was tremendously more convenient than if I had convert them
manually (and convert them back before updating them in git).

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<u9va06$25ab8$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26773&group=comp.lang.c#26773

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 28 Jul 2023 04:43:18 +0200
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <u9va06$25ab8$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 28 Jul 2023 02:43:19 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="26c9289b3511d7d687d476c4cb8f3dba";
logging-data="2271592"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+JNXwD0SfwEFR7cIz4lQjg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.16
Cancel-Lock: sha1:Z31w07Ydr6vi7RGm+di2uNBGTnI=
In-Reply-To: <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
 by: jak - Fri, 28 Jul 2023 02:43 UTC

Malcolm McLean ha scritto:
> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
>>>> On 27/07/2023 18:55, Malcolm McLean wrote:
>> [...]
>>>>> I'm trying to write a function char *loadasutf8(const char *filename, int *error)
>>>>> which will accept any text file, smart detect the format, and return the
>>>>> contents in UTF-8. So I'm looking for Unicode files in various formats to
>>>>> test it on.
>>>> Most text files will already be either UTF8, or ASCII, which is a subset
>>>> of UTF8. So what format is there to detect? What formats might be expected?
>>>>
>>> Of course. But the spirit of the resource compiler is that it's quite liberal in
>>> what it will accept. The vast majority of Unicode files are UTF-8, so it's
>>> trival to write a function to read them and return the contents as UTF-8.
>>> But you can also have UTF-16 and UTF-32, with or without a byte order
>>> marker, and in little endian or big endian format.
>> UTF-8 files are easy to find. (Some of them are also ASCII.)
>>
>> UTF-16 files are common on Windows, usually little-endian, usually with
>> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
>> Multilingual Plane, characters 0-65535).
>>
>> UTF-32 files are rare.
>>
>> You can generate all these formats from UTF-8 using the "iconv" command.
>>
> Wow, iconv --list gives far more encodings than I could possibly support.
> But that solves the problem.
>

You are starting a battle that you will lose like all those who have
started it. There is no way to identify the format of a text with
safety (especially if it is imported from the web). Allow me to
recommend you to ask your users to convert their files into UTF-8
format.

Re: Unicode test suite

<87jzuko1ta.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26774&group=comp.lang.c#26774

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 28 Jul 2023 01:52:33 -0700
Organization: None to speak of
Lines: 59
Message-ID: <87jzuko1ta.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="e5d63739fe1d7e4fdea2b8e1a08ea6c3";
logging-data="2329182"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+kNSiQpThmuat5muU4tVi9"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:iJZGR92Bs0+WwOtX3M8hyt4b4tE=
sha1:0JcfLeDIjpNGgP2kdswIsRwUI7E=
 by: Keith Thompson - Fri, 28 Jul 2023 08:52 UTC

jak <nospam@please.ty> writes:
> Malcolm McLean ha scritto:
>> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
>>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>>> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
>>>>> On 27/07/2023 18:55, Malcolm McLean wrote:
>>> [...]
>>>>>> I'm trying to write a function char *loadasutf8(const char *filename, int *error)
>>>>>> which will accept any text file, smart detect the format, and return the
>>>>>> contents in UTF-8. So I'm looking for Unicode files in various formats to
>>>>>> test it on.
>>>>> Most text files will already be either UTF8, or ASCII, which is a subset
>>>>> of UTF8. So what format is there to detect? What formats might be expected?
>>>>>
>>>> Of course. But the spirit of the resource compiler is that it's quite liberal in
>>>> what it will accept. The vast majority of Unicode files are UTF-8, so it's
>>>> trival to write a function to read them and return the contents as UTF-8.
>>>> But you can also have UTF-16 and UTF-32, with or without a byte order
>>>> marker, and in little endian or big endian format.
>>> UTF-8 files are easy to find. (Some of them are also ASCII.)
>>>
>>> UTF-16 files are common on Windows, usually little-endian, usually with
>>> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
>>> Multilingual Plane, characters 0-65535).
>>>
>>> UTF-32 files are rare.
>>>
>>> You can generate all these formats from UTF-8 using the "iconv" command.
>>>
>> Wow, iconv --list gives far more encodings than I could possibly support.
>> But that solves the problem.
>
> You are starting a battle that you will lose like all those who have
> started it. There is no way to identify the format of a text with
> safety (especially if it is imported from the web). Allow me to
> recommend you to ask your users to convert their files into UTF-8
> format.

There's no 100% reliable way to do so, but there are some very good
heuristics. The file(1) command does a reasonably good job.

Anything with a BOM is identifiable. (I agree that UTF-8 files
generally should not have a BOM, but if it's there you might as well
take advantage of it.) If there's no BOM and the first N bytes are in
the range 0..127, ASCII is a reasonable assumption; otherwise, if the
first N bytes are valid UTF-8, UTF-8 is a reasonable assumption.

(Background: a BOM, Byte Order Mark, is a character U+FEFF, ZERO WIDTH
NO-BREAK SPACE, at the beginning of a text file. It's represented as FE
FF in big-endian UTF-16, FF FE in little-endian UTF-16, and EF BB BF in
UTF-8.)

If a tool is going to be used on Windows, rejecting Windows-native text
files seems unnecessarily hostile.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26775&group=comp.lang.c#26775

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:620a:49b:b0:76c:7d45:ba57 with SMTP id 27-20020a05620a049b00b0076c7d45ba57mr5402qkr.7.1690552207638;
Fri, 28 Jul 2023 06:50:07 -0700 (PDT)
X-Received: by 2002:a05:6830:33dc:b0:6b9:97f6:655 with SMTP id
q28-20020a05683033dc00b006b997f60655mr3021748ott.2.1690552207248; Fri, 28 Jul
2023 06:50:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!newsfeed.hasname.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 28 Jul 2023 06:50:06 -0700 (PDT)
In-Reply-To: <87jzuko1ta.fsf@nosuchdomain.example.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Fri, 28 Jul 2023 13:50:07 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5454
 by: Malcolm McLean - Fri, 28 Jul 2023 13:50 UTC

On Friday, 28 July 2023 at 09:52:55 UTC+1, Keith Thompson wrote:
> jak <nos...@please.ty> writes:
> > Malcolm McLean ha scritto:
> >> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
> >>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
> >>>> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
> >>>>> On 27/07/2023 18:55, Malcolm McLean wrote:
> >>> [...]
> >>>>>> I'm trying to write a function char *loadasutf8(const char *filename, int *error)
> >>>>>> which will accept any text file, smart detect the format, and return the
> >>>>>> contents in UTF-8. So I'm looking for Unicode files in various formats to
> >>>>>> test it on.
> >>>>> Most text files will already be either UTF8, or ASCII, which is a subset
> >>>>> of UTF8. So what format is there to detect? What formats might be expected?
> >>>>>
> >>>> Of course. But the spirit of the resource compiler is that it's quite liberal in
> >>>> what it will accept. The vast majority of Unicode files are UTF-8, so it's
> >>>> trival to write a function to read them and return the contents as UTF-8.
> >>>> But you can also have UTF-16 and UTF-32, with or without a byte order
> >>>> marker, and in little endian or big endian format.
> >>> UTF-8 files are easy to find. (Some of them are also ASCII.)
> >>>
> >>> UTF-16 files are common on Windows, usually little-endian, usually with
> >>> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
> >>> Multilingual Plane, characters 0-65535).
> >>>
> >>> UTF-32 files are rare.
> >>>
> >>> You can generate all these formats from UTF-8 using the "iconv" command.
> >>>
> >> Wow, iconv --list gives far more encodings than I could possibly support.
> >> But that solves the problem.
> >
> > You are starting a battle that you will lose like all those who have
> > started it. There is no way to identify the format of a text with
> > safety (especially if it is imported from the web). Allow me to
> > recommend you to ask your users to convert their files into UTF-8
> > format.
> There's no 100% reliable way to do so, but there are some very good
> heuristics. The file(1) command does a reasonably good job.
>
> Anything with a BOM is identifiable. (I agree that UTF-8 files
> generally should not have a BOM, but if it's there you might as well
> take advantage of it.) If there's no BOM and the first N bytes are in
> the range 0..127, ASCII is a reasonable assumption; otherwise, if the
> first N bytes are valid UTF-8, UTF-8 is a reasonable assumption.
>
> (Background: a BOM, Byte Order Mark, is a character U+FEFF, ZERO WIDTH
> NO-BREAK SPACE, at the beginning of a text file. It's represented as FE
> FF in big-endian UTF-16, FF FE in little-endian UTF-16, and EF BB BF in
> UTF-8.)
>
> If a tool is going to be used on Windows, rejecting Windows-native text
> files seems unnecessarily hostile.
>
The route I took was to borrow someone else's heuristic text format
detection code. It seemed to perfom pretty well on tests. I might add
a "format" attribute to the utf8 tag to allow people to force it to load
in a certain format if the automatic detection fails. However it needs a
lot of documentation and isn't very discoverable.

Since the main point of Unicode is to support non-English languages,
there's now an "international" tag which supports internationalisation.
Strings and UTF-8 tage now have a "language" attribute. The resource
compiler then writes a simple little function which takes a string representing
the language wanted and returns the resource. This is a big change in
scope since the resource compiler is now writing executable code.

As I said, I don't work with text data professionally. So I'm very open
to comments from people with more experience.

The Baby X resource compiler is here.

https://github.com/MalcolmMcLean/babyxrc

Re: Unicode test suite

<ua0mr1$299bi$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26776&group=comp.lang.c#26776

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 28 Jul 2023 16:28:34 +0100
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <ua0mr1$299bi$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 28 Jul 2023 15:28:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9af2d3b9a48242e025ef4e796a9ca380";
logging-data="2401650"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+9Yu0iGZpWzdS4ZjQGop1Pyh9M+InQwU4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:jBcjwTqM3eL8TuALbcYwXzqY6Vk=
In-Reply-To: <456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
 by: Bart - Fri, 28 Jul 2023 15:28 UTC

On 28/07/2023 14:50, Malcolm McLean wrote:
> On Friday, 28 July 2023 at 09:52:55 UTC+1, Keith Thompson wrote:
>> jak <nos...@please.ty> writes:
>>> Malcolm McLean ha scritto:
>>>> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
>>>>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>>>>> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
>>>>>>> On 27/07/2023 18:55, Malcolm McLean wrote:
>>>>> [...]
>>>>>>>> I'm trying to write a function char *loadasutf8(const char
*filename, int *error)
>>>>>>>> which will accept any text file, smart detect the format, and
return the
>>>>>>>> contents in UTF-8. So I'm looking for Unicode files in various
formats to
>>>>>>>> test it on.
>>>>>>> Most text files will already be either UTF8, or ASCII, which is
a subset
>>>>>>> of UTF8. So what format is there to detect? What formats might
be expected?
>>>>>>>
>>>>>> Of course. But the spirit of the resource compiler is that it's
quite liberal in
>>>>>> what it will accept. The vast majority of Unicode files are
UTF-8, so it's
>>>>>> trival to write a function to read them and return the contents
as UTF-8.
>>>>>> But you can also have UTF-16 and UTF-32, with or without a byte
order
>>>>>> marker, and in little endian or big endian format.
>>>>> UTF-8 files are easy to find. (Some of them are also ASCII.)
>>>>>
>>>>> UTF-16 files are common on Windows, usually little-endian,
usually with
>>>>> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
>>>>> Multilingual Plane, characters 0-65535).
>>>>>
>>>>> UTF-32 files are rare.
>>>>>
>>>>> You can generate all these formats from UTF-8 using the "iconv"
command.
>>>>>
>>>> Wow, iconv --list gives far more encodings than I could possibly
support.
>>>> But that solves the problem.
>>>
>>> You are starting a battle that you will lose like all those who have
>>> started it. There is no way to identify the format of a text with
>>> safety (especially if it is imported from the web). Allow me to
>>> recommend you to ask your users to convert their files into UTF-8
>>> format.
>> There's no 100% reliable way to do so, but there are some very good
>> heuristics. The file(1) command does a reasonably good job.
>>
>> Anything with a BOM is identifiable. (I agree that UTF-8 files
>> generally should not have a BOM, but if it's there you might as well
>> take advantage of it.) If there's no BOM and the first N bytes are in
>> the range 0..127, ASCII is a reasonable assumption; otherwise, if the
>> first N bytes are valid UTF-8, UTF-8 is a reasonable assumption.
>>
>> (Background: a BOM, Byte Order Mark, is a character U+FEFF, ZERO WIDTH
>> NO-BREAK SPACE, at the beginning of a text file. It's represented as FE
>> FF in big-endian UTF-16, FF FE in little-endian UTF-16, and EF BB BF in
>> UTF-8.)
>>
>> If a tool is going to be used on Windows, rejecting Windows-native text
>> files seems unnecessarily hostile.
>>
> The route I took was to borrow someone else's heuristic text format
> detection code. It seemed to perfom pretty well on tests. I might add
> a "format" attribute to the utf8 tag to allow people to force it to load
> in a certain format if the automatic detection fails. However it needs a
> lot of documentation and isn't very discoverable.
>
> Since the main point of Unicode is to support non-English languages,
> there's now an "international" tag which supports internationalisation.
> Strings and UTF-8 tage now have a "language" attribute. The resource
> compiler then writes a simple little function which takes a string
representing
> the language wanted and returns the resource. This is a big change in
> scope since the resource compiler is now writing executable code.
>
> As I said, I don't work with text data professionally. So I'm very open
> to comments from people with more experience.
>
> The Baby X resource compiler is here.
>
> https://github.com/MalcolmMcLean/babyxrc
>

I took hello.c and saved it variously as UTF8, UTF8 with BOM, UTF16LE
and UTF16BE (the choices from Notepad were limited).

gcc only worked properly with the UTF8 files, with or without BOM. With
the others, it saw 'nul' every other character, but the special BOM on
those files gave errors.

Since presumably your product will be used by programmers in conjunction
with C source programs, is there any point in supporting a format that
is not also supported by the C compiler that will be needed?

Or will these diverse formats be encountered at the application level by
end-users?

Re: Unicode test suite

<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26777&group=comp.lang.c#26777

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:ac8:5782:0:b0:403:54d7:e31d with SMTP id v2-20020ac85782000000b0040354d7e31dmr10605qta.8.1690562166071;
Fri, 28 Jul 2023 09:36:06 -0700 (PDT)
X-Received: by 2002:a05:6870:76ad:b0:1bb:924b:163b with SMTP id
dx45-20020a05687076ad00b001bb924b163bmr3631928oab.2.1690562165856; Fri, 28
Jul 2023 09:36:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Fri, 28 Jul 2023 09:36:05 -0700 (PDT)
In-Reply-To: <ua0mr1$299bi$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:6c83:9fbc:2db2:919;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:6c83:9fbc:2db2:919
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Fri, 28 Jul 2023 16:36:06 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 7183
 by: Malcolm McLean - Fri, 28 Jul 2023 16:36 UTC

On Friday, 28 July 2023 at 16:28:48 UTC+1, Bart wrote:
> On 28/07/2023 14:50, Malcolm McLean wrote:
> > On Friday, 28 July 2023 at 09:52:55 UTC+1, Keith Thompson wrote:
> >> jak <nos...@please.ty> writes:
> >>> Malcolm McLean ha scritto:
> >>>> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
> >>>>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
> >>>>>> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
> >>>>>>> On 27/07/2023 18:55, Malcolm McLean wrote:
> >>>>> [...]
> >>>>>>>> I'm trying to write a function char *loadasutf8(const char
> *filename, int *error)
> >>>>>>>> which will accept any text file, smart detect the format, and
> return the
> >>>>>>>> contents in UTF-8. So I'm looking for Unicode files in various
> formats to
> >>>>>>>> test it on.
> >>>>>>> Most text files will already be either UTF8, or ASCII, which is
> a subset
> >>>>>>> of UTF8. So what format is there to detect? What formats might
> be expected?
> >>>>>>>
> >>>>>> Of course. But the spirit of the resource compiler is that it's
> quite liberal in
> >>>>>> what it will accept. The vast majority of Unicode files are
> UTF-8, so it's
> >>>>>> trival to write a function to read them and return the contents
> as UTF-8.
> >>>>>> But you can also have UTF-16 and UTF-32, with or without a byte
> order
> >>>>>> marker, and in little endian or big endian format.
> >>>>> UTF-8 files are easy to find. (Some of them are also ASCII.)
> >>>>>
> >>>>> UTF-16 files are common on Windows, usually little-endian,
> usually with
> >>>>> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
> >>>>> Multilingual Plane, characters 0-65535).
> >>>>>
> >>>>> UTF-32 files are rare.
> >>>>>
> >>>>> You can generate all these formats from UTF-8 using the "iconv"
> command.
> >>>>>
> >>>> Wow, iconv --list gives far more encodings than I could possibly
> support.
> >>>> But that solves the problem.
> >>>
> >>> You are starting a battle that you will lose like all those who have
> >>> started it. There is no way to identify the format of a text with
> >>> safety (especially if it is imported from the web). Allow me to
> >>> recommend you to ask your users to convert their files into UTF-8
> >>> format.
> >> There's no 100% reliable way to do so, but there are some very good
> >> heuristics. The file(1) command does a reasonably good job.
> >>
> >> Anything with a BOM is identifiable. (I agree that UTF-8 files
> >> generally should not have a BOM, but if it's there you might as well
> >> take advantage of it.) If there's no BOM and the first N bytes are in
> >> the range 0..127, ASCII is a reasonable assumption; otherwise, if the
> >> first N bytes are valid UTF-8, UTF-8 is a reasonable assumption.
> >>
> >> (Background: a BOM, Byte Order Mark, is a character U+FEFF, ZERO WIDTH
> >> NO-BREAK SPACE, at the beginning of a text file. It's represented as FE
> >> FF in big-endian UTF-16, FF FE in little-endian UTF-16, and EF BB BF in
> >> UTF-8.)
> >>
> >> If a tool is going to be used on Windows, rejecting Windows-native text
> >> files seems unnecessarily hostile.
> >>
> > The route I took was to borrow someone else's heuristic text format
> > detection code. It seemed to perfom pretty well on tests. I might add
> > a "format" attribute to the utf8 tag to allow people to force it to load
> > in a certain format if the automatic detection fails. However it needs a
> > lot of documentation and isn't very discoverable.
> >
> > Since the main point of Unicode is to support non-English languages,
> > there's now an "international" tag which supports internationalisation.
> > Strings and UTF-8 tage now have a "language" attribute. The resource
> > compiler then writes a simple little function which takes a string
> representing
> > the language wanted and returns the resource. This is a big change in
> > scope since the resource compiler is now writing executable code.
> >
> > As I said, I don't work with text data professionally. So I'm very open
> > to comments from people with more experience.
> >
> > The Baby X resource compiler is here.
> >
> > https://github.com/MalcolmMcLean/babyxrc
> >
> I took hello.c and saved it variously as UTF8, UTF8 with BOM, UTF16LE
> and UTF16BE (the choices from Notepad were limited).
>
> gcc only worked properly with the UTF8 files, with or without BOM. With
> the others, it saw 'nul' every other character, but the special BOM on
> those files gave errors.
>
> Since presumably your product will be used by programmers in conjunction
> with C source programs, is there any point in supporting a format that
> is not also supported by the C compiler that will be needed?
>
> Or will these diverse formats be encountered at the application level by
> end-users?
>
You've got the wrong end of the stick.

The Baby X resource compiler loads resource files (images, audio, or what we
are concerend with here, text) and saves them as portable ANSI C. So the .c
output files are ascii.
However the text data might be Unicode. So instead of using a string tag, which
creartes a C string literal, the user uses an "uft8" tag which writes the data in UTF-8,
as an array of chars.
However a good question is whether most C compilers will accept UTF-8.
UTF-8 string literals are better than hex dumped UTF-8 data, because they are
human-readable. However the standard frowns on UTF-8 in strings.

Re: Unicode test suite

<20230728095505.407@kylheku.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26778&group=comp.lang.c#26778

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 28 Jul 2023 17:04:58 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 141
Message-ID: <20230728095505.407@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 28 Jul 2023 17:04:58 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="493678823fe307ff9a2907dc4cd1e122";
logging-data="2418327"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191zXcJvMyieEVc8LuYmdve41v/HoDR3Hg="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:ttj775UIe20sfcKsernenBJcoME=
 by: Kaz Kylheku - Fri, 28 Jul 2023 17:04 UTC

On 2023-07-28, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> On Friday, 28 July 2023 at 16:28:48 UTC+1, Bart wrote:
>> On 28/07/2023 14:50, Malcolm McLean wrote:
>> > On Friday, 28 July 2023 at 09:52:55 UTC+1, Keith Thompson wrote:
>> >> jak <nos...@please.ty> writes:
>> >>> Malcolm McLean ha scritto:
>> >>>> On Thursday, 27 July 2023 at 21:51:47 UTC+1, Keith Thompson wrote:
>> >>>>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>> >>>>>> On Thursday, 27 July 2023 at 19:22:16 UTC+1, Bart wrote:
>> >>>>>>> On 27/07/2023 18:55, Malcolm McLean wrote:
>> >>>>> [...]
>> >>>>>>>> I'm trying to write a function char *loadasutf8(const char
>> *filename, int *error)
>> >>>>>>>> which will accept any text file, smart detect the format, and
>> return the
>> >>>>>>>> contents in UTF-8. So I'm looking for Unicode files in various
>> formats to
>> >>>>>>>> test it on.
>> >>>>>>> Most text files will already be either UTF8, or ASCII, which is
>> a subset
>> >>>>>>> of UTF8. So what format is there to detect? What formats might
>> be expected?
>> >>>>>>>
>> >>>>>> Of course. But the spirit of the resource compiler is that it's
>> quite liberal in
>> >>>>>> what it will accept. The vast majority of Unicode files are
>> UTF-8, so it's
>> >>>>>> trival to write a function to read them and return the contents
>> as UTF-8.
>> >>>>>> But you can also have UTF-16 and UTF-32, with or without a byte
>> order
>> >>>>>> marker, and in little endian or big endian format.
>> >>>>> UTF-8 files are easy to find. (Some of them are also ASCII.)
>> >>>>>
>> >>>>> UTF-16 files are common on Windows, usually little-endian,
>> usually with
>> >>>>> a BOM. (UCS-2 is UTF-16 with no characters outside the Basic
>> >>>>> Multilingual Plane, characters 0-65535).
>> >>>>>
>> >>>>> UTF-32 files are rare.
>> >>>>>
>> >>>>> You can generate all these formats from UTF-8 using the "iconv"
>> command.
>> >>>>>
>> >>>> Wow, iconv --list gives far more encodings than I could possibly
>> support.
>> >>>> But that solves the problem.
>> >>>
>> >>> You are starting a battle that you will lose like all those who have
>> >>> started it. There is no way to identify the format of a text with
>> >>> safety (especially if it is imported from the web). Allow me to
>> >>> recommend you to ask your users to convert their files into UTF-8
>> >>> format.
>> >> There's no 100% reliable way to do so, but there are some very good
>> >> heuristics. The file(1) command does a reasonably good job.
>> >>
>> >> Anything with a BOM is identifiable. (I agree that UTF-8 files
>> >> generally should not have a BOM, but if it's there you might as well
>> >> take advantage of it.) If there's no BOM and the first N bytes are in
>> >> the range 0..127, ASCII is a reasonable assumption; otherwise, if the
>> >> first N bytes are valid UTF-8, UTF-8 is a reasonable assumption.
>> >>
>> >> (Background: a BOM, Byte Order Mark, is a character U+FEFF, ZERO WIDTH
>> >> NO-BREAK SPACE, at the beginning of a text file. It's represented as FE
>> >> FF in big-endian UTF-16, FF FE in little-endian UTF-16, and EF BB BF in
>> >> UTF-8.)
>> >>
>> >> If a tool is going to be used on Windows, rejecting Windows-native text
>> >> files seems unnecessarily hostile.
>> >>
>> > The route I took was to borrow someone else's heuristic text format
>> > detection code. It seemed to perfom pretty well on tests. I might add
>> > a "format" attribute to the utf8 tag to allow people to force it to load
>> > in a certain format if the automatic detection fails. However it needs a
>> > lot of documentation and isn't very discoverable.
>> >
>> > Since the main point of Unicode is to support non-English languages,
>> > there's now an "international" tag which supports internationalisation.
>> > Strings and UTF-8 tage now have a "language" attribute. The resource
>> > compiler then writes a simple little function which takes a string
>> representing
>> > the language wanted and returns the resource. This is a big change in
>> > scope since the resource compiler is now writing executable code.
>> >
>> > As I said, I don't work with text data professionally. So I'm very open
>> > to comments from people with more experience.
>> >
>> > The Baby X resource compiler is here.
>> >
>> > https://github.com/MalcolmMcLean/babyxrc
>> >
>> I took hello.c and saved it variously as UTF8, UTF8 with BOM, UTF16LE
>> and UTF16BE (the choices from Notepad were limited).
>>
>> gcc only worked properly with the UTF8 files, with or without BOM. With
>> the others, it saw 'nul' every other character, but the special BOM on
>> those files gave errors.
>>
>> Since presumably your product will be used by programmers in conjunction
>> with C source programs, is there any point in supporting a format that
>> is not also supported by the C compiler that will be needed?
>>
>> Or will these diverse formats be encountered at the application level by
>> end-users?
>>
> You've got the wrong end of the stick.
>
> The Baby X resource compiler loads resource files (images, audio, or what we
> are concerend with here, text) and saves them as portable ANSI C. So the .c
> output files are ascii.
> However the text data might be Unicode. So instead of using a string tag, which
> creartes a C string literal, the user uses an "uft8" tag which writes the data in UTF-8,
> as an array of chars.

UTF-8 data can be a string literal. Firstly, it's commonly supported
directly nowadays, as in plonk UTF-8 between quotes and you're done.

Secondly, you can emit it as escapes, or make it optional whether
that is done or raw:

const char *utf8 = "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"; // 日本語

The raw UTF-8 can be (optionally) in a comment.

Unless you plan on supporting multiple encodings in the same file,
there is no point in requiring users to annotate the encoding of
individual strings.

Unicode is supposed to eliminate that "balkanization", whereby the same
set of data wrestles with multiple encodings.

The only thing you have to indicate in Unicode resource files is the
language. You have certain strings in a UI, such as "Enter your
password", which can then have multiple translations into different
languages. You have to indicate that that one is German, that one is
Korean and so on. But everything is Unicode otherwise.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Re: Unicode test suite

<87fs57op51.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26779&group=comp.lang.c#26779

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Fri, 28 Jul 2023 11:40:58 -0700
Organization: None to speak of
Lines: 21
Message-ID: <87fs57op51.fsf@nosuchdomain.example.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me>
<87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="e5d63739fe1d7e4fdea2b8e1a08ea6c3";
logging-data="2433853"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+qqbpiie7NuWLQwsqUy/Sv"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:OElKS41HwZo+BRfvQz9GDg+zMmU=
sha1:92MNPxJ9lAGCO/w8vOVynTYNTFo=
 by: Keith Thompson - Fri, 28 Jul 2023 18:40 UTC

Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
[...]
> You've got the wrong end of the stick.
>
> The Baby X resource compiler loads resource files (images, audio, or
> what we are concerend with here, text) and saves them as portable ANSI
> C. So the .c output files are ascii.
> However the text data might be Unicode. So instead of using a string
> tag, which creartes a C string literal, the user uses an "uft8" tag
> which writes the data in UTF-8, as an array of chars.
> However a good question is whether most C compilers will accept UTF-8.
> UTF-8 string literals are better than hex dumped UTF-8 data, because
> they are human-readable. However the standard frowns on UTF-8 in
> strings.

C11 introduced UTF-8 string literals. N1570 6.4.5.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Unicode test suite

<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26793&group=comp.lang.c#26793

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:2c44:0:b0:767:f1e8:d2d4 with SMTP id s65-20020a372c44000000b00767f1e8d2d4mr14538qkh.1.1690635364264;
Sat, 29 Jul 2023 05:56:04 -0700 (PDT)
X-Received: by 2002:a9d:7e8a:0:b0:6b9:2c07:8849 with SMTP id
m10-20020a9d7e8a000000b006b92c078849mr6185621otp.0.1690635364051; Sat, 29 Jul
2023 05:56:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 05:56:03 -0700 (PDT)
In-Reply-To: <87fs57op51.fsf@nosuchdomain.example.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 29 Jul 2023 12:56:04 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Malcolm McLean - Sat, 29 Jul 2023 12:56 UTC

On Friday, 28 July 2023 at 19:41:14 UTC+1, Keith Thompson wrote:
> Malcolm McLean <malcolm.ar...@gmail.com> writes:
> [...]
> > You've got the wrong end of the stick.
> >
> > The Baby X resource compiler loads resource files (images, audio, or
> > what we are concerend with here, text) and saves them as portable ANSI
> > C. So the .c output files are ascii.
> > However the text data might be Unicode. So instead of using a string
> > tag, which creartes a C string literal, the user uses an "uft8" tag
> > which writes the data in UTF-8, as an array of chars.
> > However a good question is whether most C compilers will accept UTF-8.
> > UTF-8 string literals are better than hex dumped UTF-8 data, because
> > they are human-readable. However the standard frowns on UTF-8 in
> > strings.
> C11 introduced UTF-8 string literals. N1570 6.4.5.
>
I just researched this. I'd never heard of "u8" before (I don't do text processing).
Then I read this on Wikipedia:

"Since C++20 and C23, a char8_t type was added that is meant to store UTF-8
characters and the types of u8 prefixed character and string literals were
changed to char8_t and char8_t[] respectively."

If I wanted to deal with sort of thing then I'd program in C++. I learned about a
new feature and found that it was obsolete within the space of five minutes.

I don't want the Baby X resource compiler to break.
However it would be nice to get human-readable Unicode into the C source
files. But not at cost of things not working.

Incidentally "u8" will be of interest to Bart. Apparently it is "obviously" an
unsigned 8 bit type. In C11, it means "utf-8 data". Not quite the same thing.

Re: Unicode test suite

<ua34qf$2kl06$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26794&group=comp.lang.c#26794

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: bc...@freeuk.com (Bart)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 14:39:28 +0100
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <ua34qf$2kl06$1@dont-email.me>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 29 Jul 2023 13:39:27 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="4499a152026011cd009e4bb37a25d4eb";
logging-data="2774022"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18M6Nt3JJCtQ6EEpJZ7Q2JVTcBh2Ksd9sI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:fVwZhpQQdVwnT2Z7QcqXIyuv8jQ=
In-Reply-To: <a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
 by: Bart - Sat, 29 Jul 2023 13:39 UTC

On 29/07/2023 13:56, Malcolm McLean wrote:
> On Friday, 28 July 2023 at 19:41:14 UTC+1, Keith Thompson wrote:
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>> [...]
>>> You've got the wrong end of the stick.
>>>
>>> The Baby X resource compiler loads resource files (images, audio, or
>>> what we are concerend with here, text) and saves them as portable ANSI
>>> C. So the .c output files are ascii.
>>> However the text data might be Unicode. So instead of using a string
>>> tag, which creartes a C string literal, the user uses an "uft8" tag
>>> which writes the data in UTF-8, as an array of chars.
>>> However a good question is whether most C compilers will accept UTF-8.
>>> UTF-8 string literals are better than hex dumped UTF-8 data, because
>>> they are human-readable. However the standard frowns on UTF-8 in
>>> strings.
>> C11 introduced UTF-8 string literals. N1570 6.4.5.
>>
> I just researched this. I'd never heard of "u8" before (I don't do
text processing).
> Then I read this on Wikipedia:
>
> "Since C++20 and C23, a char8_t type was added that is meant to store
UTF-8
> characters and the types of u8 prefixed character and string literals
were
> changed to char8_t and char8_t[] respectively."
>
> If I wanted to deal with sort of thing then I'd program in C++. I
learned about a
> new feature and found that it was obsolete within the space of five
minutes.
>
> I don't want the Baby X resource compiler to break.
> However it would be nice to get human-readable Unicode into the C source
> files. But not at cost of things not working.
>
> Incidentally "u8" will be of interest to Bart. Apparently it is
"obviously" an
> unsigned 8 bit type. In C11, it means "utf-8 data". Not quite the
same thing.

In C11, "u8" is a valid user identifier, so that you can do this:

typedef unsigned byte u8;

But it can also be a prefix to a string literal. I believe it's called a
'contextual keyword'.

I have something similar:

f"c:\abc\def.g" # 'f' denotes a raw string literal

int f # f is also a variable (a 64-bit one!)

Actually, I can't see the point of such a prefix in C. All you have to
do is to decree that string literals are UTF8 anyway.

In any case, a C string literal defines a sequence of *any* byte values,
which or may not represent text. UTF8 extension bytes are also likely to
be negative values. It's a mess you don't really want to stir up too much.

Re: Unicode test suite

<20230729070730.784@kylheku.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26798&group=comp.lang.c#26798

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 864-117-...@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.c
Subject: Re: Unicode test suite
Date: Sat, 29 Jul 2023 14:11:51 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <20230729070730.784@kylheku.com>
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co>
<24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me>
<5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com>
<5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com>
<ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com>
<87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com>
Injection-Date: Sat, 29 Jul 2023 14:11:51 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="892a16121661d52822fd2a02c481f346";
logging-data="2778651"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/UsyrK8ccCMg02hkm2uI7uwmMggtj1AhM="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:xEwb8Efl8GA4BfFvOPZSs4W9U8s=
 by: Kaz Kylheku - Sat, 29 Jul 2023 14:11 UTC

On 2023-07-29, Malcolm McLean <malcolm.arthur.mclean@gmail.com> wrote:
> "Since C++20 and C23, a char8_t type was added that is meant to store UTF-8
> characters and the types of u8 prefixed character and string literals were
> changed to char8_t and char8_t[] respectively."

The author of that useless cruft is is still mentally living in a world
of balkanized internationalization, where one deals with multiple
encodings and character sets in the same situation.

You can handle UTF-8 in a purely C90 program. That's deliberate.
Ken Thompson and Rob Pike would not have designed an encoding scheme
that needed special literals and character types!

Re: Unicode test suite

<730a57a8-18bc-49e6-9722-7cdc6786b7ffn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26802&group=comp.lang.c#26802

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:5a04:0:b0:765:aaf7:b37b with SMTP id o4-20020a375a04000000b00765aaf7b37bmr15765qkb.2.1690653697476;
Sat, 29 Jul 2023 11:01:37 -0700 (PDT)
X-Received: by 2002:a05:6830:6607:b0:6b8:70f3:fd36 with SMTP id
cp7-20020a056830660700b006b870f3fd36mr15246998otb.2.1690653697283; Sat, 29
Jul 2023 11:01:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 11:01:36 -0700 (PDT)
In-Reply-To: <20230729070730.784@kylheku.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:e8de:bdc2:dfd9:7264
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com>
<sk1=zVc6nIadUyDxN@bongo-ra.co> <24ec9add-5f1e-40c6-835e-c5733f73e683n@googlegroups.com>
<u9ucka$1ulb7$2@dont-email.me> <5195cf3a-3288-4bd7-9fa9-e4d6d9fb6897n@googlegroups.com>
<87v8e5nkmn.fsf@nosuchdomain.example.com> <5ed1026b-80a4-4471-9cb3-8f57c89979a7n@googlegroups.com>
<u9va06$25ab8$1@dont-email.me> <87jzuko1ta.fsf@nosuchdomain.example.com>
<456a7b5c-6f2e-43d9-bbf3-ae9839aa4a6cn@googlegroups.com> <ua0mr1$299bi$1@dont-email.me>
<432550b0-4d6e-4854-82f6-89b40eb8756dn@googlegroups.com> <87fs57op51.fsf@nosuchdomain.example.com>
<a5666e5c-404e-4e2b-ad83-58ad1370b0ebn@googlegroups.com> <20230729070730.784@kylheku.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <730a57a8-18bc-49e6-9722-7cdc6786b7ffn@googlegroups.com>
Subject: Re: Unicode test suite
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Sat, 29 Jul 2023 18:01:37 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Malcolm McLean - Sat, 29 Jul 2023 18:01 UTC

On Saturday, 29 July 2023 at 15:12:05 UTC+1, Kaz Kylheku wrote:
> On 2023-07-29, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > "Since C++20 and C23, a char8_t type was added that is meant to store UTF-8
> > characters and the types of u8 prefixed character and string literals were
> > changed to char8_t and char8_t[] respectively."
> The author of that useless cruft is is still mentally living in a world
> of balkanized internationalization, where one deals with multiple
> encodings and character sets in the same situation.
>
> You can handle UTF-8 in a purely C90 program. That's deliberate.
> Ken Thompson and Rob Pike would not have designed an encoding scheme
> that needed special literals and character types!
>
Unfortunately UTF-8 naive routines only usually work with UTF-8 input,
not always. By using a special type, you can indicate that the function is
expecting UTF-8. However very many functions which take strings put
some restrictions on which strings are acceptable as arguments, and
UTF-8 isn't really in a special category in this respect.

Then ASCII files can't display human-readable Unicode. Which is a
disadvantage. But any C compiler will compile ASCII source and any
editor will display it. I'm not so sure what the situation is with UTF-8
source. clang is fine with it, of course.

Re: Unicode test suite

<6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26803&group=comp.lang.c#26803

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:8787:0:b0:767:dcda:b2f9 with SMTP id j129-20020a378787000000b00767dcdab2f9mr17789qkd.11.1690654801183;
Sat, 29 Jul 2023 11:20:01 -0700 (PDT)
X-Received: by 2002:a9d:68c3:0:b0:6bb:1c29:f0fa with SMTP id
i3-20020a9d68c3000000b006bb1c29f0famr6706036oto.5.1690654801022; Sat, 29 Jul
2023 11:20:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!glou.org!news.glou.org!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Sat, 29 Jul 2023 11:20:00 -0700 (PDT)
In-Reply-To: <20230727105813.33@kylheku.com>
Injection-Info: google-groups.googlegroups.com; posting-host=5.172.255.104; posting-account=Sb6m8goAAABbWsBL7gouk3bfLsuxwMgN
NNTP-Posting-Host: 5.172.255.104
References: <636cb864-67f8-4bfa-9bc8-b76b9ed95761n@googlegroups.com> <20230727105813.33@kylheku.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6df7ccb9-c2c7-4d34-ae99-e400a73a77efn@googlegroups.com>
Subject: Re: Unicode test suite
From: profesor...@gmail.com (fir)
Injection-Date: Sat, 29 Jul 2023 18:20:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: fir - Sat, 29 Jul 2023 18:20 UTC

czwartek, 27 lipca 2023 o 20:50:09 UTC+2 Kaz Kylheku napisał(a):
> On 2023-07-27, Malcolm McLean <malcolm.ar...@gmail.com> wrote:
> > Lynn's comment inspired me to add Unicode support to the Baby X
> > resource compiler. But despite searching for quite a long time, I
> > can't find a test suite of Unicode files in various formats. In fact
> > it's hard to find any Unicode files at all which are not UTF-8.
> It's a fool's errand to support Unicode formats other than UTF-8.
>
dont know what fools errand mean but probbaly seems so, i dont remember waht
i was sayin back then on this but probably if utf8 become the standard in use it is
probably firtunate to stay with it, wchich is kinda fortunate (most comapatible in
ascii

btw if utf8 become de facto standard there is also defacto standard of csolo haracter?

is this just 32 bit integer which is ascii below 127 and ofical number from unicode tables
in above? i mean does things liek arrow left has one oficiall integer value?

Pages:123
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor