Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

((lambda (foo) (bar foo)) (baz))


computers / comp.os.vms / Re: Character sets

SubjectAuthor
* Character setsSimon Clubley
+* Re: Character setsArne Vajhøj
|`* Re: Character setsSimon Clubley
| `* Re: Character setsArne Vajhøj
|  `- Re: Character setsJohnny Billquist
+* Re: Character setsJohnny Billquist
|`- Re: Character setsJohn Dallman
+* Re: Character setsStephen Hoffman
|+- Re: Character setsCraig A. Berry
|`* Re: Character setsArne Vajhøj
| +* Re: Character setsStephen Hoffman
| |`* Re: Character setsArne Vajhøj
| | `- Re: Character setsStephen Hoffman
| `* Re: Character setsJohnny Billquist
|  `* Re: Character setsArne Vajhøj
|   `- Re: Character setsStephen Hoffman
`* Re: Character setsGalen
 `- Re: Character setsSimon Clubley

1
Character sets

<teth8e$2j8nv$2@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24684&group=comp.os.vms#24684

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: club...@remove_me.eisner.decus.org-Earth.UFP (Simon Clubley)
Newsgroups: comp.os.vms
Subject: Character sets
Date: Fri, 2 Sep 2022 18:15:42 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <teth8e$2j8nv$2@dont-email.me>
Injection-Date: Fri, 2 Sep 2022 18:15:42 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="dd60c1bdccca50303f6241c6f7784ed0";
logging-data="2728703"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ONO8LXVsit2e53iuvMx4Q3bOPgVCLcKY="
User-Agent: slrn/0.9.8.1 (VMS/Multinet)
Cancel-Lock: sha1:2xTKGAfTwg+hawhQzkMIRVRMC20=
 by: Simon Clubley - Fri, 2 Sep 2022 18:15 UTC

On 2022-09-02, Johnny Billquist <bqt@softjar.se> wrote:
> On 2022-09-02 15:16, Simon Clubley wrote:
>> PS: I do now understand why this was done, but at the same time, for any
>> VMS systems still doing this, it could easily give the impression to people
>> not familiar with VMS of how once again "that VMS system is different from
>> all the other systems we use."
>
> I can give you a program for Linux right now, that also expects
> ISO-646-SE, in case you really insist on thinking that this has anything
> to do with VMS.
>

Any many Linux programmers would even know that such a thing exists,
let alone have any need to use it ?

You could also do a version of Emacs (for example) that outputs EBCDIC
codes instead of one of the normal character sets when run on Linux.
How useful would that be to normal Linux users ? :-)

BTW, it's to do with VMS because VMS is the host OS for the applications
that still use these 7-bit national character sets today.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.

Re: Character sets

<631271ee$0$699$14726298@news.sunsite.dk>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24690&group=comp.os.vms#24690

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Date: Fri, 2 Sep 2022 17:13:17 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.13.0
Subject: Re: Character sets
Content-Language: en-US
Newsgroups: comp.os.vms
References: <teth8e$2j8nv$2@dont-email.me>
From: arn...@vajhoej.dk (Arne Vajhøj)
In-Reply-To: <teth8e$2j8nv$2@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 29
Message-ID: <631271ee$0$699$14726298@news.sunsite.dk>
Organization: SunSITE.dk - Supporting Open source
NNTP-Posting-Host: 420b5232.news.sunsite.dk
X-Trace: 1662153198 news.sunsite.dk 699 arne@vajhoej.dk/68.9.63.232:54262
X-Complaints-To: staff@sunsite.dk
 by: Arne Vajhøj - Fri, 2 Sep 2022 21:13 UTC

On 9/2/2022 2:15 PM, Simon Clubley wrote:
> On 2022-09-02, Johnny Billquist <bqt@softjar.se> wrote:
>> On 2022-09-02 15:16, Simon Clubley wrote:
>>> PS: I do now understand why this was done, but at the same time, for any
>>> VMS systems still doing this, it could easily give the impression to people
>>> not familiar with VMS of how once again "that VMS system is different from
>>> all the other systems we use."
>>
>> I can give you a program for Linux right now, that also expects
>> ISO-646-SE, in case you really insist on thinking that this has anything
>> to do with VMS.
>
> Any many Linux programmers would even know that such a thing exists,
> let alone have any need to use it ?
>
> You could also do a version of Emacs (for example) that outputs EBCDIC
> codes instead of one of the normal character sets when run on Linux.
> How useful would that be to normal Linux users ? :-)
>
> BTW, it's to do with VMS because VMS is the host OS for the applications
> that still use these 7-bit national character sets today.

For *some* of them including the one that triggered this sub thread.

The significance of the example running on VMS is probably small
when the discussion occurs in comp.os.vms.

Arne

Re: Character sets

<tevbmu$8el$1@news.misty.com>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24696&group=comp.os.vms#24696

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!.POSTED.185.159.157.200!not-for-mail
From: bqt...@softjar.se (Johnny Billquist)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Sat, 3 Sep 2022 12:53:17 +0200
Organization: MGT Consulting
Message-ID: <tevbmu$8el$1@news.misty.com>
References: <teth8e$2j8nv$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 3 Sep 2022 10:53:18 -0000 (UTC)
Injection-Info: news.misty.com; posting-host="185.159.157.200";
logging-data="8661"; mail-complaints-to="abuse@misty.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.13.0
Content-Language: en-US
In-Reply-To: <teth8e$2j8nv$2@dont-email.me>
 by: Johnny Billquist - Sat, 3 Sep 2022 10:53 UTC

On 2022-09-02 20:15, Simon Clubley wrote:
> On 2022-09-02, Johnny Billquist <bqt@softjar.se> wrote:
>> On 2022-09-02 15:16, Simon Clubley wrote:
>>> PS: I do now understand why this was done, but at the same time, for any
>>> VMS systems still doing this, it could easily give the impression to people
>>> not familiar with VMS of how once again "that VMS system is different from
>>> all the other systems we use."
>>
>> I can give you a program for Linux right now, that also expects
>> ISO-646-SE, in case you really insist on thinking that this has anything
>> to do with VMS.
>>
>
> Any many Linux programmers would even know that such a thing exists,
> let alone have any need to use it ?

That have even less relevance. The point is that this have nothing to do
with the OS. It's simply a case of an application using/assuming the
user is using some specific character set on his terminal.

> You could also do a version of Emacs (for example) that outputs EBCDIC
> codes instead of one of the normal character sets when run on Linux.
> How useful would that be to normal Linux users ? :-)

If you had a user with a terminal that speaks EBCDIC, it could
potentially be very useful. Is that a normal Linux user? Probably not.
Don't make it less useful for the person in that situation, and the fact
that it happens on Linux is actually irrelevant.

> BTW, it's to do with VMS because VMS is the host OS for the applications
> that still use these 7-bit national character sets today.

It happens because we're talking about old software that have not been
rewritten, and that in turns leads to people needing/using terminals
that can show this correctly. Which putty can't, and that was the
complaint against putty. Which would be equally true if you find an old
program for any kind of Unix, written to use such character sets. And
yes, they do exist.

Johnny

Re: Character sets

<memo.20220903130300.16092M@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24700&group=comp.os.vms#24700

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Sat, 3 Sep 2022 13:03 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <memo.20220903130300.16092M@jgd.cix.co.uk>
References: <tevbmu$8el$1@news.misty.com>
Reply-To: jgd@cix.co.uk
Injection-Info: reader01.eternal-september.org; posting-host="7bbd5229412a456cb2efa5d311a8cbb7";
logging-data="3028526"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uRkyU70M5pjkOsSa4w954iBruOos8ijE="
Cancel-Lock: sha1:UcKk994U5jCrAxl2BH0zEQi5Diw=
 by: John Dallman - Sat, 3 Sep 2022 12:03 UTC

In article <tevbmu$8el$1@news.misty.com>, bqt@softjar.se (Johnny
Billquist) wrote:

> On 2022-09-02 20:15, Simon Clubley wrote:
> > You could also do a version of Emacs (for example) that outputs
> > EBCDIC codes instead of one of the normal character sets when run
> > on Linux. How useful would that be to normal Linux users ? :-)
>
> If you had a user with a terminal that speaks EBCDIC, it could
> potentially be very useful. Is that a normal Linux user? Probably
> not. Don't make it less useful for the person in that situation,
> and the fact that it happens on Linux is actually irrelevant.

Yup. When you run Linux on IBM mainframes, it uses ASCII and/or UTF-8,
like any other Linux. Trying to convert Linux to run with EBCDIC would
not have been sensible.

John

Re: Character sets

<tf0a0b$2vd02$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24705&group=comp.os.vms#24705

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: seaoh...@hoffmanlabs.invalid (Stephen Hoffman)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Sat, 3 Sep 2022 15:30:19 -0400
Organization: HoffmanLabs LLC
Lines: 44
Message-ID: <tf0a0b$2vd02$1@dont-email.me>
References: <teth8e$2j8nv$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader01.eternal-september.org; posting-host="1629460f0413cdf411aaa9236f1556ec";
logging-data="3126274"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++kMSKLB1E5zwScFQNKzsuno3yWC16gCk="
User-Agent: Unison/2.2
Cancel-Lock: sha1:/iBXju6GPJ4l4gXwY7cPRM/aB70=
 by: Stephen Hoffman - Sat, 3 Sep 2022 19:30 UTC

On 2022-09-02 18:15:42 +0000, Simon Clubley said:

> Any many Linux programmers would even know that such a thing
> (predecessors to ISO 8859, and other character sets) exists, let alone
> have any need to use it ?

I'm not aware of how widespread app usage of national character sets
might be on OpenVMS, but DEC MCS / ISO Latin 1 and ODS-5 usage is
arguably still "bleeding edge" on OpenVMS, and app and OpenVMS adoption
of UTF-8 encoding is, well, negligible.

Migrating apps away from IBM EBCDIC, UNIVAC FIELDATA, DEC RADIX 50, or
whatever encoding is entirely up to app developers and maintainers. =

OpenVMS has RADIX 50 tucked away in a few very dusty corners too,
albeit still somewhat user-visible. But I digress.

Getting existing apps with older encodings migrated to current
encodings including to UTF-8 is also largely left to app developers.

Or is entirely left to, in the case of OpenVMS, as OpenVMS has not
adopted and has not integrated Unicode nor UTF-8 into the platform and
its run-time libraries. Nor are any VSI plans to upgrade DCL to support
UTF-8 at all likely to arise.

Little (nothing?) past the ODS-5 UTF-8 filename work exists with
OpenVMS, and—as with most of the retrofit-compatible-hackery—that's
less than easy for apps to use. You'll probably be using or porting
recent versions of ICU, libunistring, or ilk, and the OpenVMS 32- and
64-bit string descriptors are unfortunately also less than useful here
around language and encoding. (This is where the object abstraction
shines, too. It's what descriptors and itemlists evolved into, on other
platforms.)

Pedant notes: yes, I do know about wchar_t and friends in C and C++,
which is... a mess, and is also ill-suited for UTF-8. Probably better
to use char16_t and char32_t, if you do need fixed-width wide character
storage. Here also intentionally excluding Java from this discussion,
because, well, Java, and a central tenet of Java being efforts to
isolate and exclude the platform from most app-related considerations.

--
Pure Personal Opinion | HoffmanLabs LLC

Re: Character sets

<tf3aeq$3bcph$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24722&group=comp.os.vms#24722

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: craigbe...@nospam.mac.com (Craig A. Berry)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Sun, 4 Sep 2022 17:56:25 -0500
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <tf3aeq$3bcph$1@dont-email.me>
References: <teth8e$2j8nv$2@dont-email.me> <tf0a0b$2vd02$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 4 Sep 2022 22:56:27 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="972f410a7ad06a53c044db062959054a";
logging-data="3519281"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+T/9XJch+QRbNctUDOJw6cm7TtJNJCB5E="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.13.0
Cancel-Lock: sha1:po7w8crYifwlVgNQlkoE/+vsdac=
In-Reply-To: <tf0a0b$2vd02$1@dont-email.me>
Content-Language: en-US
 by: Craig A. Berry - Sun, 4 Sep 2022 22:56 UTC

On 9/3/22 2:30 PM, Stephen Hoffman wrote:

> Little (nothing?) past the ODS-5 UTF-8 filename work exists with
> OpenVMS, and—as with most of the retrofit-compatible-hackery—that's less
> than easy for apps to use. You'll probably be using or porting recent
> versions of ICU, libunistring, or ilk, and the OpenVMS 32- and 64-bit
> string descriptors are unfortunately also less than useful here around
> language and encoding. (This is where the object abstraction shines,
> too. It's what descriptors and itemlists evolved into, on other platforms.)

At the app level, yes, you have to have a library that works with
whatever language your app is written in. At the utility level, Perl is
available and has piconv, which does a bit more than the iconv that
ships with the SYS$I18N stuff. Or you can write your apps in Perl. I
forget how Python handles Unicode, but people who do Python tell me its
Unicode handling is significantly less lame with Python 3 than it was
with Python 2.

Re: Character sets

<tf4s19$3j36p$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24725&group=comp.os.vms#24725

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: club...@remove_me.eisner.decus.org-Earth.UFP (Simon Clubley)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Mon, 5 Sep 2022 13:02:33 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <tf4s19$3j36p$1@dont-email.me>
References: <teth8e$2j8nv$2@dont-email.me> <631271ee$0$699$14726298@news.sunsite.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 5 Sep 2022 13:02:33 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="3bb57999968c129d03a3ce8a5579a636";
logging-data="3771609"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19gs7dpX6rHoJOxaDxFhbUffUFGCM+/rUc="
User-Agent: slrn/0.9.8.1 (VMS/Multinet)
Cancel-Lock: sha1:w11hKT4X+uiYEriPn7JXa0x/hpg=
 by: Simon Clubley - Mon, 5 Sep 2022 13:02 UTC

On 2022-09-02, Arne Vajhøj <arne@vajhoej.dk> wrote:
> On 9/2/2022 2:15 PM, Simon Clubley wrote:
>> On 2022-09-02, Johnny Billquist <bqt@softjar.se> wrote:
>>> On 2022-09-02 15:16, Simon Clubley wrote:
>>>> PS: I do now understand why this was done, but at the same time, for any
>>>> VMS systems still doing this, it could easily give the impression to people
>>>> not familiar with VMS of how once again "that VMS system is different from
>>>> all the other systems we use."
>>>
>>> I can give you a program for Linux right now, that also expects
>>> ISO-646-SE, in case you really insist on thinking that this has anything
>>> to do with VMS.
>>
>> Any many Linux programmers would even know that such a thing exists,
>> let alone have any need to use it ?
>>
>> You could also do a version of Emacs (for example) that outputs EBCDIC
>> codes instead of one of the normal character sets when run on Linux.
>> How useful would that be to normal Linux users ? :-)
>>
>> BTW, it's to do with VMS because VMS is the host OS for the applications
>> that still use these 7-bit national character sets today.
>
> For *some* of them including the one that triggered this sub thread.
>
> The significance of the example running on VMS is probably small
> when the discussion occurs in comp.os.vms.
>

What other operating systems do you believe are currently hosting
applications today that require 7-bit national character sets ?

If you include Linux in that list, the next question I will ask is
how does that tie up with the fact those character sets were considered
obsolete before Linux even existed ?

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.

Re: Character sets

<6315f8d6$0$698$14726298@news.sunsite.dk>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24726&group=comp.os.vms#24726

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Date: Mon, 5 Sep 2022 09:25:37 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.13.0
Subject: Re: Character sets
Content-Language: en-US
Newsgroups: comp.os.vms
References: <teth8e$2j8nv$2@dont-email.me>
<631271ee$0$699$14726298@news.sunsite.dk> <tf4s19$3j36p$1@dont-email.me>
From: arn...@vajhoej.dk (Arne Vajhøj)
In-Reply-To: <tf4s19$3j36p$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 60
Message-ID: <6315f8d6$0$698$14726298@news.sunsite.dk>
Organization: SunSITE.dk - Supporting Open source
NNTP-Posting-Host: b414e434.news.sunsite.dk
X-Trace: 1662384342 news.sunsite.dk 698 arne@vajhoej.dk/68.9.63.232:51293
X-Complaints-To: staff@sunsite.dk
 by: Arne Vajhøj - Mon, 5 Sep 2022 13:25 UTC

On 9/5/2022 9:02 AM, Simon Clubley wrote:
> On 2022-09-02, Arne Vajhøj <arne@vajhoej.dk> wrote:
>> On 9/2/2022 2:15 PM, Simon Clubley wrote:
>>> On 2022-09-02, Johnny Billquist <bqt@softjar.se> wrote:
>>>> On 2022-09-02 15:16, Simon Clubley wrote:
>>>>> PS: I do now understand why this was done, but at the same time, for any
>>>>> VMS systems still doing this, it could easily give the impression to people
>>>>> not familiar with VMS of how once again "that VMS system is different from
>>>>> all the other systems we use."
>>>>
>>>> I can give you a program for Linux right now, that also expects
>>>> ISO-646-SE, in case you really insist on thinking that this has anything
>>>> to do with VMS.
>>>
>>> Any many Linux programmers would even know that such a thing exists,
>>> let alone have any need to use it ?
>>>
>>> You could also do a version of Emacs (for example) that outputs EBCDIC
>>> codes instead of one of the normal character sets when run on Linux.
>>> How useful would that be to normal Linux users ? :-)
>>>
>>> BTW, it's to do with VMS because VMS is the host OS for the applications
>>> that still use these 7-bit national character sets today.
>>
>> For *some* of them including the one that triggered this sub thread.
>>
>> The significance of the example running on VMS is probably small
>> when the discussion occurs in comp.os.vms.
>>
>
> What other operating systems do you believe are currently hosting
> applications today that require 7-bit national character sets ?

There is no special relationship between those 7 bit national
variants and VMS.

So I would expect about the same percentage of applications
using those in the across all OS in the relevant type of applications.

The relevant type of application must be applications developed
before early 90's, developed in a non-English speaking but Latin
alphabet using country, developed on a non-EBCDIC platform.

VMS, AIX, HP-UX, SunOS, Irix, Ultrix, SCO Unix, original BSD etc..

Many of the Unix application would be migrated to Linux later, but
for cost saving reasons the migration was done 1:1. It happens.

> If you include Linux in that list, the next question I will ask is
> how does that tie up with the fact those character sets were considered
> obsolete before Linux even existed ?

You have already been told that such a Linux application exist.

Regarding why then maybe the application was originally developed
on some Unix and later migrated to Linux or maybe the developers just
continued with old habits..

Arne

Re: Character sets

<tf58cs$e21$1@news.misty.com>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24727&group=comp.os.vms#24727

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!.POSTED.185.159.157.200!not-for-mail
From: bqt...@softjar.se (Johnny Billquist)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Mon, 5 Sep 2022 18:33:31 +0200
Organization: MGT Consulting
Message-ID: <tf58cs$e21$1@news.misty.com>
References: <teth8e$2j8nv$2@dont-email.me>
<631271ee$0$699$14726298@news.sunsite.dk> <tf4s19$3j36p$1@dont-email.me>
<6315f8d6$0$698$14726298@news.sunsite.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 5 Sep 2022 16:33:32 -0000 (UTC)
Injection-Info: news.misty.com; posting-host="185.159.157.200";
logging-data="14401"; mail-complaints-to="abuse@misty.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.13.0
Content-Language: en-US
In-Reply-To: <6315f8d6$0$698$14726298@news.sunsite.dk>
 by: Johnny Billquist - Mon, 5 Sep 2022 16:33 UTC

On 2022-09-05 15:25, Arne Vajhøj wrote:
> On 9/5/2022 9:02 AM, Simon Clubley wrote:
>> On 2022-09-02, Arne Vajhøj <arne@vajhoej.dk> wrote:
>>> On 9/2/2022 2:15 PM, Simon Clubley wrote:
>>>> On 2022-09-02, Johnny Billquist <bqt@softjar.se> wrote:
>>>>> On 2022-09-02 15:16, Simon Clubley wrote:
>>>>>> PS: I do now understand why this was done, but at the same time,
>>>>>> for any
>>>>>> VMS systems still doing this, it could easily give the impression
>>>>>> to people
>>>>>> not familiar with VMS of how once again "that VMS system is
>>>>>> different from
>>>>>> all the other systems we use."
>>>>>
>>>>> I can give you a program for Linux right now, that also expects
>>>>> ISO-646-SE, in case you really insist on thinking that this has
>>>>> anything
>>>>> to do with VMS.
>>>>
>>>> Any many Linux programmers would even know that such a thing exists,
>>>> let alone have any need to use it ?
>>>>
>>>> You could also do a version of Emacs (for example) that outputs EBCDIC
>>>> codes instead of one of the normal character sets when run on Linux.
>>>> How useful would that be to normal Linux users ? :-)
>>>>
>>>> BTW, it's to do with VMS because VMS is the host OS for the
>>>> applications
>>>> that still use these 7-bit national character sets today.
>>>
>>> For *some* of them including the one that triggered this sub thread.
>>>
>>> The significance of the example running on VMS is probably small
>>> when the discussion occurs in comp.os.vms.
>>>
>>
>> What other operating systems do you believe are currently hosting
>> applications today that require 7-bit national character sets ?
>
> There is no special relationship between those 7 bit national
> variants and VMS.
>
> So I would expect about the same percentage of applications
> using those in the across all OS in the relevant type of applications.
>
> The relevant type of application must be applications developed
> before early 90's, developed in a non-English speaking but Latin
> alphabet using country, developed on a non-EBCDIC platform.
>
> VMS, AIX, HP-UX, SunOS, Irix, Ultrix, SCO Unix, original BSD etc..
>
> Many of the Unix application would be migrated to Linux later, but
> for cost saving reasons the migration was done 1:1. It happens.

Very true. And applications for some of those platforms are definitely
from the age when ISO-646 was still the thing. And I'm sure some of
those applications are still around. Probably in the most odd place,
where you'd never even think of looking.

>> If you include Linux in that list, the next question I will ask is
>> how does that tie up with the fact those character sets were considered
>> obsolete before Linux even existed ?
>
> You have already been told that such a Linux application exist.

If that was from me, I was merely saying that I can certainly write one
on a moments notice if he wants to see one.

> Regarding why then maybe the application was originally developed
> on some Unix and later migrated to Linux or maybe the developers just
> continued with old habits..

I would even suggest that "migrate to Linux" is a bit of misleading. For
most programs it's just a question of recompiling.

And I still know some people who still use ISO-646 in their mails to me.

Johnny

Re: Character sets

<631794b2$0$703$14726298@news.sunsite.dk>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24729&group=comp.os.vms#24729

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Date: Tue, 6 Sep 2022 14:42:53 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.13.0
Subject: Re: Character sets
Content-Language: en-US
Newsgroups: comp.os.vms
References: <teth8e$2j8nv$2@dont-email.me> <tf0a0b$2vd02$1@dont-email.me>
From: arn...@vajhoej.dk (Arne Vajhøj)
In-Reply-To: <tf0a0b$2vd02$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 19
Message-ID: <631794b2$0$703$14726298@news.sunsite.dk>
Organization: SunSITE.dk - Supporting Open source
NNTP-Posting-Host: 7be4447a.news.sunsite.dk
X-Trace: 1662489778 news.sunsite.dk 703 arne@vajhoej.dk/68.9.63.232:50831
X-Complaints-To: staff@sunsite.dk
 by: Arne Vajhøj - Tue, 6 Sep 2022 18:42 UTC

On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
> which is... a mess, and is also ill-suited for UTF-8.  Probably better
> to use char16_t and char32_t, if you do need fixed-width wide character
> storage.

wchar_t is a typical C vague definition where char16_t and char32_t are
much more clearly defined.

But wchar_t got runtime support.

C (and for that matter also C++) IO functions does not not
make writing/reading UTF-8 easy.

Newer languages does much better.

Arne

Re: Character sets

<tf8amk$9q8$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24730&group=comp.os.vms#24730

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: seaoh...@hoffmanlabs.invalid (Stephen Hoffman)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Tue, 6 Sep 2022 16:31:16 -0400
Organization: HoffmanLabs LLC
Lines: 81
Message-ID: <tf8amk$9q8$1@dont-email.me>
References: <teth8e$2j8nv$2@dont-email.me> <tf0a0b$2vd02$1@dont-email.me> <631794b2$0$703$14726298@news.sunsite.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader01.eternal-september.org; posting-host="ab58ca4f877f79c43d67266019790af0";
logging-data="10056"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/47aRycHmzZMgZbofcSXpDe1lgrVlR9Ok="
User-Agent: Unison/2.2
Cancel-Lock: sha1:yjcFNOLrbJoiulqFii6ZTbdTokc=
 by: Stephen Hoffman - Tue, 6 Sep 2022 20:31 UTC

On 2022-09-06 18:42:53 +0000, Arne Vajhj said:

> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>> which is... a mess, and is also ill-suited for UTF-8.  Probably better
>> to use char16_t and char32_t, if you do need fixed-width wide character
>> storage.
>
> wchar_t is a typical C vague definition where char16_t and char32_t are
> much more clearly defined.
>
> But wchar_t got runtime support.

Run-time support which is less than useful for most purposes,
particularly given the definition and the ~portability issues.

> C (and for that matter also C++) IO functions does not not make
> writing/reading UTF-8 easy.

The C I/O functions do ~mostly fine.

Semi-recent Clang, else-platform:

$ cc x.c -o x
$ ~/x
hello 🗺
$ cat x.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{ printf("hello 🗺\n");
exit(EXIT_SUCCESS);
} $

The C character functions are decidedly lacking, but then C character
functions are also lacking for existing ISO Latin 1 / DEC MCS strings,
too.

C++ does better, here.

Extracted from my previous reply and re-posted here:

>> Little (nothing?) past the ODS-5 UTF-8 filename work exists with
>> OpenVMS, and—as with most of the retrofit-compatible-hackery—that's
>> less than easy for apps to use.
>> ** You'll probably be using or porting recent versions of ICU,
>> libunistring, or ilk, **
>> and the OpenVMS 32- and 64-bit string descriptors are unfortunately
>> also less than useful here around language and encoding. (This is where
>> the object abstraction shines, too. It's what descriptors and itemlists
>> evolved into, on other platforms.)

OpenVMS does have an older version of ICU support.

Wondering why the obscurely-named I18N kit is still optional is akin to
wondering why IP networking is still optional. But I digress.

> Newer languages does much better.

Of course. Objective C does far better here too, and that language is
hardly new. As do Perl and Python, as were mentioned by others.

On OpenVMS, BASIC is probably most obvious candidate for adding UTF-8
and a more general ooverhaul.

But there are oothers using BASIC that would never get oover that, and
would oobject to OOBASIC.

Any retrofit of UTF-8 and adding UTF-8 and/or OO support into the
OpenVMS platform is a yet larger effort.

And for now, work with no obvious nor direct payback for VSI.

--
Pure Personal Opinion | HoffmanLabs LLC

Re: Character sets

<6317d8a5$0$705$14726298@news.sunsite.dk>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24732&group=comp.os.vms#24732

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Date: Tue, 6 Sep 2022 19:32:46 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.13.0
Subject: Re: Character sets
Content-Language: en-US
Newsgroups: comp.os.vms
References: <teth8e$2j8nv$2@dont-email.me> <tf0a0b$2vd02$1@dont-email.me>
<631794b2$0$703$14726298@news.sunsite.dk> <tf8amk$9q8$1@dont-email.me>
From: arn...@vajhoej.dk (Arne Vajhøj)
In-Reply-To: <tf8amk$9q8$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 74
Message-ID: <6317d8a5$0$705$14726298@news.sunsite.dk>
Organization: SunSITE.dk - Supporting Open source
NNTP-Posting-Host: 782fd3ab.news.sunsite.dk
X-Trace: 1662507173 news.sunsite.dk 705 arne@vajhoej.dk/68.9.63.232:49679
X-Complaints-To: staff@sunsite.dk
 by: Arne Vajhøj - Tue, 6 Sep 2022 23:32 UTC

On 9/6/2022 4:31 PM, Stephen Hoffman wrote:
> On 2022-09-06 18:42:53 +0000, Arne Vajhj said:
>
>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>>> which is... a mess, and is also ill-suited for UTF-8.  Probably
>>> better to use char16_t and char32_t, if you do need fixed-width wide
>>> character storage.
>>
>> wchar_t is a typical C vague definition where char16_t and char32_t
>> are much more clearly defined.
>>
>> But wchar_t got runtime support.
>
> Run-time support which is less than useful for most purposes,
> particularly given the definition and the ~portability issues.

I think I would miss wcs*, isw*, w versions of IO functions.

>> C (and for that matter also C++) IO functions does not not make
>> writing/reading UTF-8 easy.
>
> The C I/O functions do ~mostly fine.
>
> Semi-recent Clang, else-platform:
>
>
> $ cc x.c -o x
> $ ~/x
> hello 🗺
> $ cat x.c
> #include <stdio.h>
> #include <stdlib.h>
>
> int  main(void)
> {
>  printf("hello 🗺\n");
>  exit(EXIT_SUCCESS);
> }

That is C IO processing bytes where the application
has put UTF-8 in.

What is needed is something where the application
passes unicode (wchar_t* or char16_t* or char32_t*)
to an IO function and it convert to a specified encoding
UTF-8 or otherwise.

>> Newer languages does much better.
>
> Of course. Objective C does far better here too, and that language is
> hardly new. As do Perl and Python, as were mentioned by others.
>
> On OpenVMS, BASIC is probably most obvious candidate for adding UTF-8
> and a more general ooverhaul.
>
> But there are oothers using BASIC that would never get oover that, and
> would oobject to OOBASIC.
>
> Any retrofit of UTF-8 and adding UTF-8 and/or OO support into the
> OpenVMS platform is a yet larger effort.

The most obvious languages for adding OO are Basic and Pascal
(other platforms has prove that it works - unlike Fortran
and Cobol where interest is minimal).

For UTF-8 support I would probably say Pascal, Basic and Cobol.
Not so relevant for Fortran. And C/C++ will have to wait for the
standard for a nice solution and various hacks are already
possible.

Arne

Re: Character sets

<tf8np4$1ips$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24733&group=comp.os.vms#24733

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: seaoh...@hoffmanlabs.invalid (Stephen Hoffman)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Tue, 6 Sep 2022 20:14:28 -0400
Organization: HoffmanLabs LLC
Lines: 78
Message-ID: <tf8np4$1ips$1@dont-email.me>
References: <tf0a0b$2vd02$1@dont-email.me> <631794b2$0$703$14726298@news.sunsite.dk> <tf8amk$9q8$1@dont-email.me> <6317d8a5$0$705$14726298@news.sunsite.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader01.eternal-september.org; posting-host="749b8fe5401f16e34926c5da6cba3fb0";
logging-data="52028"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+f+PgV/7Zz2ndqc43Xiq1MQHiVDGUBcbs="
User-Agent: Unison/2.2
Cancel-Lock: sha1:ilFgsuC8NRsmHQu8Q6eOOWOfkPI=
 by: Stephen Hoffman - Wed, 7 Sep 2022 00:14 UTC

On 2022-09-06 23:32:46 +0000, Arne Vajhj said:

> On 9/6/2022 4:31 PM, Stephen Hoffman wrote:
>> On 2022-09-06 18:42:53 +0000, Arne Vajhj said:
>>
>>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>>>> which is... a mess, and is also ill-suited for UTF-8.  Probably better
>>>> to use char16_t and char32_t, if you do need fixed-width wide character
>>>> storage.
>>>
>>> wchar_t is a typical C vague definition where char16_t and char32_t are
>>> much more clearly defined.
>>>
>>> But wchar_t got runtime support.
>>
>> Run-time support which is less than useful for most purposes,
>> particularly given the definition and the ~portability issues.
>
> I think I would miss wcs*, isw*, w versions of IO functions.

Other than those functions are wchar_t and thus still problematic at
best, sure.

Something akin to u8_strtok or u_strtok_r (UTF-8 variants of strtok or
wcstok) works rather better for most uses I have, though.

Yeah; those particular libunistring and ICU calls are not part of the C
standard.

C23 does add null-terminated multibyte calls, but the existing
selection of standard string-handling calls for UTF-8 is just... bad.
But then C string handling is bad. OpenVMS itself is also bad at UTF-8.

>>> C (and for that matter also C++) IO functions does not not make
>>> writing/reading UTF-8 easy.
>>
>> The C I/O functions do ~mostly fine.
>>
>> Semi-recent Clang, else-platform:
>>
>>
>> $ cc x.c -o x
>> $ ~/x
>> hello 🗺
>> $ cat x.c
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> int  main(void)
>> {
>>  printf("hello 🗺\n");
>>  exit(EXIT_SUCCESS);
>> }
>
> That is C IO processing bytes where the application has put UTF-8 in.

char (or also soon char8_t), yes. Which holds UTF-8 strings just fine.

Both the typical C string stuff and OpenVMS string descriptors can need
to carry the language and encoding separately.

> What is needed is something where the application passes unicode
> (wchar_t* or char16_t* or char32_t*) to an IO function and it convert
> to a specified encoding UTF-8 or otherwise.

Which would usually be the C character functions handling UTF-8, and
which would preferably seldom involve wchar_t, and probably not all
that much of char16_t or char32_t more generally.

Objective C and Swift are just vastly better at this stuff.

--
Pure Personal Opinion | HoffmanLabs LLC

Re: Character sets

<tf8slj$ghi$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24734&group=comp.os.vms#24734

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!aioe.org!9NDHsYWS92lSmpbnWo/VKw.user.46.165.242.75.POSTED!not-for-mail
From: no_em...@invalid.invalid (Galen)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Wed, 7 Sep 2022 01:37:55 -0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID: <tf8slj$ghi$1@gioia.aioe.org>
References: <teth8e$2j8nv$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="16946"; posting-host="9NDHsYWS92lSmpbnWo/VKw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:VfYpeqV1R5kYA7c7O7DVlp4bFTo=
X-Notice: Filtered by postfilter v. 0.9.2
 by: Galen - Wed, 7 Sep 2022 01:37 UTC

Simon and the rest of the bunch,

We who hang out around here constitute a very unusual set of characters. As
far as I know, there are no standards, whether defined by international
body, a particular programming language, a manufacturer, or otherwise, that
maps even one of us (let alone the whole motley bunch) to a specific code
point.

Re: Character sets

<tfa551$d1j$1@news.misty.com>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24735&group=comp.os.vms#24735

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!.POSTED.185.159.157.200!not-for-mail
From: bqt...@softjar.se (Johnny Billquist)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Wed, 7 Sep 2022 15:08:49 +0200
Organization: MGT Consulting
Message-ID: <tfa551$d1j$1@news.misty.com>
References: <teth8e$2j8nv$2@dont-email.me> <tf0a0b$2vd02$1@dont-email.me>
<631794b2$0$703$14726298@news.sunsite.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 7 Sep 2022 13:08:50 -0000 (UTC)
Injection-Info: news.misty.com; posting-host="185.159.157.200";
logging-data="13363"; mail-complaints-to="abuse@misty.com"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
Gecko/20100101 Thunderbird/91.13.0
Content-Language: en-US
In-Reply-To: <631794b2$0$703$14726298@news.sunsite.dk>
 by: Johnny Billquist - Wed, 7 Sep 2022 13:08 UTC

On 2022-09-06 20:42, Arne Vajhøj wrote:
> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>> which is... a mess, and is also ill-suited for UTF-8.  Probably better
>> to use char16_t and char32_t, if you do need fixed-width wide
>> character storage.
>
> wchar_t is a typical C vague definition where char16_t and char32_t are
> much more clearly defined.

wchar_t was an invention from before Unicode came about. And it's fairly
incompatible with the ideas in Unicode.

> But wchar_t got runtime support.

For some definition of runtime support, sure...

> C (and for that matter also C++) IO functions does not not
> make writing/reading UTF-8 easy.

Looking at the follow up comments here, what you mean is that string
processing functions lack UTF-8 variants, which is true. Especially if
we talk about the standards. As for as I/O goes, C have no problem at
all. It can read/write UTF-8 without any problems at all.

However, UTF-8 is actually not a character set. UTF-8 is an *encoding*
of Unicode. And if you were to do this properly, the canonical format is
just Unicode characters, which needs 21 bits. Which probably means you'd
like to store them as arrays of 32-bit values. And then you should have
functions that take a Unicode string and converts it to UTF-8
representation and back if needed.

But the problem is uglier than that. Since Unicode handling also means
you should know/handle multiple codepoints that should be considered
equivalent, and for some you have combinations of characters that are
equivalent to another single character. And of course the actual
collation of it all is also language dependent, so it's not even
possible to do without some additional information. Unicode is a mess,
and we are now stuck with it, just as we're pretty stuck with x86. Not
because it's good, but because everyone use it.

Even trying to think how ustrcmp() should be implemented makes me sick...

Or we could do as other languages, and pretend the problem don't exist...

> Newer languages does much better.

Sortof.

Johnny

Re: Character sets

<tfalcu$9kd8$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24736&group=comp.os.vms#24736

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: club...@remove_me.eisner.decus.org-Earth.UFP (Simon Clubley)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Wed, 7 Sep 2022 17:46:06 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <tfalcu$9kd8$1@dont-email.me>
References: <teth8e$2j8nv$2@dont-email.me> <tf8slj$ghi$1@gioia.aioe.org>
Injection-Date: Wed, 7 Sep 2022 17:46:06 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d1b671b34694a816cb819690860dd173";
logging-data="315816"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QwNN7lLRFMl8ytayo77qhQzkYRaIn3k0="
User-Agent: slrn/0.9.8.1 (VMS/Multinet)
Cancel-Lock: sha1:KVap1XjBgvPtOq313DJGg1y9uwM=
 by: Simon Clubley - Wed, 7 Sep 2022 17:46 UTC

On 2022-09-06, Galen <no_email@invalid.invalid> wrote:
> Simon and the rest of the bunch,
>
> We who hang out around here constitute a very unusual set of characters. As
> far as I know, there are no standards, whether defined by international
> body, a particular programming language, a manufacturer, or otherwise, that
> maps even one of us (let alone the whole motley bunch) to a specific code
> point.
>

Define unusual.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.

Re: Character sets

<63192725$0$695$14726298@news.sunsite.dk>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24738&group=comp.os.vms#24738

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!news.swapon.de!news.uzoreto.com!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Date: Wed, 7 Sep 2022 19:19:56 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.13.0
Subject: Re: Character sets
Content-Language: en-US
Newsgroups: comp.os.vms
References: <teth8e$2j8nv$2@dont-email.me> <tf0a0b$2vd02$1@dont-email.me>
<631794b2$0$703$14726298@news.sunsite.dk> <tfa551$d1j$1@news.misty.com>
From: arn...@vajhoej.dk (Arne Vajhøj)
In-Reply-To: <tfa551$d1j$1@news.misty.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 108
Message-ID: <63192725$0$695$14726298@news.sunsite.dk>
Organization: SunSITE.dk - Supporting Open source
NNTP-Posting-Host: d77eb8bd.news.sunsite.dk
X-Trace: 1662592805 news.sunsite.dk 695 arne@vajhoej.dk/68.9.63.232:55667
X-Complaints-To: staff@sunsite.dk
 by: Arne Vajhøj - Wed, 7 Sep 2022 23:19 UTC

On 9/7/2022 9:08 AM, Johnny Billquist wrote:
> On 2022-09-06 20:42, Arne Vajhøj wrote:
>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>>> which is... a mess, and is also ill-suited for UTF-8.  Probably
>>> better to use char16_t and char32_t, if you do need fixed-width wide
>>> character storage.
>>
>> wchar_t is a typical C vague definition where char16_t and char32_t are
>> much more clearly defined.
>
> wchar_t was an invention from before Unicode came about. And it's fairly
> incompatible with the ideas in Unicode.

It is crazy vague in the C standard.

But on common platforms it is just utf-16 or utf-32.

>> But wchar_t got runtime support.
>
> For some definition of runtime support, sure...

There are a bunch of w functions including wcs* and the w IO functions.

One may not like the API, but wchar_t* is approx. as supported as char*.

>> C (and for that matter also C++) IO functions does not not
>> make writing/reading UTF-8 easy.
>
> Looking at the follow up comments here, what you mean is that string
> processing functions lack UTF-8 variants, which is true. Especially if
> we talk about the standards. As for as I/O goes, C have no problem at
> all. It can read/write UTF-8 without any problems at all.

I must be bad at explaining what I mean.

I know that C char* IO can read and write bytes containing UTF-8 - those
functions just pass the bytes on.

I am looking for a decoupling between internal Unicode representation
and external encoding - aka a transparent encode/decode.

That may still sound confusing.

But it should become more clear with a few examples.

Java:

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;

public class J {
public static void main(String[] args) throws IOException {
String s1 = "ÆØÅæøå";
String s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
try(PrintWriter pw = new PrintWriter(new File("j1.txt"),
"iso-8859-1")) {
pw.printf("%s = %s\n", s1, s2);
}
try(PrintWriter pw = new PrintWriter(new File("j2.txt"),
"utf-8")) {
pw.printf("%s = %s\n", s1, s2);
}
}
}

C#:

using System;
using System.IO;
using System.Text;

public class N
{ public static void Main(string[] args)
{
string s1 = "ÆØÅæøå";
string s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
using(StreamWriter sw = new StreamWriter("n1.txt", false,
Encoding.GetEncoding("iso-8859-1")))
{
sw.WriteLine("{0} = {1}", s1, s2);
}
using(StreamWriter sw = new StreamWriter("n2.txt", false,
Encoding.GetEncoding("utf-8")))
{
sw.WriteLine("{0} = {1}", s1, s2);
}
}
}

Python:

s1 = "ÆØÅæøå";
s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
with open("p1.txt", "w", encoding="iso-8859-1") as f:
f.write("%s = %s\n" % (s1, s2))
with open("p2.txt", "w", encoding="utf-8") as f:
f.write("%s = %s\n" % (s1, s2))

One simply specify what encoding the file should be
in and the IO code handles the encode.

And that Java and .NET uses UTF-16 while Python use UTF-8
does not matter.

Arne

Re: Character sets

<tfdtet$r3tr$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=24742&group=comp.os.vms#24742

  copy link   Newsgroups: comp.os.vms
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: seaoh...@hoffmanlabs.invalid (Stephen Hoffman)
Newsgroups: comp.os.vms
Subject: Re: Character sets
Date: Thu, 8 Sep 2022 19:22:05 -0400
Organization: HoffmanLabs LLC
Lines: 48
Message-ID: <tfdtet$r3tr$1@dont-email.me>
References: <tf0a0b$2vd02$1@dont-email.me> <631794b2$0$703$14726298@news.sunsite.dk> <tfa551$d1j$1@news.misty.com> <63192725$0$695$14726298@news.sunsite.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader01.eternal-september.org; posting-host="2f9485574bcd17778adfe9f7c71d9dc1";
logging-data="888763"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tTC4Oc+TCRNL/en7fXTGjHoO7ArXSE8Y="
User-Agent: Unison/2.2
Cancel-Lock: sha1:TTOl63q8T6zikwrc6crhflg01S4=
 by: Stephen Hoffman - Thu, 8 Sep 2022 23:22 UTC

On 2022-09-07 23:19:56 +0000, Arne Vajhj said:

> On 9/7/2022 9:08 AM, Johnny Billquist wrote:
>> On 2022-09-06 20:42, Arne Vajhøj wrote:
>>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>>>> which is... a mess, and is also ill-suited for UTF-8.  Probably better
>>>> to use char16_t and char32_t, if you do need fixed-width wide character
>>>> storage.
>>>
>>> wchar_t is a typical C vague definition where char16_t and char32_t are
>>> much more clearly defined.
>>
>> wchar_t was an invention from before Unicode came about. And it's
>> fairly incompatible with the ideas in Unicode.
>
> It is crazy vague in the C standard.
>
> But on common platforms it is just utf-16 or utf-32.

Maybe I was unclear.

C string handling is bad.

C UTF-8 handling is worse.

As defined, wchar_t is... less than useful.

Sure, if I want piles of glue code, it can be sorta workable. Kinda. Maybe.

But I have OpenVMS for excess glue code.

And OpenVMS UTF-8 handling is ~negligible.

Including BASIC, Fortran, COBOL, Pascal, and the entirety of the
OpenVMS system APIs, past ODS-5 UTF-8.

While a step or three in the right direction, Python, Perl, and Java
won't help with OpenVMS here, either.

Not on OpenVMS.

--
Pure Personal Opinion | HoffmanLabs LLC

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor