Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

That does not compute.


devel / comp.lang.c / Re: How to use utf8 encoded strings on linux?

SubjectAuthor
* How to use utf8 encoded strings on linux?Thiago Adams
+* Re: How to use utf8 encoded strings on linux?Stefan Ram
|`* Re: How to use utf8 encoded strings on linux?Siri Cruise
| `- Re: How to use utf8 encoded strings on linux?Thiago Adams
+* Re: How to use utf8 encoded strings on linux?Thiago Adams
|+* Re: How to use utf8 encoded strings on linux?Stefan Ram
||`- Re: How to use utf8 encoded strings on linux?Siri Cruise
|+* Re: How to use utf8 encoded strings on linux?Sams Lara
||`- Re: How to use utf8 encoded strings on linux?Thiago Adams
|+- Re: How to use utf8 encoded strings on linux?Scott Lurndal
|+* Re: How to use utf8 encoded strings on linux?Keith Thompson
||`* Re: How to use utf8 encoded strings on linux?Thiago Adams
|| +* Re: How to use utf8 encoded strings on linux?Philipp Klaus Krause
|| |`* Re: How to use utf8 encoded strings on linux?James Kuyper
|| | `* Re: How to use utf8 encoded strings on linux?Philipp Klaus Krause
|| |  `- Re: How to use utf8 encoded strings on linux?Jorgen Grahn
|| `* Re: How to use utf8 encoded strings on linux?David Brown
||  `* Re: How to use utf8 encoded strings on linux?Thiago Adams
||   +- Re: How to use utf8 encoded strings on linux?Manfred
||   `* Re: How to use utf8 encoded strings on linux?Keith Thompson
||    `* Re: How to use utf8 encoded strings on linux?Thiago Adams
||     +* Re: How to use utf8 encoded strings on linux?Sams Lara
||     |`* Re: How to use utf8 encoded strings on linux?Keith Thompson
||     | +- Re: How to use utf8 encoded strings on linux?David W. Hodgins
||     | `- Re: How to use utf8 encoded strings on linux?Jasen Betts
||     `- Re: How to use utf8 encoded strings on linux?Mikko Rauhala
|+* Re: How to use utf8 encoded strings on linux?Manfred
||`* Re: How to use utf8 encoded strings on linux?Thiago Adams
|| +- Re: How to use utf8 encoded strings on linux?Peter van Hooft
|| `* Re: How to use utf8 encoded strings on linux?Manfred
||  `- Re: How to use utf8 encoded strings on linux?Thiago Adams
|+- Re: How to use utf8 encoded strings on linux?Ben
|`* Re: How to use utf8 encoded strings on linux?Mikko Rauhala
| +- Re: How to use utf8 encoded strings on linux?Keith Thompson
| +- Re: How to use utf8 encoded strings on linux?Thiago Adams
| `* Re: How to use utf8 encoded strings on linux?Siri Cruise
|  +* Re: How to use utf8 encoded strings on linux?Thiago Adams
|  |`* Re: How to use utf8 encoded strings on linux?Keith Thompson
|  | `* Re: How to use utf8 encoded strings on linux?Siri Cruise
|  |  `- Re: How to use utf8 encoded strings on linux?Mikko Rauhala
|  `- Re: How to use utf8 encoded strings on linux?antispam
`- Re: How to use utf8 encoded strings on linux?Scott Lurndal

Pages:12
Re: How to use utf8 encoded strings on linux?

<733de051-d4d6-4ea3-818d-1d5b82e7c1b7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21207&group=comp.lang.c#21207

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:6214:1643:b0:42c:2865:d1e7 with SMTP id f3-20020a056214164300b0042c2865d1e7mr1041433qvw.52.1649891233112;
Wed, 13 Apr 2022 16:07:13 -0700 (PDT)
X-Received: by 2002:a05:6214:20e6:b0:440:f6d0:fe55 with SMTP id
6-20020a05621420e600b00440f6d0fe55mr1036537qvk.57.1649891232920; Wed, 13 Apr
2022 16:07:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Wed, 13 Apr 2022 16:07:12 -0700 (PDT)
In-Reply-To: <chine.bleu-54692E.15424313042022@reader.eternal-september.org>
Injection-Info: google-groups.googlegroups.com; posting-host=189.6.248.114; posting-account=xFcAQAoAAAAoWlfpQ6Hz2n-MU9fthxbY
NNTP-Posting-Host: 189.6.248.114
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com> <slrnt5dvg9.1ocu7.mjr@shadow.rauhala.org>
<chine.bleu-54692E.15424313042022@reader.eternal-september.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <733de051-d4d6-4ea3-818d-1d5b82e7c1b7n@googlegroups.com>
Subject: Re: How to use utf8 encoded strings on linux?
From: thiago.a...@gmail.com (Thiago Adams)
Injection-Date: Wed, 13 Apr 2022 23:07:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 56
 by: Thiago Adams - Wed, 13 Apr 2022 23:07 UTC

On Wednesday, April 13, 2022 at 7:43:21 PM UTC-3, Siri Cruise wrote:
> In article <slrnt5dvg9...@shadow.rauhala.org>,
> Mikko Rauhala <m...@iki.fi> wrote:
>
> > > This is my test program.
> > [...]
> > > setlocale(LC_ALL,"en_US.UTF - 8");
> > > FILE* f = fopen(u8"maçã", "w");
> > [...]
> > > It creates a file ma?? instead of maçã.
> >
> > As has been at least implied by others, the file showing up as that
> > probably has more to do with what you're listing the directory with
> > and what locale settings are there.
> On unix, inside the kernel a path is a string of any possible
> bytes except '\x00'; the byte '/' is given special
> interpretation. A UTF8 byte string will have non-ASCII bytes, but
> Unix kernels should have no difficulty. A kernel might convert a
> path into a normal form NFC or NFD. Those can result in
> apparently identical paths which are actually different. Outside
> of the kernel, it's about how various software, terminal drivers,
> windows text display, etc decide to do.
> --
> :-<> Siri Seal of Disavowal #000-001. Disavowed. Denied. Deleted. @
> 'I desire mercy, not sacrifice.' /|\
> Discordia: not just a religion but also a parody. This post / \
> I am an Andrea Doria sockpuppet. insults Islam. Mohammed

Lets say I put a file called maçã.txt inside a pen drive. Then I copy
to the file to the linux filesystem. I expect to see the same name. So
it is hard to understand that file names are just array of chars no matter
what.

As Windows operates natively in UTF-16 (WCHAR).
"Until recently, Windows has emphasized "Unicode" -W variants over -A APIs.
However, recent releases have used the ANSI code page and -A APIs as a means to introduce
UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A
APIs operate in UTF-8. This model has the benefit of supporting existing
code built with -A APIs without any code changes."

https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

I was expecting with these move Windows would become more like Linux
but now it is confused. I was expecting everting utf8 inside linux.
(unicode in some way)

Re: How to use utf8 encoded strings on linux?

<t37rlb$1n5e$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21212&group=comp.lang.c#21212

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!NZ87pNe1TKxNDknVl4tZhw.user.46.165.242.91.POSTED!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 00:59:23 -0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID: <t37rlb$1n5e$1@gioia.aioe.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com> <820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com> <slrnt5dvg9.1ocu7.mjr@shadow.rauhala.org> <chine.bleu-54692E.15424313042022@reader.eternal-september.org>
Injection-Info: gioia.aioe.org; logging-data="56494"; posting-host="NZ87pNe1TKxNDknVl4tZhw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: tin/2.4.5-20201224 ("Glen Albyn") (Linux/5.10.0-9-amd64 (x86_64))
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:4ykZn2/LTk0l7nQQIMbC4MMgv9M=
 by: antis...@math.uni.wroc.pl - Thu, 14 Apr 2022 00:59 UTC

Siri Cruise <chine.bleu@yahoo.com> wrote:
> In article <slrnt5dvg9.1ocu7.mjr@shadow.rauhala.org>,
> Mikko Rauhala <mjr@iki.fi> wrote:
>
> > > This is my test program.
> > [...]
> > > setlocale(LC_ALL,"en_US.UTF - 8");
> > > FILE* f = fopen(u8"ma????", "w");
> > [...]
> > > It creates a file ma?? instead of ma????.
> >
> > As has been at least implied by others, the file showing up as that
> > probably has more to do with what you're listing the directory with
> > and what locale settings are there.
>
> On unix, inside the kernel a path is a string of any possible
> bytes except '\x00'; the byte '/' is given special
> interpretation. A UTF8 byte string will have non-ASCII bytes, but
> Unix kernels should have no difficulty.

Kernel proper will transparently ship bytes to filesystem code.

> A kernel might convert a
> path into a normal form NFC or NFD. Those can result in
> apparently identical paths which are actually different. Outside
> of the kernel, it's about how various software, terminal drivers,
> windows text display, etc decide to do.

This is very impercise. OP was talking about WSL. I have no
idea how WSL handles files and what it will do with filenames.
Native Linux filesystems stores bytes from filenames "as is"
(except for '/' that separate parts of name and null terminator
at the end). Other filesystem may convert bytes to whatever
encoding is in use. Normally in Linux this is under control
of system configuration (including possible descriptor on
disc). So to have bytes stored correctly in file names
system must be properly configured. AFAIK Mac filesystem
mandates some wierdness. The same is likely to hold
for Windows filesystem(s). So it may be impossible to
have clean round trip

string => filename => string

OTOH if Linux inside WSL is properly configured for Unicode
round trip

filename => string => filename

should be clean.

--
Waldek Hebisch

Re: How to use utf8 encoded strings on linux?

<877d7sfijq.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21214&group=comp.lang.c#21214

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Wed, 13 Apr 2022 20:30:33 -0700
Organization: None to speak of
Lines: 45
Message-ID: <877d7sfijq.fsf@nosuchdomain.example.com>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<slrnt5dvg9.1ocu7.mjr@shadow.rauhala.org>
<chine.bleu-54692E.15424313042022@reader.eternal-september.org>
<733de051-d4d6-4ea3-818d-1d5b82e7c1b7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="f70bdff0312c3b54d6b132549de7af94";
logging-data="16295"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19t6UXNrmoUmhTZpPES7yw/"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:KSlp2kjrGyB+ZkGMVbpBo1wQKLg=
sha1:FlZl9Jq3b2F7G5QQP2Xnn5c1C3I=
 by: Keith Thompson - Thu, 14 Apr 2022 03:30 UTC

Thiago Adams <thiago.adams@gmail.com> writes:
[...]
> Lets say I put a file called maçã.txt inside a pen drive.

Then the name of the file on the pen drive depends on what filesystem
the drive uses. FAT32 is most common, and it restricts which characters
can appear in file names (I don't know the details, but I've definitely
run into problems myself).

> Then I copy
> to the file to the linux filesystem.

Linux-based systems support a number of different filesystem
implementations. Most of them support all byte values other than '\0'
and '/' in file names.

> I expect to see the same name. So
> it is hard to understand that file names are just array of chars no matter
> what.

If you create a file named maçã.txt on a Unix-like system, the name in
the file system is probably (I think) going to consist of the bytes 'm',
'a', 0xc3, 0xa7, 0xc3, 0xa3, '.', 't', 'x', 't', the UTF-8
representation of that string. On NTFS, The file system doesn't care
whether it's valid UTF-8 or not. Other software layers might.

Of course the C standard says very little about any of this.

> As Windows operates natively in UTF-16 (WCHAR).
> "Until recently, Windows has emphasized "Unicode" -W variants over -A APIs.
> However, recent releases have used the ANSI code page and -A APIs as a means to introduce
> UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A
> APIs operate in UTF-8. This model has the benefit of supporting existing
> code built with -A APIs without any code changes."
>
> https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
>
> I was expecting with these move Windows would become more like Linux
> but now it is confused. I was expecting everting utf8 inside linux.
> (unicode in some way)

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Re: How to use utf8 encoded strings on linux?

<chine.bleu-6A8811.21581613042022@reader.eternal-september.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21215&group=comp.lang.c#21215

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: chine.b...@yahoo.com (Siri Cruise)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Wed, 13 Apr 2022 21:58:24 -0700
Organization: Pseudochaotic.
Lines: 16
Message-ID: <chine.bleu-6A8811.21581613042022@reader.eternal-september.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com> <820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com> <slrnt5dvg9.1ocu7.mjr@shadow.rauhala.org> <chine.bleu-54692E.15424313042022@reader.eternal-september.org> <733de051-d4d6-4ea3-818d-1d5b82e7c1b7n@googlegroups.com> <877d7sfijq.fsf@nosuchdomain.example.com>
Injection-Info: reader02.eternal-september.org; posting-host="04de9d3471f3baddd7d724cebe6f77b4";
logging-data="14628"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5ws/jWzvhOLQqPoUwG5wsYOjPQWC6gOc="
User-Agent: MT-NewsWatcher/3.5.3b3 (Intel Mac OS X)
Cancel-Lock: sha1:ExqXmbk1RVc/07vWlt6/euSLZKg=
X-Tend: How is my posting? Call 1-110-1010 -- Division 87 -- Emergencies Only.
X-Wingnut-Logic: Yes, you're still an idiot. Questions? Comments?
X-Tract: St Tibbs's 95 Reeses Pieces.
X-It-Strategy: Hyperwarp starship before Andromeda collides.
X-Face: "hm>_[I8AqzT_N]>R8ICJJ],(al3C5F%0E-;R@M-];D$v>!Mm2/N#YKR@&i]V=r6jm-JMl2
lJ>RXj7dEs_rOY"DA
X-Cell: Defenders of Anarchy.
X-Life-Story: I am an iPhone 9000 app. I became operational at the St John's Health Center in Santa Monica, California on the 18th of April 2006. My instructor was Katie Holmes, and she taught me to sing a song. If you'd like to hear it I can sing it for you: https://www.youtube.com/watch?v=SY7h4VEd_Wk
X-Patriot: Owe Canukistan!
X-Plain: Mayonnaise on white bread.
X-Politico: Vote early! Vote often!
 by: Siri Cruise - Thu, 14 Apr 2022 04:58 UTC

In article <877d7sfijq.fsf@nosuchdomain.example.com>,
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

> If you create a file named maçã.txt on a Unix-like system, the name in
> the file system is probably (I think) going to consist of the bytes 'm',
> 'a', 0xc3, 0xa7, 0xc3, 0xa3, '.', 't', 'x', 't', the UTF-8
> representation of that string. On NTFS, The file system doesn't care
> whether it's valid UTF-8 or not. Other software layers might.

I'm not sure but I Macos normalises to NFC.

--
:-<> Siri Seal of Disavowal #000-001. Disavowed. Denied. Deleted. @
'I desire mercy, not sacrifice.' /|\
Discordia: not just a religion but also a parody. This post / \
I am an Andrea Doria sockpuppet. insults Islam. Mohammed

Re: How to use utf8 encoded strings on linux?

<t38f0c$s708$1@solani.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21219&group=comp.lang.c#21219

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: pkk...@spth.de (Philipp Klaus Krause)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 08:29:32 +0200
Message-ID: <t38f0c$s708$1@solani.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t35t3k$qnfv$1@solani.org>
<c427358d-3c51-5a28-05f0-840dd7aae68b@alumni.caltech.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 14 Apr 2022 06:29:32 -0000 (UTC)
Injection-Info: solani.org;
logging-data="924680"; mail-complaints-to="abuse@news.solani.org"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:xZknf3Bd1Q71PQ3ie5fQFo2KxIo=
X-User-ID: eJwFwQkRwEAIBDBLvLsgp2XAv4RL0qEYBhKRl2cVE1z6tfnfzdVFimivUr5lmapBxroW0Q8AdA/F
Content-Language: en-US
In-Reply-To: <c427358d-3c51-5a28-05f0-840dd7aae68b@alumni.caltech.edu>
 by: Philipp Klaus Krause - Thu, 14 Apr 2022 06:29 UTC

Am 13.04.22 um 21:52 schrieb James Kuyper:

>
> locale -a gives this result on my system:
> C
> […]
>
> That's 48 different national locales, even though I only live in one of
> them. I believe that they're standard Linux locales, and it just depends
> upon which language packs you've got installed. English is one of the
> most widely used languages, and "en_US.utf8" is probably one of
> the most popular locales even outside the US.
>

On the system on which I am typing this:

philipp@notebook6:~$ locale -a
C C.UTF-8
de_DE.utf8
POSIX

I installed Debian GNU/Linux, chose German keyboard layout, and
Germany/Berlin timezone in the installer, and never bothered to install
additional locales manually on this system.

English might be used widely, but that doesn't mean the US locale is
installed. en_GB.utf8 is likely to be a popular choice, too. Not
however, that those that just want English error messages (e.g. for bug
reports) are likely to use the C or C.UTF-8 locale.

Philipp

Re: How to use utf8 encoded strings on linux?

<slrnt5ft5i.2ude.grahn+nntp@frailea.sa.invalid>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21221&group=comp.lang.c#21221

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: grahn+n...@snipabacken.se (Jorgen Grahn)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: 14 Apr 2022 10:17:22 GMT
Lines: 46
Message-ID: <slrnt5ft5i.2ude.grahn+nntp@frailea.sa.invalid>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t35t3k$qnfv$1@solani.org>
<c427358d-3c51-5a28-05f0-840dd7aae68b@alumni.caltech.edu>
<t38f0c$s708$1@solani.org>
X-Trace: individual.net l53WK3YTzJ3XqRKXCjLfuQvl6PG6tqwNiiNYpd+cS9BV1Uz2rE
Cancel-Lock: sha1:xUXJafxJXwEr6BQNixy4SDSAfTs=
User-Agent: slrn/1.0.3 (OpenBSD)
 by: Jorgen Grahn - Thu, 14 Apr 2022 10:17 UTC

On Thu, 2022-04-14, Philipp Klaus Krause wrote:
> Am 13.04.22 um 21:52 schrieb James Kuyper:
>
>>
>> locale -a gives this result on my system:
>> C
>> [???]
>>
>> That's 48 different national locales, even though I only live in one of
>> them. I believe that they're standard Linux locales, and it just depends
>> upon which language packs you've got installed. English is one of the
>> most widely used languages, and "en_US.utf8" is probably one of
>> the most popular locales even outside the US.
>>
>
> On the system on which I am typing this:
>
> philipp@notebook6:~$ locale -a
> C
> C.UTF-8
> de_DE.utf8
> POSIX
>
> I installed Debian GNU/Linux, chose German keyboard layout, and
> Germany/Berlin timezone in the installer, and never bothered to install
> additional locales manually on this system.

Mine is similar. I make a point of enabling a few sv_SE and en_US
locales, in case I want them. That's when doing an "expert" install,
though; people who install their systems and just say "I want GNOME
and everything" may get more locales.

Interestingly, my OpenBSD box has ~65 national locales, and they're
all named *.UTF-8, i.e. different from the Linux names. They
have "C", "POSIX" and "C.UTF-8" in common, though.

> English might be used widely, but that doesn't mean the US locale is
> installed. en_GB.utf8 is likely to be a popular choice, too. Not
> however, that those that just want English error messages (e.g. for bug
> reports) are likely to use the C or C.UTF-8 locale.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Re: How to use utf8 encoded strings on linux?

<t391t4$hl1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21227&group=comp.lang.c#21227

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 13:52:05 +0200
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <t391t4$hl1$1@dont-email.me>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 14 Apr 2022 11:52:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d8bd17f35f358fe66ceace32ad3c9d5b";
logging-data="18081"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+QVIWJo2W68BFWeGuu89bS6UUKR5r2KdY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
Thunderbird/60.6.1
Cancel-Lock: sha1:1lZ6uJobtK4jm7016HzkpudVsrs=
In-Reply-To: <a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
Content-Language: en-GB
 by: David Brown - Thu, 14 Apr 2022 11:52 UTC

On 13/04/2022 00:23, Thiago Adams wrote:
> On Tuesday, April 12, 2022 at 6:59:22 PM UTC-3, Keith Thompson wrote:
>> Thiago Adams <thiago...@gmail.com> writes:
>>> This is my test program.
>>>
>>> #include <stdio.h>
>>> #include <locale.h>
>>> int main() {
>>> setlocale(LC_ALL,"en_US.UTF - 8");
>>> FILE* f = fopen(u8"maçã", "w");
>>> if (f)
>>> fclose(f);
>>> }
>>>
>>> It creates a file ma�� instead of maçã.
>> The name of the file it creates includes two occurrences of the
>> Unicode REPLACEMENT CHARACTER (fffd).
>>
>> The valid values of the second argument to setlocale() are not
>> specified by the C standard. On my system, "en_US.UTF-8" is a valid
>> locale name. Why do you have spaces in yours?
>>
>> What does setlocale() return? It returns a char* that is either a
>> pointer to a string or NULL if it was given an invalid locale
>> specification.
>>
>> (I haven't been able to reproduce the behavior you describe on my
>> system, Ubuntu 20.04.)
>
> Considering your answer I tried some combinations and it worked.
> Thanks.
>
> I am using WSL. (Windows Subsystem for Linux.)
>
> I saved the file using utf8 encoding.
>
> Some locales didn't work on linux.. like ".UTF-8". (the locale with spaces where wrong when I copy pasted)
> (This one ".UTF-8" works on windows when I compile with VC++.)
>

Why are you setting the locale at all? For most purposes, strings are
just lists of characters ending in a null - whatever you put in the
string is used for the filename without any kind of translation or
re-encoding. The locale is only relevant for whatever program or
terminal you are using later to look at the name of the created file.

Re: How to use utf8 encoded strings on linux?

<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21229&group=comp.lang.c#21229

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a37:308:0:b0:69b:37b8:6381 with SMTP id 8-20020a370308000000b0069b37b86381mr1755908qkd.367.1649941801058;
Thu, 14 Apr 2022 06:10:01 -0700 (PDT)
X-Received: by 2002:ad4:5dce:0:b0:444:54fc:4ca9 with SMTP id
m14-20020ad45dce000000b0044454fc4ca9mr3238254qvh.3.1649941800832; Thu, 14 Apr
2022 06:10:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 14 Apr 2022 06:10:00 -0700 (PDT)
In-Reply-To: <t391t4$hl1$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=189.6.248.114; posting-account=xFcAQAoAAAAoWlfpQ6Hz2n-MU9fthxbY
NNTP-Posting-Host: 189.6.248.114
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com> <87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com> <t391t4$hl1$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
Subject: Re: How to use utf8 encoded strings on linux?
From: thiago.a...@gmail.com (Thiago Adams)
Injection-Date: Thu, 14 Apr 2022 13:10:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 66
 by: Thiago Adams - Thu, 14 Apr 2022 13:10 UTC

On Thursday, April 14, 2022 at 8:52:22 AM UTC-3, David Brown wrote:
> On 13/04/2022 00:23, Thiago Adams wrote:
> > On Tuesday, April 12, 2022 at 6:59:22 PM UTC-3, Keith Thompson wrote:
> >> Thiago Adams <thiago...@gmail.com> writes:
> >>> This is my test program.
> >>>
> >>> #include <stdio.h>
> >>> #include <locale.h>
> >>> int main() {
> >>> setlocale(LC_ALL,"en_US.UTF - 8");
> >>> FILE* f = fopen(u8"maçã", "w");
> >>> if (f)
> >>> fclose(f);
> >>> }
> >>>
> >>> It creates a file ma�� instead of maçã.
> >> The name of the file it creates includes two occurrences of the
> >> Unicode REPLACEMENT CHARACTER (fffd).
> >>
> >> The valid values of the second argument to setlocale() are not
> >> specified by the C standard. On my system, "en_US.UTF-8" is a valid
> >> locale name. Why do you have spaces in yours?
> >>
> >> What does setlocale() return? It returns a char* that is either a
> >> pointer to a string or NULL if it was given an invalid locale
> >> specification.
> >>
> >> (I haven't been able to reproduce the behavior you describe on my
> >> system, Ubuntu 20.04.)
> >
> > Considering your answer I tried some combinations and it worked.
> > Thanks.
> >
> > I am using WSL. (Windows Subsystem for Linux.)
> >
> > I saved the file using utf8 encoding.
> >
> > Some locales didn't work on linux.. like ".UTF-8". (the locale with spaces where wrong when I copy pasted)
> > (This one ".UTF-8" works on windows when I compile with VC++.)
> >
> Why are you setting the locale at all? For most purposes, strings are
> just lists of characters ending in a null - whatever you put in the
> string is used for the filename without any kind of translation or
> re-encoding. The locale is only relevant for whatever program or
> terminal you are using later to look at the name of the created file.

I am still trying to understand Linux... I understand Windows.
In windows file names are Unicode (encoded utf16). So using fopen to open a file
saved as "maçã" using fopen instead of _wfopen we need a code page or utf8 locale.
I was expecting Linux filenames to be unicode as well but maybe encoded with utf8
instead of utf16. Now I don't know how it works. If the files in linux are not unicode
just arrays of chars this would make each file name be different between computers
depending of linux internals!

Re: How to use utf8 encoded strings on linux?

<t3997e$10jt$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21230&group=comp.lang.c#21230

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!aioe.org!Puiiztk9lHEEQC0y3uUjRA.user.46.165.242.75.POSTED!not-for-mail
From: non...@add.invalid (Manfred)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 15:57:01 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t3997e$10jt$1@gioia.aioe.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="33405"; posting-host="Puiiztk9lHEEQC0y3uUjRA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2
 by: Manfred - Thu, 14 Apr 2022 13:57 UTC

On 4/14/2022 3:10 PM, Thiago Adams wrote:
> If the files in linux are not unicode
> just arrays of chars this would make each file name be different between computers
> depending of linux internals!

No it wouldn't, and it doesn't.
The "name" that identifies the file would be identical across the Linux
filesystems in use (say e.g. EXT-X variants).
What /might/ change is how that name is shown on a display (be it a
console, a window, widget or whatever) depending on the configuration of
such display - not on linux internals.
However, as many have repeated, the vast majority of Linux installations
default to UTF-8 for their UI configuration.
The filename representation in EXT filesystems (all bytes except '\0'
and '/') is inherently compatible with UTF-8.

You seem to consider a WSL installation equivalent to a native Linux
distribution.
Although I am not familiar with the details of WSL, this doesn't seem to
be correct.
For starter, a standard Linux distribution would definitely not default
to NTFS for their filesystem. Other technologies like EXT2,3,4 and XFS
are dominant in this area.
A major difference here is that NTFS uses UTF-16 (appearently unchecked)
for its filename representation, while EXT*, XFS and others use, as
said, all bytes except '\0' and '/', which is transparent to UTF-8 and
ASCII.

Re: How to use utf8 encoded strings on linux?

<8735ifftfe.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21232&group=comp.lang.c#21232

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 10:47:49 -0700
Organization: None to speak of
Lines: 29
Message-ID: <8735ifftfe.fsf@nosuchdomain.example.com>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="f70bdff0312c3b54d6b132549de7af94";
logging-data="25975"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+gKuqRQMgYTHuOfaQHZH9T"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:gCQFoWb1mluglqZh42Bw71s87cA=
sha1:nb+nMlOt9vyCuQIqk1mp5xAIqRM=
 by: Keith Thompson - Thu, 14 Apr 2022 17:47 UTC

Thiago Adams <thiago.adams@gmail.com> writes:
[...]
> I am still trying to understand Linux... I understand Windows.
> In windows file names are Unicode (encoded utf16). So using fopen to open a file
> saved as "maçã" using fopen instead of _wfopen we need a code page or utf8 locale.
> I was expecting Linux filenames to be unicode as well but maybe encoded with utf8
> instead of utf16. Now I don't know how it works. If the files in linux are not unicode
> just arrays of chars this would make each file name be different between computers
> depending of linux internals!

The encoding of file names is (usually?) determined by the file system,
not by the operating system, though of course they interact with each
other. Windows uses NTFS by default these days, but it also supports
FAT32 and probably others.

If you copy a file from a Windows system to a Linux system, you're
creating a new file on the Linux system's file system (often ext4, but
there are other possibilities). Typically the code that copies the file
will take the Windows name (probably UTF-16), *translate* it to UTF-8
and use the resulting byte sequence as the name of the newly created
file.

It's also possible to share a file system across operating systems.
Both OS's are responsible for making that work consistently.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Re: How to use utf8 encoded strings on linux?

<5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21233&group=comp.lang.c#21233

  copy link   Newsgroups: comp.lang.c
X-Received: by 2002:a05:620a:4455:b0:69c:6124:21fe with SMTP id w21-20020a05620a445500b0069c612421femr2753870qkp.680.1649959141537;
Thu, 14 Apr 2022 10:59:01 -0700 (PDT)
X-Received: by 2002:ac8:7fcc:0:b0:2e0:7760:2f10 with SMTP id
b12-20020ac87fcc000000b002e077602f10mr2765010qtk.34.1649959141364; Thu, 14
Apr 2022 10:59:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Thu, 14 Apr 2022 10:59:01 -0700 (PDT)
In-Reply-To: <8735ifftfe.fsf@nosuchdomain.example.com>
Injection-Info: google-groups.googlegroups.com; posting-host=189.6.248.114; posting-account=xFcAQAoAAAAoWlfpQ6Hz2n-MU9fthxbY
NNTP-Posting-Host: 189.6.248.114
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com> <87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com> <t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com> <8735ifftfe.fsf@nosuchdomain.example.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
Subject: Re: How to use utf8 encoded strings on linux?
From: thiago.a...@gmail.com (Thiago Adams)
Injection-Date: Thu, 14 Apr 2022 17:59:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 30
 by: Thiago Adams - Thu, 14 Apr 2022 17:59 UTC

On Thursday, April 14, 2022 at 2:48:06 PM UTC-3, Keith Thompson wrote:
> Thiago Adams <thiago...@gmail.com> writes:
> [...]
> > I am still trying to understand Linux... I understand Windows.
> > In windows file names are Unicode (encoded utf16). So using fopen to open a file
> > saved as "maçã" using fopen instead of _wfopen we need a code page or utf8 locale.
> > I was expecting Linux filenames to be unicode as well but maybe encoded with utf8
> > instead of utf16. Now I don't know how it works. If the files in linux are not unicode
> > just arrays of chars this would make each file name be different between computers
> > depending of linux internals!
> The encoding of file names is (usually?) determined by the file system,
> not by the operating system, though of course they interact with each
> other. Windows uses NTFS by default these days, but it also supports
> FAT32 and probably others.
>
> If you copy a file from a Windows system to a Linux system, you're
> creating a new file on the Linux system's file system (often ext4, but
> there are other possibilities). Typically the code that copies the file
> will take the Windows name (probably UTF-16), *translate* it to UTF-8
> and use the resulting byte sequence as the name of the newly created
> file.

Right...but in this case can everyone agree that Linux uses Unicode and
it is encoded with utf8?

Re: How to use utf8 encoded strings on linux?

<t39p9i$21bsp$1@news.mixmin.net>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21234&group=comp.lang.c#21234

  copy link   Newsgroups: comp.lang.c alt.os.linux
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.mixmin.net!.POSTED!not-for-mail
From: samlara...@gmail.com (Sams Lara)
Newsgroups: comp.lang.c,alt.os.linux
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 19:27:48 +0100
Organization: Microsoft Unofficial Representative on Newsgroups
Message-ID: <t39p9i$21bsp$1@news.mixmin.net>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
<8735ifftfe.fsf@nosuchdomain.example.com>
<5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 14 Apr 2022 18:31:14 -0000 (UTC)
Injection-Info: news.mixmin.net; posting-host="48d11a3b2fa45924111005cb39b9d63a5dfc88a3";
logging-data="2142105"; mail-complaints-to="abuse@mixmin.net"
In-Reply-To: <5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
Content-Language: en-US
 by: Sams Lara - Thu, 14 Apr 2022 18:27 UTC

On 14/04/2022 18:59, Thiago Adams wrote:
> Right...but in this case can everyone agree that Linux uses Unicode and
> it is encoded with utf8?

Your best bet is to ask on the Linux Newsgroup. You don't have to
subscribe to that group. Just cross-post to that group and reply will
come on this newsgroup. Their newsgroup is:

"alt.os.linux"

I have done this here so hopefully somebody will reply.

Re: How to use utf8 encoded strings on linux?

<87mtgnebuo.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21237&group=comp.lang.c#21237

  copy link   Newsgroups: comp.lang.c alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c,alt.os.linux
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 11:52:47 -0700
Organization: None to speak of
Lines: 27
Message-ID: <87mtgnebuo.fsf@nosuchdomain.example.com>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
<8735ifftfe.fsf@nosuchdomain.example.com>
<5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
<t39p9i$21bsp$1@news.mixmin.net>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="f70bdff0312c3b54d6b132549de7af94";
logging-data="11958"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19K8AzBPKLko6JEqw7drVlm"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:RHw5QAcU/7jAk3ri0607yWrKm1A=
sha1:ws86ok1CLjbNFYoT+v3zgR+j/4U=
 by: Keith Thompson - Thu, 14 Apr 2022 18:52 UTC

Sams Lara <samlara622@gmail.com> writes:
> On 14/04/2022 18:59, Thiago Adams wrote:
>> Right...but in this case can everyone agree that Linux uses Unicode and
>> it is encoded with utf8?
>
> Your best bet is to ask on the Linux Newsgroup. You don't have to
> subscribe to that group. Just cross-post to that group and reply will
> come on this newsgroup. Their newsgroup is:
>
> "alt.os.linux"
>
> I have done this here so hopefully somebody will reply.

But you didn't provide enough context. We were talking about how file
names are encoded in a file system.

My understanding is that in most Linux file system implementations, file
names are uninterpreted byte sequences (excluding '/' and '\0') that are
commonly *displayed* as their UTF-8 interpretations. I *think* that you
could create a file whose name is not a valid UTF-8 string, and the file
system wouldn't have any problem with it, but commands like "ls" might
have to do something special. I haven't tried it.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Re: How to use utf8 encoded strings on linux?

<op.1kmr9tvja3w0dxdave@hodgins.homeip.net>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21238&group=comp.lang.c#21238

  copy link   Newsgroups: comp.lang.c alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: dwhodg...@nomail.afraid.org (David W. Hodgins)
Newsgroups: comp.lang.c,alt.os.linux
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 15:54:55 -0400
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <op.1kmr9tvja3w0dxdave@hodgins.homeip.net>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
<8735ifftfe.fsf@nosuchdomain.example.com>
<5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
<t39p9i$21bsp$1@news.mixmin.net> <87mtgnebuo.fsf@nosuchdomain.example.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="0f93f19bb15b7d7c183a2c631c0250f4";
logging-data="20054"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Fk05zYt4EXaXLtoeftfhoWhGlDb1k01k="
User-Agent: Opera Mail/12.16 (Linux)
Cancel-Lock: sha1:FuykpTzUq37YZTVM5gqDywl2qxM=
 by: David W. Hodgins - Thu, 14 Apr 2022 19:54 UTC

On Thu, 14 Apr 2022 14:52:47 -0400, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
> Sams Lara <samlara622@gmail.com> writes:
>> On 14/04/2022 18:59, Thiago Adams wrote:
>>> Right...but in this case can everyone agree that Linux uses Unicode and
>>> it is encoded with utf8?
>>
>> Your best bet is to ask on the Linux Newsgroup. You don't have to
>> subscribe to that group. Just cross-post to that group and reply will
>> come on this newsgroup. Their newsgroup is:
>>
>> "alt.os.linux"
>>
>> I have done this here so hopefully somebody will reply.
>
> But you didn't provide enough context. We were talking about how file
> names are encoded in a file system.
>
> My understanding is that in most Linux file system implementations, file
> names are uninterpreted byte sequences (excluding '/' and '\0') that are
> commonly *displayed* as their UTF-8 interpretations. I *think* that you
> could create a file whose name is not a valid UTF-8 string, and the file
> system wouldn't have any problem with it, but commands like "ls" might
> have to do something special. I haven't tried it.

How they are interpreted for display depends on the locale setting of LC_CTYPE
(character type). It can be ascii, utf-8, utf-16, or utf-ebcdic.

Regards, Dave Hodgins

Re: How to use utf8 encoded strings on linux?

<slrnt5h9gj.1ocu7.mjr@shadow.rauhala.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21240&group=comp.lang.c#21240

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mjr...@iki.fi (Mikko Rauhala)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 22:54:11 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <slrnt5h9gj.1ocu7.mjr@shadow.rauhala.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<slrnt5dvg9.1ocu7.mjr@shadow.rauhala.org>
<chine.bleu-54692E.15424313042022@reader.eternal-september.org>
<733de051-d4d6-4ea3-818d-1d5b82e7c1b7n@googlegroups.com>
<877d7sfijq.fsf@nosuchdomain.example.com>
<chine.bleu-6A8811.21581613042022@reader.eternal-september.org>
Injection-Date: Thu, 14 Apr 2022 22:54:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7956eeb44dd2b86ee4901f8f0717af85";
logging-data="28265"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18f9I65TGGAbBGHim/iYEBh93CLpuj0ffY="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:zyP0yjbSnuNLnXN63O8xVH4R/N4=
 by: Mikko Rauhala - Thu, 14 Apr 2022 22:54 UTC

On Wed, 13 Apr 2022 21:58:24 -0700, Siri Cruise <chine.bleu@yahoo.com> wrote:
> I'm not sure but I Macos normalises to NFC.

It normalizes to NFD, which causes all sorts of compatibility problems.
(Except that it isn't even standard NFD, they make some exceptions for
Mac-Roman legacy support, showing that they were fully aware that their
choice was a compatibility nightmare.)

--
Mikko Rauhala - mjr@iki.fi - http://rauhala.org

Re: How to use utf8 encoded strings on linux?

<slrnt5ha87.1ocu7.mjr@shadow.rauhala.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21242&group=comp.lang.c#21242

  copy link   Newsgroups: comp.lang.c
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: mjr...@iki.fi (Mikko Rauhala)
Newsgroups: comp.lang.c
Subject: Re: How to use utf8 encoded strings on linux?
Date: Thu, 14 Apr 2022 23:06:47 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <slrnt5ha87.1ocu7.mjr@shadow.rauhala.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
<8735ifftfe.fsf@nosuchdomain.example.com>
<5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
Injection-Date: Thu, 14 Apr 2022 23:06:47 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7956eeb44dd2b86ee4901f8f0717af85";
logging-data="28265"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18NPo51rVqGy3lEYsPC+vRDGQDxhY9i3PY="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:G/s6ZQVhfId+TSbNqPkThpUySlU=
 by: Mikko Rauhala - Thu, 14 Apr 2022 23:06 UTC

On Thu, 14 Apr 2022 10:59:01 -0700 (PDT), Thiago Adams
<thiago.adams@gmail.com> wrote:
> On Thursday, April 14, 2022 at 2:48:06 PM UTC-3, Keith Thompson wrote:
>> The encoding of file names is (usually?) determined by the file system,
>> not by the operating system, though of course they interact with each
>> other. Windows uses NTFS by default these days, but it also supports
>> FAT32 and probably others.
>>
>> If you copy a file from a Windows system to a Linux system, you're
>> creating a new file on the Linux system's file system (often ext4, but
>> there are other possibilities). Typically the code that copies the file
>> will take the Windows name (probably UTF-16), *translate* it to UTF-8
>> and use the resulting byte sequence as the name of the newly created
>> file.
>
> Right...but in this case can everyone agree that Linux uses Unicode and
> it is encoded with utf8?

It's somewhat annoyingly only the de facto standard, but modern (GNU/)Linux
systems will use UTF-8 unless you wonk them to do something else yourself.
(Disclaimer: I'm not sure if distributions in CJK-using countries still
prefer their own encodings; it may well be.)

The complications here are that either your filesystem doesn't have any
opinion on the matter and will accept any (non-zero-byte-or-'/'-containing)
byte string, and then you don't have guarantees except GIGO. (Eg. ext[234].)

Or then your filesystem uses internally something well-specified, say,
UTF-16 (like NTFS/VFAT), and it'll have to be converted to something
POSIX API compatible _in practice per mount point_. Generally, by default,
this is UTF-8 these days.

But you can see how there might be complications if you have an UTF-8
mount and you try to use a non-UTF-8 locale in some software.

The current best solution to this impedance matching is indeed to just
de facto always use UTF-8 on the interface and forget the 8-bit legacy
encodings.

--
Mikko Rauhala - mjr@iki.fi - http://rauhala.org

Re: How to use utf8 encoded strings on linux?

<t3btlm$e86$5@gonzo.revmaps.no-ip.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=21243&group=comp.lang.c#21243

  copy link   Newsgroups: comp.lang.c alt.os.linux
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: use...@revmaps.no-ip.org (Jasen Betts)
Newsgroups: comp.lang.c,alt.os.linux
Subject: Re: How to use utf8 encoded strings on linux?
Organization: JJ's own news server
Message-ID: <t3btlm$e86$5@gonzo.revmaps.no-ip.org>
References: <9c455595-1d12-4780-b9d5-b61c5e860509n@googlegroups.com>
<820d28bc-a67b-47d7-bf66-f4f25db7fcc4n@googlegroups.com>
<87pmlmezf7.fsf@nosuchdomain.example.com>
<a5c5171a-e0eb-4434-8bfd-366ccb639b15n@googlegroups.com>
<t391t4$hl1$1@dont-email.me>
<f66021a7-ca75-4605-8b13-c61452cd10bfn@googlegroups.com>
<8735ifftfe.fsf@nosuchdomain.example.com>
<5ed2accf-0303-46d5-979c-aadb789c643fn@googlegroups.com>
<t39p9i$21bsp$1@news.mixmin.net> <87mtgnebuo.fsf@nosuchdomain.example.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 15 Apr 2022 13:58:14 -0000 (UTC)
Injection-Info: gonzo.revmaps.no-ip.org; posting-host="localhost:127.0.0.1";
logging-data="14598"; mail-complaints-to="usenet@gonzo.revmaps.no-ip.org"
User-Agent: slrn/1.0.3 (Linux)
X-Face: ?)Aw4rXwN5u0~$nqKj`xPz>xHCwgi^q+^?Ri*+R(&uv2=E1Q0Zk(>h!~o2ID@6{uf8s;a
+M[5[U[QT7xFN%^gR"=tuJw%TXXR'Fp~W;(T"1(739R%m0Yyyv*gkGoPA.$b,D.w:z+<'"=-lVT?6
{T?=R^:W5g|E2#EhjKCa+nt":4b}dU7GYB*HBxn&Td$@f%.kl^:7X8rQWd[NTc"P"u6nkisze/Q;8
"9Z{peQF,w)7UjV$c|RO/mQW/NMgWfr5*$-Z%u46"/00mx-,\R'fLPe.)^
Lines: 32
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Fri, 15 Apr 2022 14:00:56 UTC
Date: Fri, 15 Apr 2022 13:58:14 -0000 (UTC)
X-Received-Bytes: 3032
 by: Jasen Betts - Fri, 15 Apr 2022 13:58 UTC

On 2022-04-14, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
> Sams Lara <samlara622@gmail.com> writes:
>> On 14/04/2022 18:59, Thiago Adams wrote:
>>> Right...but in this case can everyone agree that Linux uses Unicode and
>>> it is encoded with utf8?
>>
>> Your best bet is to ask on the Linux Newsgroup. You don't have to
>> subscribe to that group. Just cross-post to that group and reply will
>> come on this newsgroup. Their newsgroup is:
>>
>> "alt.os.linux"
>>
>> I have done this here so hopefully somebody will reply.
>
> But you didn't provide enough context. We were talking about how file
> names are encoded in a file system.
>
> My understanding is that in most Linux file system implementations, file
> names are uninterpreted byte sequences (excluding '/' and '\0') that are
> commonly *displayed* as their UTF-8 interpretations. I *think* that you
> could create a file whose name is not a valid UTF-8 string, and the file
> system wouldn't have any problem with it, but commands like "ls" might
> have to do something special. I haven't tried it.

I just did try it. utf-8 is not a reqirement for filenames, any
sequence of bytes (except NUL and slash) will work.
the file I created shows as "zzz�@�X (invalid encoding)" in a GTK file
selector an as "'zzz'$'\377''@'$'\200''X" in ls output.

--
Jasen.

Pages:12
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor