novaBBS - comp.lang.c - Re: Simple(?) Unicode questions

Simple(?) Unicode questions

<ul13hl$24kg5$1@dont-email.me>

https://www.novabbs.com/devel/article-flat.php?id=30347&group=comp.lang.c#30347

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.c
Subject: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 08:04:20 +0100
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <ul13hl$24kg5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 07:04:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9332dd265ead81445844bbccdb834d57";
logging-data="2249221"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/1ppPBaJ4UAoXkiS1sjkEI"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:B23ZWcO0s/gAkl17x9GR6oaMx6Q=
X-Mozilla-News-Host: news://news.eternal-september.org:119
X-Enigmail-Draft-Status: N1110

by: Janis Papanagnou - Sat, 9 Dec 2023 07:04 UTC

After decades I'm again writing some C code and intended to use some
Unicode characters for output. I'm using C99. I have two questions.

I am able to inline the character in the code like: printf ("█\n");

But I also want to make it a printf argument: printf ("%c\n", '█');
which doesn't work (at least not in the depicted way).

And I want to declare such characters, like: char ch = '█';
which also doesn't work, and neither does: wchar_t ch = '█';
And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

Without giving all the gory details about the "problems of Unicode",
are there practical answers to those questions that "simply work"
and reliably?

I have experimented and observed that working with strings at least
*seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

Thanks.

Janis

Re: Simple(?) Unicode questions

<ul1oel$3aems$1@i2pn2.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30348&group=comp.lang.c#30348

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!.POSTED!not-for-mail
From: rich...@damon-family.org (Richard Damon)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 08:01:09 -0500
Organization: i2pn2 (i2pn.org)
Message-ID: <ul1oel$3aems$1@i2pn2.org>
References: <ul13hl$24kg5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 13:01:09 -0000 (UTC)
Injection-Info: i2pn2.org;
logging-data="3488476"; mail-complaints-to="usenet@i2pn2.org";
posting-account="diqKR1lalukngNWEqoq9/uFtbkm5U+w3w6FQ0yesrXg";
User-Agent: Mozilla Thunderbird
In-Reply-To: <ul13hl$24kg5$1@dont-email.me>
Content-Language: en-US

by: Richard Damon - Sat, 9 Dec 2023 13:01 UTC

On 12/9/23 2:04 AM, Janis Papanagnou wrote:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output. I'm using C99. I have two questions.

There are several things that are considered a "Character" in C.

we have the "char", which is a single "narrow" character,
we have character strings, which can represent multi-byte-characters
we have "wchar", which can represent "wide" characters as a single unit.

>
> I am able to inline the character in the code like: printf ("█\n");

Because, while it isn't a single "narrow character", but can be
converted into a "multi-byte-character-string" that represents that
character.
>
> But I also want to make it a printf argument: printf ("%c\n", '█');
> which doesn't work (at least not in the depicted way).

Because it isn't a "narrow character" and thus can't be put into a
single "char"

>
> And I want to declare such characters, like: char ch = '█';
> which also doesn't work, and neither does: wchar_t ch = '█';
> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).

you can use wchar ch = L'█'; or wchar ch = L'\u2588';
The key is that you are creating a WIDE character, not a narrow character.

>
> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?
>
> I have experimented and observed that working with strings at least
> *seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?
>
> Thanks.
>
> Janis

You need to make a decision if you will represent the bigger set of
characters as always using wide characters, or
multi-byte-character-strings.

Most often, it is the multi-byte-character-string, as wide characters
are less well supported in most systems.

Re: Simple(?) Unicode questions

<ul1vbr$289m4$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30349&group=comp.lang.c#30349

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 15:59:08 +0100
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <ul1vbr$289m4$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 14:59:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a99ef3a62fdff44e0be2aa28d81d0633";
logging-data="2369220"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+VZNFcbD79e2DwbLbPw6gq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17.1
Cancel-Lock: sha1:3UijE7vFoqglvJGETMoTwLzvUtc=
In-Reply-To: <ul13hl$24kg5$1@dont-email.me>

by: jak - Sat, 9 Dec 2023 14:59 UTC

Janis Papanagnou ha scritto:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output. I'm using C99. I have two questions.
>
> I am able to inline the character in the code like: printf ("█\n");
>
> But I also want to make it a printf argument: printf ("%c\n", '█');
> which doesn't work (at least not in the depicted way).
>
> And I want to declare such characters, like: char ch = '█';
> which also doesn't work, and neither does: wchar_t ch = '█';
> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).
>
> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?
>
> I have experimented and observed that working with strings at least
> *seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?
>
> Thanks.
>
> Janis
>

HI,
You merged two questions together. I will try to divide them:
Initialization of wchar_t types:
like char strings can be initialized with literal strings:

char str[] = "Hello";

the same can be done for wchar_t type strings using the prefix L:

wchar_t wstr[] = L"Hello";
wchar_t wstr[] = L"█\n";
wchar_t wstr[] = L"\u2588\n";

A similar thing is possible for individual characters:

char ch = 'a';
wchar_t wch = L'a';

with the prefix L, it is therefore possible to use extensive characters:

wchar_t wch = L'█';
or:
wchar_t wch = 0x2588;
or:
wchar_t wch = L'\u2588';
or:
wchar_t wch = L"\u2588"[0];
or:
wchar_t wch = *L"█";

Also for the printf there is the relative formatting prefix ('l') for
the wchar_t type:

printf("%s", str);
printf("%ls", wstr);

printf("%c", ch);
printf("%lc", wch);

But it would be more correct to use the suitable version of the wchar_t
(on many occasions it is also more comfortable):

wprintf(L"%ls", wstr);
wprintf(L"%lc", wch);

However, remember to configure the 'locale' for viewing on your
terminal, otherwise the characters you will see may not be the ones you
expect or you will not see at all. Using the 'setlocale' function will
allow the program to convert between the character that prints and the
one corresponding to the locale of your terminal.
To explain myself better if I write a program that prints an extended
unicode character and my terminal uses the UTF-8 characters if the
program does not convert the character from Unicode to UTF-8 I will not
see anything. To prove it I will send the character to a file:

$> cat foo.c
#include <stdio.h>
#include <stddef.h>
#include <wchar.h>
#include <locale.h>

int main()
{ wchar_t wch = L'\u2588';
FILE *fp;

setlocale(LC_ALL, "");

if((fp = fopen("char.txt", "wb")) != NULL)
{
fwprintf(fp, L"%lc", wch);
fclose(fp);
}
return 0;
}

$> hexdump -C char.txt
00000000 e2 96 88 |...|
00000003

As you can see the character code is not the same that I sent. With
python it is easy to highlight the conversion:

$> python
>>> u'\u2588'.encode('utf-8')
b'\xe2\x96\x88'

Re: Simple(?) Unicode questions

<=H=fRiU4BbThlUWDM@bongo-ra.co>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30350&group=comp.lang.c#30350

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!paganini.bofh.team!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 15:12:55 -0000 (UTC)
Organization: To protect and to server
Message-ID: <=H=fRiU4BbThlUWDM@bongo-ra.co>
References: <ul13hl$24kg5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 15:12:55 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="2439763"; posting-host="9H7U5kayiTdk7VIdYU44Rw.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
Cancel-Lock: sha256:Yg3MGqtvVFeVLGCcyJkFJPyAXyFmrkXI3t5cT//bksg=
X-Organisation: Weyland-Yutani
X-Server-Commands: nowebcancel
X-Notice: Filtered by postfilter v. 0.9.3

by: Spiros Bousbouras - Sat, 9 Dec 2023 15:12 UTC

On Sat, 9 Dec 2023 08:04:20 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output. I'm using C99. I have two questions.

I assume you have already read <ul1oel$3aems$1@i2pn2.org> .I will add
that for printf() and wide characters or strings you need %lc and
%ls respectively. The C99 standard says for %ls

If an l length modifier is present, the argument shall be a pointer to
the initial element of an array of wchar_t type. Wide characters from the
array are converted to multibyte characters (each as if by a call to the
wcrtomb function, with the conversion state described by an mbstate_t
object initialized to zero before the first wide character is converted)
up to and including a terminating null wide character. The resulting
multibyte characters are written up to (but not including) the
terminating null character (byte).

I don't think there is a standard way to determine which conversions
wcrtomb() can handle. Not only that but those depend on what the LC_CTYPE
locale category has.

My own approach would be to do as much as possible in my own code. A lot
depends on whether you need to pass your own characters (of whatever type) to
some external library which expects a specific type like wchar_t or not.
There are many different scenarios so I will cover what would be most likely
to occur in my own code.

- No external library involved.
- Output encoded in UTF-8
- The text editor I use to write the code stores everything as UTF-8.

With the above assumptions I would simply use ordinary C strings and put
UTF-8 in them like "ΑΒΓΔΕΖΗΘ..." and output them in the ordinary way.
It's not guaranteed to work but it most likely will.

If I want to use directly unicode codepoints I will encode them as
unsigned long which is guaranteed to be wide enough to cover the whole
range of codepoints values ; in contrast , it is conforming for wchar_t
to cover no greater range than char .Converting from codepoints to UTF-8
is an easy and pleasant exercise. So I may have

typedef unsigned long codepoint ;
codepoint my_wide_string = { \x2588 , ... } ;

Then convert from that to UTF-8 and output the UTF-8 octets.

With this approach you can store the codepoints in whatever textual
representation you want , say in some configuration file and read that
during the start up of your programme.

[...]

> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).

Why is that ? It seems to me that it makes the code harder to understand.

> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?

What works reliably depends a lot on what you're trying to do. Unicode in
general is messy.

> I have experimented and observed that working with strings at least
> *seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?

If you do

char * ch = "\u2588"
size_t i ;
for (i = 0 ; ch[i] != 0 ; i++) {
printf("%d " , ch[i]) ;
}
puts("") ;

what output do you get ? I will guess that you see the bytes
226 150 136 .

--
vlaho.ninja/menu

Re: Simple(?) Unicode questions

<77XeQojqcvfK7uNgV@bongo-ra.co>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30351&group=comp.lang.c#30351

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!news.samoylyk.net!paganini.bofh.team!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 15:32:44 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <77XeQojqcvfK7uNgV@bongo-ra.co>
References: <ul13hl$24kg5$1@dont-email.me> <ul1vbr$289m4$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 15:32:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="485723f14abef935367e2535ef31609e";
logging-data="2378821"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/J2LCpFwoQM9nIOkb8F6JG"
Cancel-Lock: sha1:EMJq/NJtww+MYdgzBamAlmlYwNc=
X-Organisation: Weyland-Yutani
X-Server-Commands: nowebcancel
In-Reply-To: <ul1vbr$289m4$1@dont-email.me>

by: Spiros Bousbouras - Sat, 9 Dec 2023 15:32 UTC

On Sat, 9 Dec 2023 15:59:08 +0100
jak <nospam@please.ty> wrote:
> To explain myself better if I write a program that prints an extended
> unicode character and my terminal uses the UTF-8 characters if the
> program does not convert the character from Unicode to UTF-8 I will not
> see anything. To prove it I will send the character to a file:
>
> $> cat foo.c
> #include <stdio.h>
> #include <stddef.h>
> #include <wchar.h>
> #include <locale.h>
>
> int main()
> {
> wchar_t wch = L'\u2588';
> FILE *fp;
>
> setlocale(LC_ALL, "");
>
> if((fp = fopen("char.txt", "wb")) != NULL)
> {
> fwprintf(fp, L"%lc", wch);
> fclose(fp);
> }
> return 0;
> }
>
> $> hexdump -C char.txt
> 00000000 e2 96 88 |...|
> 00000003
>
> As you can see the character code is not the same that I sent.

In what way is it not the same as what you sent ? With hexdump you
can only hope to see octets regardless of what the octets encode. So
you read back the octets which are the UTF-8 encoding of codepoint
U+2588 .What you got is exactly what I would expect to see. If you
use a terminal which supports UTF-8 and has the necessary font and
you do

cat char.txt

what do you see ? I expect you will see the block character.

> With python it is easy to highlight the conversion:
>
> $> python
> >>> u'\u2588'.encode('utf-8')
> b'\xe2\x96\x88'

Re: Simple(?) Unicode questions

<ul26dl$29c3i$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30352&group=comp.lang.c#30352

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 17:59:32 +0100
Organization: A noiseless patient Spider
Lines: 103
Message-ID: <ul26dl$29c3i$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me> <=H=fRiU4BbThlUWDM@bongo-ra.co>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 16:59:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="6cfad004dd3318952238aaa7c9ffb7a6";
logging-data="2404466"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19EgDYdE9q35dSQFcqDiUay"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:EFz/SppcL2xruI/Fanog2TtJlhk=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <=H=fRiU4BbThlUWDM@bongo-ra.co>

by: Janis Papanagnou - Sat, 9 Dec 2023 16:59 UTC

Thanks Richard, jak, and Spiros, for your explanations!

Some comments on the net about building wrappers around libraries,
and whatnot, irritated me.

In my initial tries I got confused about the error/warning message;
I had omitted the 'L' prefix for the character literal definition.
So that hint helped to get some assurance here.

On 09.12.2023 16:12, Spiros Bousbouras wrote:
>
> My own approach would be to do as much as possible in my own code.

Same here.

If possible, I want to avoid external libraries, unnecessary
dependencies, and language constructs that are not guaranteed to
work reliably or that are non-portable, and I like simplicity and
transparency.

> A lot
> depends on whether you need to pass your own characters (of whatever type) to
> some external library which expects a specific type like wchar_t or not.
> There are many different scenarios so I will cover what would be most likely
> to occur in my own code.

My requirements are quite trivial and there's no exchange of data
between systems, processes, or applications. It's only data to be
displayed at the local screen.

>
> - No external library involved.
> - Output encoded in UTF-8
> - The text editor I use to write the code stores everything as UTF-8.
>
> With the above assumptions I would simply use ordinary C strings and put
> UTF-8 in them like "ΑΒΓΔΕΖΗΘ..." and output them in the ordinary way.

> It's not guaranteed to work but it most likely will.

That exactly was my uncertainty.

> [...]
>
>> And ideally the character should not be copy/pasted into the code
>> but given by some standard representation like '\u2588' (or so).
>
> Why is that ? It seems to me that it makes the code harder to understand.

I'm not encoding non-latin texts (like your Greek example above).

In my case the characters are just "graphical candy", so it's not
important to "read" them; a comment behind the \u encoding appears
to me to be sufficient.

It may also be a habit to have a program coded as ASCII source;
during my first decades of programming there were no languages
that I used that supported anything else than ASCII (or EBCDIC,
or even less, like 6-bit character sets, in some cases [CDC]).

This way (so my assumption goes) also less things will possibly
go wrong. I also never programmed in languages where the program
could be written in ones native (non-English) language by using
Unicode or UTF-8 encoding. I think I had the possibility in Java
(but these days were nothing but an episode as seen from today).

>
> What works reliably depends a lot on what you're trying to do. Unicode in
> general is messy.

Yeah, that's why I want to keep it as simple as possible; but it
should of course work reliably.

>
>> I have experimented and observed that working with strings at least
>> *seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
>> Is that an acceptable/reliable and the usual way in C to tackle the
>> issue?
>
> If you do
>
> char * ch = "\u2588"
> size_t i ;
> for (i = 0 ; ch[i] != 0 ; i++) {
> printf("%d " , ch[i]) ;
> }
> puts("") ;
>
> what output do you get ? I will guess that you see the bytes
> 226 150 136 .

Almost. I get the complementary values: -30 -106 -120

But why are you asking? - To show that "\u2588" is internally
represented by a [UTF-8] code sequence? - Ideally the interface
should not make me care about internal representations. :-)

The explanations and hints were all helpful - thanks again!

Janis

Re: Simple(?) Unicode questions

<5zg=MX9oDHkwG45=8@bongo-ra.co>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30353&group=comp.lang.c#30353

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 17:19:56 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <5zg=MX9oDHkwG45=8@bongo-ra.co>
References: <ul13hl$24kg5$1@dont-email.me> <=H=fRiU4BbThlUWDM@bongo-ra.co> <ul26dl$29c3i$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 17:19:56 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="485723f14abef935367e2535ef31609e";
logging-data="2410641"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ccZ3/UKaZkxCHqhWy11yv"
Cancel-Lock: sha1:aLWLdP9cmM23mbvf7VCG76+8Sk8=
X-Organisation: Weyland-Yutani
X-Server-Commands: nowebcancel
In-Reply-To: <ul26dl$29c3i$1@dont-email.me>

by: Spiros Bousbouras - Sat, 9 Dec 2023 17:19 UTC

On Sat, 9 Dec 2023 17:59:32 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

> On 09.12.2023 16:12, Spiros Bousbouras wrote:
> > Why is that ? It seems to me that it makes the code harder to understand.
>
> I'm not encoding non-latin texts (like your Greek example above).
>
> In my case the characters are just "graphical candy", so it's not
> important to "read" them; a comment behind the \u encoding appears
> to me to be sufficient.

Well , it's your code. If it is some kind of block characters based
"art" then it may even be more important to be able to see it in the
source.

> It may also be a habit to have a program coded as ASCII source;
> during my first decades of programming there were no languages
> that I used that supported anything else than ASCII (or EBCDIC,
> or even less, like 6-bit character sets, in some cases [CDC]).
>
> This way (so my assumption goes) also less things will possibly
> go wrong. I also never programmed in languages where the program
> could be written in ones native (non-English) language by using
> Unicode or UTF-8 encoding. I think I had the possibility in Java
> (but these days were nothing but an episode as seen from today).

In this age , it is probably an unnecessarily restrictive habit. If
anything , you *should* try to go beyond ASCII whenever it would be
useful so that you will get to see what works and what doesn't. I
think you will find that a lot just works.

> > What works reliably depends a lot on what you're trying to do. Unicode in
> > general is messy.
>
> Yeah, that's why I want to keep it as simple as possible; but it
> should of course work reliably.
>
> >
> >> I have experimented and observed that working with strings at least
> >> *seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
> >> Is that an acceptable/reliable and the usual way in C to tackle the
> >> issue?
> >
> > If you do
> >
> > char * ch = "\u2588"
> > size_t i ;
> > for (i = 0 ; ch[i] != 0 ; i++) {
> > printf("%d " , ch[i]) ;
> > }
> > puts("") ;
> >
> > what output do you get ? I will guess that you see the bytes
> > 226 150 136 .
>
> Almost. I get the complementary values: -30 -106 -120

Ahh yes , of course , my mistake. By the way , that's one of the things which
is not guaranteed by the standard to work. If char has the range from -128
to 127 then converting from values >= 128 results in an

either the result is implementation-defined or an implementation-defined
signal is raised.

..But almost certainly it will work.

> But why are you asking? - To show that "\u2588" is internally
> represented by a [UTF-8] code sequence?

Yes. If it is (which seems to be in your case) , that's a good sign
that you can keep things simple and avoid conversions and wide
characters.

> Ideally the interface
> should not make me care about internal representations. :-)

--
My sister, also a conductor, once explained to the board of one of her
orchestras why she wouldn't let them play Mozart in her first season;
"Mozart" she said, "is the string bikini of composers, and I just think that
we, as an orchestra, don't have the body to pull it off yet."
https://kennethwoods.net/blog1/2012/06/25/which-would-you-rather-conduct-or-joining-the-mozart-protection-society/

Re: Simple(?) Unicode questions

<3wItOAziDrtvid93G@bongo-ra.co>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30354&group=comp.lang.c#30354

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!news.niel.me!news.gegeweb.eu!gegeweb.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 17:40:03 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <3wItOAziDrtvid93G@bongo-ra.co>
References: <ul13hl$24kg5$1@dont-email.me> <=H=fRiU4BbThlUWDM@bongo-ra.co>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 9 Dec 2023 17:40:03 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="485723f14abef935367e2535ef31609e";
logging-data="2416461"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FHvXo2ooFKdnbSS5YTOoz"
Cancel-Lock: sha1:cD2N4AQ17vrtJlcuOVnsMPFe8ZU=
In-Reply-To: <=H=fRiU4BbThlUWDM@bongo-ra.co>
X-Server-Commands: nowebcancel
X-Organisation: Weyland-Yutani

by: Spiros Bousbouras - Sat, 9 Dec 2023 17:40 UTC

On Sat, 9 Dec 2023 15:12:55 -0000 (UTC)
Spiros Bousbouras <spibou@gmail.com> wrote:
> If I want to use directly unicode codepoints I will encode them as
> unsigned long which is guaranteed to be wide enough to cover the whole
> range of codepoints values ; in contrast , it is conforming for wchar_t
> to cover no greater range than char .Converting from codepoints to UTF-8
> is an easy and pleasant exercise. So I may have
>
> typedef unsigned long codepoint ;
> codepoint my_wide_string = { \x2588 , ... } ;

codepoint my_wide_string[...] = { \x2588 , ... } ;

> Then convert from that to UTF-8 and output the UTF-8 octets.

Re: Simple(?) Unicode questions

<ul290e$29osl$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30355&group=comp.lang.c#30355

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 18:43:41 +0100
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <ul290e$29osl$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me> <=H=fRiU4BbThlUWDM@bongo-ra.co>
<ul26dl$29c3i$1@dont-email.me> <5zg=MX9oDHkwG45=8@bongo-ra.co>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 9 Dec 2023 17:43:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="a0fa9fe624fbed563e9f5369a22a45c7";
logging-data="2417557"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX180jIgwLu0yfd69VOL8Xt02"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:65Vcw6AwU1d6dLzDNdgkAWr1ccA=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <5zg=MX9oDHkwG45=8@bongo-ra.co>

by: Janis Papanagnou - Sat, 9 Dec 2023 17:43 UTC

On 09.12.2023 18:19, Spiros Bousbouras wrote:
> On Sat, 9 Dec 2023 17:59:32 +0100
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>>
>> In my case the characters are just "graphical candy", so it's not
>> important to "read" them; a comment behind the \u encoding appears
>> to me to be sufficient.
>
> Well , it's your code. If it is some kind of block characters based
> "art" then it may even be more important to be able to see it in the
> source.

I actually do have them visibly in my code; but non-functional,
as a comment. That way I have both, the [functional] safety and
the "readability". (And I don't mind the redundancy here.)

BTW, I also had situations the other way round, where I encode
programmatically characters and add comments with their values
(in decimal, hex, or binary, as it fits best for the purpose).
As an example, I had a case with similar or even equal glyphs,
and I wanted to have them specified exactly. A copy/paste from
some Web resource would, in my book, not have been good enough
for specification purposes; you couldn't tell them apart.

Janis

Re: Simple(?) Unicode questions

<ul29pc$29sdi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30356&group=comp.lang.c#30356

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 9 Dec 2023 18:57:01 +0100
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <ul29pc$29sdi$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me> <ul1vbr$289m4$1@dont-email.me>
<77XeQojqcvfK7uNgV@bongo-ra.co>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 9 Dec 2023 17:57:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9e107ad516a1304684092e6d4a00a5fe";
logging-data="2421170"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+78HNA64fmFnev3sDINeQy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17.1
Cancel-Lock: sha1:1xORu1fG7jeSdhAls1DZ86MDMSA=
In-Reply-To: <77XeQojqcvfK7uNgV@bongo-ra.co>

by: jak - Sat, 9 Dec 2023 17:57 UTC

Spiros Bousbouras ha scritto:
> On Sat, 9 Dec 2023 15:59:08 +0100
> jak <nospam@please.ty> wrote:
>> To explain myself better if I write a program that prints an extended
>> unicode character and my terminal uses the UTF-8 characters if the
>> program does not convert the character from Unicode to UTF-8 I will not
>> see anything. To prove it I will send the character to a file:
>>
>> $> cat foo.c
>> #include <stdio.h>
>> #include <stddef.h>
>> #include <wchar.h>
>> #include <locale.h>
>>
>> int main()
>> {
>> wchar_t wch = L'\u2588';
>> FILE *fp;
>>
>> setlocale(LC_ALL, "");
>>
>> if((fp = fopen("char.txt", "wb")) != NULL)
>> {
>> fwprintf(fp, L"%lc", wch);
>> fclose(fp);
>> }
>> return 0;
>> }
>>
>> $> hexdump -C char.txt
>> 00000000 e2 96 88 |...|
>> 00000003
>>
>> As you can see the character code is not the same that I sent.
>
> In what way is it not the same as what you sent ? With hexdump you
> can only hope to see octets regardless of what the octets encode. So
> you read back the octets which are the UTF-8 encoding of codepoint
> U+2588 .What you got is exactly what I would expect to see. If you
> use a terminal which supports UTF-8 and has the necessary font and
> you do
>

Sorry but your comment is not clear to me. I gave this explanation
because it seemed to me that it was not clear to the OP that a
conversion takes place during the printf. Also I wouldn't take what
you say for granted:

$> cat foo.c
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main()
{ union {
unsigned char c[0];
wchar_t w[10];
} str = {.w = L"\u2588"};

setlocale(LC_ALL, "");

printf("\nraw data: ");
for(size_t i = 0; str.c[i] != '\0'; i++)
printf("%02X ", str.c[i]);
printf("\n");

FILE *fp;
if((fp = fopen("char.txt", "wb")))
{
fwprintf(fp, L"%ls", str.w);
fclose(fp);
}
}

compiled with gcc:
$> gcc foo.c -o foo
$> foo

raw data: 88 25

$> od -t x1 char.txt
0000000 e2 96 88
0000003

compiled with tcc:
$> tcc foo.c
$> foo

raw data: 88 25

$> od -t x1 char.txt
0000000 88 25
0000002

ops...

> cat char.txt
>
> what do you see ? I expect you will see the block character.
>
>> With python it is easy to highlight the conversion:
>>
>> $> python
>> >>> u'\u2588'.encode('utf-8')
>> b'\xe2\x96\x88'

Re: Simple(?) Unicode questions

<877clnxd5o.fsf@nosuchdomain.example.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30357&group=comp.lang.c#30357

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 09 Dec 2023 13:46:11 -0800
Organization: None to speak of
Lines: 30
Message-ID: <877clnxd5o.fsf@nosuchdomain.example.com>
References: <ul13hl$24kg5$1@dont-email.me> <=H=fRiU4BbThlUWDM@bongo-ra.co>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="d6185c06b9fbffcced2c74244103958c";
logging-data="2488052"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18jchu+DnmB/rqH5jXO5LSo"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:8FwP/WekUtx+GJ4vCzgjGphNQBE=
sha1:KEnKf7xxUed4xqQs4tHOMNKJLyw=

by: Keith Thompson - Sat, 9 Dec 2023 21:46 UTC

Spiros Bousbouras <spibou@gmail.com> writes:
[...]
> If I want to use directly unicode codepoints I will encode them as
> unsigned long which is guaranteed to be wide enough to cover the whole
> range of codepoints values ; in contrast , it is conforming for wchar_t
> to cover no greater range than char.
[...]

The C standard requires wchar_t to be: "an integer type whose range of
values can represent distinct codes for all members of the largest
extended character set specified among the supported locales".

Yes, it's conforming for wchar_t to cover a range no wider than char,
but only if the implementation has no extended character sets wider than
char.

On Linux-based systems, wchar_t is typically 32 bits, more than enough
to cover all Unicode codepoints. On Windows, however, wchar_t is
generally only 16 bits, which (I think) is non-conforming.

(Microsoft started to support Unicode when the standard specified only
up to 2**16 codepoints, so UCS-2 was sufficient. When Unicode expanded
beyond the Basic Multilingual Plane, Microsoft handled it by supporting
UTF-16, a variable-length encoding composed of 16-bit characters.
Inertia made it too difficult to expand wchar_t from 16 to 32 bits.)

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Simple(?) Unicode questions

<ulb729$3t0bp$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30358&group=comp.lang.c#30358

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: spen...@yeah.net (spender)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Wed, 13 Dec 2023 11:05:45 +0800
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <ulb729$3t0bp$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 13 Dec 2023 03:05:45 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b0dcf317e3c7186f3962ac1a4d14ac1a";
logging-data="4096377"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/NPI1/gpSHfTPRquWCKo9O"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
Cancel-Lock: sha1:OrTWR2XEcT96pB5ZOfraIG63HCE=
In-Reply-To: <ul13hl$24kg5$1@dont-email.me>

by: spender - Wed, 13 Dec 2023 03:05 UTC

printf("%c",ch), the ch must <0xFF, <255

In c lang, The character must be a character of an ASCII table, i.e. <
(int)255. A string is a collection of characters.

在 2023/12/9 15:04, Janis Papanagnou 写道:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output. I'm using C99. I have two questions.
>
> I am able to inline the character in the code like: printf ("█\n");
>
> But I also want to make it a printf argument: printf ("%c\n", '█');
> which doesn't work (at least not in the depicted way).
>
> And I want to declare such characters, like: char ch = '█';
> which also doesn't work, and neither does: wchar_t ch = '█';
> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).
>
> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?
>
> I have experimented and observed that working with strings at least
> *seems* to work: char * ch = "\u2588"; printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?
>
> Thanks.
>
> Janis

Re: Simple(?) Unicode questions

<ulb859$12gh$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30360&group=comp.lang.c#30360

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Wed, 13 Dec 2023 04:24:25 +0100
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <ulb859$12gh$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 13 Dec 2023 03:24:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2c7ffa95e74c13f76949958ab9163ac4";
logging-data="35345"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jqesj1DM96pBLheImJv8b"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:PCOnzSXZTHAHxSI/oDFOvLcd0XY=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <ulb729$3t0bp$1@dont-email.me>

by: Janis Papanagnou - Wed, 13 Dec 2023 03:24 UTC

> 在 2023/12/9 15:04, Janis Papanagnou 写道:
>> [...] intended to use some Unicode characters for output. [...]

On 13.12.2023 04:05, spender wrote:
> printf("%c",ch), the ch must <0xFF, <255

The question was about the output of multi-octet Unicode characters,
it was not about single octet characters.

Though the question has also already been addressed by the other
replies, so don't bother.

>
> In c lang, The character must be a character of an ASCII table,
> i.e. < (int)255. A string is a collection of characters.

(Note, ASCII is 7 bit.) In the C language ordinary single-octet
characters may have values of -128..+127 or 0..255, depending on
whether the char type is defined as signed or unsigned.

And you can also output Unicode characters as had been showed in
this thread.

Janis

Re: Simple(?) Unicode questions

<87wmtiwzkc.fsf@nosuchdomain.example.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30361&group=comp.lang.c#30361

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Tue, 12 Dec 2023 19:28:51 -0800
Organization: None to speak of
Lines: 27
Message-ID: <87wmtiwzkc.fsf@nosuchdomain.example.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="c34340931401a0293bee67ad839d033e";
logging-data="34025"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX190acV50FI30PaExdOJqsD9"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:TZoOvJ0kyy7EhbBzoqK8woO6Whc=
sha1:Q1XOteYeDkrXlKFJW4w4lUP1vpw=

by: Keith Thompson - Wed, 13 Dec 2023 03:28 UTC

spender <spender@yeah.net> writes:
> printf("%c",ch), the ch must <0xFF, <255
>
> In c lang, The character must be a character of an ASCII table, i.e. <
> (int)255. A string is a collection of characters.
[...]

Not exactly.

C doesn't require ASCII; there are implementations that use EBCDIC, for
example.

The argument corresponding to a "%c" format specifier is of type int,
and is converted to unsigned char. Conversion to unsigned char is well
defined for values outside the range of unsigned char (the value wraps
around), which can be useful if the argument is a negative char value
promoted to int.

Typically UCHAR_MAX is 255, so the value after conversion will be >= 0
and <= 255 (note "<=", not "<"). Exotic implementations might have
UCHAR_MAX > 255, but such implementations are typically freestanding,
and therefore needn't support printf.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Simple(?) Unicode questions

<ulbg3i$1m9v$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30365&group=comp.lang.c#30365

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jameskuy...@alumni.caltech.edu (James Kuyper)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Wed, 13 Dec 2023 00:40:01 -0500
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <ulbg3i$1m9v$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 13 Dec 2023 05:40:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="90780e7acabb42838b4dc5fef4013593";
logging-data="55615"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/7osdTiW/1RT4cF2GXfl0C56SYd6vQsB4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:YW/Wxf2g5kF0a7y1JW5/gLklGKQ=
In-Reply-To: <ulb729$3t0bp$1@dont-email.me>
Content-Language: en-US

by: James Kuyper - Wed, 13 Dec 2023 05:40 UTC

On 12/12/23 22:05, spender wrote:
> printf("%c",ch), the ch must <0xFF, <255

The only 'ch' in the code that you responded to was declared as "char
*", not char, and that value was used with a "%s" format specifier, for
which char* is the appropriate type.
*ch has char type, and as such must have a value between CHAR_MIN and
CHAR_MAX. If char is signed, CHAR_MIN == SCHAR_MIN, and SCHAR_MIN <=
-128. If char is unsigned, CHAR_MAX == UCHAR_MAX, and UCHAR_MAX >= 255.
Those are inequalities, not equalities, because 8 is the minimum value
for CHAR_BIT, rather than the only permitted value, and there are
real-world systems with other sizes (not many, to be fair), with
CHAR_BIT==16 being the most common alternative.

When ch is passed to printf(), it's gets converted to unsigned char. The
maximum resulting value is UCHAR_MAX, which as noted above, is allowed
to be >255.

> In c lang, The character must be a character of an ASCII table, i.e. <

There is no such requirement. The standard explicitly describes the
encoding recognized by C standard library functions such as printf() as
implementation-defined and locale-dependent, and describes it as a
multibyte encoding, though MB_CUR_MAX and MB_LEN_MAX are both allowed to
== 1.

On most Unix-like platforms, the default encoding is UTF-8. For
characters that can be represented in a single byte, that is equivalent
to 7-bit ASCII, not 8-bit, so the maximum is 127, not 255. There are
also a number of other encodings still in use, such as EBCDIC.

The standard only mentions ASCII twice, both times in non-normative
footnotes:
"17) The trigraph sequences enable the input of characters that are not
defined in the Invariant Code Set as described in ISO/IEC 646, which is
a subset of the seven-bit US ASCII code set."

In footnote 215 it mentions 7-bit ASCII as an example, not as something
that is mandated.

Re: Simple(?) Unicode questions

<ulcgm5$sopg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30367&group=comp.lang.c#30367

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: lew.pitc...@digitalfreehold.ca (Lew Pitcher)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Wed, 13 Dec 2023 14:56:06 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <ulcgm5$sopg$1@dont-email.me>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 13 Dec 2023 14:56:06 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ccd17c67258609c523631e4357991431";
logging-data="942896"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kOdZkalHdvRLwwGm7FFDiEEbqF2CFT7A="
User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508
git://git.gnome.org/pan2)
Cancel-Lock: sha1:fpkRUc3dSr7QSoAaLDksBhzTZC0=

by: Lew Pitcher - Wed, 13 Dec 2023 14:56 UTC

On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:

> printf("%c",ch), the ch must <0xFF, <255

Not quite.
1) ch /must/ represent an integer value.
2) ch /should/ represent a C char value. Note that a C char /is not/
defined as an 8-bit unsigned quantity, but as a CHAR_BIT quantity,
with implementation-defined sign, where CHAR_BIT is /at least/
8 bits. printf() will happily /mis-interpret/ any other integer
for you, when given the '%c' format specifier.

> In c lang, The character must be a character of an ASCII table, i.e. <
> (int)255. A string is a collection of characters.

Nonsense.

1) The C language does /not/ specify the representation
of char, other than it's size in bits and whether or not it carries
a sign. The C language has been implemented in EBCDIC environments
(for instance), which is not even close to ASCII.

2) ASCII is a 7-bit encoding scheme; all valid ASCII values exist between
0 and 127. /Some software/ extend ASCII to 8 bits, with the high-order
bit either extending the characterset, or representing some
meta-characteristic (such as parity or sign).

--
Lew Pitcher
"In Skills We Trust"

Re: Simple(?) Unicode questions

<86wmt2tx80.fsf@linuxsc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30395&group=comp.lang.c#30395

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17...@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Mon, 25 Dec 2023 02:03:59 -0800
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <86wmt2tx80.fsf@linuxsc.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me> <ulcgm5$sopg$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="cf83539a01e1a6fb89ca8ebe59d59ac1";
logging-data="3173107"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/EtFxgbkl9sZ9xdhEwnladb+6kd6Al3qU="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:sfkiUEanSdr+EcWWNUL0wwzepSs=
sha1:KPbnKd/G18LOQUE1nZ59nRe+/M0=

by: Tim Rentsch - Mon, 25 Dec 2023 10:03 UTC

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>
>> printf("%c",ch), the ch must <0xFF, <255
>
> Not quite.
> 1) ch /must/ represent an integer value.

More specifically, it must have a type that is or promotes
to int, or a type that is or promotes to unsigned int, with
a value that is in the common range of int and unsigned int.

> 2) ch /should/ represent a C char value. Note that a C char /is not/
> defined as an 8-bit unsigned quantity, but as a CHAR_BIT quantity,
> with implementation-defined sign, where CHAR_BIT is /at least/
> 8 bits. [...]

This part isn't exactly right. Any value in the range of char
is okay. However, any value in the range of unsigned char is
also okay. The type 'int' for the argument is meant to include
values returned by, for example, getchar(), and such functions
always return non-negative values (not counting EOF). The rules
for character input/output functions generally convert characters
to unsigned char, and such values are meant to be admissible as
arguments for a %c conversion specifier.

Re: Simple(?) Unicode questions

<87bkadx5s6.fsf@nosuchdomain.example.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30399&group=comp.lang.c#30399

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Mon, 25 Dec 2023 14:43:05 -0800
Organization: None to speak of
Lines: 23
Message-ID: <87bkadx5s6.fsf@nosuchdomain.example.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me>
<ulcgm5$sopg$1@dont-email.me> <86wmt2tx80.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="7ba1a8c7b6f632e582b825503d230ce2";
logging-data="3360295"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TYPAOA5WAbILWgpaCP2E2"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:WBa41nQwpX3WEHxBDWiXnVbC14U=
sha1:TBQMhmqtox2HOIixAA0wnZO0IZs=

by: Keith Thompson - Mon, 25 Dec 2023 22:43 UTC

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>
>> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>>
>>> printf("%c",ch), the ch must <0xFF, <255
>>
>> Not quite.
>> 1) ch /must/ represent an integer value.
>
> More specifically, it must have a type that is or promotes
> to int, or a type that is or promotes to unsigned int, with
> a value that is in the common range of int and unsigned int.

Not quite. "If no l length modifier is present, the int argument is
converted to an unsigned char, and the resulting character is written."
For example printf("%c", -193) is equivalent to printf("%c", 63), which
assuming an ASCII-based character set will print '?'.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: Simple(?) Unicode questions

<86frytjplg.fsf@linuxsc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31234&group=comp.lang.c#31234

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!nntp.comgw.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17...@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Fri, 19 Jan 2024 07:43:39 -0800
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <86frytjplg.fsf@linuxsc.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me> <ulbg3i$1m9v$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="4317632f5246f6fddd21ccc8d476a409";
logging-data="3367669"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/8AlcnXxLe2w46g9c8eor9QyBW8ZyoXdk="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:8Hdro212wlRpPMp3MiZkV7z3DGM=
sha1:utaraTyDkhCe3/EKHWrjMnFJJYc=

by: Tim Rentsch - Fri, 19 Jan 2024 15:43 UTC

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

> On 12/12/23 22:05, spender wrote:
>
>> printf("%c",ch), the ch must <0xFF, <255
>
> The only 'ch' in the code that you responded to was declared as
> "char *", not char, [...]

The posting in question also gave declarations

char ch = [...];

and

wchar_t ch = [...];

Re: Simple(?) Unicode questions

<8634urkiyx.fsf@linuxsc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31301&group=comp.lang.c#31301

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17...@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 20 Jan 2024 09:33:42 -0800
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <8634urkiyx.fsf@linuxsc.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me> <ulcgm5$sopg$1@dont-email.me> <86wmt2tx80.fsf@linuxsc.com> <87bkadx5s6.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="1385652e36adeece913480d5ff53e83e";
logging-data="3963028"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QMR0kATC6lv+xASahSQFLnIVZjcc02Ok="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:H3YxXvRvplRsWDzquD/ecKjzrZA=
sha1:vV00QcEQ1xs6pNaITk4eJRP5Tck=

by: Tim Rentsch - Sat, 20 Jan 2024 17:33 UTC

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>>
>>> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>>>
>>>> printf("%c",ch), the ch must <0xFF, <255
>>>
>>> Not quite.
>>> 1) ch /must/ represent an integer value.
>>
>> More specifically, it must have a type that is or promotes
>> to int, or a type that is or promotes to unsigned int, with
>> a value that is in the common range of int and unsigned int.
>
> Not quite. "If no l length modifier is present, the int argument
> is converted to an unsigned char, and the resulting character is
> written." For example printf("%c", -193) is equivalent to
> printf("%c", 63), which assuming an ASCII-based character set will
> print '?'.

The rule for arguments to printf() is the same as the rule for
accessing variadic arguments using va_arg(). That has always
been true, although not expressed clearly in early versions of
the C standard. Fortunately that shortcoming is addressed in
the upcoming C23 (is it still not yet ratified?): in N3096,
paragraph 9 in section 7.23.6.1 says in part

fprintf shall behave as if it uses va_arg with a type
argument naming the type resulting from applying the
default argument promotions to the type corresponding
to the conversion specification [...]

and the rule for va_arg (in 7.16.1.1 p2) says in part

one type is a signed integer type, the other type is
the corresponding unsigned integer type, and the value
is representable in both types

So supplying an unsigned int argument is okay, provided of
course the value is in the range of values of signed int.

Re: Simple(?) Unicode questions

<87v87nbqbk.fsf@nosuchdomain.example.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31304&group=comp.lang.c#31304

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Sat, 20 Jan 2024 14:19:43 -0800
Organization: None to speak of
Lines: 75
Message-ID: <87v87nbqbk.fsf@nosuchdomain.example.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me>
<ulcgm5$sopg$1@dont-email.me> <86wmt2tx80.fsf@linuxsc.com>
<87bkadx5s6.fsf@nosuchdomain.example.com> <8634urkiyx.fsf@linuxsc.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="b0c2652ca9a7add253eb420c247e1ee2";
logging-data="4053314"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/x2C0RVVEUJnspNU7Jbkri"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:1n5tLNXM/hVQq/wNx5Vbsy0m6ag=
sha1:FzFSwvb2/ezQ+lKOQK7lbxAGDgw=

by: Keith Thompson - Sat, 20 Jan 2024 22:19 UTC

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
>> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>>> Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>>>> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>>>>> printf("%c",ch), the ch must <0xFF, <255
>>>>
>>>> Not quite.
>>>> 1) ch /must/ represent an integer value.
>>>
>>> More specifically, it must have a type that is or promotes
>>> to int, or a type that is or promotes to unsigned int, with
>>> a value that is in the common range of int and unsigned int.
>>
>> Not quite. "If no l length modifier is present, the int argument
>> is converted to an unsigned char, and the resulting character is
>> written." For example printf("%c", -193) is equivalent to
>> printf("%c", 63), which assuming an ASCII-based character set will
>> print '?'.
>
> The rule for arguments to printf() is the same as the rule for
> accessing variadic arguments using va_arg(). That has always
> been true, although not expressed clearly in early versions of
> the C standard. Fortunately that shortcoming is addressed in
> the upcoming C23 (is it still not yet ratified?): in N3096,
> paragraph 9 in section 7.23.6.1 says in part
>
> fprintf shall behave as if it uses va_arg with a type
> argument naming the type resulting from applying the
> default argument promotions to the type corresponding
> to the conversion specification [...]
>
> and the rule for va_arg (in 7.16.1.1 p2) says in part
>
> one type is a signed integer type, the other type is
> the corresponding unsigned integer type, and the value
> is representable in both types
>
> So supplying an unsigned int argument is okay, provided of
> course the value is in the range of values of signed int.

Re-reading what you wrote, I think I misunderstood your intent (and I
think what you wrote was ambiguous).

"%c" specifies an int argument.

You wrote:

More specifically, it must have a type that is or promotes to int,
or a type that is or promotes to unsigned int, with a value that is
in the common range of int and unsigned int.

I read that as:

More specifically,
(it must have a type that is or promotes to int, or a type that is
or promotes to unsigned int),
with a value that is in the common range of int and unsigned int.

which would incorrectly imply that a negative int value is not allowed.

It's now clear to me that you meant was:

More specifically,
(it must have a type that is or promotes to int),
or
(a type that is or promotes to unsigned int, with a value that is in
the common range of int and unsigned int).

I agree with that.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */

Re: Simple(?) Unicode questions

<86ede6do3h.fsf@linuxsc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31644&group=comp.lang.c#31644

copy link Newsgroups: comp.lang.c

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tr.17...@z991.linuxsc.com (Tim Rentsch)
Newsgroups: comp.lang.c
Subject: Re: Simple(?) Unicode questions
Date: Wed, 24 Jan 2024 20:38:26 -0800
Organization: A noiseless patient Spider
Lines: 79
Message-ID: <86ede6do3h.fsf@linuxsc.com>
References: <ul13hl$24kg5$1@dont-email.me> <ulb729$3t0bp$1@dont-email.me> <ulcgm5$sopg$1@dont-email.me> <86wmt2tx80.fsf@linuxsc.com> <87bkadx5s6.fsf@nosuchdomain.example.com> <8634urkiyx.fsf@linuxsc.com> <87v87nbqbk.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="233ecb52840d096af87a0e664f2c83b5";
logging-data="2297708"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7r/IAu5lVC4QHmhpqxQ9xjD/5HIudlXY="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:3g17XEQKgFY9D8udVwDBBwsUX9A=
sha1:SLiNPdx8zMLGRO8bXnk35WF0hqY=

by: Tim Rentsch - Thu, 25 Jan 2024 04:38 UTC

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
>>
>>> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>>>
>>>> Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>>>>
>>>>> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>>>>>
>>>>>> printf("%c",ch), the ch must <0xFF, <255
>>>>>
>>>>> Not quite.
>>>>> 1) ch /must/ represent an integer value.
>>>>
>>>> More specifically, it must have a type that is or promotes
>>>> to int, or a type that is or promotes to unsigned int, with
>>>> a value that is in the common range of int and unsigned int.
>>>
>>> Not quite. "If no l length modifier is present, the int argument
>>> is converted to an unsigned char, and the resulting character is
>>> written." For example printf("%c", -193) is equivalent to
>>> printf("%c", 63), which assuming an ASCII-based character set will
>>> print '?'.
>>
>> The rule for arguments to printf() is the same as the rule for
>> accessing variadic arguments using va_arg(). That has always
>> been true, although not expressed clearly in early versions of
>> the C standard. Fortunately that shortcoming is addressed in
>> the upcoming C23 (is it still not yet ratified?): in N3096,
>> paragraph 9 in section 7.23.6.1 says in part
>>
>> fprintf shall behave as if it uses va_arg with a type
>> argument naming the type resulting from applying the
>> default argument promotions to the type corresponding
>> to the conversion specification [...]
>>
>> and the rule for va_arg (in 7.16.1.1 p2) says in part
>>
>> one type is a signed integer type, the other type is
>> the corresponding unsigned integer type, and the value
>> is representable in both types
>>
>> So supplying an unsigned int argument is okay, provided of
>> course the value is in the range of values of signed int.
>
> Re-reading what you wrote, I think I misunderstood your intent (and I
> think what you wrote was ambiguous).
>
> "%c" specifies an int argument.
>
> You wrote:
>
> More specifically, it must have a type that is or promotes to int,
> or a type that is or promotes to unsigned int, with a value that is
> in the common range of int and unsigned int.
>
> I read that as:
>
> More specifically,
> (it must have a type that is or promotes to int, or a type that is
> or promotes to unsigned int),
> with a value that is in the common range of int and unsigned int.
>
> which would incorrectly imply that a negative int value is not allowed.
>
> It's now clear to me that you meant was:
>
> More specifically,
> (it must have a type that is or promotes to int),
> or
> (a type that is or promotes to unsigned int, with a value that is in
> the common range of int and unsigned int).
>
> I agree with that.

Right. Sorry for the confusion.

Within a computer, natural language is unnatural.

devel / comp.lang.c / Re: Simple(?) Unicode questions

Subject	Author
Simple(?) Unicode questions	Janis Papanagnou
Re: Simple(?) Unicode questions	Richard Damon
Re: Simple(?) Unicode questions	jak
Re: Simple(?) Unicode questions	Spiros Bousbouras
Re: Simple(?) Unicode questions	jak
Re: Simple(?) Unicode questions	Spiros Bousbouras
Re: Simple(?) Unicode questions	Janis Papanagnou
Re: Simple(?) Unicode questions	Spiros Bousbouras
Re: Simple(?) Unicode questions	Janis Papanagnou
Re: Simple(?) Unicode questions	Spiros Bousbouras
Re: Simple(?) Unicode questions	Keith Thompson
Re: Simple(?) Unicode questions	spender
Re: Simple(?) Unicode questions	Janis Papanagnou
Re: Simple(?) Unicode questions	Keith Thompson
Re: Simple(?) Unicode questions	James Kuyper
Re: Simple(?) Unicode questions	Tim Rentsch
Re: Simple(?) Unicode questions	Lew Pitcher
Re: Simple(?) Unicode questions	Tim Rentsch
Re: Simple(?) Unicode questions	Keith Thompson
Re: Simple(?) Unicode questions	Tim Rentsch
Re: Simple(?) Unicode questions	Keith Thompson
Re: Simple(?) Unicode questions	Tim Rentsch