Message-ID:

What we anticipate seldom occurs; what we least expect generally happens. -- Bengamin Disraeli

devel / comp.lang.ada / Re: How to read in a (long) UTF-8 file, incrementally?

How to read in a (long) UTF-8 file, incrementally?

<d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=6263&group=comp.lang.ada#6263

X-Received: by 2002:ad4:56a4:: with SMTP id bd4mr23957647qvb.16.1635874957635;
Tue, 02 Nov 2021 10:42:37 -0700 (PDT)
X-Received: by 2002:a25:3787:: with SMTP id e129mr38373579yba.91.1635874957334;
Tue, 02 Nov 2021 10:42:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Tue, 2 Nov 2021 10:42:37 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=94.60.6.132; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY
NNTP-Posting-Host: 94.60.6.132
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Subject: How to read in a (long) UTF-8 file, incrementally?
From: amado.al...@gmail.com (Marius Amado-Alves)
Injection-Date: Tue, 02 Nov 2021 17:42:37 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 12

by: Marius Amado-Alves - Tue, 2 Nov 2021 17:42 UTC

As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

Now, Unicode files usually are in UTF-8.

One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

Thanks a lot.

Re: How to read in a (long) UTF-8 file, incrementally?

<slrvcr$1inu$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6264&group=comp.lang.ada#6264

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!x6YkKUCkj2qHLwbKnVEeag.user.46.165.242.91.POSTED!not-for-mail
From: mail...@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 2 Nov 2021 19:17:58 +0100
Organization: Aioe.org NNTP Server
Message-ID: <slrvcr$1inu$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="51966"; posting-host="x6YkKUCkj2qHLwbKnVEeag.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Tue, 2 Nov 2021 18:17 UTC

On 2021-11-02 18:42, Marius Amado-Alves wrote:

> So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

You simply read a stream of Characters into a buffer. Never ever use
Wide or Wide_Wide, they are useless. Inside the buffer you must have 4
Characters ahead unless the file end is reached. Usually you read until
some separator like line end.

Then you call this:

http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Get

That will give you a code point and advance the cursor to the next UTF-8
character.

However, normally, no text processing task needs that. Whatever you want
to do, you can accomplish it using normal String operations and normal
String-based data structures like maps and tables. You need not to care
about any UTF-8 character boundaries ever.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: How to read in a (long) UTF-8 file, incrementally?

<90818857-b379-4c2a-81a2-f988ce8598ban@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6266&group=comp.lang.ada#6266

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:a1c:4d0b:: with SMTP id o11mr13035116wmh.68.1635925383246;
Wed, 03 Nov 2021 00:43:03 -0700 (PDT)
X-Received: by 2002:a25:4d83:: with SMTP id a125mr45466206ybb.277.1635925382583;
Wed, 03 Nov 2021 00:43:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Wed, 3 Nov 2021 00:43:02 -0700 (PDT)
In-Reply-To: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.117.51.48; posting-account=niG3UgoAAAD7iQ3takWjEn_gw6D9X3ww
NNTP-Posting-Host: 87.117.51.48
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90818857-b379-4c2a-81a2-f988ce8598ban@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: vgodu...@gmail.com (Vadim Godunko)
Injection-Date: Wed, 03 Nov 2021 07:43:03 +0000
Content-Type: text/plain; charset="UTF-8"

by: Vadim Godunko - Wed, 3 Nov 2021 07:43 UTC

On Tuesday, November 2, 2021 at 8:42:38 PM UTC+3, amado...@gmail.com wrote:
> As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.
>
> Now, Unicode files usually are in UTF-8.
>
> One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.
>
> If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.
>
> So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?
>
There is special library to process Unicode text, see https://github.com/AdaCore/VSS; 'contrib' directory contains VSS.Utils.File_IO package to load file into Virtual_String. However, attempt to load whole file into the memory is bad decision usually.

Re: How to read in a (long) UTF-8 file, incrementally?

<sltigk$43o$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6267&group=comp.lang.ada#6267

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!Lx7EM+81f32E0bqku+QpCA.user.46.165.242.75.POSTED!not-for-mail
From: lagu...@archeia.com (Luke A. Guest)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Wed, 3 Nov 2021 08:48:58 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sltigk$43o$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="4216"; posting-host="Lx7EM+81f32E0bqku+QpCA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB

by: Luke A. Guest - Wed, 3 Nov 2021 08:48 UTC

On 02/11/2021 17:42, Marius Amado-Alves wrote:
> As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

You can take a look at my simple lib: https://github.com/Lucretia/uca

> Now, Unicode files usually are in UTF-8.
>
> One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

It can read into a large string buffer.

> If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

And can break it up into lines. There's no Unicode consistency checks.

The lib is a bit hacky, but seems to work for now. There's nothing more
than what I've mentioned so far.

Re: How to read in a (long) UTF-8 file, incrementally?

<c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6268&group=comp.lang.ada#6268

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:ac8:5dd2:: with SMTP id e18mr16941166qtx.267.1636026203269;
Thu, 04 Nov 2021 04:43:23 -0700 (PDT)
X-Received: by 2002:a25:c68a:: with SMTP id k132mr40651390ybf.531.1636026203007;
Thu, 04 Nov 2021 04:43:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Thu, 4 Nov 2021 04:43:22 -0700 (PDT)
In-Reply-To: <sltigk$43o$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.60.6.132; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY
NNTP-Posting-Host: 94.60.6.132
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com> <sltigk$43o$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: amado.al...@gmail.com (Marius Amado-Alves)
Injection-Date: Thu, 04 Nov 2021 11:43:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 7

by: Marius Amado-Alves - Thu, 4 Nov 2021 11:43 UTC

Great libraries, thanks.

It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

if C = '±' then ...

And Wide_Wide_Character'Pos should give the codepoint.

Re: How to read in a (long) UTF-8 file, incrementally?

<sm0ion$1m0r$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6269&group=comp.lang.ada#6269

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!Hx95GBhnJb0Xc8StPhH8AA.user.46.165.242.91.POSTED!not-for-mail
From: mail...@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Thu, 4 Nov 2021 13:13:12 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sm0ion$1m0r$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org>
<c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="55323"; posting-host="Hx95GBhnJb0Xc8StPhH8AA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.2.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Thu, 4 Nov 2021 12:13 UTC

On 2021-11-04 12:43, Marius Amado-Alves wrote:
> Great libraries, thanks.
>
> It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.
>
> if C = '±' then ...

If the source supports Unicode, it should do UTF-8 as well. So, you
would write

if C = "±" then ...

where C is String.

> And Wide_Wide_Character'Pos should give the codepoint.

Yes, but you need no Wide_Wide to get an integer value and if your
objective is Unicode categorization, that is too complicated for manual
comparisons. Use a library function [generated from UnicodeData.txt]
instead.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: How to read in a (long) UTF-8 file, incrementally?

<sm0qss$1vl7$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6271&group=comp.lang.ada#6271

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!Lx7EM+81f32E0bqku+QpCA.user.46.165.242.75.POSTED!not-for-mail
From: lagu...@archeia.com (Luke A. Guest)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Thu, 4 Nov 2021 14:30:25 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sm0qss$1vl7$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org>
<c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="65191"; posting-host="Lx7EM+81f32E0bqku+QpCA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2

by: Luke A. Guest - Thu, 4 Nov 2021 14:30 UTC

On 04/11/2021 11:43, Marius Amado-Alves wrote:
> Great libraries, thanks.
>
> It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.
>
> if C = '±' then ...
>
> And Wide_Wide_Character'Pos should give the codepoint.
>

Characters no longer exist as a thing as one can even be represented as
multiple utf-32 code points.

Re: How to read in a (long) UTF-8 file, incrementally?

<1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6273&group=comp.lang.ada#6273

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:a05:620a:2950:: with SMTP id n16mr32753651qkp.405.1636109803270;
Fri, 05 Nov 2021 03:56:43 -0700 (PDT)
X-Received: by 2002:a5b:846:: with SMTP id v6mr58658703ybq.457.1636109803118;
Fri, 05 Nov 2021 03:56:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Fri, 5 Nov 2021 03:56:42 -0700 (PDT)
In-Reply-To: <sm0qss$1vl7$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=193.137.201.145; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY
NNTP-Posting-Host: 193.137.201.145
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org> <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: amado.al...@gmail.com (Marius Amado-Alves)
Injection-Date: Fri, 05 Nov 2021 10:56:43 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 3

by: Marius Amado-Alves - Fri, 5 Nov 2021 10:56 UTC

Re: How to read in a (long) UTF-8 file, incrementally?

<lymtmixtqi.fsf@pushface.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6274&group=comp.lang.ada#6274

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!8nKyDL3nVTTIdBB8axZhRA.user.46.165.242.75.POSTED!not-for-mail
From: sim...@pushface.org (Simon Wright)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Fri, 05 Nov 2021 19:55:33 +0000
Organization: Aioe.org NNTP Server
Message-ID: <lymtmixtqi.fsf@pushface.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org>
<c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org>
<1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="25053"; posting-host="8nKyDL3nVTTIdBB8axZhRA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (darwin)
X-Notice: Filtered by postfilter v. 0.9.2
Cancel-Lock: sha1:RNeOMK7AlyGDQPG63yTS5UlEawU=

by: Simon Wright - Fri, 5 Nov 2021 19:55 UTC

Marius Amado-Alves <amado.alves@gmail.com> writes:

>> Characters no longer exist as a thing as one can even be represented as
>> multiple utf-32 code points.
>
> You're alluding to combining characters?

Fun & games on macOS[1]:

> $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
> gcc -c páck3.ads
> páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
>
> The reason for this apparently-bizarre message is that macOS takes the
> composed form (lowercase a acute) and converts it under the hood to
> what HFS+ insists on, the fully decomposed form (lowercase a,
> combining acute); thus the names are actually different even though
> they _look_ the same.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114#c1

Re: How to read in a (long) UTF-8 file, incrementally?

<f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6298&group=comp.lang.ada#6298

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:a05:620a:2589:: with SMTP id x9mr5649400qko.454.1637063705923;
Tue, 16 Nov 2021 03:55:05 -0800 (PST)
X-Received: by 2002:a25:183:: with SMTP id 125mr7348029ybb.143.1637063705787;
Tue, 16 Nov 2021 03:55:05 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Tue, 16 Nov 2021 03:55:05 -0800 (PST)
In-Reply-To: <lymtmixtqi.fsf@pushface.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.60.27.164; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY
NNTP-Posting-Host: 94.60.27.164
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org> <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org> <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
<lymtmixtqi.fsf@pushface.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: amado.al...@gmail.com (Marius Amado-Alves)
Injection-Date: Tue, 16 Nov 2021 11:55:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 6

by: Marius Amado-Alves - Tue, 16 Nov 2021 11:55 UTC

I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position.. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated.

(For me, a combining character is not a character, the combination is. Unicode agrees, right?)

Re: How to read in a (long) UTF-8 file, incrementally?

<sn08jf$pkq$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6299&group=comp.lang.ada#6299

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!Hx95GBhnJb0Xc8StPhH8AA.user.46.165.242.91.POSTED!not-for-mail
From: mail...@dmitry-kazakov.de (Dmitry A. Kazakov)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 16 Nov 2021 13:36:00 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sn08jf$pkq$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org>
<c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org>
<1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
<lymtmixtqi.fsf@pushface.org>
<f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="26266"; posting-host="Hx95GBhnJb0Xc8StPhH8AA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.1
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Dmitry A. Kazakov - Tue, 16 Nov 2021 12:36 UTC

On 2021-11-16 12:55, Marius Amado-Alves wrote:
> I'm worried. I need the concept of character, for proper text processing.

Simply ignore or reject decomposed characters.

> For example, I want to reference characters in a text file by their position.

That is no problem either. There are two alternatives:

1. Fixed font representation. Reduce everything to normal glyphs, use
string position corresponding to the beginning of an UTF-8 sequence.

2. Proportional font. Use a graphical user interface like GTK. The GTK
text buffer has a data type (iterator) to indicate a place in the
buffer, e.g. when a selection happens. These iterators are consistent
with the glyphs as rendered on the screen and you can convert between
them and string position.

> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

No, Unicode disagrees, e.g. É can be composed from E and acute accent.
But it is advised just to ignore all this nonsense and consider:

code point = character

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Re: How to read in a (long) UTF-8 file, incrementally?

<88a83bf1-f1af-4252-bad1-cf86c3fa2eaen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6300&group=comp.lang.ada#6300

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:a37:a8e:: with SMTP id 136mr6410349qkk.395.1637070780372;
Tue, 16 Nov 2021 05:53:00 -0800 (PST)
X-Received: by 2002:a25:2157:: with SMTP id h84mr8672151ybh.425.1637070780178;
Tue, 16 Nov 2021 05:53:00 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Tue, 16 Nov 2021 05:52:59 -0800 (PST)
In-Reply-To: <sn08jf$pkq$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=94.60.27.164; posting-account=3cDqWgoAAAAZXc8D3pDqwa77IryJ2nnY
NNTP-Posting-Host: 94.60.27.164
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org> <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org> <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
<lymtmixtqi.fsf@pushface.org> <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
<sn08jf$pkq$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <88a83bf1-f1af-4252-bad1-cf86c3fa2eaen@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: amado.al...@gmail.com (Marius Amado-Alves)
Injection-Date: Tue, 16 Nov 2021 13:53:00 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 11

by: Marius Amado-Alves - Tue, 16 Nov 2021 13:52 UTC

> Simply ignore or reject decomposed characters.

Brilliant!

> 1. Fixed font representation. Reduce everything to normal glyphs, use
> string position corresponding to the beginning of an UTF-8 sequence.

I am indeed resorting to byte position in UTF-8 files as the character position. Treating UTF-8 entities as the strings that they are:-)

(Not dealing with fonts nor graphics yet, just plain text.)

Thanks a lot.

Re: How to read in a (long) UTF-8 file, incrementally?

<sn0ijs$7v2$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6301&group=comp.lang.ada#6301

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!aioe.org!Lx7EM+81f32E0bqku+QpCA.user.46.165.242.75.POSTED!not-for-mail
From: lagu...@archeia.com (Luke A. Guest)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 16 Nov 2021 15:25:10 +0000
Organization: Aioe.org NNTP Server
Message-ID: <sn0ijs$7v2$1@gioia.aioe.org>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org>
<c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org>
<1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
<lymtmixtqi.fsf@pushface.org>
<f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="8162"; posting-host="Lx7EM+81f32E0bqku+QpCA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB

by: Luke A. Guest - Tue, 16 Nov 2021 15:25 UTC

On 16/11/2021 11:55, Marius Amado-Alves wrote:
> I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated.
>
> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)
>

You can't. The concept of character is dead, the new concept are
grapheme clusters.

Re: How to read in a (long) UTF-8 file, incrementally?

<0a3065fd-1d17-416d-b640-427aca3a090bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6302&group=comp.lang.ada#6302

copy link Newsgroups: comp.lang.ada

X-Received: by 2002:a05:6214:2427:: with SMTP id gy7mr47455976qvb.38.1637084293932;
Tue, 16 Nov 2021 09:38:13 -0800 (PST)
X-Received: by 2002:a05:6902:1543:: with SMTP id r3mr11015351ybu.166.1637084293676;
Tue, 16 Nov 2021 09:38:13 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Tue, 16 Nov 2021 09:38:13 -0800 (PST)
In-Reply-To: <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.117.51.195; posting-account=niG3UgoAAAD7iQ3takWjEn_gw6D9X3ww
NNTP-Posting-Host: 87.117.51.195
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com>
<sltigk$43o$1@gioia.aioe.org> <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com>
<sm0qss$1vl7$1@gioia.aioe.org> <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com>
<lymtmixtqi.fsf@pushface.org> <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0a3065fd-1d17-416d-b640-427aca3a090bn@googlegroups.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
From: vgodu...@gmail.com (Vadim Godunko)
Injection-Date: Tue, 16 Nov 2021 17:38:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2271

by: Vadim Godunko - Tue, 16 Nov 2021 17:38 UTC

On Tuesday, November 16, 2021 at 2:55:06 PM UTC+3, amado...@gmail.com wrote:
> I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of Unicode, greatly appreciated.
>
> (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

You can use VSS and Grapheme_Cluster_Iterator to lookup for grapheme cluster at given position and to obtain position of the grapheme cluster in the string (as well as UTF-8/UTF-16 code units).

However, concept of grapheme clusters doesn't handle special cases like tabulation stops; TAB is just single grapheme cluster.

Re: How to read in a (long) UTF-8 file, incrementally?

<sn1401$ubi$1@franka.jacob-sparre.dk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=6303&group=comp.lang.ada#6303

copy link Newsgroups: comp.lang.ada

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsfeed.xs3.de!callisto.xs3.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail
From: ran...@rrsoftware.com (Randy Brukardt)
Newsgroups: comp.lang.ada
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 16 Nov 2021 14:23:28 -0600
Organization: JSA Research & Innovation
Lines: 21
Message-ID: <sn1401$ubi$1@franka.jacob-sparre.dk>
References: <d1c5ba75-bc0a-4e7b-a2df-394bc710cbcen@googlegroups.com> <sltigk$43o$1@gioia.aioe.org> <c1973b0d-7f3e-487f-8766-586b2d8c69edn@googlegroups.com> <sm0qss$1vl7$1@gioia.aioe.org> <1c6b151b-f017-496d-b381-ba08bef1bbb7n@googlegroups.com> <lymtmixtqi.fsf@pushface.org> <f0d17e38-58c7-4914-ab9c-8632cecc8215n@googlegroups.com> <sn08jf$pkq$1@gioia.aioe.org>
Injection-Date: Tue, 16 Nov 2021 20:23:29 -0000 (UTC)
Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226";
logging-data="31090"; mail-complaints-to="news@jacob-sparre.dk"
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Response
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246

by: Randy Brukardt - Tue, 16 Nov 2021 20:23 UTC

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:sn08jf$pkq$1@gioia.aioe.org...
> On 2021-11-16 12:55, Marius Amado-Alves wrote:
>> I'm worried. I need the concept of character, for proper text processing.
>
> Simply ignore or reject decomposed characters.

Unicode calls that "requiing Normalization Form C". ("Form D" is all
decomposed characters.) You'll note that what Ada compilers do with text not
in Normalization Form C is implementation-defined; in particular, a compiler
could reject such text.

My understanding is that various Internet standards also require
Normalization Form C. For instance, web pages are supposed to always be in
that format. Whether browsers actually enforce that is unknown (they should
enforce a lot of stuff about web pages, but generally just try to muddle
through, which causes all kinds of security issues).

Randy.

Subject	Author
How to read in a (long) UTF-8 file, incrementally?	Marius Amado-Alves
Re: How to read in a (long) UTF-8 file, incrementally?	Dmitry A. Kazakov
Re: How to read in a (long) UTF-8 file, incrementally?	Vadim Godunko
Re: How to read in a (long) UTF-8 file, incrementally?	Luke A. Guest
Re: How to read in a (long) UTF-8 file, incrementally?	Marius Amado-Alves
Re: How to read in a (long) UTF-8 file, incrementally?	Dmitry A. Kazakov
Re: How to read in a (long) UTF-8 file, incrementally?	Luke A. Guest
Re: How to read in a (long) UTF-8 file, incrementally?	Marius Amado-Alves
Re: How to read in a (long) UTF-8 file, incrementally?	Simon Wright
Re: How to read in a (long) UTF-8 file, incrementally?	Marius Amado-Alves
Re: How to read in a (long) UTF-8 file, incrementally?	Dmitry A. Kazakov
Re: How to read in a (long) UTF-8 file, incrementally?	Marius Amado-Alves
Re: How to read in a (long) UTF-8 file, incrementally?	Randy Brukardt
Re: How to read in a (long) UTF-8 file, incrementally?	Luke A. Guest
Re: How to read in a (long) UTF-8 file, incrementally?	Vadim Godunko