novaBBS - comp.arch - Re: Minor idea for indirect target predictor

On 7/1/2021 3:37 AM, Marcus wrote:
> On 2021-06-30, BGB wrote:
>> On 6/30/2021 9:21 AM, David Brown wrote:
>>> On 30/06/2021 14:36, Thomas Koenig wrote:
>>>> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>>>>> Stephen Fuld wrote:
>>>>>> On 6/29/2021 12:27 PM, Ivan Godard wrote:
>>>>>> I am not a compiler guy, so forgive my ignorance. What would a
>>>>>> compiler
>>>>>> do in such a situation? Would it first test for something like if
>>>>>> char
>>>>>> >= "a" and <= "z", etc.?
>>>>>>
>>>>>> Just thinking about how I might do it, I think I would have a 256
>>>>>> byte
>>>>>> array, indexed by the character value, with each entry having an
>>>>>> indicator value, say 1 for lower case letters, 2 for upper case
>>>>>> letters,
>>>>>> 3 for numbers, 4 for white space characters, etc. Looking up the
>>>>>> value
>>>>>> shouldn't be costly as most of the table with shortly be in D-cache.
>>>>>> Then use the indicator value to index a jump or call table (which I
>>>>>> guess is fairly predictable as most characters are lower case
>>>>>> letters.
>>>>>>
>>>>>> But what do I know? :-(
>>>>>
>>>>> About the same as the standard C lib?
>>>>>
>>>>> I.e. the classic character classification macros use just such a
>>>>> table.
>>>>>
>>>>> As I noted previously, it tends to break down a bit with utf8, but you
>>>>> can at least handle all 7-bit ascii very quickly.
>>>
>>> It breaks down a /lot/ with UTF-8 - the traditional <ctype.h>
>>> classifications such as "isdigit" or "tolower" collapse pretty quickly
>>> in the face of non-ASCII character sets. Locales can help a bit, but
>>> you don't have to get very exotic before the answer is "it's
>>> complicated".
>>>
>>
>> If one is feeding raw byte or char values to these functions, and
>> using UTF-8 or similar, then as soon as one gets outside ASCII range
>> the answer is basically useless.
>>
>> Simple case is to ignore everything outside the ASCII range.
>>
>> Secondary option is to treat everything outside the ASCII range as if
>> it were Codepage-1252 or Latin-1.
>>
>>
>>
>> For wider characters, one would likely need some other functions, like
>> those from 'wctype.h', but then it depends on locale rather than "just
>> assume it is UTF-16 or something".
>>
>
> There's a clear difference between string encoding (ASCII, Latin-1,
> CP-1252, UTF-8, UTF-16, ...) and character encoding. The latter can
> usually be safely represented as a 32-bit integer representing a
> single Unicode code point.
>
> In my opinion, any function dealing with a single character at a time
> should use Unicode code points as input/output (rather than locale
> dependent encodings that is just a big source of bugs).
>

The behavior of some parts of the C library assume locale, for example
the multibyte string conversion functions, ...

Likewise, the "cytpe.h" generally also assume locale-dependent behavior.
The simple case though is to just hard-wire these to raw ASCII or CP1252
or similar.

Or, say:
"C" /"POSIX" / "EN-US":
The 'ctype' functions assume plain ASCII;
Most other stuff can assume CP1252.
Other (non-UTF-8):
Assume CP1252.
UTF8 (".UTF8"/"EN-US.UTF8"/...):
Most functions assume UTF-8;
The ctype functions fall back to plain ASCII.

I went recently looking into some of it, trying to figure out how
character encoding and locale were supposed to interact, realized that
apparently rather than doing something "sensible", the compilers were
mostly originally just sort of "winging it" back in the codepage days,
and it seems the closest interpretation to "sensible" is to effectively
treat the normal string type as a raw collection of bytes with an
implicit NUL terminator on the end.

But, then in modern days, we have gone from mostly using codepage
source-files to mostly UTF-8 source files, sometimes with a BOM, and
some editors (such as VisualStudio) spitting out UTF-16 files if used
for editing (and any non-ASCII characters are used).

The compiler can normalize everything to UTF-8 internally, but then this
leaves some ambiguous cases what happens with string literals, which
sort of results in some of the funky handling I ended up with (just sort
of loosely crushing everything down into byte-range in the default case).

Alternately, one would need to set a "default codepage" for the compiler
to assume for its string conversions, though this does still leave some
ambiguity (and we still probably need to be able to deal with the "code
which likes to encode binary data in strings" issue, but could
potentially warn about conversions for any other codepoints which can't
be represented in the currently selected codepage). Though, it is likely
the selection would still be fairly limited, and more exotic encodings
would remain unlikely.

Though it seems according to the MSVC documentation, for example, some
multibyte string functions depend on locale, whereas others are
hard-wired to assume UTF8.

In my case, it ends up similar, though sometimes diverging in different
cases.

I ended up working some on the locale stuff, and have basically ended up
with two major locales: "C"/"POSIX"/"EN-US"/... basically just assume
ASCII+CP1252 or similar; "C.UTF8"/".UTF8"/"EN-US.UTF8"/... Assume UTF-8.

The C library I am using ignores pretty much everything else about
locale settings, and I will probably keep up this pattern. For general
use by multi-language programs, the original C locale system seems a bit
broken.

> For instance in the GLFW text input API, all input events carry an
> unsigned int representing the Unicode code point for a single character:
>
> https://www.glfw.org/docs/latest/input_guide.html#input_char
>
> ...whereas functions that deal with text strings use UTF-8 encoding,
> like glfwSetWindowTitle():
>
> https://www.glfw.org/docs/latest/window_guide.html#window_title
>
> My personal preference for C/C++ applications is to use UTF-8 strings
> everywhere internally, and whenever an external API (e.g. the Win32 API)
> is used - do the necessary string translations in conjunction with the
> API calls. On POSIX systems you usally do not have to do any
> translations since most API:s accept UTF-8 strings.
>
> I successfully used this design philosophy for BuildCache, for instance.
>

As can be noted, my implementation of most filesystem API calls assume
UTF-8 by default.

One can't really make UTF-8 or similar the default encoding for string
literals though, mostly because there is a lot of software which tries
to do things like stick raw binary data into string literals, and will
break in catastrophic ways if this behavior is not preserved. This
includes the ability to encode strings with embedded NUL characters
without the literal being implicitly truncated by the compiler, ...

I suspect this may be part of why newer versions of C added things like:
char *s;
s = u8"Whatever\U0001F346\U0001F602"; //explicitly UTF-8

Though, one can debate whether it is better to encode these directly, or
split them up into surrogate pairs encoded via UTF-8, ...

Likewise:
wchar_t *ws;
ws=L"\U0001F346\U0001F602";
Could end up as two codepoints (UTF-32), or four (UTF-16).

Whereas char16_t/char32_t are more explicit:
char16_t *ws1;
char32_t *ws2;
ws1=u"\U0001F346\U0001F602"; //four
ws2=U"\U0001F346\U0001F602"; //two

Starts looking into it (partly based on the MSVC/MSDN descriptions), and
apparently for "proper" C string semantics in plain (raw) string
literals, I may actually end up needing to double-encode them:
String literal uses UTF-8 encoded to encode values in the range 00..FF;
Codepoints from the C source (or encoded via '\uXXXX'), will need to be
encoded as a series of bytes representing the UTF-8 codepoint,
themselves encoded as above.

It looks like 'L' and 'u' literals would still need to be encoded in a
more traditional M-UTF-8 (encoding 0000..FFFF directly, with
10000..1FFFF encoded via surrogate pairs encoded via UTF-8).

Likewise, 'U' literals would represent the whole codepoint range
directly via UTF-8 codepoints. Similarly the valid range for '\x'
character escaped also seems to be dependent on string literal type.

Descriptions seem slightly confusing to the behavior specifics of 'u8'
string literals though, it is most likely though I could leave them as
single-encoded UTF-8, and otherwise handle them like 'U' string literals
in the compiler (least change from the current behavior; will also
assume "go out on a limb" and assume people probably aren't going to try
to stick raw binary data into a 'u8' string literal and expect it to
come out the other side intact).

Still need the M-UTF-8 encoding trick of overlong NUL because the
compiler needs to deal with "Foo\0Yolo\0\0" and "Foo\u0000Bar\u0000" and
similar without the string literal implicitly truncating itself, ...

Well, and of course, old-style UCS-2 string literals represented as
"F\0o\0o\0 \0B\0a\0r\0\0\0" rather than L"Foo Bar", but this is more
just a legacy of old Win32 code and similar.

As for why '\UXXXXXXXX' has 8 hex digits, vs 5, I am not entirely sure.

This does mean though that the tokenizer will need to keep track of what
type of string literal it is dealing with, which possibly also means
that encoding-conversion for string literals would also be handled in
the tokenizer, rather than in the backend.

....

Some of the behavioral descriptions would make more sense if one assumes
the compiler were using length-prefixed raw binary string literals
rather than building everything on top of M-UTF-8 or similar though...

>> Or, at least assuming the locale actually does something (simpler
>> answer is to hard wire it).
>>
>> In BGBCC, the patterns I followed are:
>>    Narrow strings (in C) are typically Codepage-1252 by default;
>>      Except when they are UTF-8.
>>    Wide strings are typically UTF-16.
>>
>> I went with the Windows pattern of using UTF-16 for wide-strings and
>> wchar_t, rather than the Linux pattern of using UTF-32, partly because
>> UTF-32 is impractically wasteful, and rarely useful. Similarly, one
>> still doesn't get "one character, one codepoint" even with UTF-32 when
>> it comes to things like emojis or similar, so alas...
>>
>>
>>>>
>>>> And if you don't sanitize your input, you can get "interesting"
>>>> results, such as when the value is negative for a signed char
>>>> or when testing if 99999 is printable.
>>>>
>>>> There may be a few CVEs along those lines.
>>>>
>>>
>>> The C standard library functions here all take "int" parameters, with
>>> the results defined for EOF or a value in the range of unsigned char.
>>> Giving them anything else is undefined behaviour - it is the
>>> programmer's fault, and the programmer's problem.
>>>
>>> However, range testing the value is so simple and obvious that I would
>>> not expect any C library implementation to have problems here.
>>>
>>
>> Yep.
>

Subject	Replies	Author
Minor idea for indirect target predictor By: Paul A. Clayton on Sat, 26 Jun 2021	188	Paul A. Clayton

The speed of anything depends on the flow of everything.

computers / comp.arch / Re: Minor idea for indirect target predictor