novaBBS - comp.lang.c++ - Re: Ain't that beautiful / 2

On Wed, 2024-04-10 at 14:21 +0200, Bonita Montero wrote:
> Unfortunately C++20's from_chars doesn's support wide characters.
> So I implemented my own from_chars called parse_double generically
> as a template which can handle any ASCII-based character set whose
> characters are integers.
> Another requirement for me was that the code should support zero
> -terminated strings as well as flat memory ranges with a beginning
> and an end. This is done with a function-object which derermines
> if a iterator points to a terminating character or address. The
> default type for this function object is parse_never_ends(), which
> results in any invalid character for the text text value to be a
> termination character and as the zero is always an invalid charac-
> ter for a floating point value parsing doens't end because the end
> function object reports an end but because parsing can't find fur-
> ther caracters.
> And there's an overload of parse_double which is given a start and
> and end iterator and this overload internally has its own end-func-
> tion-object which compares the current reading position against the
> end iterator.
> My code scans the digits beyond the comma first and stores it in a
> thead_local vector of doubles whose capacity only grows across mul-
> tiple calls of my function for optimal performance. To maximize per-
> formance I'm using my own union of doubles to suppress default-ini-
> tialization of the vector's elements. The digits are appended until
> their value becomes zero or there are no further digits.
> The values in the suffix-vector are added in reverse order. If I'd
> add them in forward order the precision would be less since there
> would be mantissa digits dropped right from the final mantissa.
> That's while there's the vector of doubles.
> Each digit's valus is multiplied by a 10 ^ N value. This value is
> calculated incrementally by successive / 10.0 or * 10.0 operations.
> This successive calculations may lead to less precision than if
> this value is calculated for each digit with binary exponentation.
> So there's a precise mode with my code which is activated through
> the first template parameter of my function which defaults to false.
> With that each digit's value is calculated with binary exponenta-
> tion of 10 ^ N, but this also gives less performance. With my test
> code this gives up to four bits of additional precision.
>
> So here's the code:
>
> template<std::random_access_iterator Iterator>
> requires std::integral<std::iter_value_t<Iterator>>
> struct parse_result
> {
> std::errc ec;
> Iterator next;
> };
>
> // parse ends at first invalid character, at least at '\0'
> auto parse_never_ends = []( auto ) { return false; };
>
> template<bool Precise = false, std::random_access_iterator Iterator,
> typename End = decltype(parse_never_ends)>
> requires std::integral<std::iter_value_t<Iterator>>
> && requires( End end, Iterator it ) { { end( it ) } ->
> std::convertible_to<bool>; }
> parse_result<Iterator> parse_double( Iterator str, double &result, End
> end = End() )
> {
> using namespace std;
> static_assert(sizeof(double) == sizeof(uint64_t) &&
> numeric_limits<double>::is_iec559, "double must be IEEE-754 double
> precision");
> // mask to a double's exponent
> constexpr uint64_t EXP_MASK = 0x7FFull << 52;
> // calculate 10 ^ exp in double
> auto pow10 = []( int64_t exp ) -> double
> {
> // table for binary exponentation with 10 ^ (2 ^ N)
> static array<double, 64> tenPows;
> // table initialized ?
> if( static atomic_bool once( false ); !once.load( memory_order_acquire ) )
> {
> // weakly no: test locked again
> static mutex onceMtx;
> lock_guard lock( onceMtx );
> if( !once.load( memory_order_relaxed ) )
> {
> // no: calculate table
> for( double p10x2xN = 10.0; double &pow : tenPows )
> pow = p10x2xN,
> p10x2xN *= p10x2xN;
> // set initialized flag with release semantics
> once.store( true, memory_order_release );
> }
> }
> // begin with 1.0 since x ^ 0 = 1
> double result = 1.0;
> // unsigned exponent
> uint64_t uExp = exp >= 0 ? exp : -exp;
> // highest set bit of exponent
> size_t bit = 63 - countl_zero( uExp );
> // bit mask to highest set bit
> uint64_t mask = 1ull << bit;
> // loop as long as there are bits in unsigned exponent
> for( ; uExp; uExp &= ~mask, mask >>= 1, --bit )
> // bit set ?
> if( uExp & mask )
> {
> // yes: multiply result by 10 ^ (bit + 1)
> result *= tenPows[bit];
> // overlow ?
> if( (bit_cast<uint64_t>( result ) & EXP_MASK) == EXP_MASK )
> // yes: result wouldn't change furhter; stop
> break;
> }
> // return 1 / result if exponent is negative
> return exp >= 0 ? result : 1.0 / result;
> };
> Iterator scn = str;
> // ignore-case compare of a string with arbitrary with with a C-string
> auto xstricmp = [&]( Iterator str, char const *second ) -> bool
> {
> // unsigned character-type
> using uchar_t = make_unsigned_t<iter_value_t<Iterator>>;
> auto toLower = []( uchar_t c ) -> uchar_t
> {
> return c - (c >= 'a' && c <= 'a' ? 'a' - 'A' : 0);
> };
> for( ; ; ++str, ++second )
> if( !*second ) [[unlikely]]
> return true;
> else if( end( str ) ) [[unlikely]]
> return false;
> else if( toLower( *str ) != (unsigned char)*second ) [[unlikely]]
> return false;
> };
> // at end ?
> if( end( scn ) )
> // yes: err
> return { errc::invalid_argument, scn };
> // double's binary representation sign
> uint64_t binSign = 0;
> // positive sign ?
> if( *scn == '+' ) [[unlikely]]
> // at end ?
> if( end( ++scn ) ) [[unlikely]]
> // yes: err
> return { errc::invalid_argument, str };
> else;
> // negative sign ?
> else if( *scn == '-' )
> {
> // yes: remember sign
> binSign = 1ull << 63;
> // at end ?
> if( end( ++scn ) )
> // yes: err
> return { errc::invalid_argument, str };
> }
> // apply binSign to a double
> auto applySign = [&]( double d )
> {
> return bit_cast<double>( binSign | bit_cast<uint64_t>( d ) );
> };
> // NaN ?
> if( xstricmp( scn, "nan" ) ) [[unlikely]]
> {
> // yes: apply sign to NaN
> result = applySign( numeric_limits<double>::quiet_NaN() );
> return { errc(), scn + 3 };
> }
> // SNaN ?
> if( xstricmp( scn, "snan" ) ) [[unlikely]]
> {
> // yes: apply sign to NaN
> result = applySign( numeric_limits<double>::signaling_NaN() );
> return { errc(), scn + 4 };
> }
> // Inf
> if( xstricmp( scn, "inf" ) ) [[unlikely]]
> {
> // yes: apply sign to Inf
> result = applySign( numeric_limits<double>::infinity() );
> return { errc(), scn + 3 };
> }
> // begin of prefix
> Iterator prefixBegin = scn;
> while( *scn >= '0' && *scn <= '9' && !end( ++scn ) );
> Iterator
> // end of prefix
> prefixEnd = scn,
> // begin and end of suffix initially empty
> suffixBegin = scn,
> suffixEnd = scn;
> // has comma for suffix ?
> if( !end( scn ) && *scn == '.' )
> {
> // suffix begin
> suffixBegin = ++scn;
> for( ; !end( scn ) && *scn >= '0' && *scn <= '9'; ++scn );
> // suffix end
> suffixEnd = scn;
> }
> // prefix and suffix empty ?
> if( prefixBegin == prefixEnd && suffixBegin == suffixEnd ) [[unlikely]]
> // yes: err
> return { errc::invalid_argument, str };
> // exponent initially zero
> int64_t exp = 0;
> // has 'e' for exponent ?
> if( !end( scn ) && (*scn == 'e' || *scn == 'E') )
> // yes: scan exponent
> if( auto [ec, next] = parse_int( ++scn, exp, end ); ec == errc() )
> [[likely]]
> // succeeded: rembember end of exponent
> scn = next;
> else
> // failed: 'e' without actual exponent
> return { ec, scn };
> // number of suffix digits
> size_t nSuffixes;
> if( exp >= 0 ) [[likely]]
> // suffix is within suffix or right from suffix
> if( suffixEnd - suffixBegin - exp > 0 ) [[likely]]
> // suffix is within suffix
> nSuffixes = suffixEnd - suffixBegin - (ptrdiff_t)exp;
> else
> // there are no suffixes
> nSuffixes = 0;
> else
> if( prefixEnd - prefixBegin + exp >= 0 ) [[likely]]
> // suffix is within prefix
> nSuffixes = suffixEnd - suffixBegin - (ptrdiff_t)exp;
> else
> // there are no prefixes, all digits are suffixes
> nSuffixes = suffixEnd - suffixBegin + (prefixEnd - prefixBegin);
> // have non-default initialized doubles to save CPU-time
> union ndi_dbl { double d; ndi_dbl() {} };
> // thread-local vector with suffixes
> thread_local vector<ndi_dbl> ndiSuffixDbls;
> // resize suffixes vector, capacity will stick to the maximum
> ndiSuffixDbls.resize( nSuffixes );
> // have range checking with suffixes on debugging
> span suffixDbls( &to_address( ndiSuffixDbls.begin() )->d, &to_address(
> ndiSuffixDbls.end() )->d );
> // iterator after last suffix
> auto suffixDblsEnd = suffixDbls.begin();
> double digMul;
> int64_t nextExp;
> auto suffix = [&]( Iterator first, Iterator end )
> {
> while( first != end ) [[likely]]
> {
> // if we're having maximum precision calculate digMul with pow10 for
> every iteration
> if constexpr( Precise )
> digMul = pow10( nextExp-- );
> // pow10-value of digit becomes zero ?
> if( !bit_cast<uint64_t>( digMul ) )
> // yes: no further suffix digits to calculate
> return false;
> // append suffix double
> *suffixDblsEnd++ = (int)(*first++ - '0') * digMul;
> // if we're having less precision calculate digMul cascaded
> if constexpr( !Precise )
> digMul /= 10.0;
> }
> // further suffix digits to calculate
> return true;
> };
> // flag that signals that is suffix beyond the suffix in prefix
> bool furtherSuffix;
> if( exp >= 0 ) [[likely]]
> // there's no suffix in prefix
> nextExp = -1,
> digMul = 1.0 / 10.0,
> furtherSuffix = true;
> else
> {
> // there's suffix in prefix
> Iterator suffixInPrefixBegin;
> if( prefixEnd - prefixBegin + exp >= 0 )
> // sufix begins within prefix
> suffixInPrefixBegin = prefixEnd + (ptrdiff_t)exp,
> nextExp = -1,
> digMul = 1.0 / 10.0;
> else
> {
> // suffix begins before prefix
> suffixInPrefixBegin = prefixBegin;
> nextExp = (ptrdiff_t)exp + (prefixEnd - prefixBegin) - 1;
> if constexpr( !Precise )
> digMul = pow10( nextExp );
> }
> furtherSuffix = suffix( suffixInPrefixBegin, prefixEnd );
> }
> if( furtherSuffix && exp < suffixEnd - suffixBegin )
> // there's suffix in suffix
> if( exp <= 0 )
> // (remaining) suffix begins at suffix begin
> suffix( suffixBegin, suffixEnd );
> else
> // suffix begins at exp in suffix
> suffix( suffixBegin + (ptrdiff_t)exp, suffixEnd );
> result = 0.0;
> // add suffixes from the tail to the beginning
> for( ; suffixDblsEnd != suffixDbls.begin(); result += *--suffixDblsEnd );
> // add prefix digits from end reverse to first
> auto prefix = [&]( Iterator end, Iterator first )
> {
> while( end != first ) [[likely]]
> {
> // if we're having maximum precision calculate digMul with pow10 for
> every iteration
> if constexpr( Precise )
> digMul = pow10( nextExp++ );
> // pow10-value of digit becomes infinte ?
> if( (bit_cast<uint64_t>( digMul ) & EXP_MASK) == EXP_MASK ) [[unlikely]]
> {
> // yes: infinte result, no further suffix digits to calculate
> result = numeric_limits<double>::infinity();
> return false;
> }
> // add digit to result
> result += (int)(*--end - '0') * digMul;
> // if we're having less precision calculate digMul cascaded
> if constexpr( !Precise )
> digMul *= 10.0;
> }
> return true;
> };
> // flag that signals that prefix digits are finite so far, i.e. not Inf
> bool prefixFinite = true;
> if( !exp ) [[likely]]
> // prefix ends at suffix
> nextExp = 0,
> digMul = 1.0;
> else if( exp > 0 ) [[likely]]
> {
> // there's prefix in suffix
> Iterator prefixInSuffixEnd;
> if( exp <= suffixEnd - suffixBegin )
> // prefix ends within suffix
> prefixInSuffixEnd = suffixBegin + (ptrdiff_t)exp,
> nextExp = 0,
> digMul = 1.0;
> else
> {
> // prefix ends after suffix end
> prefixInSuffixEnd = suffixEnd;
> nextExp = exp - (suffixEnd - suffixBegin);
> if constexpr( !Precise )
> digMul = pow10( nextExp );
> }
> prefixFinite = prefix( prefixInSuffixEnd, suffixBegin );
> }
> else if( exp < 0 )
> {
> // prefix ends before suffix
> nextExp = -exp;
> if constexpr( !Precise )
> digMul = pow10( -exp );
> }
> if( prefixFinite && prefixEnd - prefixBegin + exp > 0 ) [[likely]]
> // prefix has prefix
> if( exp >= 0 ) [[likely]]
> // there's full prefix in prefix
> prefixFinite = prefix( prefixEnd, prefixBegin );
> else
> // remaining prefix is within prefix
> prefixFinite = prefix( prefixEnd + (ptrdiff_t)exp, prefixBegin );
> if( !prefixFinite ) [[unlikely]]
> {
> // result is Inf or NaN
> // if there we had (digit = 0) * (digMul = Inf) == NaN:
> // make result +/-Inf
> result = bit_cast<double>( binSign | EXP_MASK );
> return { errc::result_out_of_range, scn };
> }
> result = applySign( result );
> return { errc(), scn };
> }
>
> template<bool Precise = false, std::random_access_iterator Iterator>
> requires std::integral<std::iter_value_t<Iterator>>
> parse_result<Iterator> parse_double( Iterator str, Iterator end, double
> &result )
> {
> return parse_double<Precise>( str, result, [&]( Iterator it ) { return
> it == end; } );
> }
>

I saw several design/implement issues.

Refer Wy::strnum, 10 years before std::from_chars

---------------------------------------
NAME
strnum - scan and convert string to a number

SYNOPSIS
Except POD types, C structures, all types are declared in namespace Wy.

#include <Wy.stdlib.h>

template<typename ValueType>
Errno strnum(
ValueType& value,
const char* nptr,
const char*& endptr,
const int& radix=10,
NumInterpFlag cflags=Interp_C_Notation ) ;

template<typename ValueType>
Errno strnum(
ValueType& value,
const char* nptr,
const char*& endptr,
int& radix,
NumInterpFlag cflags=Interp_C_Notation ) ;

template<typename ValueType>
Errno strnum(
ValueType& value,
const char* nptr,
const int& radix=10,
NumInterpFlag cflags=Interp_C_Notation ) ;

template<typename ValueType>
Errno strnum(
ValueType& value,
const char* nptr,
int& radix,
NumInterpFlag cflags=Interp_C_Notation ) ;

DESCRIPTION
Function templates strnum accepts two kinds of number strings, one is C
string (zero terminated), another is specified by range pointer [be‐
gin,end). For example:
const char nstr[]="34567";
int n;
const char *beg=nstr, *end=nstr+sizeof(nstr)-1;

strnum(n,"12345"); // C string, radix=10, format=C
strnum(n,"F2345",16,No_Flag);// C string, radix=16, format=exact

strnum(n,beg,end); // [beg,end), radix=10, format=C
strnum(n,beg,end,16,No_Flag);// [beg,end), radix=16, format=exact

strnum scan the string pointed by [nptr,endptr) as if the number is in
radix representation, and store the result into value. If endptr is
not specified, nptr is assumed pointing to a zero-terminated string. If
specified, endptr will be read and modified pointing to the last char‐
acter recognized plus one.

radix should be in the range [2-36] or 0. If Interp_C_Radix is set and
radix is 0, the radix in effect depends on the prefix of the string and
is written back if radix is a non-const reference type. If the radix
prefix is not recognized, radix 10 is assumed.

cflags is a bit-OR'd value of the enum NumInterpFlag members, control‐
ling the string-to-number conversion.

No_Flag ............ No flag is set.
All_Flag ........... All flags are set.
Skip_Leading_Space . Skip leading space characters as determined by
isspace(3wy)
Interp_Sign ........ Interpret sign symbol (i.e. '+' '-')
Interp_C_Radix ..... Interpret C radix prefix
Skip_Space_After_Sign Skip space characters as determined by
isspace(3wy). No effect if Interp_Sign is not
set.
Interp_C_Exponent .. Interpret C exponent suffix
Interp_C_NanInf .... Interpret string "nan" or "[+-]inf",
disregarding case.
Skip_Less_Significant Skip long and less significant digit characters
that would cause ERANGE otherwise.

If ValueType is an integral type, the accepted string is in the form:

::= 0(xX) --> hexidecimal (radix=16)
::= 0 --> octal (radix=8)
<b1>::= <b2> --> space characters (isspace(3wy))
<ddd> --> non-empty digit string (_charnum(3wy))

<b1> is interpreted iff Skip_Leading_Space is set.
[-+] is interpreted iff Interp_Sign is set.
<b2> is interpreted iff Skip_Space_After_Sign and
Interp_Sign are set.
 is interpreted iff Interp_C_Radix is set.

Flags Skip_Less_Significant, Interp_C_Exponent, Interp_C_NanInf
have no effect for conversion of integral types.

If ValueType is a floating-point type, the accepted string is in the
form:

::= 0(xX) --> hexidecimal (radix=16)
<b1>::= <b2> --> space characters (isspace(3wy))
<ddd>::=<ggg>::=<kkk> --> non-empty digit string (_charnum(3wy))

<Exp> is interpreted iff Interp_C_Exponent is set.

If Interp_C_NanInf is set, string for nan,inf is interpreted.
If Skip_Less_Significant is set, less significant digits can be
recognized and ignored.

The reset flags are the same as for integral type.

RETURN
Ok Succeed

EINVAL radix is not in range [2-36], or is 0 while
Interp_C_Radix is not set.

ENOENT No leading sub-string can be converted to a number

EBADMSG An unrecognized character is encountered

ERANGE Result would be unrepresentable or the interpretation
gives up because of long string input.

EFAULT nptr is NULL.

The function template tries to recognize as many characters as possi‐
ble.

When returning Ok or EBADMSG, value is the converted value of the re‐
turned string [nptr,endptr).

SPECIALIZATION
The following lists types that are specialized by this library (type
char is not implemented, cast it to signed char or unsigned char):

signed char, signed short, signed int, signed long, signed long long
unsigned char, unsigned short, unsigned int, unsigned long,
unsigned long long, VLInt<T>.

Supported radix is in range [0,2-36].
For all replies, integral value is the accurate conversion of the re‐
turned string [nptr,endptr) if endptr is specified and not indicating
an empty range.

float, double, long double

Supported radix is in range [0,2-36].
If the string indicates a big number, value may be set to
[-]huge_value<ValueType>() (equal to HUGE_VAL, HUGE_VALF or HUGE_VALL).

[Implementation Note] Conversion accuracy may be limited due to using
C/C++ math functions to compose the result.

Wy.Timespec

Supported radix is [0,2,4,8,10,16,32].
String representation in radix 10 can be accurately converted.

SEE ALSO
Wy Wy.Timespec Wy.String Wy.ByteFlow Wy.VLInt

NOTE
Project is in development. https://sourceforge.net/projects/cscall

Subject	Replies	Author
Ain't that beautiful / 2 By: Bonita Montero on Wed, 10 Apr 2024	3	Bonita Montero

One person's error is another person's data.

devel / comp.lang.c++ / Re: Ain't that beautiful / 2