novaBBS - comp.lang.c - Idle: C library features wish-list.

There are some things that come up often that it might be "useful" if
they could be supported in a more portable ways.

Some of these exists as extensions in various libraries (and some of
this is specific to my own project, which may also happen to include its
own C compiler and runtime).

Maybe add disclaimers that none of this is intended for general
portability outside of a specific (albeit very common) class of target
architectures.

Actual function names subject to vary, but "general idea" mostly.

I don't expect any of this to amount to much.

Getting/setting values of defined size and endianess given a pointer:
uint32_t _mget_uint32le(void *ptr);
void _mset_uint32le(void *ptr, uint32_t val);
...
uint64_t _mget_uint64be(void *ptr);
void _mset_uint64be(void *ptr, uint64_t val);
...
Say, the pointer is assumed to be unaligned, endianess is explicit. It
is the responsibility of the compiler or runtime to make it work in
whatever way is most efficient for that target.

Malloc/heap related:
size_t _msize(void *ptr);
Get the size of the pointed to heap object.
Returns 0 if pointer is not valid a valid heap object.
Need not match (exactly) the size given to malloc.
void *_mgetbase(void *ptr);
Get the base-address of a previously malloc'ed pointer.
Returns NULL if the pointer does not point at a malloc'ed object.
void *_malloc_cat(size_t size, int mode);
Allocate an object which is usable according to the mode flags.
MALLOC_MCAT_RW //Read/Write, Default
MALLOC_MCAT_RWX //Read/Write/Execute
MALLOC_MCAT_ZTAG //Object needs to support tags and zone.
...
(Also partially overlaps with mmap's prot/map flags)

_mgetbase would return the pointer originally returned by malloc or
_malloc_cat or similar, if given a pointer into an object on the heap.
If given the base pointer for a malloc'ed object, it will return this
pointer.

The _malloc_cat call allocates by "category", and this category may not
be changed after the fact (since the categories may be in different
parts of the address space or similar; say because they involve
different mmap's).

Something like "realloc()" would preserve the relevant parts of the
category when copying an object.

By default, memory returned by malloc and similar would be assumed to be
non-executable, with the ability to allocate executable memory being a
special case. Specifics of "actually using" using RWX memory will depend
on the target (usually assumes writing target-specific machine code into
a buffer and some means of flushing the instruction cache and similar).

The _msize call would not be expected to preserve the exact size from
the original malloc call:
Don't necessarily want to store the exact size, but an approximate size
will be known;
It typically makes sense to "quantize" or "bucket" the allocation sizes,
so the stored object will typically be larger than the size it was
originally allocated as (both due to alignment padding and also because
this can help reduce heap fragmentation).

Say, for example, one malloc's 1392 bytes and the underlying object can
hold 1536 or similar, etc.

But, maybe add to this:
uint16_t _mgettag(void *ptr);
void _msettag(void *ptr, uint16_t ttag);
These get/set a user-defined tag value.
This would be something like an object type-tag or similar.
Default tag is 0.
Tag may always return as 0 if object does not support tags.

Zone, possible:
uint16_t _mgetzone(void *ptr);
void _msetzone(void *ptr, uint16_t ztag);
Get/Set a 'zone' tag.
void _mfreezone(uint16_t ztag, uint16_t zmask);
Free all objects on the heap where ((obj->ztag&zmask)==ztag).
Trying to free (ztag==0) is undefined (may abort/crash).

Default zone tag is 0.
Tag may always return as 0 if object does not support zone tags.

These do imply that the malloc implementation have a per-object header
with enough information needed to encode the size and relevant tags for
the object. For small objects, this may be undesirable.

One may need to request support explicitly. For objects allocated with
bare malloc, it would be undefined whether or not these tags exist.

The zone system may be itself used either as a crude garbage collector,
or used to implement the "sweep" stage of a garbage collector.

However, the assumption here is that the implementation does not itself
necessarily provide a garbage collector (at least, not enabled by
default, and will not apply to normal use of malloc/free).

An implementation need not have these "actually work", so the minimum
"required" behavior would be that the tag functions return 0, and
_mfreezone() does nothing.

Otherwise, "_mfreezone()" is potentially very slow, and is assumed to be
used infrequently (since it does basically walk the whole heap and
bulk-frees anything which matches the requested tag pattern).

While one could argue for other things per-object, like a
reference-count or "color" bits, one could also potentially make a case
for having the program itself use the zone-tag bits for doing this, say:
(15:8): Zone Level or Zone-Node
( 7:2): Reference Count
( 1:0): Object Color

As for 16-bit tags (vs 32-bit):
The 16-bit tags should mostly be sufficient for most of these use-cases,
whereas 8 bits may not be sufficient;
Support for these tags would need to be paid for by nearly every object
on the heap, so cost is a concern.

Yeah, granted, these ones are pretty non-standard.
Trying to avoid getting too much into the specifics of how this stuff
would be implemented.

There are some features which can be useful, such as the ability to
register finalizer callbacks on top of the zone system:

int _mzone_add_finalizer(
uint16_t ttag, uint16_t tmask,
uint16_t ztag, uint16_t zmask,
void (*func)(void *ptr));

Calls func whenever _mfreezone frees an object and:
(((obj->ttag&tmask)==ttag) && ((obj->ztag&zmask)==ztag))

Note that the implementation "may" combine these fields internally into
a single 32-bit field or similar. Here, "_mfreezone" will call
finalizers, but "free()" will not do so (and if the program wants to
NULL out user-pointers or similar, it may do so via the finalizer callback).

Note that a finalizer will only necessarily be called for an object if
the finalizer was registered prior to the object's tags being set
(otherwise, undefined). Which finalizer is called if multiple finalizers
can match a given bit-mask pattern is also undefined (though, could be
evaluated separately for both ztag and ttag, in which case potentially
two finalizers could be called for the same object).

If the implementation provides its own tagged type-system and GC
facilities, then it is also possible that the implementation uses some
range of the tag spaces for itself, say:
0000..3FFF: Reserved for runtime.
4000..7FFF: Available for application use.
8000..FFFF: TBD

Note that the finalizer should not be seen as equivalent to an object
destructor. Both may exist, but would be handled in different ways and
by different mechanisms.

While this also does not directly provide for "this object being freed
also sets other pointers to this object to NULL" semantics, this
functionality can be faked by creating a finalizer which sets the
pointers to NULL (and otherwise, this sort of thing is obscure enough in
practice to not make it worth wasting space in the object headers for
sake of supporting this).

One other optional requirement is that pointers returned by "malloc" and
friends are able to be used (and freed), from any library in the running
program instance, and across thread boundaries, etc. This sort of thing
is an annoyance in MSVC, I would prefer if "malloc from anywhere and
free from anywhere else" be required de-facto (without extra hassle).

Well, by extension would prefer if stdio and "FILE *" also did this,
stuff like passing an open FILE* descriptor from one DLL to another, and
then trying to perform IO on it from that other DLL, and having the C
library promptly grenade itself, is pointless and annoying.

Usual workaround is to provide a wrapper interface to redirect all of
the IO back to a common location (along with of "extra stuff" if one
wants using "stdio.h" stuff to also be thread safe); would prefer it if
this stuff working could be assumed to work as a default.

Granted, yes, Linux and friends already have the behavior I am looking
for here (mostly would just prefer it more if this were universal).

Copy/compare:

void _memlzcpy(void *dst, void *src, size_t sz);
Copy the buffer as-if it were implemented as:
unsigned char *cs, *ct, *cse;
ct=dst; cs=src; cse=cs+sz;
while(cs<cse)
*ct++=*cs++;
However, unlike the naive loop, hopefully in a way that is not so
painfully slow:
Non-overlap: Behaves like memcpy;
Backwards (dst<src): Behaves like memmove;
Forwards overlap (dst>src): Fill with a repeating byte pattern.

Click here to read the complete article

BGB <cr88192@gmail.com> writes:

> There are some things that come up often that it might be "useful" if
> they could be supported in a more portable ways.
>
> [ ... ]
>
> Any thoughts?...

None of these is suitable for inclusion in the ISO C standard.

On 10/2/2022 10:38 PM, Tim Rentsch wrote:
> BGB <cr88192@gmail.com> writes:
>
>> There are some things that come up often that it might be "useful" if
>> they could be supported in a more portable ways.
>>
>> [ ... ]
>>
>> Any thoughts?...
>
> None of these is suitable for inclusion in the ISO C standard.

Possibly.

As noted, this was more an "idle wish list", based mostly on stuff that
comes up a lot in my experience, not really a proposal that this stuff
be added (as-is) to the C standard.

A few of them, such as _msize(), exist in MSVCRT, and is functionally
equivalent to malloc_usable_size() in GLIBC.

Also was using the "_whatever()" naming convention, because this seems
to be the typical naming convention for library extensions (vs
"__whatever" being more typical for compiler-specific keywords).

This is stuff that comes up a lot, and almost every non-trivial program
needs to implement a lot of this itself (often multiple times, if one
uses libraries and every library provides its own implementation).

Like, say, for example, what if the C library had not provided
"memcpy()" and similar, and nearly every application was left to roll
their own, often doing so poorly.

Also, C11 threads exist, so one never really knows what sorts of stuff
they might throw in...

Most of this does exist in the C library used in by my compiler (itself
a highly modified fork of PDPCLIB, *1), but as noted, this does limit
its scope (since for code portability one is limited mostly to the least
common denominator between the targets in question).

Well, there are also a bunch of C language extensions, but I decided to
leave these out in this case (same sort of issue), and some amount exist
partly as quirky side effect of trying to squeeze performance out of a
50 MHz CPU running on top of an FPGA (mostly on an Artix-7 based board).
Well, among other funkiness.

Well, and also C isn't the only language I am running here (and the
compiler and runtime exist with some stuff partly intended for use by
the other languages I am compiling with my compiler).

*1: I have ended up rewriting a fair chunk of the C library as some of
the code was "kinda awful" (but, it is more of a "ship of Theseus"
thing). Had a few times considered possibly doing a full rewrite, mostly
motivated by architectural reasons, but had not fully done so yet partly
due to inertia.

...

In article <86y1txinka.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>BGB <cr88192@gmail.com> writes:
>
>> There are some things that come up often that it might be "useful" if
>> they could be supported in a more portable ways.
>>
>> [ ... ]
>>
>> Any thoughts?...
>
>None of these is suitable for inclusion in the ISO C standard.

Wow. What a charming, helpful, forward-looking response.

--
Hindsight is (supposed to be) 2020.

Trumpers, don't make the same mistake twice.
Don't shoot yourself in the feet - and everywhere else - again!.

On 2022-10-03, BGB <cr88192@gmail.com> wrote:
> On 10/2/2022 10:38 PM, Tim Rentsch wrote:
>> BGB <cr88192@gmail.com> writes:
>>
>>> There are some things that come up often that it might be "useful" if
>>> they could be supported in a more portable ways.
>>>
>>> [ ... ]
>>>
>>> Any thoughts?...
>>
>> None of these is suitable for inclusion in the ISO C standard.
>
> Possibly.
>
>
> As noted, this was more an "idle wish list", based mostly on stuff that
> comes up a lot in my experience, not really a proposal that this stuff
> be added (as-is) to the C standard.

I don't know why you would even wish to have most of that stuff in the
standard.

The standard would be objectively worse, even for you, whenever
you're working on anything but the one program where you need any
of it.

> A few of them, such as _msize(), exist in MSVCRT, and is functionally
> equivalent to malloc_usable_size() in GLIBC.

_msize doesn't return the size that was passed to malloc; it returns
some rounded up size. Still that can be useful.

Code which manages a buffer that grows when it becomes full
tracks the allocated size from the actual filled size. With this
function, you don't have to waste space storing the allocated
size and keeping it up-to-date: you just retrieve it. Moreover,
you use the full underlying size without any waste.

The function would have to be specified such that if you malloc(42),
and then malsize(ptr) yields 64, it becomes legitimate for you
to make use of all the bytes bytes 0 to 63.

Moreover, it would have to be specified that malsize(ptr) is called, and
returns some value, then it must always return a value at least as large
for ptr, regardless of any memory allocations or deallocations that take
place.. The memory indicated by that size must really belong to the
allocated object.

> Like, say, for example, what if the C library had not provided
> "memcpy()" and similar, and nearly every application was left to roll
> their own, often doing so poorly.

Sure, but how many need a memcpy that allows overlap, but if the
overlap is in the wrong direction, it then repeats a byte?

A memcpy that allows overlap, if the second operand has a higher
address than the first, would be mildly useful. However,
if we say that the second address must be higher, that can be
satisfied by it being higher only by a byte.

The motivation for that function is that a simple loop can perform the
copy, which sweeps over both operands in order of increasing address.
However, it can only work reliably if the transfer unit's width
is no larger than the displacement between the two buffers.
So in the case of a one byte difference, the loop must transfer
a byte at a time.

In cases when the address delta can't be deduced at compile time,
that function would have to switch on the delta size, and say
handle the 1, 2, 4 and 8 byte cases specially. Plus handle the
alignment cases and all that.

It's not clear that it would end up winning very much over memmove.
Programmers who want the most performance out of memcpy just
make it non-overlapping.

Versions of memcpy and memmove which allow the application to
specify the alignment (whereby the application ensures that
the promised alignemnt is true) would be useful:

/* array copy, array move */

/* non-overlapping operands.
both operand pointers aligned to elem_size, else UB. */

arrcpy(dest, src, elem_cnt, elem_size)

/* Possibly overlapping operands.
both operand pointers aligned to elem_size, else UB. */

arrmove(dest, src, elem_cnt, elem_size)

Copy operations that don't have to handle run-time alignment cases, and
odd leftover sizes, could likely be implemented faster.

The elem_size expression is often a constant expression, in which cases
the compiler can rewrite the call to use a function which handles that
transfer unit size (or multiples), without worrying about alignment or
partial transfer units at the end.

ISO C (since 99) has something like this, for wchar_t: wmemcpy
and wmemmove. The above functions would just generalize that.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Kaz Kylheku <864-117-4973@kylheku.com> writes:
>On 2022-10-03, BGB <cr88192@gmail.com> wrote:

>> Like, say, for example, what if the C library had not provided
>> "memcpy()" and similar, and nearly every application was left to roll
>> their own, often doing so poorly.
>
>Sure, but how many need a memcpy that allows overlap, but if the
>overlap is in the wrong direction, it then repeats a byte?
>
>A memcpy that allows overlap, if the second operand has a higher
>address than the first, would be mildly useful. However,
>if we say that the second address must be higher, that can be
>satisfied by it being higher only by a byte.

Why would someone want to use memcpy for an overlapping
move when memmove is available to handle the overlap cases?

If one programs in C, one should RTFM, and the FM says

"The memory areas must not overlap"

As an aside, the latest extensions to the ARMv8 architecture
include instructions to implement memset and memcpy in an
hardware efficient manner.

DAGS DDI0487I_a_a-profile_architecture_reference_manual.pdf
and look at the 30 CPY* instructions (and they still call it RISC :-),
or the SETP/SETM/SETE instructions.

The key is that the instructions allow the processor to move data
in the most effective unit-size (e.g. a cache line) rather than
byte-at-a-time (or the C library mem* assembler functions using
64-bit or 128-bit accesses in a loop). It's still a loop, but
the processor determines how much is moved in each iteration
rather than the programmer.

On 10/3/2022 3:40 PM, Kaz Kylheku wrote:
> On 2022-10-03, BGB <cr88192@gmail.com> wrote:
>> On 10/2/2022 10:38 PM, Tim Rentsch wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>
>>>> There are some things that come up often that it might be "useful" if
>>>> they could be supported in a more portable ways.
>>>>
>>>> [ ... ]
>>>>
>>>> Any thoughts?...
>>>
>>> None of these is suitable for inclusion in the ISO C standard.
>>
>> Possibly.
>>
>>
>> As noted, this was more an "idle wish list", based mostly on stuff that
>> comes up a lot in my experience, not really a proposal that this stuff
>> be added (as-is) to the C standard.
>
> I don't know why you would even wish to have most of that stuff in the
> standard.
>
> The standard would be objectively worse, even for you, whenever
> you're working on anything but the one program where you need any
> of it.
>

I didn't originally say anything about wanting to add any of this to the
C standard in the first place...

I would just prefer if it could be "more portable", which could be
achieved easily enough in a "de-facto" way.

>> A few of them, such as _msize(), exist in MSVCRT, and is functionally
>> equivalent to malloc_usable_size() in GLIBC.
>
> _msize doesn't return the size that was passed to malloc; it returns
> some rounded up size. Still that can be useful.
>

Yes, I mentioned this in the OP.

The proposed behavior *was* that it would return the padded-up size.

> Code which manages a buffer that grows when it becomes full
> tracks the allocated size from the actual filled size. With this
> function, you don't have to waste space storing the allocated
> size and keeping it up-to-date: you just retrieve it. Moreover,
> you use the full underlying size without any waste.
>
> The function would have to be specified such that if you malloc(42),
> and then malsize(ptr) yields 64, it becomes legitimate for you
> to make use of all the bytes bytes 0 to 63.
>
> Moreover, it would have to be specified that malsize(ptr) is called, and
> returns some value, then it must always return a value at least as large
> for ptr, regardless of any memory allocations or deallocations that take
> place.. The memory indicated by that size must really belong to the
> allocated object.
>

All this was already implicit in the original idea.

I wasn't claiming:
p=malloc(42);
sz=_msize(p);
Should have sz==42, merely sz>=42 ...

Usually, because the allocator does tend to pad things up internally,
and also we don't usually want to preserve the exact size of the
allocation in the first place (it is not usually needed, and has a
non-zero cost needed to store it).

Yes.

In this case:
_memlzcpy(dst+1, dst, 256);
Would be semantically equivalent to:
memset(dst+1, *dst, 256);

If the delta is 2 bytes, it will repeat those 2 bytes, or 3 bytes will
repeat a 3 byte pattern, etc.

> A memcpy that allows overlap, if the second operand has a higher
> address than the first, would be mildly useful. However,
> if we say that the second address must be higher, that can be
> satisfied by it being higher only by a byte.
>
> The motivation for that function is that a simple loop can perform the
> copy, which sweeps over both operands in order of increasing address.
> However, it can only work reliably if the transfer unit's width
> is no larger than the displacement between the two buffers.
> So in the case of a one byte difference, the loop must transfer
> a byte at a time.
>

Not necessarily, in a typical implementation, it can be turned into a
pattern-fill register, which is then written to memory in a single
larger block.

We don't want to fall back to a "one byte at a time" copy in this case,
because this is slow; but it is necessary to have the same output
*as-if* it had been a "byte at a time" copy operation.

> In cases when the address delta can't be deduced at compile time,
> that function would have to switch on the delta size, and say
> handle the 1, 2, 4 and 8 byte cases specially. Plus handle the
> alignment cases and all that.
>

Yes.

Typically, it needs to specially handle all 1-15 cases, with 16+ bytes
cases typically able to fall back to the normal SIMD based copy, and
another (slightly faster) SIMD loop usually at 32 or 64 bytes.

Non-power-of-2 sizes (3/5/7/...) get a little more complicated, but
would still need to be handled.

Usual options for this are one of:
Pattern fill has a non-power-of-2 stepping, using misaligned memory stores;
Multiple pattern fills are generated, with the fill alternating between
fill patterns based on a modulo.

The latter case is more limited in scope (doesn't scale very well), so
the former is typically what is used (say, each store is 128 bits, but
the destination pointer is advanced by 13 or 15 bytes or similar each time).

Despite typically needing to pay a penalty for misaligned SIMD store,
this tends to work out faster on-average than the other option (as well
as being a lot simpler in terms of the required "big blobs of ASM").

> It's not clear that it would end up winning very much over memmove.
> Programmers who want the most performance out of memcpy just
> make it non-overlapping.
>

memmove has the wrong semantics for cases where one actually needs the
preceding behavior.

The point of "_memlzcpy()" is partly because:
In some cases, one needs these particular semantics for overlapping
copies (so paying these costs is unavoidable);
One doesn't want to make normal "memcpy()" slower by asking it to
detect/handle scenarios that are N/A to most normal uses of memcpy.

So, one ends up with programs needing to implement their own version,
with it often either being slow or turning into an ugly mess of
platform-specific code.

As implied by the name, one of the major cases where this comes up tends
to be things like LZ77 decompressors (such as: LZ4, Deflate, etc).

For some decompressors (such as LZ4, or my own RP2 format), copying
matches around in the "sliding window" tends to be the majority of the
clock-cycle budget for these tasks.

> Versions of memcpy and memmove which allow the application to
> specify the alignment (whereby the application ensures that
> the promised alignemnt is true) would be useful:
>
> /* array copy, array move */
>
> /* non-overlapping operands.
> both operand pointers aligned to elem_size, else UB. */
>
> arrcpy(dest, src, elem_cnt, elem_size)
>
> /* Possibly overlapping operands.
> both operand pointers aligned to elem_size, else UB. */
>
> arrmove(dest, src, elem_cnt, elem_size)
>
> Copy operations that don't have to handle run-time alignment cases, and
> odd leftover sizes, could likely be implemented faster.
>

Yes, granted.

Possible option is specifying both the size and alignment...

Well, and/or doing the more naive solution and providing functions for
each (power of 2) combination of size and alignment up to a certain range.

_arrmcpy_16x8(dst, src, cnt); //16-byte items with 8-byte alignment
_arrmcpy_64x16(dst, src, cnt); //64-byte items with 16-byte alignment
...

> The elem_size expression is often a constant expression, in which cases
> the compiler can rewrite the call to use a function which handles that
> transfer unit size (or multiples), without worrying about alignment or
> partial transfer units at the end.
>
> ISO C (since 99) has something like this, for wchar_t: wmemcpy
> and wmemmove. The above functions would just generalize that.
>

Granted.

My compiler also does some similar stuff internally for things like
struct copying, since it statically knows the size and alignment of the
struct or array.

Otherwise, "memcpy()" is also specialized in some cases as well, since
the compiler can "see" the types and alignments of the passed in
pointers, and if the copy size is constant, and so may special case some
of this (only producing a "true" memcpy call as a fallback case). In
some other cases, it might turn it into bare loads and stores.

Likewise, "memset()" may also get similar treatment.

Also can note that for a lot of this, I am dealing with a 64-bit VLIW
architecture clocked at 50 MHz, where "little things" like this can have
a fairly drastic impact on performance.

Click here to read the complete article

On 10/3/2022 4:10 PM, Scott Lurndal wrote:
> Kaz Kylheku <864-117-4973@kylheku.com> writes:
>> On 2022-10-03, BGB <cr88192@gmail.com> wrote:
>
>>> Like, say, for example, what if the C library had not provided
>>> "memcpy()" and similar, and nearly every application was left to roll
>>> their own, often doing so poorly.
>>
>> Sure, but how many need a memcpy that allows overlap, but if the
>> overlap is in the wrong direction, it then repeats a byte?
>>
>> A memcpy that allows overlap, if the second operand has a higher
>> address than the first, would be mildly useful. However,
>> if we say that the second address must be higher, that can be
>> satisfied by it being higher only by a byte.
>
> Why would someone want to use memcpy for an overlapping
> move when memmove is available to handle the overlap cases?
>

Because memmove will not have the needed semantics in some cases.

In this case, memmove will always keep the original buffer contents
intact. For some algorithms, this is not the needed behavior.

In some cases, one "actually needs" a copy operation that will turn the
output into a repeating pattern of bytes whenever one forward-copies a
chunk of memory over the top of itself.

Also it can be used as a way to implement a "multi byte memset", say for
example, if one wants a fast way to flood-fill a chunk of memory with
0xDEADBEEF or similar, ...

> If one programs in C, one should RTFM, and the FM says
>
> "The memory areas must not overlap"
>

This is the whole point of proposing a "_memlzcpy()" function, in that
it would define this behavior in a particular way for a particular function.

If "memcpy()" or "memmove()" had the needed semantics for this, there
would be no need to define such an extension in the first place.

One could implement it, in theory, as:
void *_memlzcpy(void *dst, void *src, int n)
{
unsigned char *cs, *ct, *cse;
cs=src; ct=dst; cse=cs+n;
while(cs<cse)
*ct++=*cs++;
return(ct);
}

Issue is mostly that this implementation would be unacceptably slow for
some of these use-cases.

On the target I added this as an extension on, implementing the loop
this way is nearly two orders of magnitude slower than doing it using
128-bit load/store operations in 32-byte chunks. So, the "actual"
version will need to use a chunked memory fill in this case (effectively
more like "memset()").

By the time one makes it "not slow", it is an awkward mess.

> As an aside, the latest extensions to the ARMv8 architecture
> include instructions to implement memset and memcpy in an
> hardware efficient manner.
>
> DAGS DDI0487I_a_a-profile_architecture_reference_manual.pdf
> and look at the 30 CPY* instructions (and they still call it RISC :-),
> or the SETP/SETM/SETE instructions.
>
> The key is that the instructions allow the processor to move data
> in the most effective unit-size (e.g. a cache line) rather than
> byte-at-a-time (or the C library mem* assembler functions using
> 64-bit or 128-bit accesses in a loop). It's still a loop, but
> the processor determines how much is moved in each iteration
> rather than the programmer.
>

Can note that in this case, I wasn't necessarily talking about either
x86-64 or ARM...

Granted, the same basic functions could be implemented on both x86 and
ARM without too much issue.

BGB <cr88192@gmail.com> writes:
>On 10/3/2022 4:10 PM, Scott Lurndal wrote:
>> Kaz Kylheku <864-117-4973@kylheku.com> writes:
>>> On 2022-10-03, BGB <cr88192@gmail.com> wrote:
>>

>> Why would someone want to use memcpy for an overlapping
>> move when memmove is available to handle the overlap cases?
>>
>
>Because memmove will not have the needed semantics in some cases.
>
>In this case, memmove will always keep the original buffer contents
>intact. For some algorithms, this is not the needed behavior.
>
>
>In some cases, one "actually needs" a copy operation that will turn the
>output into a repeating pattern of bytes whenever one forward-copies a
>chunk of memory over the top of itself.

What's that, perhaps 0.0001% of the use cases? Roll your own is probably
best for that.

>
>Also it can be used as a way to implement a "multi byte memset", say for
>example, if one wants a fast way to flood-fill a chunk of memory with
>0xDEADBEEF or similar, ...

A simple loop will likely be as fast as any library function.

Historically, the Burroughs B4900 mainframe MVA (Move Alpha) instruction could do both
of the above overlapping move operations if necessary.

On 2022-10-04, BGB <cr88192@gmail.com> wrote:
> In some cases, one "actually needs" a copy operation that will turn the
> output into a repeating pattern of bytes whenever one forward-copies a
> chunk of memory over the top of itself.

You mention this in the context of LZ77 deflate, but I don't see
any such thing in zlib sources. There is a zmemcpy which
in some cases is just a macro for memcpy.

Is Adler missing some clue or something?

> Also it can be used as a way to implement a "multi byte memset", say for
> example, if one wants a fast way to flood-fill a chunk of memory with
> 0xDEADBEEF or similar, ...

A multi-byte memset that is reading from the area where it is writing
seems inefficient, particularly if the memory is cache-cold,
since it will actually be sucking the memory into the processor's
caches only to turn around a blast it out again.

You really want an actual multi-byte memset for that use case.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

On 10/4/2022 8:43 AM, Scott Lurndal wrote:
> BGB <cr88192@gmail.com> writes:
>> On 10/3/2022 4:10 PM, Scott Lurndal wrote:
>>> Kaz Kylheku <864-117-4973@kylheku.com> writes:
>>>> On 2022-10-03, BGB <cr88192@gmail.com> wrote:
>>>
>
>>> Why would someone want to use memcpy for an overlapping
>>> move when memmove is available to handle the overlap cases?
>>>
>>
>> Because memmove will not have the needed semantics in some cases.
>>
>> In this case, memmove will always keep the original buffer contents
>> intact. For some algorithms, this is not the needed behavior.
>>
>>
>> In some cases, one "actually needs" a copy operation that will turn the
>> output into a repeating pattern of bytes whenever one forward-copies a
>> chunk of memory over the top of itself.
>
> What's that, perhaps 0.0001% of the use cases? Roll your own is probably
> best for that.
>

It is niche, granted.

Still happened commonly enough to where I ended up adding it to my C
library as an extension.

Though, the larger "platform level" API also includes things like LZ
compression and decompression functions (used for various purposes).

>>
>> Also it can be used as a way to implement a "multi byte memset", say for
>> example, if one wants a fast way to flood-fill a chunk of memory with
>> 0xDEADBEEF or similar, ...
>
> A simple loop will likely be as fast as any library function.
>
> Historically, the Burroughs B4900 mainframe MVA (Move Alpha) instruction could do both
> of the above overlapping move operations if necessary.
>

Most of this will be in the context of my ISA (BJX2) at 50 MHz.

Say:
uint32_t *pi;
pi=dest;
for(i=0; i<n; i++)
pi[i]=0xDEADBEEFU;

Will fill at around 25 MB/sec (with the loop spinning at roughly 8 clock
cycles per iteration).

Note: Compiler in this case is basically incapable of unrolling loops.
It can bundle stuff into VLIW bundles, but there is very little to
bundle in these sorts of loops (one needs a bunch of non-dependent ALU
ops and other stuff going on to effectively make much use of VLIW).

With 32 bits per iteration, it can't even reach full DRAM speed.

Also, the CPU is strictly in-order, so a loop with 8 cycles of latency
will always take (at least) 8 cycles to run.

This is part of why naive byte-copy loops are so slow, then one is
looking at around 11 cycles per byte.

But, they could also write:
_mset_uint64(dest, 0xDEADBEEFDEADBEEFULL);
_memlzcpy(dest+8, dest, (n-2)<<2);

And get basically "max speed".

Depends slightly on things like alignment, copy-size, and where it fits
in the cache hierarchy (fastest cases being around 290 MB/sec, for fills
within the L1 cache; this case would drop to around 180 MB/s for the L2
cache, and around 54 MB/s for external DRAM).

Though, even as slow as DRAM is, it is still faster than it would be to
fill the memory 32 bits at a time in a loop.

RAM is pretty slow in this case: Roughly 100 MB/s for unidirectional
load/store, and 56 MB/s for Swap (conjoined Load+Store). For "memset()"
style tasks, the L2 cache primarily does Swap operations.

Where, in this case, a 64-bit store would give an 8-byte alignment,
which allows using 128-bit memory operations (a 4-byte alignment would
drop it to around 160 MB/s due to needing to drop back to 64-bit
operations).

Where in this case, the ISA is limited to only one memory Load/Store per
clock-cycle (1 throughput, 3 latency, for both 64 and 128 bit Load/Store
ops).

Whereas, say:
uint64_t *pli, *plie;
pli=dest; n1=n>>1; plie=pli+n1;
while(pli<plie)
{
pli[0]=0xDEADBEEFDEADBEEFULL;
pli[1]=0xDEADBEEFDEADBEEFULL;
pli[2]=0xDEADBEEFDEADBEEFULL;
pli[3]=0xDEADBEEFDEADBEEFULL;
pli+=4;
}

Will reach roughly 160 MB/s, but now "what if n was not an even multiple
of 8 elements?", etc...

If larger than the L1 size, it will drop down to L2 and then DRAM speeds.

Granted, one could modify the loop further:
uint64_t *pli, *plie;
uint64_t fill;

fill=0xDEADBEEFDEADBEEFULL;
pli=dest; n1=n>>1; plie=pli+n1;
while(pli<plie)
{
pli[0]=fill; pli[1]=fill;
pli[2]=fill; pli[3]=fill;
pli[4]=fill; pli[5]=fill;
pli[6]=fill; pli[7]=fill;
pli+=8;
}

Roughly 14 cycles per iteration, 64 bytes per iteration, so ~ 228 MB/s.

But, what else could one do:
uint128_t *pli, *plie;
uint128_t fill;

fill=0xDEADBEEFDEADBEEFDEADBEEFDEADBEEFUI128;
pli=dest; n1=n>>2; plie=pli+n1;
while(pli<plie)
{
pli[0]=fill; pli[1]=fill;
pli[2]=fill; pli[3]=fill;
pli[4]=fill; pli[5]=fill;
pli[6]=fill; pli[7]=fill;
pli+=8;
}

Now pushes 128 bytes per loop iteration, and can making some use of
128-bit memory store operations (and can hit the 290 MB/s limit).

Obvious problem: Using 128-bit types is no longer standard C.

Also, on this target, this will break if the destination is not properly
aligned, and while you could write:
__unaligned uint128_t *pli, *plie;

This falls back to using smaller (64-bit) stores internally (for
technical reasons, the ISA in question imposes a 64-bit alignment for
128-bit Load/Store operations).

Theoretical hard-limit for 128-bit stores is 800MB/s at 50MHz, but this
can't really be achieved in practice (the closest I have gotten in
practice is around 470MB/s, when storing at 512 bytes per loop iteration).

For most uses, a memory fill loop that move 64 bytes at a time and hits
a limit of ~ 300 MB/s tend to be a little more practical (fast enough to
saturate DRAM and L2, so "good enough").

On my desktop PC (Zen+, 3.7 GHz), can generally get:
~ 3.8 GB/s, for large fills (4MB).
~ 7.4 GB/s, for medium fills (128K).
~ 12 GB/s, for small fills (8K).

The relative impact of element size or loop structure for the memory
stores seems to be a lot smaller on this machine (likely because of OoO
and similar).

Though, naive byte copy loops are still relatively slow, even on x86-64.

LZ decompressors are harder, usually harder to get them much past around
2 GB/s or so for typical data (for an LZ4 style compression format).

For anything with a Huffman stage, hard to get much past around 600-800
MB/s or so (single threaded).

Hiding this stuff behind a function is "usually" preferable, and then
the program can hopefully avoid a big mess along the lines of, say:

#ifdef _MSC_VER
... MSVC stuff
#endif

#ifdef __GNUC__
... GCC stuff
#endif

#ifdef _M_X64 //MSVC
...
#endif

#ifdef _M_X86 //MSVC
...
#endif

#ifdef __i386__ //GCC or Clang
...
#endif

#ifdef __x86_64__ //GCC or Clang
...
#endif

#ifdef _M_ARM //ARM + MSVC
...
#endif

#ifdef __arm__ //ARM + GCC
...
#endif

#ifdef __BJX2__ //my ISA, assumes it is BGBCC
...
#endif

...

Where, stuff like this is preferably kept to a minimum, and tends to
turn programs into a big hairy mess.

One usual option is to put a lot of the "common hair" into files that
one copy/pastes from one project to another, but this still kinda sucks.

And, what usually goes in these blocks:
Wrappers for loading/storing values from pointers;
Specialized memory copy stuff;
...

Basically, some amount of the stuff I proposed in the OP.

The malloc stuff is its own category, mostly because of annoyances where
baseline malloc isn't really sufficient, and so the program tends to
need to implement its own memory allocator.

On 10/4/2022 11:56 AM, Kaz Kylheku wrote:
> On 2022-10-04, BGB <cr88192@gmail.com> wrote:
>> In some cases, one "actually needs" a copy operation that will turn the
>> output into a repeating pattern of bytes whenever one forward-copies a
>> chunk of memory over the top of itself.
>
> You mention this in the context of LZ77 deflate, but I don't see
> any such thing in zlib sources. There is a zmemcpy which
> in some cases is just a macro for memcpy.
>
> Is Adler missing some clue or something?
>

No zlib here, rather a custom implementation.

One needs Deflate mostly for things like ZIP and PNG, but one doesn't
necessarily need to use zlib to do so (nor does one need libpng for PNGs).

Well, and in this case, also not running on a conventional OS.
No Linux or Windows in this case, rather "TestKern", which is sort of like:
Uses a lot of file-formats and other stuff borrowed from Windows
Uses PE/COFF, RIFF based formats, ...
Uses an API design more modeled after POSIX
General "architecture" generally has more in common with MS-DOS at this
stage (memory protection is borderline non existent; does more or less
have virtual memory working, but all of the program instances still
currently run in a single shared virtual address space).

Also, like DOS (and unlike Linux or Windows), doesn't have preemptive
multitasking yet. Had written some code for this, but it isn't yet used,
and work is still needed for "processes" to be a thing. Cooperative
multithreading isn't exactly the same.

Porting software to it is a little bit of a hassle (things like
"./configure" aren't going to work when one doesn't even have Bash).

Porting something like Linux or BSD or similar would probably be better,
but porting these to my ISA look like probably an uphill battle (and
porting the GNU userland isn't really going to work out well without GCC
support for this architecture, ...).

As for the way _memlzcpy fits in with LZ77, it is typically that matches
are expressed as a backwards distance and a length. Where, if the
distance is less than the length, one gets a repeating pattern.

So:
ct=_memlzcpy(ct, ct-dist, len);
Can express the typical LZ style match-copy operation.

Say, for decoding an LZ4 style format, one could write:
ct=dest; cs=src; cse=cs+csize;
while(cs<cse)
{
i=*cs++;
rl=i>>4; ml=(i&15)+4;
if(rl==15)
{
i=*cs++;
while(i==255)
{ rl+=i; i=*cs++; }
rl+=i;
}
ct=_memlzcpyf(ct, cs, rl);
cs+=rl;
if(cs>=cse)
break;
md=_mget_uint16le(cs);
cs+=2;
if(ml==19)
{
i=*cs++;
while(i==255)
{ ml+=i; i=*cs++; }
ml+=i;
}
ct=_memlzcpyf(ct, ct-md, ml);
ct+=ml;
}

Decided to leave out going a bunch into stuff related to LZ compressors.

They are used for various purposes, among other things, using
compression as a way to read data from the SDcard faster (at 12.5 MHz,
the SDcard only does IO at around 1.5 MB/s in SPI mode).

The main target I am dealing with for this is mostly on my BJX2 ISA,
which is basically a 64-bit 3-wide VLIW, generally runs on FPGA,
generally at 50MHz.

In some areas, it is kinda meh:
Runs Doom at ~ 15-20 fps;
Runs Hexen at ~ 8-10 fps;
Runs ROTT at ~ 10-12 fps;
SW Quake at ~ 2-4 fps;
GLQuake at ~ 5-8 fps;
At present, only gets ~ 74k in Dhrystone (~ 0.84 DMIPS/MHz, *1);
...

But, it does a "surprisingly passable" job at things like software
OpenGL (*2) rasterization (and a lot of "my own stuff" does rendering
using the software rasterized OpenGL; also a custom implementation
optimized for this ISA, with a fair chunk written in ASM).

*1: At Dhrystone, it seems that RISC-V gets better DMIPS/MHz scores.
I suspect some of this is due to GCC being a lot more "clever" than my
compiler (BGBCC).
Arguably, RISC-V is still a much better option in terms of "being
practical for general use".

Though, can generally pass timing at higher clock speeds than the SweRV
core, even if the DMIPS/MHz score is worse. Both need roughly similar
class FPGAs (XC7S50 or XC7A100 or similar). Had noted that internal
architecture was very different. Though, have noted that Dhrystone on
simpler 1-wide scalar RISC-V cores seems to be closer to around 0.6
DMIPS/MHz (rather than ~ 1.4).

The pipeline is very different, they seem to have the instruction
pipeline and memory load-store as two independent components (connected
via a FIFO interface or similar).

In my case, the L1 caches and pipeline operate in lockstep. So, if an L1
miss happens, the pipeline stalls until the situation is resolved (and
all instructions in a VLIW bundle advance strictly in lockstep).
Extracting usable ILP from a program is left pretty much entirely to the
compiler (and/or the person writing ASM code for it).

*2: It implements the OpenGL API, more or less, but is basically a
software renderer on the backend, and uses affine filtering (with
dynamic tessellation), so tends to generate output that kinda more
resembles something like the original PlayStation than a modern GPU.

Software Quake is around 2-4 fps, GLQuake is around 5-8.
Had worked some on trying to porting Quake 3 Arena to it, but this
fizzled out as Q3A is both memory hungry and very unlikely to be usable.

Partly this is because an OpenGL style rasterizer can make slightly more
effective use of the CPU's VLIW capabilities.

Ironically, I do have a small custom "Minecraft like" 3D engine running
on it (though, staying above 5 fps requires limiting it to a 12 meter
draw distance, which kinda sucks).

Also ironically, still faster than trying to run "actual" Minecraft with
a similar draw distance on a laptop from 2003 (which has 36x higher
clock speed). Though, this laptop is plenty fast enough to run Quake and
similar.

Can do video playback semi passable. But, it is a balancing act between
computational cost of the video decoding and keeping the bitrate low
enough that it doesn't get stuck on IO bandwidth (some of the "classic"
codecs like CRAM or RPZA use need too much IO bandwidth to get the video
data off the SDcard).

Was generally having best results in this case with hybrid CC/VQ codecs
with an LZ post-compression stage.

Had observed that it is fast enough at JPEG decoding, that it is at
least possible that an MPEG style decoder could be used (not tested
yet). Main "slow parts" of the JPEG decoding in this case being the
Huffman/VLC decoding, and writing stuff to the output framebuffer
(things like IDCT and YCbCr->RGB transform mapping "reasonably well" to
VLIW).

Though, an MPEG-like codec would likely be limited to 160x100 or
similar, as 320x200 is likely to have more computational cost than the
CPU could deal with at 50 MHz.

With the VQ+LZ approach was generally able to manage 320x200 video.

>> Also it can be used as a way to implement a "multi byte memset", say for
>> example, if one wants a fast way to flood-fill a chunk of memory with
>> 0xDEADBEEF or similar, ...
>
> A multi-byte memset that is reading from the area where it is writing
> seems inefficient, particularly if the memory is cache-cold,
> since it will actually be sucking the memory into the processor's
> caches only to turn around a blast it out again.
>
> You really want an actual multi-byte memset for that use case.
>

That was also a possible consideration. Depends mostly on use case.

Main hassle with a multi-byte memset is that one (potentially) needs
multiple versions for each fill size.

But, yeah, something like _memset16, _memset32, or _memset64 could also
address this use-case.

Though, a multibyte memset would make things like alignment a little
easier (in my case, calls like "malloc()" always return memory with a
16-byte alignment).

...

On 10/4/2022 3:20 PM, BGB wrote:
> On 10/4/2022 8:43 AM, Scott Lurndal wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 10/3/2022 4:10 PM, Scott Lurndal wrote:
>>>> Kaz Kylheku <864-117-4973@kylheku.com> writes:
>>>>> On 2022-10-03, BGB <cr88192@gmail.com> wrote:
>>>>
>>
>>>> Why would someone want to use memcpy for an overlapping
>>>> move when memmove is available to handle the overlap cases?
>>>>
>>>
>>> Because memmove will not have the needed semantics in some cases.
>>>
>>> In this case, memmove will always keep the original buffer contents
>>> intact. For some algorithms, this is not the needed behavior.
>>>
>>>
>>> In some cases, one "actually needs" a copy operation that will turn the
>>> output into a repeating pattern of bytes whenever one forward-copies a
>>> chunk of memory over the top of itself.
>>
>> What's that, perhaps 0.0001% of the use cases? Roll your own is
>> probably
>> best for that.
>>
>
> It is niche, granted.
>
> Still happened commonly enough to where I ended up adding it to my C
> library as an extension.
>
> Though, the larger "platform level" API also includes things like LZ
> compression and decompression functions (used for various purposes).
>

I noted that the version that existed online was a few months out of
date, so stuck the code for a newer version on pastebin:
https://pastebin.com/6TG2DbEc

While I guess one could object to the wanton abuse of pointers in this
code, the target ISA allows 64-bit (and smaller) pointers to be
misaligned, and my compiler doesn't do "Strict Aliasing" / TBAA by
default (otherwise, some of this wouldn't really fly in a compiler which
does TBAA).

Generally, in this case, abusing the pointers is generally the fastest
way to get this stuff shuffled around in memory.

In some other parts of the codebase, there are wrappers around the raw
pointer casts and derefs.

In a few of these cases, constant "memcpy()" could be used, which BGBCC
will optimize into bare loads/stores internally, so could be worth
considering.

Though, one still needs to be sparing with memcpy, since how effectively
it may be optimized away is a bit hit or miss (and it will still tend to
end the current basic-block, which will in turn may still cause the
compiler to spill all the not-statically-assigned variables to the stack
even in cases where it does turn it into memory ops; sometimes is will
spill the registers just to reload them again so that it can do a memory
op, then discard the registers, only reload the registers again from the
stack after the "call" has returned).

Though, this is partly because the front-end doesn't necessarily know in
advance whether the backend will treat it like a built-in or emit a
function call (for functions known to be intrinsics, it can use a
different operator and sidestep the spill-and-reload parts implied by a
normal function call).

Bare pointer cast and de-reference at least manages to avoid this
particular issue...

In these functions, the "register" keyword isn't used mostly because,
as-is, my compiler is mostly smart enough at this point to figure out
which variables to prioritize (usually), and otherwise "register" serves
as a hint that the compiler should assume that the function is on the
hot path (and using this keyword too casually may have a detrimental
effect on code-density).

Can't expect too many "clever" optimizations though, my compiler isn't
particularly clever about this stuff...

As can be observed, most of this isn't ASM (the actual implementation of
the "memcpy()" function in this case is written in ASM though).

>
>>>
>>> Also it can be used as a way to implement a "multi byte memset", say for
>>> example, if one wants a fast way to flood-fill a chunk of memory with
>>> 0xDEADBEEF or similar, ...
>>
>> A simple loop will likely be as fast as any library function.
>>
>> Historically, the Burroughs B4900 mainframe MVA (Move Alpha)
>> instruction could do both
>> of the above overlapping move operations if necessary.
>>
>
> Most of this will be in the context of my ISA (BJX2) at 50 MHz.
>
> Say:
> uint32_t *pi;
> pi=dest;
> for(i=0; i<n; i++)
>     pi[i]=0xDEADBEEFU;
>
> Will fill at around 25 MB/sec (with the loop spinning at roughly 8 clock
> cycles per iteration).
>
> Note: Compiler in this case is basically incapable of unrolling loops.
> It can bundle stuff into VLIW bundles, but there is very little to
> bundle in these sorts of loops (one needs a bunch of non-dependent ALU
> ops and other stuff going on to effectively make much use of VLIW).
>
> With 32 bits per iteration, it can't even reach full DRAM speed.
>
>
> Also, the CPU is strictly in-order, so a loop with 8 cycles of latency
> will always take (at least) 8 cycles to run.
>
> This is part of why naive byte-copy loops are so slow, then one is
> looking at around 11 cycles per byte.
>
>
>
> But, they could also write:
> _mset_uint64(dest, 0xDEADBEEFDEADBEEFULL);
> _memlzcpy(dest+8, dest, (n-2)<<2);
>
> And get basically "max speed".
>
>
> Depends slightly on things like alignment, copy-size, and where it fits
> in the cache hierarchy (fastest cases being around 290 MB/sec, for fills
> within the L1 cache; this case would drop to around 180 MB/s for the L2
> cache, and around 54 MB/s for external DRAM).
>
> Though, even as slow as DRAM is, it is still faster than it would be to
> fill the memory 32 bits at a time in a loop.
>
> RAM is pretty slow in this case: Roughly 100 MB/s for unidirectional
> load/store, and 56 MB/s for Swap (conjoined Load+Store). For "memset()"
> style tasks, the L2 cache primarily does Swap operations.
>
>
> Where, in this case, a 64-bit store would give an 8-byte alignment,
> which allows using 128-bit memory operations (a 4-byte alignment would
> drop it to around 160 MB/s due to needing to drop back to 64-bit
> operations).
>
> Where in this case, the ISA is limited to only one memory Load/Store per
> clock-cycle (1 throughput, 3 latency, for both 64 and 128 bit Load/Store
> ops).
>
>
> Whereas, say:
> uint64_t *pli, *plie;
> pli=dest; n1=n>>1; plie=pli+n1;
> while(pli<plie)
> {
>     pli[0]=0xDEADBEEFDEADBEEFULL;
>     pli[1]=0xDEADBEEFDEADBEEFULL;
>     pli[2]=0xDEADBEEFDEADBEEFULL;
>     pli[3]=0xDEADBEEFDEADBEEFULL;
>     pli+=4;
> }
>
> Will reach roughly 160 MB/s, but now "what if n was not an even multiple
> of 8 elements?", etc...
>
>
> If larger than the L1 size, it will drop down to L2 and then DRAM speeds.
>
>
> Granted, one could modify the loop further:
> uint64_t *pli, *plie;
> uint64_t fill;
>
> fill=0xDEADBEEFDEADBEEFULL;
> pli=dest; n1=n>>1; plie=pli+n1;
> while(pli<plie)
> {
>     pli[0]=fill;    pli[1]=fill;
>     pli[2]=fill;    pli[3]=fill;
>     pli[4]=fill;    pli[5]=fill;
>     pli[6]=fill;    pli[7]=fill;
>     pli+=8;
> }
>
> Roughly 14 cycles per iteration, 64 bytes per iteration, so ~ 228 MB/s.
>
> But, what else could one do:
> uint128_t *pli, *plie;
> uint128_t fill;
>
> fill=0xDEADBEEFDEADBEEFDEADBEEFDEADBEEFUI128;
> pli=dest; n1=n>>2; plie=pli+n1;
> while(pli<plie)
> {
>     pli[0]=fill;    pli[1]=fill;
>     pli[2]=fill;    pli[3]=fill;
>     pli[4]=fill;    pli[5]=fill;
>     pli[6]=fill;    pli[7]=fill;
>     pli+=8;
> }
>
> Now pushes 128 bytes per loop iteration, and can making some use of
> 128-bit memory store operations (and can hit the 290 MB/s limit).
>
> Obvious problem: Using 128-bit types is no longer standard C.
>
>
> Also, on this target, this will break if the destination is not properly
> aligned, and while you could write:
> __unaligned uint128_t *pli, *plie;
>
> This falls back to using smaller (64-bit) stores internally (for
> technical reasons, the ISA in question imposes a 64-bit alignment for
> 128-bit Load/Store operations).
>
> Theoretical hard-limit for 128-bit stores is 800MB/s at 50MHz, but this
> can't really be achieved in practice (the closest I have gotten in
> practice is around 470MB/s, when storing at 512 bytes per loop iteration).
>
> For most uses, a memory fill loop that move 64 bytes at a time and hits
> a limit of ~ 300 MB/s tend to be a little more practical (fast enough to
> saturate DRAM and L2, so "good enough").
>
>
>
>
> On my desktop PC (Zen+, 3.7 GHz), can generally get:
> ~ 3.8 GB/s, for large fills (4MB).
> ~ 7.4 GB/s, for medium fills (128K).
> ~ 12 GB/s, for small fills (8K).
>
> The relative impact of element size or loop structure for the memory
> stores seems to be a lot smaller on this machine (likely because of OoO
> and similar).
>
> Though, naive byte copy loops are still relatively slow, even on x86-64.
>
>
>
> LZ decompressors are harder, usually harder to get them much past around
> 2 GB/s or so for typical data (for an LZ4 style compression format).
>
> For anything with a Huffman stage, hard to get much past around 600-800
> MB/s or so (single threaded).
>
>
> Hiding this stuff behind a function is "usually" preferable, and then
> the program can hopefully avoid a big mess along the lines of, say:
>
> #ifdef _MSC_VER
> ... MSVC stuff
> #endif
>
> #ifdef __GNUC__
> ... GCC stuff
> #endif
>
> #ifdef _M_X64 //MSVC
> ...
> #endif
>
> #ifdef _M_X86 //MSVC
> ...
> #endif
>
> #ifdef __i386__ //GCC or Clang
> ...
> #endif
>
> #ifdef __x86_64__ //GCC or Clang
> ...
> #endif
>
> #ifdef _M_ARM //ARM + MSVC
> ...
> #endif
>
> #ifdef __arm__ //ARM + GCC
> ...
> #endif
>
> #ifdef __BJX2__ //my ISA, assumes it is BGBCC
> ...
> #endif
>
> ..
>
> Where, stuff like this is preferably kept to a minimum, and tends to
> turn programs into a big hairy mess.
>
>
> One usual option is to put a lot of the "common hair" into files that
> one copy/pastes from one project to another, but this still kinda sucks.
>
>
> And, what usually goes in these blocks:
> Wrappers for loading/storing values from pointers;
> Specialized memory copy stuff;
> ...
>
> Basically, some amount of the stuff I proposed in the OP.
>
>
> The malloc stuff is its own category, mostly because of annoyances where
> baseline malloc isn't really sufficient, and so the program tends to
> need to implement its own memory allocator.
>
>
>

Click here to read the complete article

Subject	Author
Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	Tim Rentsch
Re: Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	Kaz Kylheku
Re: Idle: C library features wish-list.	Scott Lurndal
Re: Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	Scott Lurndal
Re: Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	Kaz Kylheku
Re: Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	BGB
Re: Idle: C library features wish-list.	Tim Rentsch
Re: Idle: C library features wish-list.	Kenny McCormack

<<<<< EVACUATION ROUTE <<<<<

devel / comp.lang.c / Idle: C library features wish-list.