Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Gravity is a myth, the Earth sucks.


computers / alt.os.linux / [OT] really delusional compression ratio => WHAT NOT TO compress

SubjectAuthor
* [OT] really delusional compression ratio => WHAT NOT TO compressMarioCPPP
+* Re: [OT] really delusional compression ratio => WHAT NOT TO compressPaul
|+* Re: [OT] really delusional compression ratio => WHAT NOT TO compressMarioCPPP
||+- Re: [OT] really delusional compression ratio => WHAT NOT TO compressBit Twister
||`- Re: [OT] really delusional compression ratio => WHAT NOT TO compressCarlos E.R.
|`- Re: [OT] really delusional compression ratio => WHAT NOT TO compressJim Diamond
+* Re: [OT] really delusional compression ratio => WHAT NOT TO compressDan Purgert
|`- Re: [OT] really delusional compression ratio => WHAT NOT TO compressJim Diamond
`- Re: [OT] really delusional compression ratio => WHAT NOT TO compressMarioCPPP

1
[OT] really delusional compression ratio => WHAT NOT TO compress

<u5uudm$1slrc$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1331&group=alt.os.linux#1331

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: alt.os.linux
Subject: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Fri, 9 Jun 2023 12:20:36 +0200
Organization: A noiseless patient Spider
Lines: 60
Message-ID: <u5uudm$1slrc$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
Injection-Date: Fri, 9 Jun 2023 10:20:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="f92385b0b9c105ed434fec8a701bbe17";
logging-data="1988460"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19A9/+WbPbAwS03+bzrO3LN"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:3tpyye+4JbjqsgYAqnhEK67bEzc=
Content-Language: en-GB, it-IT
 by: MarioCPPP - Fri, 9 Jun 2023 10:20 UTC


I am writing a "per file" backup program in gambas (btw, If
sb is intrested into, I can send him/her the sourcecode,
which has one issue I must fix *)
Now, it took more or less 24 hours backupping
compressed 526282 regular files : 613,5 GB (614,9 GB on disk)
original (most of large ones just compressed and left as
such) 699,8 GB (701,1 GB on disk)

now I have added the feature to COMPRESS most, but COPY the
pre-compressed ones as such, basing on a list (user
modifiable) of popular compressed extentions :
.7z .7Z .Z .zip .rar .arj .gz .bz2 .xz .lzma .lzh .arj .bz
which is far from complete ... apropos : any hint here is
WELCOME !
The very poor compression ratio (slightly less than 88 % in
size), is surely in part due to the fact that a lot of
stuff, expecially LARGE files, were just compressed before
the backup.
But inquiring with k4DirStat and QDirStat looking for large
stuff in the backup, I have noticed a lot of MULTIMEDIA
files, that have been sillily compressed, were the same size.
It was a mistake not to have considered that many multimedia
stuff is just a lot optimized (often in a lossy way) that
makes stupid to try to squeeze further (the backup was
annoyingly SLOW, since I have selected the .BZIP2 algorithm,
with MAXIMUM compression option, which theoretically gives
the best results at the expence of execution time ... but
this best with a lot of video / audio files is really
negligible).
So, in your opinion, which multimedia files, by extention
(the program is not smart as to look into files trying to
guess entropy, it considers just the extention alone),
deserves compression and which would be a lot faster to
store as such ?

ig . .BMP, .WAV, .AVI seems fit for compression
.MP3, .MP4, .MKV, .WEBM and a lot more I cannot recall,
seems very well pre.compressed.
For future use I'd like to add this extentions to the
Copy-As-Such list.
I need some advices, the media extentions are really
numerous, and I don't know which are the most frequent. I
don't have a program that scans the disk and produces a
count for each extention.
Adding a file type mentioned just a couple of times, would
change nothing. But if mentioned some hundredth times would !
Tnx for any advice to my "skip list"
***
I still am unable to REFRESH (little confidence with the
events loop, tried to use WAIT statement with unexpected
outcome, and then removed it and left only REFRESH, which
delays indefinitely until the end of the duty cycle :\)
I followed the course in the debugger, but this is very
annoying for sb who wants to run a sw in normal mode, in
that case having the interface frozen for a day (that seems
stuck and broken, but internally really traps / catches
every kind of errors and LOGS them to a file, to manually
check the backup problem, it was written with reliability in
mind, not speed). But it is annoying to see no progress bar
move, and at completion they jump to 100% :\ I'll maybe try
to use a timer object to force refresh every X seconds.

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<u5v3ch$1t7og$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1333&group=alt.os.linux#1333

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Fri, 9 Jun 2023 07:45:20 -0400
Organization: A noiseless patient Spider
Lines: 138
Message-ID: <u5v3ch$1t7og$1@dont-email.me>
References: <u5uudm$1slrc$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 9 Jun 2023 11:45:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8bba9eebf6b02a7727398f4504364ce2";
logging-data="2006800"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189UtSJwtMUpMvqWK02lATfc1/BZVgQOv0="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:vBTctW+mF0F1E0US/I+9ZcB+uMc=
In-Reply-To: <u5uudm$1slrc$1@dont-email.me>
Content-Language: en-US
 by: Paul - Fri, 9 Jun 2023 11:45 UTC

On 6/9/2023 6:20 AM, MarioCPPP wrote:
>
> I am writing a "per file" backup program in gambas (btw, If sb is intrested into, I can send him/her the sourcecode, which has one issue I must fix *)
>
> Now, it took more or less 24 hours backupping
>
> compressed  526282  regular files : 613,5 GB (614,9 GB on disk)
> original (most of large ones just compressed and left as such) 699,8 GB (701,1 GB on disk)
>
>
> now I have added the feature to COMPRESS most, but COPY the pre-compressed ones as such, basing on a list (user modifiable) of popular compressed extentions :
> .7z .7Z .Z .zip .rar .arj .gz .bz2 .xz .lzma .lzh .arj .bz
> which is far from complete ... apropos : any hint here is WELCOME !
>
> The very poor compression ratio (slightly less than 88 % in size), is surely in part due to the fact that a lot of stuff, expecially LARGE files, were just compressed before the backup.
>
> But inquiring with k4DirStat and QDirStat looking for large stuff in the backup, I have noticed a lot of MULTIMEDIA files, that have been sillily compressed, were the same size.
>
> It was a mistake not to have considered that many multimedia stuff is just a lot optimized (often in a lossy way) that makes stupid to try to squeeze further (the backup was annoyingly SLOW, since I have selected the .BZIP2 algorithm, with MAXIMUM compression option, which theoretically gives the best results at the expence of execution time ... but this best with a lot of video / audio files is really negligible).
>
> So, in your opinion, which multimedia files, by extention (the program is not smart as to look into files trying to guess entropy, it considers just the extention alone), deserves compression and which would be a lot faster to store as such ?
>
>
> ig . .BMP, .WAV, .AVI  seems fit for compression
> .MP3, .MP4, .MKV, .WEBM and a lot more I cannot recall, seems very well pre.compressed.
>
> For future use I'd like to add this extentions to the Copy-As-Such list.
>
> I need some advices, the media extentions are really numerous, and I don't know which are the most frequent. I don't have a program that scans the disk and produces a count for each extention.
> Adding a file type mentioned just a couple of times, would change nothing. But if mentioned some hundredth times would !
>
> Tnx for any advice to my "skip list"
>
> ***
>
> I still am unable to REFRESH (little confidence with the events loop, tried to use WAIT statement with unexpected outcome, and then removed it and left only REFRESH, which delays indefinitely until the end of the duty cycle :\)
>
> I followed the course in the debugger, but this is very annoying for sb who wants to run a sw in normal mode, in that case having the interface frozen for a day (that seems stuck and broken, but internally really traps / catches every kind of errors and LOGS them to a file, to manually check the backup problem, it was written with reliability in mind, not speed). But it is annoying to see no progress bar move, and at completion they jump to 100% :\ I'll maybe try to use a timer object to force refresh every X seconds.
>

If you have individually compressed a large file set from your sample
disk, then you likely have all the data you need. You could compare
the filelist of the original files, versus the filelist of the individually
compressed files. And then noticed the pattern of the ones that do not
compress well.

In the following example, DO NOT run this on large files, if you can help
it. This is just a way to determine how much compression can be achieved,
without using a compressor as such.

$ sudo apt install ent

$ ent sample.mp4

Entropy = 7.998724 bits per byte. <=== almost incompressible, like /dev/random below

Optimum compression would reduce the size
of this 23730017 byte file by 0 percent.

$ dd if=/dev/zero of=zero.bin bs=1024 count=1024
$ ent zero.bin

Entropy = 0.000000 bits per byte. <=== very very compressible, much redundancy

Optimum compression would reduce the size
of this 1048576 byte file by 100 percent. <=== there is always a little overhead...

$ dd if=/dev/random of=devrandom.bin bs=1024 count=1024
$ ent devrandom.bin

Entropy = 7.999825 bits per byte. <=== crypto-quality source

Optimum compression would reduce the size
of this 1048576 byte file by 0 percent.

$ dd if=/dev/urandom of=devurandom.bin bs=1024 count=1024
$ ent devurandom.bin

Entropy = 7.999793 bits per byte. <=== not a crypto-quality source, tiny diff in entropy

Optimum compression would reduce the size
of this 1048576 byte file by 0 percent.

Also note, that of the various compression algorithms, some "speed up"
when they hit really low entropy materials. Arithmetic compressors don't
speed up quite like that. BZ2 is likely an excellent compressor, but on
the other hand, if you compress a file of zeros with it, it takes more
time for BZ2 to do that, than for gzip to do the same file.

Backup programs, commercial ones, tend to use "lightweight" compressors.
It is the opinion, of the developer, that a short backup time, is more
important than absolutely smallest output. I don't agree with this
particularly, because disk drives cost money. The degree of compression used,
the lightness, is intended to approximately match the write speed of the
destination disk drive. So if the average backup drive does 100MB/sec ,
then the compression method selected is intended to create that level
of output (100MB/sec) during a backup. They do not want the compressor
to be the rate limiting step.

It is inevitable, that if you select high compression from an arithmetic
compressor, it will never match the I/O rates of the storage devices.
It will behave like a pig. Customers will be unimpressed.

*******

By using multi-core compression (all the CPU cores busy on the
compression), you can improve the performance. An arithmetic compressor
can be memory-bandwidth-bound, which is why you would hope a 5800X3D
(with extra cache) would run faster than a vanilla 5800 CPU. Now, one
reason that does not happen, is the L3 on such a chip, is only a bit
faster than main memory. The larger a cache becomes, generally there
is an access speed penalty on that. Some caches are also "segmented"
and the "distance" from the core to the cache, the routing, slows it down.
This is why the large cache on an Epyc or a Milan, isn't as effective
as you might like.

As an example of a multi-core one, you could experiment with "pigz".
I do not know how well it scales, and exactly how many cores it can use,
but it's an example of a multi-core on compression program. On decompression,
it is still a single thread of execution, last time I checked.

https://linux.die.net/man/1/pigz

But if you were hoping to "set the world on fire" with your programming
skills in this topic, others have been there and "got their Tshirt" :-)
You are not going to beat any other developer at this game, on your
first try.

Summary: If you consider disk drives expensive, than BZ2 is the answer.
If you consider disk drives cheap, you can try pigz instead.
And yes, if you want, you can avoid compressing the .mp4 file.
Good idea. PDF files can be compressed a bit, but not much.

Make sure the CPU has a good cooler, for jobs like this, as the
CPU is going to get warm. To compress all the data on a decent
sized hard drive, costs about $1 in electricity. Released as heat.

Paul

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<slrnu8665r.tpe.dan@djph.net>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1334&group=alt.os.linux#1334

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dan...@djph.net (Dan Purgert)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Fri, 9 Jun 2023 12:17:57 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <slrnu8665r.tpe.dan@djph.net>
References: <u5uudm$1slrc$1@dont-email.me>
Injection-Date: Fri, 9 Jun 2023 12:17:57 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1dc59d502455b9d01455c681507c12db";
logging-data="2011545"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Htm9mjJR1cps+5F9+wjo8x0yNQrdgiyM="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:dcgLXwid+fSypDc+/FJ/TUGzoV0=
 by: Dan Purgert - Fri, 9 Jun 2023 12:17 UTC

On 2023-06-09, MarioCPPP wrote:
> [...]
> So, in your opinion, which multimedia files, by extention
> (the program is not smart as to look into files trying to
> guess entropy, it considers just the extention alone),
> deserves compression and which would be a lot faster to
> store as such ?

All of them. They're already compressed. It's probably easier talking
about what is viable for compression --> text documents (*odf, *txt,
sourcecode, etc.)

--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<u617a1$27vbb$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1338&group=alt.os.linux#1338

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Sat, 10 Jun 2023 09:04:30 +0200
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <u617a1$27vbb$1@dont-email.me>
References: <u5uudm$1slrc$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 10 Jun 2023 07:04:35 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b9a61dc5a3422846c8f29feb2eb620dd";
logging-data="2358635"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BBtMz2h12np5febuSJeVX"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:nR+ZT3tj66PPUoFke2lycmYloas=
In-Reply-To: <u5uudm$1slrc$1@dont-email.me>
Content-Language: en-GB, it-IT
 by: MarioCPPP - Sat, 10 Jun 2023 07:04 UTC

On 09/06/23 12:20, MarioCPPP wrote:
>
CUT

here is an "in fieri" updated list of recent searches

..fsa .squashfs .squash .lzw .lz77
..ppm .mp3 .aac .ogg .ogx .vorbis .gif .png .jpeg .jpg .heif
..mpeg .mpg .mp4 .m4v .hevc
..webm .mkv .mk3d .avi .wmv .asf .DivX .mkv .mov .qt .FLV
..F4V .SWF
..m2p .ps .ts .tsv .m2ts .mts .vob .evo .3gp
..jr .jz ?? jahr archives ?
..apk .msi .deb

also, the APT packages, are .deb or some other intermediate
format that is cached ?

(I mentioned also some tipically windows format .MSI since
here and there some installers lingers in backups)

pls share your experience with other rarer formats (also for
media), and if you know some media (video, audio, images)
for sure to be NOT compressed ...

>
>

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<u6199p$286r1$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1339&group=alt.os.linux#1339

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Sat, 10 Jun 2023 09:38:02 +0200
Organization: A noiseless patient Spider
Lines: 177
Message-ID: <u6199p$286r1$1@dont-email.me>
References: <u5uudm$1slrc$1@dont-email.me> <u5v3ch$1t7og$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
Injection-Date: Sat, 10 Jun 2023 07:38:53 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b9a61dc5a3422846c8f29feb2eb620dd";
logging-data="2366305"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+yNGtPl71QgdfMjLYSxVbF"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:ul7FUzwpQaFgbltiHwEiJCuMCjw=
Content-Language: en-GB, it-IT
In-Reply-To: <u5v3ch$1t7og$1@dont-email.me>
 by: MarioCPPP - Sat, 10 Jun 2023 07:38 UTC

On 09/06/23 13:45, Paul wrote:
> On 6/9/2023 6:20 AM, MarioCPPP wrote:
>>
CUT my question
>>
>
> If you have individually compressed a large file set from
> your sample
> disk, then you likely have all the data you need. You could
> compare
> the filelist of the original files, versus the filelist of
> the individually
no need, ALL of the problems are logged. I found some 14
amputated symlinks (maybe due to the fact that in "DIR" cmd
i stated FOLLOW SYMLINKS = FALSE, in order not to spread the
scan to other disks) and just 1 single file whose name part
was just exactly 255 bytes long, and that produced error
trying to add a further .bz2 (four chars).
There seem to be no limit as to the whole PATH size, but 255
chars in every single piece of the path (filename included)

> compressed files. And then noticed the pattern of the ones
> that do not
> compress well.
yes, but I am just "sampling", since I cannot examine each
of a 500 K files and above.
>
> In the following example, DO NOT run this on large files, if
> you can help
> it. This is just a way to determine how much compression can
> be achieved,
> without using a compressor as such.
>
>  $ sudo apt install ent
>
>  $ ent sample.mp4
>
>    Entropy = 7.998724 bits per byte.    <=== almost
> incompressible, like /dev/random below
>
>    Optimum compression would reduce the size
>    of this 23730017 byte file by 0 percent.
>
>  $ dd if=/dev/zero of=zero.bin bs=1024 count=1024
>  $ ent zero.bin
>
>    Entropy = 0.000000 bits per byte.          <=== very
> very compressible, much redundancy

very intresting indeed !
but I am really at a loss in esteemating how much time (a
sunk cost) I spend in advance for calculating entropy
compared with the time I spare not compressing the high entropy.


>
>    Optimum compression would reduce the size
>    of this 1048576 byte file by 100 percent.  <=== there is
> always a little overhead...
>
>  $ dd if=/dev/random of=devrandom.bin bs=1024 count=1024
>  $ ent devrandom.bin
>
>    Entropy = 7.999825 bits per byte.     <===
> crypto-quality source
>
>    Optimum compression would reduce the size
>    of this 1048576 byte file by 0 percent.
>
>  $ dd if=/dev/urandom of=devurandom.bin bs=1024 count=1024
>  $ ent devurandom.bin
>
>    Entropy = 7.999793 bits per byte.     <=== not a
> crypto-quality source, tiny diff in entropy
>
>    Optimum compression would reduce the size
>    of this 1048576 byte file by 0 percent.
>
> Also note, that of the various compression algorithms, some
> "speed up"
> when they hit really low entropy materials. Arithmetic
> compressors don't
> speed up quite like that. BZ2 is likely an excellent
> compressor, but on
> the other hand, if you compress a file of zeros with it, it
> takes more
> time for BZ2 to do that, than for gzip to do the same file.
well, no algo. is perfect, but I don't know well every
detail to try to chose a different algo. for each kind of file
>
> Backup programs, commercial ones, tend to use "lightweight"
> compressors.
this is by far not commercial, just for my use.
the highest care was in the LOG feature, that helps outcome
analysis. The log contains "formats" that help the location
of different errors using RegEx

> It is the opinion, of the developer, that a short backup
> time, is more
> important than absolutely smallest output. I don't agree
> with this
> particularly, because disk drives cost money.
sure ! I think this too. Also, a larger set of backups can
be stored if more compressed.
I am thinking of using two kind of backups
One on a "per-folder" basis (like AWIT DBACKUP does), for
longer storage and more sporadic use, which surely achieves
better ratios, expecially with a lot o small files, and one
on a "per-file" basis, worse performing on compression (the
overhead is duplicated for each small file ... luckily
enough in each cluster some waste space in the original file
could accomodate this overhead, from time to time, I mean,
the coarse granularity of allocation space tend to damp this
problem), but Which is more usable in finding single items
and RETAINS file names, which is useful for SEARCHES in the
FS, while the filenames are not exposed in the per-folder
backup from awit dbackup (so just the structure is exposed,
but is not searchable).
This way (single files) is the most resilient : any i/o
error can affect just one file.
Maybe the worst drawback is the heavy disk structure,
containing millions of files. Disks so burdened with
filenames are slow to scan and analize by tools. There is no
free meal :\

What I will no longer be doing is : compress on a per-disk
basis. It not only produce huge files (which might be not so
resilient to error recovery), but these images are horribly
slow to "mount" in the compressor, and also to be checked at
the end.
for instance, per-disk compressors (like .FSA from
FileSystemArchiver and images from DAR) are really good in
the compression. I forgot to add .fsa to the exclusion list,
and one backup was compressed. Here the result :
original actual disk size (.fsa) : 29529489408 byte
compressed size (.bz2) : 28779950080 byte
ratio = 97,4617260812 %
just 2,6 % of gain, simply ridiculous !
It might have taken an hour or so, I guess (the log does not
contain times ... I am considering to produce a second log
for stats, which would be a text file not very amenable to
editors, due to its size).

> The degree of
> compression used,
> the lightness, is intended to approximately match the write
> speed of the
> destination disk drive. So if the average backup drive does
> 100MB/sec ,
I have a new 8 TB external (USB 3.1) disk, which does NOT
reach that high, but is said 95 MBs in reading and 85 MBs in
writing.
> then the compression method selected is intended to create
> that level
> of output (100MB/sec) during a backup. They do not want the
> compressor
> to be the rate limiting step.
well, too far fetched for me :D
>
> It is inevitable, that if you select high compression from
> an arithmetic
> compressor, it will never match the I/O rates of the storage
> devices.
> It will behave like a pig. Customers will be unimpressed.
>
> *******
>
> By using multi-core compression (all the CPU cores busy on the
> compression), you can improve the performance. An arithmetic
> compressor
> can be memory-bandwidth-bound, which is why you would hope a
> 5800X3D
> (with extra cache) would run faster than a vanilla 5800 CPU.
> Now, one
> reason that does not happen, is the L3 on such a chip, is
> only a bit
> faster than main memory. The larger a cache becomes,
> generally there
> is an access speed penalty on that. Some caches are also
> "segmented"
> and the "distance" from the core to the cache, the routing,
> slows it down.
> This is why the large cache on an Epyc or a Milan, isn't as
> effective
> as you might like.
>
> As an example of a multi-core one, you could experiment with
> "pigz".
Gambas component just supports 3 algos.
gzip, bzip and another I cannot recall (the fastest and
worse in ratios).
I 've chosen the bzip2. I don't even know if it has parallel
versions or not. I guess not, since even during duty I did
not notice slowing in any other running SW and CPU usage did
not peak too much.

> I do not know how well it scales, and exactly how many cores
> it can use,
> but it's an example of a multi-core on compression program.
> On decompression,
> it is still a single thread of execution, last time I checked.
>
>    https://linux.die.net/man/1/pigz
>
> But if you were hoping to "set the world on fire" with your
> programming
> skills in this topic, others have been there and "got their
> Tshirt" :-)
no no, not at all. Just using existing components in gambas
to perform compression.
which btw does not support a per-folder compression.
I am evaluating using shell / exec and invoking TAR to add a
layer and then to feed the .tar to the compressor.
But this would add nothing useful to AWIT DBACKUP, which is
very good in per-folder backup. It can even update backups
(but it's me that is unable to exploit this feature).

Click here to read the complete article

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<slrnu88i00.15vf.BitTwister@wb.home.arpa>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1341&group=alt.os.linux#1341

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BitTwis...@mouse-potato.com (Bit Twister)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Sat, 10 Jun 2023 04:53:04 -0500
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <slrnu88i00.15vf.BitTwister@wb.home.arpa>
References: <u5uudm$1slrc$1@dont-email.me> <u5v3ch$1t7og$1@dont-email.me>
<u6199p$286r1$1@dont-email.me>
Injection-Info: dont-email.me; posting-host="052de7d99065609aec57687bacec0c20";
logging-data="2392757"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19maWd9rx9+MdiU7YVckghysz23RGJFQTA="
User-Agent: slrn/pre1.0.4-6 (Linux)
Cancel-Lock: sha1:/LZ/3Tr1XtmcYDDTUeXwEcxv5ac=
 by: Bit Twister - Sat, 10 Jun 2023 09:53 UTC

On Sat, 10 Jun 2023 09:38:02 +0200, MarioCPPP wrote:
> On 09/06/23 13:45, Paul wrote:
>> On 6/9/2023 6:20 AM, MarioCPPP wrote:
>>>
> CUT my question
>
>>>
>>
>> If you have individually compressed a large file set from
>> your sample
>> disk, then you likely have all the data you need. You could
>> compare
>> the filelist of the original files, versus the filelist of
>> the individually
>
> no need, ALL of the problems are logged. I found some 14
> amputated symlinks

I always report dangling link on any software install/update package.
MY install script winds up running
symlinks -r / | grep dangling
as root, to warn me of any 'amputated' symlinks

<BIG SNIP>

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<7r2eljxtg5.ln2@Telcontar.valinor>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1342&group=alt.os.linux#1342

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.imp.ch!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: robin_li...@es.invalid (Carlos E.R.)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Sat, 10 Jun 2023 13:19:35 +0200
Lines: 25
Message-ID: <7r2eljxtg5.ln2@Telcontar.valinor>
References: <u5uudm$1slrc$1@dont-email.me> <u5v3ch$1t7og$1@dont-email.me>
<u6199p$286r1$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net +GzuLpyg77oIWvO3EaMZFw72pUu7MR2u2pxrtUSYAuv1r2QG8b
X-Orig-Path: Telcontar.valinor!not-for-mail
Cancel-Lock: sha1:vk0+Sy32nTrkwegx6A2iyRXtqAg=
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.9.1
Content-Language: es-ES, en-CA
In-Reply-To: <u6199p$286r1$1@dont-email.me>
 by: Carlos E.R. - Sat, 10 Jun 2023 11:19 UTC

On 2023-06-10 09:38, MarioCPPP wrote:
> On 09/06/23 13:45, Paul wrote:
>> On 6/9/2023 6:20 AM, MarioCPPP wrote:

>> The degree of compression used,
>> the lightness, is intended to approximately match the write speed of the
>> destination disk drive. So if the average backup drive does 100MB/sec ,
>
> I have a new 8 TB external (USB 3.1) disk, which does NOT reach that
> high, but is said 95 MBs in reading and 85 MBs in writing.
>
>> then the compression method selected is intended to create that level
>> of output (100MB/sec) during a backup. They do not want the compressor
>> to be the rate limiting step.
>
> well, too far fetched for me :D

It is also my criteria. I adjust the compression ratio so that the
backup runs at 150 MB/S or thereabouts. If it runs at only 80, the
compression is slowing it down.

--
Cheers, Carlos.

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<slrnu8chs7.8tq.JimDiamond@x360.localdomain>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1345&group=alt.os.linux#1345

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: JimDiam...@jdvb.ca (Jim Diamond)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Sun, 11 Jun 2023 19:15:33 -0300
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <slrnu8chs7.8tq.JimDiamond@x360.localdomain>
References: <u5uudm$1slrc$1@dont-email.me> <u5v3ch$1t7og$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Info: dont-email.me; posting-host="a94c99bd4c49bc4c0e9b68f1135f9d6b";
logging-data="2970284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jFoW7e0HWGYDpLIYd8O97"
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:+UampWjXy7P035AkFtlJavE4E1U=
 by: Jim Diamond - Sun, 11 Jun 2023 22:15 UTC

On 2023-06-09 at 08:45 ADT, Paul <nospam@needed.invalid> wrote:
> On 6/9/2023 6:20 AM, MarioCPPP wrote:
>>
>> I am writing a "per file" backup program in gambas (btw, If sb is intrested into, I can send him/her the sourcecode, which has one issue I must fix *)
>>
>> Now, it took more or less 24 hours backupping
>>
>> compressed  526282  regular files : 613,5 GB (614,9 GB on disk)
>> original (most of large ones just compressed and left as such) 699,8 GB (701,1 GB on disk)

<snip>

>> Tnx for any advice to my "skip list"

> In the following example, DO NOT run this on large files, if you can help
> it. This is just a way to determine how much compression can be achieved,
> without using a compressor as such.
>
> $ sudo apt install ent
>
> $ ent sample.mp4
>
> Entropy = 7.998724 bits per byte. <=== almost incompressible, like /dev/random below
>
> Optimum compression would reduce the size
> of this 23730017 byte file by 0 percent.
>
> $ dd if=/dev/zero of=zero.bin bs=1024 count=1024
> $ ent zero.bin
>
> Entropy = 0.000000 bits per byte. <=== very very compressible, much redundancy
>
> Optimum compression would reduce the size
> of this 1048576 byte file by 100 percent. <=== there is always a little overhead...
>

I think ent is no doubt interesting for some purposes, but its output makes
some bold and incorrect claims.

For example:

$ yes blartifas | head -1000 > a
$ ent a
Entropy = 3.121928 bits per byte.

Optimum compression would reduce the size
of this 10000 byte file by 60 percent.

Chi square distribution for 10000 samples is 297200.00, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 96.2000 (127.5 = random).
Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
Serial correlation coefficient is -0.129567 (totally uncorrelated = 0.0).

$ xz a
$ ls -ls a.xz
4 -rw-r--r-- 1 zsd users 120 Jun 11 19:07 a.xz

$ yes blartifas | head -1000 |wc
1000 1000 10000

So xz is able to reduce this file not down to 60% of its original size, but
rather 1.2% of its original size.

I am (wildly) guessing that 'ent's claim means "if you are using an entropy
compressor like Huffman or arithmetic coding, then ..." rather than what
someone reading "Optimum" could interpret as "the best possible
compressor".

Jim

Re: [OT] really delusional compression ratio => WHAT NOT TO compress

<slrnu8ciai.8tq.JimDiamond@x360.localdomain>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=1346&group=alt.os.linux#1346

  copy link   Newsgroups: alt.os.linux
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: JimDiam...@jdvb.ca (Jim Diamond)
Newsgroups: alt.os.linux
Subject: Re: [OT] really delusional compression ratio => WHAT NOT TO compress
Date: Sun, 11 Jun 2023 19:23:14 -0300
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <slrnu8ciai.8tq.JimDiamond@x360.localdomain>
References: <u5uudm$1slrc$1@dont-email.me> <slrnu8665r.tpe.dan@djph.net>
Injection-Info: dont-email.me; posting-host="a94c99bd4c49bc4c0e9b68f1135f9d6b";
logging-data="2970284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/3iiW4jcWcnG3++kLmgi9O"
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:aPLwdyAY+c6PTj//kDPYucrX21k=
 by: Jim Diamond - Sun, 11 Jun 2023 22:23 UTC

On 2023-06-09 at 09:17 ADT, Dan Purgert <dan@djph.net> wrote:
> On 2023-06-09, MarioCPPP wrote:
>> [...]
>> So, in your opinion, which multimedia files, by extention
>> (the program is not smart as to look into files trying to
>> guess entropy, it considers just the extention alone),
>> deserves compression and which would be a lot faster to
>> store as such ?
>
> All of them. They're already compressed.

Not quite all of them. Someone might have raw files from a digital camera,
or uncompressed audio files (e.g., .wav files), and I've even see windoze
users happily email around .bmp files.

> It's probably easier talking about what is viable for compression -->
> text documents (*odf, *txt, sourcecode, etc.)

I avoid word processors whenever possible, but I created a .odt file with
libreoffice (there doesn't seem to be a .odf choice with my program) and
that file was apparently already compressed. Or, at least, xz didn't make
any significant improvement.

Jim

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor