Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

What the scientists have in their briefcases is terrifying. -- Nikita Khruschev


tech / sci.electronics.design / Re: Archive formats

SubjectAuthor
* Archive formatsDon Y
+* Re: Archive formatsDon Y
|`- Re: Archive formatsCydrome Leader
+* Re: Archive formatsJasen Betts
|+* Re: Archive formatsDon Y
||`* Re: Archive formatsMartin Brown
|| `* Re: Archive formatsDon Y
||  `* Re: Archive formatsMartin Brown
||   `- Re: Archive formatsDon Y
|`* Re: Archive formatsDave Platt
| `* Re: Archive formatsDon Y
|  +* Re: Archive formatsDave Platt
|  |`- Re: Archive formatsDon Y
|  `* Re: Archive formatsMartin Brown
|   +* Re: Archive formatsDon Y
|   |`* Re: Archive formatsMartin Brown
|   | `* Re: Archive formatsDon Y
|   |  `- Re: Archive formatsDon Y
|   `- Re: Archive formatsPhil Hobbs
`* Re: Archive formatsJan Panteltje
 +* Re: Archive formatsDon
 |`- Re: Archive formatsDon Y
 `* Re: Archive formatsDon Y
  `- Re: Archive formatsJan Panteltje

1
Archive formats

<so6vaj$q31$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83922&group=sci.electronics.design#83922

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Archive formats
Date: Tue, 30 Nov 2021 21:56:46 -0700
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <so6vaj$q31$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 04:56:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="26721"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Wm1VTna92Wca/60dm2S0W"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:C8i+aCU2hC8tJ8CGJ0k2QGCcFIw=
Content-Language: en-US
X-Mozilla-News-Host: news://news.eternal-september.org:119
 by: Don Y - Wed, 1 Dec 2021 04:56 UTC

I'm looking for "established" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

What's come to mind includes (I'm not being pedantic, here -- sometimes
using file extensions to represent file formats):

7z
ace
apk
arc
arj
brotli
bzip2
cab
cfs
compress
cpio
cpt
dar
dmg
egg
gzip
jar
lbr
lha
lz4
lzip
lzma
lzop
lzx
mpq
pea
rar
rpm
shar
sit
sitx
sq
sqx
tar
xar
xz
zip
zoo
zopfli
zpaq
zstd

Daunting list, eh? Any others that I have overlooked?

Re: Archive formats

<so72n4$9m0$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83925&group=sci.electronics.design#83925

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Tue, 30 Nov 2021 22:54:39 -0700
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <so72n4$9m0$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 05:54:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="9920"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Mum3bxaGduIjR7izzk9vP"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:E8FFLFL1/WUGS7/r4rN4VtOcqto=
In-Reply-To: <so6vaj$q31$1@dont-email.me>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 05:54 UTC

On 11/30/2021 9:56 PM, Don Y wrote:
> I'm looking for "established" archive formats and/or compression
> formats (the thinking being that an archive can always be subsequently
> compressed).

> Daunting list, eh? Any others that I have overlooked?

Ugh! Skip that. I've apparently missed *dozens* (scores?)... :<

Re: Archive formats

<so764l$r9v$1@gonzo.revmaps.no-ip.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83931&group=sci.electronics.design#83931

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.szaf.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
From: use...@revmaps.no-ip.org (Jasen Betts)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Organization: JJ's own news server
Message-ID: <so764l$r9v$1@gonzo.revmaps.no-ip.org>
References: <so6vaj$q31$1@dont-email.me>
Injection-Date: Wed, 1 Dec 2021 06:53:09 -0000 (UTC)
Injection-Info: gonzo.revmaps.no-ip.org; posting-host="localhost:127.0.0.1";
logging-data="27967"; mail-complaints-to="usenet@gonzo.revmaps.no-ip.org"
User-Agent: slrn/1.0.3 (Linux)
X-Face: ?)Aw4rXwN5u0~$nqKj`xPz>xHCwgi^q+^?Ri*+R(&uv2=E1Q0Zk(>h!~o2ID@6{uf8s;a
+M[5[U[QT7xFN%^gR"=tuJw%TXXR'Fp~W;(T"1(739R%m0Yyyv*gkGoPA.$b,D.w:z+<'"=-lVT?6
{T?=R^:W5g|E2#EhjKCa+nt":4b}dU7GYB*HBxn&Td$@f%.kl^:7X8rQWd[NTc"P"u6nkisze/Q;8
"9Z{peQF,w)7UjV$c|RO/mQW/NMgWfr5*$-Z%u46"/00mx-,\R'fLPe.)^
Lines: 9
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Wed, 01 Dec 2021 07:00:54 UTC
Date: Wed, 1 Dec 2021 06:53:09 -0000 (UTC)
X-Received-Bytes: 1417
 by: Jasen Betts - Wed, 1 Dec 2021 06:53 UTC

On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
> I'm looking for "established" archive formats and/or compression
> formats (the thinking being that an archive can always be subsequently
> compressed).

If by compressed you mean made smaller, that's obviously false.

--
Jasen.

Re: Archive formats

<so7apt$kun$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83932&group=sci.electronics.design#83932

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 01:12:40 -0700
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <so7apt$kun$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 08:12:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="21463"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19iYJD+0izZP4HnKjpNYP9l"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:4nBDpUSqytdp+PfwHs36RBUjQgU=
In-Reply-To: <so764l$r9v$1@gonzo.revmaps.no-ip.org>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 08:12 UTC

On 11/30/2021 11:53 PM, Jasen Betts wrote:
> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>> I'm looking for "established" archive formats and/or compression
>> formats (the thinking being that an archive can always be subsequently
>> compressed).
>
> If by compressed you mean made smaller, that's obviously false.

No, I mean "original (uncompressed) content being made obscure,
without the *intent* of *hiding* the content". I.e., you can't
(generally) peek into a compressed archive and understand what it
contains, without some assistance from tools.

Given "foo.zip", tell me *anything* about foo? Repeat for
"folder.tgz"? Or, "volume.imz"? (yet, there's nothing that
prevents you from using a tool to examine the contents;
unlike "message.pem")

[And, the *contents* of the archive -- along with the compression
algorithm used -- determine if the result is (physically) smaller
or larger than the original. Compressing (TRULY) random data will
inevitably result in a larger result. Compressing data with
"predictable" patterns will most often result in a space savings
if the chosen compressor knows how to exploit those patterns.]

Re: Archive formats

<so7fnn$ooa$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83934&group=sci.electronics.design#83934

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!aioe.org!fce17Vq++jWAoH1dvt+1NQ.user.46.165.242.75.POSTED!not-for-mail
From: '''newsp...@nonad.co.uk (Martin Brown)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 09:36:55 +0000
Organization: Aioe.org NNTP Server
Message-ID: <so7fnn$ooa$1@gioia.aioe.org>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <so7apt$kun$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="25354"; posting-host="fce17Vq++jWAoH1dvt+1NQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB
 by: Martin Brown - Wed, 1 Dec 2021 09:36 UTC

On 01/12/2021 08:12, Don Y wrote:
> On 11/30/2021 11:53 PM, Jasen Betts wrote:
>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>> I'm looking for "established" archive formats and/or compression
>>> formats (the thinking being that an archive can always be subsequently
>>> compressed).
>>
>> If by compressed you mean made smaller, that's obviously false.
>
> No, I mean "original (uncompressed) content being made obscure,
> without the *intent* of *hiding* the content".  I.e., you can't
> (generally) peek into a compressed archive and understand what it
> contains, without some assistance from tools.

Some archive formats have the directory in a form where you can read it
fairly easily even if it isn't quite in plaintext.
>
> Given "foo.zip", tell me *anything* about foo?  Repeat for
> "folder.tgz"?  Or, "volume.imz"?  (yet, there's nothing that
> prevents you from using a tool to examine the contents;
> unlike "message.pem")

If you want to compare the effectiveness of the different algorithms
then compressing a chunk of web content or a random executable will span
a reasonable range of important use cases.

There was a nice DOS tool called something like ifl that could read the
contents of most common archive formats. Look for it on Simtel.
>
> [And, the *contents* of the archive -- along with the compression
> algorithm used -- determine if the result is (physically) smaller
> or larger than the original.  Compressing (TRULY) random data will
> inevitably result in a larger result.  Compressing data with
> "predictable" patterns will most often result in a space savings
> if the chosen compressor knows how to exploit those patterns.]

Bytewise entropy of the source material will give you a reasonable
independent estimate of how compressible or otherwise it is.

Highly compressed material tends toward ln(256) ~ 5.545
png ~ 5.25
jpg ~ 5.20
exe's ~ 4.4
test ~ 2.0

It can be used to classify unknown data to likely type of file.

(ignoring the sign) sum(p[i].ln(p[i]))

p[i] = n[i]/N

Where n[i] = number of times token i appears
N = sum_over_i (n[i]) = N = filesize

It will give you a fair guess at whether a given file can still be
compressed by a general compression algorithm. You have to work
incredibly hard to get the last 2% reduction in size.

--
Regards,
Martin Brown

Re: Archive formats

<so7kdq$ltv$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83935&group=sci.electronics.design#83935

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 03:56:49 -0700
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <so7kdq$ltv$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <so7apt$kun$1@dont-email.me>
<so7fnn$ooa$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 10:56:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="22463"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+wII6her3gca/tl3Y6lXDI"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:92fe9+p9e9grEYeiVb4yPScPshs=
In-Reply-To: <so7fnn$ooa$1@gioia.aioe.org>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 10:56 UTC

On 12/1/2021 2:36 AM, Martin Brown wrote:
> On 01/12/2021 08:12, Don Y wrote:
>> On 11/30/2021 11:53 PM, Jasen Betts wrote:
>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>> I'm looking for "established" archive formats and/or compression
>>>> formats (the thinking being that an archive can always be subsequently
>>>> compressed).
>>>
>>> If by compressed you mean made smaller, that's obviously false.
>>
>> No, I mean "original (uncompressed) content being made obscure,
>> without the *intent* of *hiding* the content". I.e., you can't
>> (generally) peek into a compressed archive and understand what it
>> contains, without some assistance from tools.
>
> Some archive formats have the directory in a form where you can read it fairly
> easily even if it isn't quite in plaintext.

Yes. And, "image" files often allow one to read the "plaintext"
of the contained file -- though often not in a contiguous manner.

>> Given "foo.zip", tell me *anything* about foo? Repeat for
>> "folder.tgz"? Or, "volume.imz"? (yet, there's nothing that
>> prevents you from using a tool to examine the contents;
>> unlike "message.pem")
>
> If you want to compare the effectiveness of the different algorithms then
> compressing a chunk of web content or a random executable will span a
> reasonable range of important use cases.
>
> There was a nice DOS tool called something like ifl that could read the
> contents of most common archive formats. Look for it on Simtel.

Right now, I'm just looking to see how many different ways file
contents are typically "obscured" (without "information hiding"
being an explicit goal)

>> [And, the *contents* of the archive -- along with the compression
>> algorithm used -- determine if the result is (physically) smaller
>> or larger than the original. Compressing (TRULY) random data will
>> inevitably result in a larger result. Compressing data with
>> "predictable" patterns will most often result in a space savings
>> if the chosen compressor knows how to exploit those patterns.]
>
> Bytewise entropy of the source material will give you a reasonable independent
> estimate of how compressible or otherwise it is.
>
> Highly compressed material tends toward ln(256) ~ 5.545
> png ~ 5.25
> jpg ~ 5.20
> exe's ~ 4.4
> test ~ 2.0
>
> It can be used to classify unknown data to likely type of file.
>
> (ignoring the sign) sum(p[i].ln(p[i]))
>
> p[i] = n[i]/N
>
> Where n[i] = number of times token i appears
> N = sum_over_i (n[i]) = N = filesize
>
> It will give you a fair guess at whether a given file can still be compressed
> by a general compression algorithm. You have to work incredibly hard to get the
> last 2% reduction in size.

I'm not concerned about the compressibility of the file or how
effective a particular tool is at achieving that compression.

Rather, the fact that compressors are commonly applied to
files (and "archives" are files) and, as a result, alter
their representation as a side-effect of their goal.

The only other "regularly applied" tools that alter file
contents typically involve encryption (of varying degrees).

[I can think of no other reason to alter a file's content]

Re: Archive formats

<so7m3r$1cj$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83937&group=sci.electronics.design#83937

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!aioe.org!fce17Vq++jWAoH1dvt+1NQ.user.46.165.242.75.POSTED!not-for-mail
From: '''newsp...@nonad.co.uk (Martin Brown)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 11:25:47 +0000
Organization: Aioe.org NNTP Server
Message-ID: <so7m3r$1cj$1@gioia.aioe.org>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <so7apt$kun$1@dont-email.me>
<so7fnn$ooa$1@gioia.aioe.org> <so7kdq$ltv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="1427"; posting-host="fce17Vq++jWAoH1dvt+1NQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-GB
 by: Martin Brown - Wed, 1 Dec 2021 11:25 UTC

On 01/12/2021 10:56, Don Y wrote:
> On 12/1/2021 2:36 AM, Martin Brown wrote:

>> It will give you a fair guess at whether a given file can still be
>> compressed by a general compression algorithm. You have to work
>> incredibly hard to get the last 2% reduction in size.
>
> I'm not concerned about the compressibility of the file or how
> effective a particular tool is at achieving that compression.
>
> Rather, the fact that compressors are commonly applied to
> files (and "archives" are files) and, as a result, alter
> their representation as a side-effect of their goal.

There are quite a few backup programs that use their own proprietary
encoding and compression sometimes allowing a tradeoff of speed vs
redundancy vs compression. The one I use historically names its files
with extensions .000 .001 and maked them just under 2^32 bytes each.

Backups are not much use if the easily become write only read never.

Cue April 1st adverts for infinite capacity write only memory...

> The only other "regularly applied" tools that alter file
> contents typically involve encryption (of varying degrees).
>
> [I can think of no other reason to alter a file's content]

To make it more compressible is one such, lossy compression always wins
out over lossless unless it is a very peculiar edge case.

--
Regards,
Martin Brown

Re: Archive formats

<so7n40$8fi$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83938&group=sci.electronics.design#83938

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: pNaonStp...@yahoo.com (Jan Panteltje)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 01 Dec 2021 11:37:33 GMT
Organization: A noiseless patient Spider
Lines: 399
Message-ID: <so7n40$8fi$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 1 Dec 2021 11:42:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bd3bdba9ffe6b072935bc28918439b39";
logging-data="8690"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++tcCJrjbVPNf/RK6hTZ61Iaygm3YKtZ8="
User-Agent: NewsFleX-1.5.7.5 (Linux-2.6.37.6)
Cancel-Lock: sha1:TgQ4l17r8cuBXLjEyvV8z2ERvn8=
X-Newsreader-location: NewsFleX-1.5.7.5 (c) 'LIGHTSPEED' off line news reader for the Linux platform
NewsFleX homepage: http://www.panteltje.com/panteltje/newsflex/ and ftp download ftp://sunsite.unc.edu/pub/linux/system/news/readers/
 by: Jan Panteltje - Wed, 1 Dec 2021 11:37 UTC

On a sunny day (Tue, 30 Nov 2021 21:56:46 -0700) it happened Don Y
<blockedofcourse@foo.invalid> wrote in <so6vaj$q31$1@dont-email.me>:

>I'm looking for "established" archive formats and/or compression
>formats (the thinking being that an archive can always be subsequently
>compressed).
>
>What's come to mind includes (I'm not being pedantic, here -- sometimes
>using file extensions to represent file formats):
>
>7z
>ace
>apk
>arc
>arj
>brotli
>bzip2
>cab
>cfs
>compress
>cpio
>cpt
>dar
>dmg
>egg
>gzip
>jar
>lbr
>lha
>lz4
>lzip
>lzma
>lzop
>lzx
>mpq
>pea
>rar
>rpm
>shar
>sit
>sitx
>sq
>sqx
>tar
>xar
>xz
>zip
>zoo
>zopfli
>zpaq
>zstd
>
>Daunting list, eh? Any others that I have overlooked?

Probably.
I usually use
tar -zcvf my_archive.tgz /xx/yyy/*

The 'v' list what it does, including filenames
You could probably use
tar -zcvf my_archive.tgz /xx/yyy/* 2>my_archive_contents.txt
to get a plain text content file, and save it with the archive.

Ultimately YOU decide how you compress / store. encrypt, whatever.
US launch codes are all zeros I've read as in a stress situation those poor guys cannot
remember anything more complicated, so something like Kamalatypezerozerozero.txt would be a good compression example.
Standards.....

This message was muddified by BiBo

ISO format for blueray
contains 3 movies in .ts (transport stream format):

disc number:
991
Thu Aug 23 14:24:40 CEST 2018
BD-R25GB
ext2
Mediarange 4x inkjet printable
LG BH10LS38
Method:
PLEASE STOP ANY RTL_SDR write data errors observed when that is running!
Make sure you habve enough disk space.
dd if=/dev/zero bs=100000000 count=242 > bluray.iso
mke2fs bluray.iso
mount -o loop=/dev/loop0 bluray.iso /mnt/loop
cp ... /mnt/loop/
du /mnt/loop
#umount /dev/loop0
umount /mnt/loop
cd /mnt/sda1/video/satellite
growisofs -speed=4 -dvd-compat -Z /dev/dvd=bluray.iso
dvdimagecmp -a bluray.iso -b /dev/dvd
l /mnt/loop
total 19283944
-rw-r--r-- 1 root root 3906700000 Aug 19 08:38 bond_golden_eye_1995.ts amovie
-rw-r--r-- 1 root root 12913150172 Aug 19 20:15 pirates_of_the_caribbean_dead_mans_chest_2006_HD.ts amovie
-rw-r--r-- 1 root root 2907600000 Aug 20 12:45 bond_spectre_2015.ts amovie

So the .ts contains video in mpeg2 format (so compressed) and several audio channels in mp2 (so also compressed) format.

There seems to be new audio and video compression methods every few years...
If you want a list of many of those type
ffmpeg -formats
File formats:
D. = Demuxing supported
.E = Muxing supported
--
E 3g2 3GP2 (3GPP2 file format)
E 3gp 3GP (3GPP file format)
D 4xm 4X Technologies
E a64 a64 - video for Commodore 64
D aac raw ADTS AAC (Advanced Audio Coding)
DE ac3 raw AC-3
D act ACT Voice file format
D adf Artworx Data Format
E adts ADTS AAC (Advanced Audio Coding)
DE adx CRI ADX
D aea MD STUDIO audio
D afc AFC
DE aiff Audio IFF
DE alaw PCM A-law
DE alsa ALSA audio output
DE amr 3GPP AMR
D anm Deluxe Paint Animation
D apc CRYO APC
D ape Monkey's Audio
D aqtitle AQTitle subtitles
DE asf ASF (Advanced / Active Streaming Format)
E asf_stream ASF (Advanced / Active Streaming Format)
DE ass SSA (SubStation Alpha) subtitle
DE ast AST (Audio Stream)
DE au Sun AU
DE avi AVI (Audio Video Interleaved)
E avm2 SWF (ShockWave Flash) (AVM2)
D avr AVR (Audio Visual Research)
D avs AVS
D bethsoftvid Bethesda Softworks VID
D bfi Brute Force & Ignorance
D bin Binary text
D bink Bink
DE bit G.729 BIT file format
D bmv Discworld II BMV
D brstm BRSTM (Binary Revolution Stream)
D c93 Interplay C93
DE caf Apple CAF (Core Audio Format)
DE cavsvideo raw Chinese AVS (Audio Video Standard) video
D cdg CD Graphics
D cdxl Commodore CDXL video
D concat Virtual concatenation script
E crc CRC testing
DE daud D-Cinema audio
D dfa Chronomaster DFA
DE dirac raw Dirac
DE dnxhd raw DNxHD (SMPTE VC-3)
D dsicin Delphine Software International CIN
DE dts raw DTS
D dtshd raw DTS-HD
DE dv DV (Digital Video)
D dv1394 DV1394 A/V grab
E dvd MPEG-2 PS (DVD VOB)
D dxa DXA
D ea Electronic Arts Multimedia
D ea_cdata Electronic Arts cdata
DE eac3 raw E-AC-3
D epaf Ensoniq Paris Audio File
DE f32be PCM 32-bit floating-point big-endian
DE f32le PCM 32-bit floating-point little-endian
E f4v F4V Adobe Flash Video
DE f64be PCM 64-bit floating-point big-endian
DE f64le PCM 64-bit floating-point little-endian
D fbdev Linux framebuffer
DE ffm FFM (FFserver live feed)
DE ffmetadata FFmpeg metadata in text
D film_cpk Sega FILM / CPK
DE filmstrip Adobe Filmstrip
DE flac raw FLAC
D flic FLI/FLC/FLX animation
DE flv FLV (Flash Video)
E framecrc framecrc testing
E framemd5 Per-frame MD5 testing
D frm Megalux Frame
DE g722 raw G.722
DE g723_1 raw G.723.1
D g729 G.729 raw format demuxer
DE gif GIF Animation
D gsm raw GSM
DE gxf GXF (General eXchange Format)
DE h261 raw H.261
DE h263 raw H.263
DE h264 raw H.264 video
E hls Apple HTTP Live Streaming
D hls,applehttp Apple HTTP Live Streaming
DE ico Microsoft Windows ICO
D idcin id Cinematic
D idf iCE Draw File
D iff IFF (Interchange File Format)
DE ilbc iLBC storage
DE image2 image2 sequence
DE image2pipe piped image2 sequence
D ingenient raw Ingenient MJPEG
D ipmovie Interplay MVE
E ipod iPod H.264 MP4 (MPEG-4 Part 14)
DE ircam Berkeley/IRCAM/CARL Sound Format
E ismv ISMV/ISMA (Smooth Streaming)
D iss Funcom ISS
D iv8 IndigoVision 8000 video
DE ivf On2 IVF
DE jacosub JACOsub subtitle format
D jv Bitmap Brothers JV
DE latm LOAS/LATM
D lavfi Libavfilter virtual input device
D lmlm4 raw lmlm4
D loas LOAS AudioSyncStream
D lvf LVF
D lxf VR native stream (LXF)
DE m4v raw MPEG-4 video
E matroska Matroska
D matroska,webm Matroska / WebM
E md5 MD5 testing
D mgsts Metal Gear Solid: The Twin Snakes
DE microdvd MicroDVD subtitle format
DE mjpeg raw MJPEG video
E mkvtimestamp_v2 extract pts as timecode v2 format, as defined by mkvtoolnix
DE mlp raw MLP
D mm American Laser Games MM
DE mmf Yamaha SMAF
E mov QuickTime / MOV
D mov,mp4,m4a,3gp,3g2,mj2 QuickTime / MOV
E mp2 MP2 (MPEG audio layer 2)
DE mp3 MP3 (MPEG audio layer 3)
E mp4 MP4 (MPEG-4 Part 14)
D mpc Musepack
D mpc8 Musepack SV8
DE mpeg MPEG-1 Systems / MPEG program stream
E mpeg1video raw MPEG-1 video
E mpeg2video raw MPEG-2 video
DE mpegts MPEG-TS (MPEG-2 Transport Stream)
D mpegtsraw raw MPEG-TS (MPEG-2 Transport Stream)
D mpegvideo raw MPEG video
E mpjpeg MIME multipart JPEG
D mpl2 MPL2 subtitles
D mpsub MPlayer subtitles
D msnwctcp MSN TCP Webcam stream
D mtv MTV
DE mulaw PCM mu-law
D mv Silicon Graphics Movie
D mvi Motion Pixels MVI
DE mxf MXF (Material eXchange Format)
E mxf_d10 MXF (Material eXchange Format) D-10 Mapping
D mxg MxPEG clip
D nc NC camera feed
D nistsphere NIST SPeech HEader REsources
D nsv Nullsoft Streaming Video
E null raw null video
DE nut NUT
D nuv NuppelVideo
DE ogg Ogg
DE oma Sony OpenMG audio
DE oss OSS (Open Sound System) playback
D paf Amazing Studio Packed Animation File
D pjs PJS (Phoenix Japanimation Society) subtitles
D pmp Playstation Portable PMP
E psp PSP MP4 (MPEG-4 Part 14)
D psxstr Sony Playstation STR
D pva TechnoTrend PVA
D pvf PVF (Portable Voice Format)
D qcp QCP
D r3d REDCODE R3D
DE rawvideo raw video
E rcv VC-1 test bitstream
D realtext RealText subtitle format
D rl2 RL2
DE rm RealMedia
DE roq raw id RoQ
D rpl RPL / ARMovie
DE rso Lego Mindstorms RSO
DE rtp RTP output
DE rtsp RTSP output
DE s16be PCM signed 16-bit big-endian
DE s16le PCM signed 16-bit little-endian
DE s24be PCM signed 24-bit big-endian
DE s24le PCM signed 24-bit little-endian
DE s32be PCM signed 32-bit big-endian
DE s32le PCM signed 32-bit little-endian
DE s8 PCM signed 8-bit
D sami SAMI subtitle format
DE sap SAP output
D sbg SBaGen binaural beats script
E sdl SDL output device
D sdp SDP
E segment segment
D shn raw Shorten
D siff Beam Software SIFF
DE smjpeg Loki SDL MJPEG
D smk Smacker
E smoothstreaming Smooth Streaming Muxer
D smush LucasArts Smush
D sol Sierra SOL
DE sox SoX native
DE spdif IEC 61937 (used on S/PDIF - IEC958)
DE srt SubRip subtitle
E stream_segment,ssegment streaming segment muxer
D subviewer SubViewer subtitle format
D subviewer1 SubViewer v1 subtitle format
E svcd MPEG-2 PS (SVCD)
DE swf SWF (ShockWave Flash)
D tak raw TAK
D tedcaptions TED Talks captions
D thp THP
D tiertexseq Tiertex Limited SEQ
D tmv 8088flex TMV
DE truehd raw TrueHD
D tta TTA (True Audio)
D tty Tele-typewriter
D txd Renderware TeXture Dictionary
DE u16be PCM unsigned 16-bit big-endian
DE u16le PCM unsigned 16-bit little-endian
DE u24be PCM unsigned 24-bit big-endian
DE u24le PCM unsigned 24-bit little-endian
DE u32be PCM unsigned 32-bit big-endian
DE u32le PCM unsigned 32-bit little-endian
DE u8 PCM unsigned 8-bit
D vc1 raw VC-1
D vc1test VC-1 test bitstream
E vcd MPEG-1 Systems / MPEG program stream (VCD)
D video4linux2,v4l2 Video4Linux2 device grab
D vivo Vivo
D vmd Sierra VMD
E vob MPEG-2 PS (VOB)
D vobsub VobSub subtitle format
DE voc Creative Voice
D vplayer VPlayer subtitles
D vqf Nippon Telegraph and Telephone Corporation (NTT) TwinVQ
DE w64 Sony Wave64
DE wav WAV / WAVE (Waveform Audio)
D wc3movie Wing Commander III movie
E webm WebM
D webvtt WebVTT subtitle
D wsaud Westwood Studios audio
D wsvqa Westwood Studios VQA
DE wtv Windows Television (WTV)
DE wv WavPack
D xa Maxis XA
D xbin eXtended BINary text (XBIN)
D xmv Microsoft XMV
D xwma Microsoft xWMA
D yop Psygnosis YOP
DE yuv4mpegpipe YUV4MPEG pipe


Click here to read the complete article
Re: Archive formats

<20211201a@crcomp.net>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83942&group=sci.electronics.design#83942

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: g...@crcomp.net (Don)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 15:02:39 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <20211201a@crcomp.net>
References: <so6vaj$q31$1@dont-email.me> <so7n40$8fi$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 1 Dec 2021 15:02:39 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1a83133a102a2528aab9c6de6ea02e9b";
logging-data="30557"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+x6fMuxxmObEMb7xX3WVwY"
Cancel-Lock: sha1:EQBN5jezXLGLpENqBhmbqtmsEP8=
 by: Don - Wed, 1 Dec 2021 15:02 UTC

Jan Panteltje wrote:
> Don Y wrote:

<snip>

>>Daunting list, eh? Any others that I have overlooked?
>
> Probably.
> I usually use
> tar -zcvf my_archive.tgz /xx/yyy/*

Some people use tar.gz as the suffix.

Self-extracting Microsoft .EXE files make it easy on the user.

Self-extracting unix shell archives are absolutely elegant:

https://alt.sources.narkive.com/k7MHsAnN/example-code-for-reading-and-writing-data-via-bscan-spartan6-with-urjtag-and-python

Although shell archives use .shar as a suffix by convention, they
actually accommodate any old suffix. Here's how you unpack a shar:

sh filename.shar

There's more archive (and even more compression) suffixes, ranked by
popularity, at the link below:

https://fileinfo.com/filetypes/compressed

Danke,

--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.

Re: Archive formats

<so8cso$c6l$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83967&group=sci.electronics.design#83967

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 10:54:22 -0700
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <so8cso$c6l$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <so7apt$kun$1@dont-email.me>
<so7fnn$ooa$1@gioia.aioe.org> <so7kdq$ltv$1@dont-email.me>
<so7m3r$1cj$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 17:54:32 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="12501"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bj2Maj1h58XlZs49S7s4G"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:7A6yjgi7dDF0zNd71sAFithgksM=
In-Reply-To: <so7m3r$1cj$1@gioia.aioe.org>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 17:54 UTC

On 12/1/2021 4:25 AM, Martin Brown wrote:
> On 01/12/2021 10:56, Don Y wrote:
>> On 12/1/2021 2:36 AM, Martin Brown wrote:
>
>>> It will give you a fair guess at whether a given file can still be
>>> compressed by a general compression algorithm. You have to work incredibly
>>> hard to get the last 2% reduction in size.
>>
>> I'm not concerned about the compressibility of the file or how
>> effective a particular tool is at achieving that compression.
>>
>> Rather, the fact that compressors are commonly applied to
>> files (and "archives" are files) and, as a result, alter
>> their representation as a side-effect of their goal.
>
> There are quite a few backup programs that use their own proprietary encoding
> and compression sometimes allowing a tradeoff of speed vs redundancy vs
> compression. The one I use historically names its files with extensions .000
> .001 and maked them just under 2^32 bytes each.
>
> Backups are not much use if the easily become write only read never.

Agreed. There are also many file extensions that are proprietary
reassignments of standard file formats (e.g., .whatever being
ZIP under a different -- less obvious -- name)

> Cue April 1st adverts for infinite capacity write only memory...

One of my oldest "saved adverts" was for a Signetics WoM.

>> The only other "regularly applied" tools that alter file
>> contents typically involve encryption (of varying degrees).
>>
>> [I can think of no other reason to alter a file's content]
>
> To make it more compressible is one such, lossy compression always wins out
> over lossless unless it is a very peculiar edge case.

Yes, but then you're not encoding the original file -- just an approximation
of it!

If I strip the EXIF tags from a photo, have I changed the file?
(I've certainly made it smaller!).

One can "translate" Inuit to English and get an *approximation* of
what was said. Converting back to Inuit will likely not give you
the same "statement", though.

And, folks don't casually convert text files into ".inu" format
for any particular reason! :>

Re: Archive formats

<so8deo$gtt$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83968&group=sci.electronics.design#83968

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 11:03:57 -0700
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <so8deo$gtt$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me> <so7n40$8fi$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 18:04:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="17341"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/uVf/ZD+QTrDsAeU6YFvBX"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:JH2KgPFWjPhPC8FEbJgRvPH8sGM=
In-Reply-To: <so7n40$8fi$1@dont-email.me>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 18:03 UTC

On 12/1/2021 4:37 AM, Jan Panteltje wrote:
> On a sunny day (Tue, 30 Nov 2021 21:56:46 -0700) it happened Don Y
> <blockedofcourse@foo.invalid> wrote in <so6vaj$q31$1@dont-email.me>:
>
>> I'm looking for "established" archive formats and/or compression
>> formats (the thinking being that an archive can always be subsequently
>> compressed).
>>
>> Daunting list, eh? Any others that I have overlooked?
>
> Probably.
> I usually use
> tar -zcvf my_archive.tgz /xx/yyy/*

'tar cvpf' in my case.

> Ultimately YOU decide how you compress / store. encrypt, whatever.

Or, *someone else* has already made that decision. If the format
is "well known" *and* not protected with a key, you can recover the
original file(s) at a later date.

And, potentially recompress using a different algorithm. You
now have three versions of the same file: the original
compressed form, the recovered file and the newly compressed
form. All are, effectively, the same file.

> ISO format for blueray
> contains 3 movies in .ts (transport stream format):

Hmmm... I'd not considered media formats.

Different CODECs produce different outputs. I'm not sure
you can take "source" and process it through two different
CODECs and still recover the (exact!) *same* source from
each of them -- let alone try to convert from one CODEC
to another.

OTOH, *containers* can arguably act as different "envelopes"
on the same encoded streams. So, converting from one
container to another is a completely reversible process
(barring the presence of additional nonportable metadata)

> mke2fs bluray.iso > mount -o loop=/dev/loop0 bluray.iso /mnt/loop

I completely missed the "image formats": iso, dd, vmdk, etc.

Your mount(8) example is a perfect example of the point I am making:
once mounted, you effectively have "recovered" the original files
contained in that image. You now have an accessible *copy* of the
files that are contained in that image!

Re: Archive formats

<so8dpm$jhi$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83971&group=sci.electronics.design#83971

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 11:09:46 -0700
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <so8dpm$jhi$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me> <so7n40$8fi$1@dont-email.me>
<20211201a@crcomp.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 18:09:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="20018"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hh9ImJmCrIkfv1wNDE3jT"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:1dPLTqLdliJ4pTTvuYh1hXje29E=
In-Reply-To: <20211201a@crcomp.net>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 18:09 UTC

On 12/1/2021 8:02 AM, Don wrote:
> Jan Panteltje wrote:
>> Don Y wrote:
>
> <snip>
>
>>> Daunting list, eh? Any others that I have overlooked?
>>
>> Probably.
>> I usually use
>> tar -zcvf my_archive.tgz /xx/yyy/*
>
> Some people use tar.gz as the suffix.

And, some people pipe tar to gzip instead of using the -z switch.

> Self-extracting Microsoft .EXE files make it easy on the user.

Hmmmm... another form I'd not considered (though there are several
tools that will build SE executables while encoding the original
content in other forms *within* the executable).

> Self-extracting unix shell archives are absolutely elegant:
>
> https://alt.sources.narkive.com/k7MHsAnN/example-code-for-reading-and-writing-data-via-bscan-spartan6-with-urjtag-and-python
>
> Although shell archives use .shar as a suffix by convention, they
> actually accommodate any old suffix. Here's how you unpack a shar:
>
> sh filename.shar

Yes, the whole notion of file extensions is just a needless complication.

tar -czpf my_archive.tgz /xx/yyy/*
mv my_archive.tgz my_archive.mytriviallydisguisedfiletype

Moral of story: you can't rely on file name/extension to tell you *anything*.
(file(1) is your friend)

> There's more archive (and even more compression) suffixes, ranked by
> popularity, at the link below:
>
> https://fileinfo.com/filetypes/compressed

Thanks, I've been finding multiple such "lists". Way more types
than I'd initially imagined! <frown>

Re: Archive formats

<trsk7i-dicg1.ln1@coop.radagast.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83976&group=sci.electronics.design#83976

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!npeer.as286.net!npeer-ng0.as286.net!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
References: <so6vaj$q31$1@dont-email.me> <so764l$r9v$1@gonzo.revmaps.no-ip.org>
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
From: dpl...@coop.radagast.org (Dave Platt)
Originator: dplatt@coop.radagast.org (Dave Platt)
Message-ID: <trsk7i-dicg1.ln1@coop.radagast.org>
Lines: 61
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Wed, 01 Dec 2021 18:52:19 UTC
Date: Wed, 1 Dec 2021 10:52:13 -0800
X-Received-Bytes: 3427
 by: Dave Platt - Wed, 1 Dec 2021 18:52 UTC

In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
>On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>> I'm looking for "established" archive formats and/or compression
>> formats (the thinking being that an archive can always be subsequently
>> compressed).
>
>If by compressed you mean made smaller, that's obviously false.

If we interpret "compressed" to mean "compressed without information
loss", Jasen is correct. This can't be done.

The proof is based on the pigeonhole principle. If you have an
existing archive whose length is N bits, there are 2^N possible
combinations of bits in that archive. If you claim that you can
always compress such an archive further (and make it smaller), then
you're claiming that the (re-compressed) archive will always be no
larger than N-1 bits in length.

The maximum number of bit combinations available in the compressed
representation is a sum: 2^1 + 2^2 + 2^3 + ... + 2^(N-2) +
2^(N-1). This sum is equal to 2^N - 1.

This means that the total number of compressed representations is,
at best, one less than the number of uncompressed representations.
As the Pigeonhole Principle phrases it, you have one less pigeonhole
in your office desk, than you have slips of paper that you need to
put into the pigeonholes.

So, you're left with two possibilities:

(1) The compression algorithm can always map each of the 2^N inputs
onto a specific pigeonhole. But, since you've got one fewer
pigeonholes than inputs, two of the inputs must map to the same
pigeonhole. Two different input archives, compress down to the
same output archive.

When it comes time to decompress, the decompression algorithm
can only produce one output (which presumably will be one of
those two inputs). It can't successfully reconstruct the second
input. If you compress the "unlucky" input, and then decompress,
you get the wrong result. This contradicts (and disproves)
your starting assumption that you can always compress without
loss.

(2) One or more of the 2^N inputs doesn't map into any of the
2^N - 1 pigeonholes. It maps into something longer (2^N or
more bits long), or it causes the compression algorithm to
crash, hang, explode, or cross the streams and instantly end
all life as we know it.

This contradicts your starting assumption that you can always
compress any input further.

JPEG, MPEG, and similar systems are usually called "compression"
algorithms, but it's clearer to think of them as "lossy encoding".
They get around the pigeonhole principle by being willing to
lose information - the decoded signal is not guaranteed to be
identical to the input signal.

Re: Archive formats

<so8il2$j51$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83978&group=sci.electronics.design#83978

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!aioe.org!9Z4VvSzUop8HXV7/P1aDNw.user.46.165.242.75.POSTED!not-for-mail
From: pNaOnStP...@yahoo.com (Jan Panteltje)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 01 Dec 2021 19:27:59 GMT
Organization: Aioe.org NNTP Server
Message-ID: <so8il2$j51$1@gioia.aioe.org>
References: <so6vaj$q31$1@dont-email.me> <so7n40$8fi$1@dont-email.me> <so8deo$gtt$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="19617"; posting-host="9Z4VvSzUop8HXV7/P1aDNw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: NewsFleX-1.5.7.5 (Linux-2.6.37.6)
X-Newsreader-location: NewsFleX-1.5.7.5 (c) 'LIGHTSPEED' off line news reader for the Linux platform
NewsFleX homepage: http://www.panteltje.com/panteltje/newsflex/ and ftp download ftp://sunsite.unc.edu/pub/linux/system/news/readers/
X-Notice: Filtered by postfilter v. 0.9.2
 by: Jan Panteltje - Wed, 1 Dec 2021 19:27 UTC

On a sunny day (Wed, 1 Dec 2021 11:03:57 -0700) it happened Don Y
<blockedofcourse@foo.invalid> wrote in <so8deo$gtt$1@dont-email.me>:

>On 12/1/2021 4:37 AM, Jan Panteltje wrote:
>> On a sunny day (Tue, 30 Nov 2021 21:56:46 -0700) it happened Don Y
>> <blockedofcourse@foo.invalid> wrote in <so6vaj$q31$1@dont-email.me>:
>>
>>> I'm looking for "established" archive formats and/or compression
>>> formats (the thinking being that an archive can always be subsequently
>>> compressed).
>>>
>>> Daunting list, eh? Any others that I have overlooked?
>>
>> Probably.
>> I usually use
>> tar -zcvf my_archive.tgz /xx/yyy/*
>
>'tar cvpf' in my case.
>
>> Ultimately YOU decide how you compress / store. encrypt, whatever.
>
>Or, *someone else* has already made that decision. If the format
>is "well known" *and* not protected with a key, you can recover the
>original file(s) at a later date.
>
>And, potentially recompress using a different algorithm. You
>now have three versions of the same file: the original
>compressed form, the recovered file and the newly compressed
>form. All are, effectively, the same file.
>
>> ISO format for blueray
>> contains 3 movies in .ts (transport stream format):
>
>Hmmm... I'd not considered media formats.
>
>Different CODECs produce different outputs. I'm not sure
>you can take "source" and process it through two different
>CODECs and still recover the (exact!) *same* source from
>each of them -- let alone try to convert from one CODEC
>to another.

Oh that is old crypto fun, take pictures of the source binary in hexadecimal
combined frame by frame into a movie.
Re-encode with a different video codec.
Keep enough bandwidth to keep it readable for a computer.
Like for audio encode / decode with text to speech - speech to text.
maybe translate language >one two three -> un deux trois.
You can scramble the pictures too so it shows whatever,
There was a discussion some time back in sci.crypt about using fractals,
google shows several papers on fractal encryption.
Converting from one codec to an other I do all the time with ffmpeg.
ffmpeg -i q1.avi -i q1.mp2 -f avi -vcodec copy -acodec ac3 -y $1-hd.avi
... ffmpeg -f yuv4mpegpipe -i - -f avi -vcodec libx264 -b 10M -y q1.avi
You may lose detail depending on allocated bandwidth.

>OTOH, *containers* can arguably act as different "envelopes"
>on the same encoded streams. So, converting from one
>container to another is a completely reversible process
>(barring the presence of additional nonportable metadata)
>
>> mke2fs bluray.iso > mount -o loop=/dev/loop0 bluray.iso /mnt/loop
>
>I completely missed the "image formats": iso, dd, vmdk, etc.
>
>Your mount(8) example is a perfect example of the point I am making:
>once mounted, you effectively have "recovered" the original files
>contained in that image. You now have an accessible *copy* of the
>files that are contained in that image!

Yes

Re: Archive formats

<so8t57$6cf$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=83984&group=sci.electronics.design#83984

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 15:31:56 -0700
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <so8t57$6cf$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 1 Dec 2021 22:32:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fe429f17aebec9e405d52c8653a02edf";
logging-data="6543"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/37nOg7HJFQu5zhpmd4Wov"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:cBasDyMQJ8iD1084Ucr9sF846rI=
In-Reply-To: <trsk7i-dicg1.ln1@coop.radagast.org>
Content-Language: en-US
 by: Don Y - Wed, 1 Dec 2021 22:31 UTC

On 12/1/2021 11:52 AM, Dave Platt wrote:
> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
> Jasen Betts <usenet@revmaps.no-ip.org> wrote:
>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>> I'm looking for "established" archive formats and/or compression
>>> formats (the thinking being that an archive can always be subsequently
>>> compressed).
>>
>> If by compressed you mean made smaller, that's obviously false.
>
> If we interpret "compressed" to mean "compressed without information
> loss", Jasen is correct. This can't be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

I, for example, have JUST designed a compressor that compresses
all occurrences of the string "No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon." into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte! The remaining characters in this message are
not affected by my compressor. So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I'd have to encode *it* in some
other manner)

[Unapplicable "proof" elided]

Re: Archive formats

<mqll7i-jmgg1.ln1@coop.radagast.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84001&group=sci.electronics.design#84001

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx98.iad.POSTED!not-for-mail
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
References: <so6vaj$q31$1@dont-email.me> <so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org> <so8t57$6cf$1@dont-email.me>
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
From: dpl...@coop.radagast.org (Dave Platt)
Originator: dplatt@coop.radagast.org (Dave Platt)
Message-ID: <mqll7i-jmgg1.ln1@coop.radagast.org>
Lines: 69
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Thu, 02 Dec 2021 01:58:19 UTC
Date: Wed, 1 Dec 2021 17:58:14 -0800
X-Received-Bytes: 4037
 by: Dave Platt - Thu, 2 Dec 2021 01:58 UTC

In article <so8t57$6cf$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:
>On 12/1/2021 11:52 AM, Dave Platt wrote:
>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>> Jasen Betts <usenet@revmaps.no-ip.org> wrote:
>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>> I'm looking for "established" archive formats and/or compression
>>>> formats (the thinking being that an archive can always be subsequently
>>>> compressed).
>>>
>>> If by compressed you mean made smaller, that's obviously false.
>>
>> If we interpret "compressed" to mean "compressed without information
>> loss", Jasen is correct. This can't be done.
>
>No, you are assuming there is no other (implicit) source of
>information that the compressor can rely upon.
>
>I, for example, have JUST designed a compressor that compresses
>all occurrences of the string "No, you are assuming there is no
>other (implicit) source of information that the compressor can
>rely upon." into the hex constant 0xFE.
>
>As such, the first paragraph in my reply, here, can be compressed
>to a single byte! The remaining characters in this message are
>not affected by my compressor. So, the message ends up SMALLER
>as a result of the elided characters in that first paragraph.

Sure - you can always design a compressor which works very well indeed
for certain classes of input. If you "cherry-pick" the allowable
inputs, you can get extremely high coding gain.

What you can't do, is design any _single_ compressor which is
guaranteed to always compress any _arbitrary_ input (arbitrary sets of
bits), to less bits in total (and here you have to include any magic
"implicit" bits your algorithm may be depending on, such as "which
special compressor was used?" in the output-file header).

>My compressor obviously relies on the fact that 0xFE does not
>occur in ascii text. (If it did, I'd have to encode *it* in some
>other manner)

Yup. The common way in telecom protocols is to "escape" such
special codes, so you'd send escape-0xFE to represent a single
0xFE in the file. Of course, that means that you've just increased
the size of the file rather than decreased it.

You're doing a double cherry-pick here, by pre-defining the
magic input string you'll compress so well, and by declaring the
existence of a compressed-representation token for it which is not
allowed to appear in the input. That combination gives you extremely
high coding gain... for this one magic input string.

It gives you bupkis for any input which doesn't contain that magic
string, though. You get _zero_ compression there.

Fixed-dictionary-based compression schemes (which is essentially what
you are proposing here) can give extremely high coding gain (compression)
as long as most of the input is "words" in the fixed "vocabulary", and
as long as those "words" are significantly longer than the tokens you use
to replace or number them. That amounts to saying "the input must have
a relatively low entropy"... the input isn't just random (or random-like)
collections of bits.

Shannon's source coding theorem is applicable here... it sets a pretty
hard limit on how far you can compress any given input (given the
statistics of the input data) before information loss becomes virtually
certain.

Re: Archive formats

<so9jj7$ek6$1@reader1.panix.com>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84014&group=sci.electronics.design#84014

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!.POSTED.panix2.panix.com!not-for-mail
From: prese...@MUNGEpanix.com (Cydrome Leader)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Thu, 2 Dec 2021 04:55:03 -0000 (UTC)
Organization: PANIX Public Access Internet and UNIX, NYC
Message-ID: <so9jj7$ek6$1@reader1.panix.com>
References: <so6vaj$q31$1@dont-email.me> <so72n4$9m0$1@dont-email.me>
Injection-Date: Thu, 2 Dec 2021 04:55:03 -0000 (UTC)
Injection-Info: reader1.panix.com; posting-host="panix2.panix.com:166.84.1.2";
logging-data="14982"; mail-complaints-to="abuse@panix.com"
User-Agent: tin/2.6.0-20210823 ("Coleburn") (NetBSD/9.2 (amd64))
 by: Cydrome Leader - Thu, 2 Dec 2021 04:55 UTC

Don Y <blockedofcourse@foo.invalid> wrote:
> On 11/30/2021 9:56 PM, Don Y wrote:
>> I'm looking for "established" archive formats and/or compression
>> formats (the thinking being that an archive can always be subsequently
>> compressed).
>
>> Daunting list, eh? Any others that I have overlooked?
>
> Ugh! Skip that. I've apparently missed *dozens* (scores?)... :<

you also duplicated quite a few formats in that original list.

Re: Archive formats

<so9lee$9v0$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84017&group=sci.electronics.design#84017

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Wed, 1 Dec 2021 22:26:28 -0700
Organization: A noiseless patient Spider
Lines: 128
Message-ID: <so9lee$9v0$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me> <mqll7i-jmgg1.ln1@coop.radagast.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 2 Dec 2021 05:26:39 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="97338051e9042b5541a3e6f2ec31e7f9";
logging-data="10208"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cnumMsM9WYFsQ6t45Pb9s"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:ms0gG3ngj8omYw8QGNucaQxL2JQ=
In-Reply-To: <mqll7i-jmgg1.ln1@coop.radagast.org>
Content-Language: en-US
 by: Don Y - Thu, 2 Dec 2021 05:26 UTC

On 12/1/2021 6:58 PM, Dave Platt wrote:
> In article <so8t57$6cf$1@dont-email.me>,
> Don Y <blockedofcourse@foo.invalid> wrote:
>> On 12/1/2021 11:52 AM, Dave Platt wrote:
>>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>>> Jasen Betts <usenet@revmaps.no-ip.org> wrote:
>>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>>> I'm looking for "established" archive formats and/or compression
>>>>> formats (the thinking being that an archive can always be subsequently
>>>>> compressed).
>>>>
>>>> If by compressed you mean made smaller, that's obviously false.
>>>
>>> If we interpret "compressed" to mean "compressed without information
>>> loss", Jasen is correct. This can't be done.
>>
>> No, you are assuming there is no other (implicit) source of
>> information that the compressor can rely upon.
>>
>> I, for example, have JUST designed a compressor that compresses
>> all occurrences of the string "No, you are assuming there is no
>> other (implicit) source of information that the compressor can
>> rely upon." into the hex constant 0xFE.
>>
>> As such, the first paragraph in my reply, here, can be compressed
>> to a single byte! The remaining characters in this message are
>> not affected by my compressor. So, the message ends up SMALLER
>> as a result of the elided characters in that first paragraph.
>
> Sure - you can always design a compressor which works very well indeed
> for certain classes of input. If you "cherry-pick" the allowable
> inputs, you can get extremely high coding gain.

But you don't have to "cherry pick"! "Data" typically already has
"known characteristics" that compressors can exploit.

JPEG exploits the idea that the human eye won't "notice" certain
loss of detail in photos. MP3 makes similar assumptions wrt
audio. ASCII text is already 14% larger than required (as every
byte has a high-order bit that is KNOWN to be '0'). English
prose can make other assumptions regarding "expectations" of
what follows in a given sequence of words. Speech can be
encoded in a few hundred *bits* per second. etc.

> What you can't do, is design any _single_ compressor which is
> guaranteed to always compress any _arbitrary_ input (arbitrary sets of
> bits), to less bits in total (and here you have to include any magic
> "implicit" bits your algorithm may be depending on, such as "which
> special compressor was used?" in the output-file header).

And you'll notice there isn't a *single* compressor available!
And, as I said, you can't compress truly random data (and expect
it to get smaller).

But, anyone can choose to apply any compressor to any file.
There's nothing that prevents this from being done.
So, you can find a RAR archive of a set of ZIP files, each
compressing an ISO archive, etc. The ratio of "compressed"
file size to original file size can exceed 1.0. But,
despite that, one can still apply the proper sequence of
DEcompressors to retrieve the original "input".

My concern over archives and the sorts of "manipulations"
that can be applied to them (the most common of which is
compression) is solely in how it affects recovery of the
"original" content.

>> My compressor obviously relies on the fact that 0xFE does not
>> occur in ascii text. (If it did, I'd have to encode *it* in some
>> other manner)
>
> Yup. The common way in telecom protocols is to "escape" such
> special codes, so you'd send escape-0xFE to represent a single
> 0xFE in the file. Of course, that means that you've just increased
> the size of the file rather than decreased it.
>
> You're doing a double cherry-pick here, by pre-defining the
> magic input string you'll compress so well, and by declaring the
> existence of a compressed-representation token for it which is not
> allowed to appear in the input. That combination gives you extremely
> high coding gain... for this one magic input string.

But there are many "magic strings" that appear in day to day encounters.
And other "special conditions" that a compressor (and the user
who applies that compressor) can exploit.

Facsimiles tend to contain lots of white space. That can be compressed
as can the runs of "black" (for B&W FAXs). Instead of many megabytes
to represent a single sheet image, you can reduce it to kilobytes.

In your vernacular, it wouldn't give you "bupkis" if you tried to
apply it to a color image. So, you (wisely) wouldn't use that
algorithm in that case.

An "unused" sector on a disk can be represented with a single bit.
(or, RLE the number of such consecutive empty sectors to exploit
the fact that deleted files occupy contiguous space on a volume).
So, you've represented 4,096 bits with *one*.

Granted, after "compression", we can't recover the contents of those
"unused" sectors. But, we typically don't want to. We will trade
that ability for this higher compression rate.

> It gives you bupkis for any input which doesn't contain that magic
> string, though. You get _zero_ compression there.

So, you design a compressor that exploits the *patterns* that are
present in that other "input".

> Fixed-dictionary-based compression schemes (which is essentially what
> you are proposing here) can give extremely high coding gain (compression)
> as long as most of the input is "words" in the fixed "vocabulary", and
> as long as those "words" are significantly longer than the tokens you use
> to replace or number them. That amounts to saying "the input must have
> a relatively low entropy"... the input isn't just random (or random-like)
> collections of bits.
>
> Shannon's source coding theorem is applicable here... it sets a pretty
> hard limit on how far you can compress any given input (given the
> statistics of the input data) before information loss becomes virtually
> certain.

Yes, but that applies to unconstrained data. Where the compressor has no
*additional* knowledge of the content that it can exploit. Few people
encounter such "uncompressable" (raw) data. Hence the appeal and value
of compressors (if they had little/no use, there wouldn't be so many
of them!)

Re: Archive formats

<soa3c4$1ek5$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84024&group=sci.electronics.design#84024

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!aioe.org!fce17Vq++jWAoH1dvt+1NQ.user.46.165.242.75.POSTED!not-for-mail
From: '''newsp...@nonad.co.uk (Martin Brown)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Thu, 2 Dec 2021 09:24:19 +0000
Organization: Aioe.org NNTP Server
Message-ID: <soa3c4$1ek5$1@gioia.aioe.org>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="47749"; posting-host="fce17Vq++jWAoH1dvt+1NQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2
 by: Martin Brown - Thu, 2 Dec 2021 09:24 UTC

On 01/12/2021 22:31, Don Y wrote:
> On 12/1/2021 11:52 AM, Dave Platt wrote:
>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>> Jasen Betts  <usenet@revmaps.no-ip.org> wrote:
>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>> I'm looking for "established" archive formats and/or compression
>>>> formats (the thinking being that an archive can always be subsequently
>>>> compressed).
>>>
>>> If by compressed you mean made smaller, that's obviously false.
>>
>> If we interpret "compressed" to mean "compressed without information
>> loss", Jasen is correct.  This can't be done.
>
> No, you are assuming there is no other (implicit) source of
> information that the compressor can rely upon.

He is stating a well known and general result.

One that sometimes catches people out. We had offline compression for
bulk data over phone line that could break some telecom modems realtime
compression back in the day. Internal buffer overflow because the data
expanded quite a bit when their simplistic "compression" algorithm tried
to process it in realtime. If it is still around I created a document
called fullfile which epitomised the maximally incompressible file.
There were already a test file sample of ASCII text and an empty file
(which essentially tests the baud rate of the modems at each end).

> I, for example, have JUST designed a compressor that compresses
> all occurrences of the string "No, you are assuming there is no
> other (implicit) source of information that the compressor can
> rely upon." into the hex constant 0xFE.
>
> As such, the first paragraph in my reply, here, can be compressed
> to a single byte!  The remaining characters in this message are
> not affected by my compressor.  So, the message ends up SMALLER
> as a result of the elided characters in that first paragraph.
>
> My compressor obviously relies on the fact that 0xFE does not
> occur in ascii text.  (If it did, I'd have to encode *it* in some
> other manner)
>
> [Unapplicable "proof" elided]

His general point is true though.

Unless there is some other redundant structure in the file you cannot
compress a file where the bytewise entropy is ln(256) or nearly so.

You also have to work much harder to get that very last 1% of additional
compression too - most algorithms don't even try.

PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)

--
Regards,
Martin Brown

Re: Archive formats

<soa51d$va1$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84027&group=sci.electronics.design#84027

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Thu, 2 Dec 2021 02:52:34 -0700
Organization: A noiseless patient Spider
Lines: 120
Message-ID: <soa51d$va1$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me> <soa3c4$1ek5$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 2 Dec 2021 09:52:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="97338051e9042b5541a3e6f2ec31e7f9";
logging-data="32065"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MYZqy7j6wZb60UcdcsXlw"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:1Jw3LLZetNeiiBMYiF3Udo6HttE=
In-Reply-To: <soa3c4$1ek5$1@gioia.aioe.org>
Content-Language: en-US
 by: Don Y - Thu, 2 Dec 2021 09:52 UTC

On 12/2/2021 2:24 AM, Martin Brown wrote:
> On 01/12/2021 22:31, Don Y wrote:
>> On 12/1/2021 11:52 AM, Dave Platt wrote:
>>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>>> Jasen Betts <usenet@revmaps.no-ip.org> wrote:
>>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>>> I'm looking for "established" archive formats and/or compression
>>>>> formats (the thinking being that an archive can always be subsequently
>>>>> compressed).
>>>>
>>>> If by compressed you mean made smaller, that's obviously false.
>>>
>>> If we interpret "compressed" to mean "compressed without information
>>> loss", Jasen is correct. This can't be done.
>>
>> No, you are assuming there is no other (implicit) source of
>> information that the compressor can rely upon.
>
> He is stating a well known and general result.

That only applies in the general case. The fact that most compressors
achieve *some* compression means the general case is RARE in the wild;
typically encountered when someone tries to compress already compressed
content.

> One that sometimes catches people out. We had offline compression for bulk data
> over phone line that could break some telecom modems realtime compression back
> in the day. Internal buffer overflow because the data expanded quite a bit when
> their simplistic "compression" algorithm tried to process it in realtime. If it
> is still around I created a document called fullfile which epitomised the
> maximally incompressible file. There were already a test file sample of ASCII
> text and an empty file (which essentially tests the baud rate of the modems at
> each end).
>
>> I, for example, have JUST designed a compressor that compresses
>> all occurrences of the string "No, you are assuming there is no
>> other (implicit) source of information that the compressor can
>> rely upon." into the hex constant 0xFE.
>>
>> As such, the first paragraph in my reply, here, can be compressed
>> to a single byte! The remaining characters in this message are
>> not affected by my compressor. So, the message ends up SMALLER
>> as a result of the elided characters in that first paragraph.
>>
>> My compressor obviously relies on the fact that 0xFE does not
>> occur in ascii text. (If it did, I'd have to encode *it* in some
>> other manner)
>>
>> [Unapplicable "proof" elided]
>
> His general point is true though.

It isn't important to the issues I'm addressing.

*If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
ENCOUNTER THE (compressed) FILE(s). Any increase or decrease in file
size will already have been "baked in". There is no value to my being
able to "lecture" the content creator that his compression actually
INCREASED the size of his content. (caps are for emphasis, not shouting).

[Compression also affords other features that are absent in its absence.
In particular, most compressors include checksums -- either implied
or explicit -- that further act to vouch for the integrity of the
content. Can you tell me if "foo.txt" is corrupted? What about
"foo.zip"?]

*My* concern is being able to recover the original file(s). REGARDLESS
OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.

A user can use an off-the-shelf archiver to "bundle" multiple files
into a single "archive file". So, I need to be able to "unbundle"
them, regardless of the archiver he happened to choose -- hence my
interest in "archive formats".

A user can often opt to "compress" that resulting archive (or, the archive
program may offer that as an option applied while the archive is built).
(Or, an individual file without "bundling")

So, in order to unbundle the archive (or recover the singleton), I need
to be able to UNcompress it. Hence my interest in compressors.

A user *could* opt to encrypt the contents. If so, I won't even attempt
to access the original files. I have no desire to expend resource
"guessing" secrets!

He can also opt to apply some other (wacky, home-baked) encoding or compression
scheme (e.g., when sending executables through mail, I routinely change the
file extenstion to "xex" and prepend some gibberish at the front of the file
to obscure its signature -- because some mail scanners will attempt to
decompress compressed files to "protect" the recipients, otherwise wrapping
it in a ZIP would suffice). If so, I won't even attempt to access the
original file(s).

One can argue that a user might do some other "silly" transform (ROT13?)
so I could cover those bases with (equally silly) inversions. I want to
identify the sorts of *likely* "processes" to which some (other!) user
could have subjected a file's (or group of files') content and be able
to reverse them.

[I recently encountered some dictionaries that were poorly disguised ZIP
archives]

If the user *chose* to encode his content in BNPF, then I want to be able
to *decode* that content. (as long as I don't have to "guess secrets"
or try to reverse engineer some wacky coding/packing scheme)

Its a relatively simple problem to solve --once you've identified the
range of *common* archivers/encoders/compressors that might be used!
(e.g., SIT is/was common on Macs)

> Unless there is some other redundant structure in the file you cannot compress
> a file where the bytewise entropy is ln(256) or nearly so.
>
> You also have to work much harder to get that very last 1% of additional
> compression too - most algorithms don't even try.
>
> PNG is one of the better lossless image ones and gets ~ln(190)
> ZIP on a larger files gets very close indeed ~ln(255.7)

Re: Archive formats

<soa878$1l7m$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84030&group=sci.electronics.design#84030

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!aioe.org!fce17Vq++jWAoH1dvt+1NQ.user.46.165.242.75.POSTED!not-for-mail
From: '''newsp...@nonad.co.uk (Martin Brown)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Thu, 2 Dec 2021 10:47:04 +0000
Organization: Aioe.org NNTP Server
Message-ID: <soa878$1l7m$1@gioia.aioe.org>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me> <soa3c4$1ek5$1@gioia.aioe.org>
<soa51d$va1$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="54518"; posting-host="fce17Vq++jWAoH1dvt+1NQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Content-Language: en-GB
X-Notice: Filtered by postfilter v. 0.9.2
 by: Martin Brown - Thu, 2 Dec 2021 10:47 UTC

On 02/12/2021 09:52, Don Y wrote:
> On 12/2/2021 2:24 AM, Martin Brown wrote:
>> On 01/12/2021 22:31, Don Y wrote:
>>> On 12/1/2021 11:52 AM, Dave Platt wrote:
>>>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>>>> Jasen Betts  <usenet@revmaps.no-ip.org> wrote:
>>>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>>>> I'm looking for "established" archive formats and/or compression
>>>>>> formats (the thinking being that an archive can always be
>>>>>> subsequently
>>>>>> compressed).
>>>>>
>>>>> If by compressed you mean made smaller, that's obviously false.
>>>>
>>>> If we interpret "compressed" to mean "compressed without information
>>>> loss", Jasen is correct.  This can't be done.
>>>
>>> No, you are assuming there is no other (implicit) source of
>>> information that the compressor can rely upon.
>>
>> He is stating a well known and general result.
>
> That only applies in the general case.  The fact that most compressors
> achieve *some* compression means the general case is RARE in the wild;
> typically encountered when someone tries to compress already compressed
> content.

That used to be true but most content these days apart from HTML and
flat text files is already compressed with something like a crude ZIP.
MS Office 2007 onwards files are thinly disguised ZIP files.

>>> My compressor obviously relies on the fact that 0xFE does not
>>> occur in ascii text.  (If it did, I'd have to encode *it* in some
>>> other manner)
>>>
>>> [Unapplicable "proof" elided]
>>
>> His general point is true though.
>
> It isn't important to the issues I'm addressing.

It isn't clear to me what issue you are trying to address. Recognising
the header info for each compression method and the using it will get
you a long way but ISTR it has already been done. A derivative of the
IFL DOS utility programme that used to do that for example.

AV programs that can see into (some) archive files also use this method.

> *If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
> ENCOUNTER THE (compressed) FILE(s).  Any increase or decrease in file
> size will already have been "baked in".  There is no value to my being
> able to "lecture" the content creator that his compression actually
> INCREASED the size of his content.  (caps are for emphasis, not shouting).

But if you have control of what is being done you can detect files where
the method you are using cannot make any improvements and just copy it.

> [Compression also affords other features that are absent in its absence.
> In particular, most compressors include checksums -- either implied
> or explicit -- that further act to vouch for the integrity of the
> content.  Can you tell me if "foo.txt" is corrupted?  What about
> "foo.zip"?]
>
> *My* concern is being able to recover the original file(s).  REGARDLESS
> OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.

You definitely want to use the byte entropy to classify them then. That
will tell you fairly quickly whether or not a given block of disk is
JPEG, HTML or EXE without doing much work at all.
>
> A user can use an off-the-shelf archiver to "bundle" multiple files
> into a single "archive file".  So, I need to be able to "unbundle"
> them, regardless of the archiver he happened to choose -- hence my
> interest in "archive formats".

There were utilities of this sort back in the 1990's why reinvent the
wheel? It is harder now since there are even more rival formats.

> A user can often opt to "compress" that resulting archive (or, the archive
> program may offer that as an option applied while the archive is built).
> (Or, an individual file without "bundling")
>
> So, in order to unbundle the archive (or recover the singleton), I need
> to be able to UNcompress it.  Hence my interest in compressors.
>
> A user *could* opt to encrypt the contents.  If so, I won't even attempt
> to access the original files.  I have no desire to expend resource
> "guessing" secrets!
>
> He can also opt to apply some other (wacky, home-baked) encoding or
> compression
> scheme (e.g., when sending executables through mail, I routinely change the
> file extenstion to "xex" and prepend some gibberish at the front of the
> file
> to obscure its signature -- because some mail scanners will attempt to
> decompress compressed files to "protect" the recipients, otherwise wrapping
> it in a ZIP would suffice).  If so, I won't even attempt to access the
> original file(s).

That was a part of what I used to do to numeric data files way back
exploiting the fact that spreadsheets treat a blank cell as 0. Then
compress by RLE and then hit that with ZIP. It got quite close to what a
modern compression algorithm can do.

> One can argue that a user might do some other "silly" transform (ROT13?)
> so I could cover those bases with (equally silly) inversions.  I want to
> identify the sorts of *likely* "processes" to which some (other!) user
> could have subjected a file's (or group of files') content and be able
> to reverse them.

Transforms that only alter the value of the symbols but not their
frequency should make no difference at all to the compressibility with
the entropy based variable bit length encoding methods being used today.
>
> [I recently encountered some dictionaries that were poorly disguised ZIP
> archives]
>
> If the user *chose* to encode his content in BNPF, then I want to be able
> to *decode* that content.  (as long as I don't have to "guess secrets"
> or try to reverse engineer some wacky coding/packing scheme)
>
> Its a relatively simple problem to solve --once you've identified the
> range of *common* archivers/encoders/compressors that might be used!
> (e.g., SIT is/was common on Macs)

Trying every one is a bit too brute force for my taste. Trying the ones
that might be appropriate for the dataset would be my preference.

>> Unless there is some other redundant structure in the file you cannot
>> compress a file where the bytewise entropy is ln(256) or nearly so.
>>
>> You also have to work much harder to get that very last 1% of
>> additional compression too - most algorithms don't even try.
>>
>> PNG is one of the better lossless image ones and gets ~ln(190)
>> ZIP on a larger files gets very close indeed ~ln(255.7)

I think you will find this gets complicated and slow.
I do something not unlike what you are proposing to recognise fragments
of damaged JPEG files and then splice a plausible header on the front to
take a look at what it looks like. Enough systems use default Huffman
tables that there is a fair chance of getting a piece of an image back.

Most modern lossless compressors optimise their symbol tree heavily and
so you have no such crib to deal with a general archive file. The
impossible ones are the backups using proprietary unpublished algorithms
- I can tell from byte entropy that they are pretty good though.

BTW do you have time to run that (now rather large benchmark) through
the Intel C/C++ compiler? I have just got it to compile with Clang for
the M1 and am hoping to run the benchmarks on Friday. My new target
Intel CPU to test is 12600K which looks like it could be a real winner.
(ie slam it through the compiler and send me back the error msgs)

I'm sure there will be some since every port of notionally "Portable"
software to a new compiler uncovers coding defects (or compiler
defects). Clang for instance doesn't honour I64 length in printf.

--
Regards,
Martin Brown

Re: Archive formats

<97728224-8422-0ca2-745f-ca3681fc01ca@electrooptical.net>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84038&group=sci.electronics.design#84038

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: pcdhSpam...@electrooptical.net (Phil Hobbs)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Thu, 2 Dec 2021 09:49:32 -0500
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <97728224-8422-0ca2-745f-ca3681fc01ca@electrooptical.net>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me> <soa3c4$1ek5$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="3dbf367a7594c1d10f6a7ffb3c88e41c";
logging-data="21514"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19D26K5Q7e4ICG5zhqQV2j6"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
Thunderbird/60.0
Cancel-Lock: sha1:KuzWQ6TstTETE9vp5px0zJXrhOA=
In-Reply-To: <soa3c4$1ek5$1@gioia.aioe.org>
 by: Phil Hobbs - Thu, 2 Dec 2021 14:49 UTC

Martin Brown wrote:
> On 01/12/2021 22:31, Don Y wrote:
>> On 12/1/2021 11:52 AM, Dave Platt wrote:
>>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>>> Jasen Betts  <usenet@revmaps.no-ip.org> wrote:
>>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>>> I'm looking for "established" archive formats and/or compression
>>>>> formats (the thinking being that an archive can always be subsequently
>>>>> compressed).
>>>>
>>>> If by compressed you mean made smaller, that's obviously false.
>>>
>>> If we interpret "compressed" to mean "compressed without information
>>> loss", Jasen is correct.  This can't be done.
>>
>> No, you are assuming there is no other (implicit) source of
>> information that the compressor can rely upon.
>
> He is stating a well known and general result.
>
> One that sometimes catches people out. We had offline compression for
> bulk data over phone line that could break some telecom modems realtime
> compression back in the day. Internal buffer overflow because the data
> expanded quite a bit when their simplistic "compression" algorithm tried
> to process it in realtime. If it is still around I created a document
> called fullfile which epitomised the maximally incompressible file.
> There were already a test file sample of ASCII text and an empty file
> (which essentially tests the baud rate of the modems at each end).
>
>> I, for example, have JUST designed a compressor that compresses
>> all occurrences of the string "No, you are assuming there is no
>> other (implicit) source of information that the compressor can
>> rely upon." into the hex constant 0xFE.
>>
>> As such, the first paragraph in my reply, here, can be compressed
>> to a single byte!  The remaining characters in this message are
>> not affected by my compressor.  So, the message ends up SMALLER
>> as a result of the elided characters in that first paragraph.
>>
>> My compressor obviously relies on the fact that 0xFE does not
>> occur in ascii text.  (If it did, I'd have to encode *it* in some
>> other manner)
>>
>> [Unapplicable "proof" elided]
>
> His general point is true though.
>
> Unless there is some other redundant structure in the file you cannot
> compress a file where the bytewise entropy is ln(256) or nearly so.
>
> You also have to work much harder to get that very last 1% of additional
> compression too - most algorithms don't even try.
>
> PNG is one of the better lossless image ones and gets ~ln(190)
> ZIP on a larger files gets very close indeed ~ln(255.7)
>

A piece of ancient programming wisdom appears relevant:

"As everybody knows, all programs have bugs, and all programs can be
made smaller.

Therefore all programs can be reduced to a single incorrect instruction."

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Re: Archive formats

<soaosi$9d7$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84039&group=sci.electronics.design#84039

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Thu, 2 Dec 2021 08:31:19 -0700
Organization: A noiseless patient Spider
Lines: 286
Message-ID: <soaosi$9d7$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me> <soa3c4$1ek5$1@gioia.aioe.org>
<soa51d$va1$1@dont-email.me> <soa878$1l7m$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 2 Dec 2021 15:31:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="97338051e9042b5541a3e6f2ec31e7f9";
logging-data="9639"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18UrMuojH0XwI6M5grx2ORh"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:4UOsuuaw4LaxppYGcuqYNvmQyBc=
In-Reply-To: <soa878$1l7m$1@gioia.aioe.org>
Content-Language: en-US
 by: Don Y - Thu, 2 Dec 2021 15:31 UTC

On 12/2/2021 3:47 AM, Martin Brown wrote:
> On 02/12/2021 09:52, Don Y wrote:
>> On 12/2/2021 2:24 AM, Martin Brown wrote:
>>> On 01/12/2021 22:31, Don Y wrote:
>>>> On 12/1/2021 11:52 AM, Dave Platt wrote:
>>>>> In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
>>>>> Jasen Betts <usenet@revmaps.no-ip.org> wrote:
>>>>>> On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
>>>>>>> I'm looking for "established" archive formats and/or compression
>>>>>>> formats (the thinking being that an archive can always be subsequently
>>>>>>> compressed).
>>>>>>
>>>>>> If by compressed you mean made smaller, that's obviously false.
>>>>>
>>>>> If we interpret "compressed" to mean "compressed without information
>>>>> loss", Jasen is correct. This can't be done.
>>>>
>>>> No, you are assuming there is no other (implicit) source of
>>>> information that the compressor can rely upon.
>>>
>>> He is stating a well known and general result.
>>
>> That only applies in the general case. The fact that most compressors
>> achieve *some* compression means the general case is RARE in the wild;
>> typically encountered when someone tries to compress already compressed
>> content.
>
> That used to be true but most content these days apart from HTML and flat text
> files is already compressed with something like a crude ZIP. MS Office 2007
> onwards files are thinly disguised ZIP files.

Again, it doesn't matter to *my* use. If someone chooses to ZIP a RAR of a UUE
of a BZ2 and then repeat the entire chain of compressors a *second* time,
resulting in a godawful mess, there's nothing that I can do to have prevented
that from being done.

*My* interest is being able to unravel those different layers -- regardless
of *which* compressors and archivers had been applied to get at the "juicy
noughat" inside.

>>>> My compressor obviously relies on the fact that 0xFE does not
>>>> occur in ascii text. (If it did, I'd have to encode *it* in some
>>>> other manner)
>>>>
>>>> [Unapplicable "proof" elided]
>>>
>>> His general point is true though.
>>
>> It isn't important to the issues I'm addressing.
>
> It isn't clear to me what issue you are trying to address. Recognising the
> header info for each compression method and the using it will get you a long
> way but ISTR it has already been done. A derivative of the IFL DOS utility
> programme that used to do that for example.

I want to know the "extent" of the problem before posing a solution.
I have many compressors, decompressors, archivers, dearchivers, etc.
already. Many have been written to try to address multiple "forms"
of these actions.

But, none (IMO) have actually addressed *all*. And, I don't *need*
a single-executable-solution; I need to know *which* executable to
apply based on the compression and archiving format "detected"
in the file in question. In the classic UN*X fashion, I can
build a new tool from a *set* of existing tools -- instead of the
Windows solution of rewriting (re-bugging?) all of those existing
tools into a new version of same. The advantage being that I can
add whatever new/exotic format comes along instead of waiting for
someone to build a new (bug free!) executable.

> AV programs that can see into (some) archive files also use this method.
>
>> *If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
>> ENCOUNTER THE (compressed) FILE(s). Any increase or decrease in file
>> size will already have been "baked in". There is no value to my being
>> able to "lecture" the content creator that his compression actually
>> INCREASED the size of his content. (caps are for emphasis, not shouting).
>
> But if you have control of what is being done you can detect files where the
> method you are using cannot make any improvements and just copy it.

Again, I'm not going to alter the original archive/compressed file/etc.
I'm going to leave it however it was. *But* apply whatever inverse
transforms are needed to examine its internal contents in the form
that those contents were originally intended to take.

E.g., if someone builds a tarball of a bunch of RAW images and then compresses
with StuffIt, I'm going to preserve that mess -- but extract copies of those
images for my own use. There's no point in my "repackaging" them; they
are already "packaged" AND I HAVE THE TOOLS TO UNPACK THEM, again, if I
so desire.

>> [Compression also affords other features that are absent in its absence.
>> In particular, most compressors include checksums -- either implied
>> or explicit -- that further act to vouch for the integrity of the
>> content. Can you tell me if "foo.txt" is corrupted? What about
>> "foo.zip"?]
>>
>> *My* concern is being able to recover the original file(s). REGARDLESS
>> OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.
>
> You definitely want to use the byte entropy to classify them then. That will
> tell you fairly quickly whether or not a given block of disk is JPEG, HTML or
> EXE without doing much work at all.

Or SIT, DMG, UUE, BZ2, ...

My plan was to use the file extension as a *hint* and then file(1) (or similar)
to verify signature(s) before processing with whatever tool (and set of command
line switches) is required -- assuming the tool may require "further direction"
than just being thrown at the file.

7z is fairly comprehensive. But, I'm not sure it would recognize "legacy"
tarballs. (this is also an argument against re-encoding the files; I'd
be annoyed if a Sun patch archive wouldn't be deployable on a Slowaris box
because I opted to reencode it in a "denser" form!)

>> A user can use an off-the-shelf archiver to "bundle" multiple files
>> into a single "archive file". So, I need to be able to "unbundle"
>> them, regardless of the archiver he happened to choose -- hence my
>> interest in "archive formats".
>
> There were utilities of this sort back in the 1990's why reinvent the wheel? It
> is harder now since there are even more rival formats.

If there's such a beast, then it will clearly enumerate EVERY such format,
right? So, all I'd need to do is look at it's spec sheet to answer the
question posed by this post...

If it is foolish enough to rely on file names (extensions), then it's
already likely doomed -- as anyone can name any file anything they
want!

>> A user can often opt to "compress" that resulting archive (or, the archive
>> program may offer that as an option applied while the archive is built).
>> (Or, an individual file without "bundling")
>>
>> So, in order to unbundle the archive (or recover the singleton), I need
>> to be able to UNcompress it. Hence my interest in compressors.
>>
>> A user *could* opt to encrypt the contents. If so, I won't even attempt
>> to access the original files. I have no desire to expend resource
>> "guessing" secrets!
>>
>> He can also opt to apply some other (wacky, home-baked) encoding or compression
>> scheme (e.g., when sending executables through mail, I routinely change the
>> file extenstion to "xex" and prepend some gibberish at the front of the file
>> to obscure its signature -- because some mail scanners will attempt to
>> decompress compressed files to "protect" the recipients, otherwise wrapping
>> it in a ZIP would suffice). If so, I won't even attempt to access the
>> original file(s).
>
> That was a part of what I used to do to numeric data files way back exploiting
> the fact that spreadsheets treat a blank cell as 0. Then compress by RLE and
> then hit that with ZIP. It got quite close to what a modern compression
> algorithm can do.
>
>> One can argue that a user might do some other "silly" transform (ROT13?)
>> so I could cover those bases with (equally silly) inversions. I want to
>> identify the sorts of *likely* "processes" to which some (other!) user
>> could have subjected a file's (or group of files') content and be able
>> to reverse them.
>
> Transforms that only alter the value of the symbols but not their frequency
> should make no difference at all to the compressibility with the entropy based
> variable bit length encoding methods being used today.


Click here to read the complete article
Re: Archive formats

<sodlcs$lbb$1@dont-email.me>

  copy mid

https://www.novabbs.com/tech/article-flat.php?id=84078&group=sci.electronics.design#84078

  copy link   Newsgroups: sci.electronics.design
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: blockedo...@foo.invalid (Don Y)
Newsgroups: sci.electronics.design
Subject: Re: Archive formats
Date: Fri, 3 Dec 2021 10:50:05 -0700
Organization: A noiseless patient Spider
Lines: 48
Message-ID: <sodlcs$lbb$1@dont-email.me>
References: <so6vaj$q31$1@dont-email.me>
<so764l$r9v$1@gonzo.revmaps.no-ip.org> <trsk7i-dicg1.ln1@coop.radagast.org>
<so8t57$6cf$1@dont-email.me> <soa3c4$1ek5$1@gioia.aioe.org>
<soa51d$va1$1@dont-email.me> <soa878$1l7m$1@gioia.aioe.org>
<soaosi$9d7$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 3 Dec 2021 17:50:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b112f5396e68294c73299288eb35ffee";
logging-data="21867"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dpkInYiqBkhf6FGjuLY4y"
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
Thunderbird/52.1.1
Cancel-Lock: sha1:ROK3T2zLY7R+bcUAkBi8i5r7n9g=
In-Reply-To: <soaosi$9d7$1@dont-email.me>
Content-Language: en-US
 by: Don Y - Fri, 3 Dec 2021 17:50 UTC

On 12/2/2021 8:31 AM, Don Y wrote:

>> BTW do you have time to run that (now rather large benchmark) through the
>> Intel C/C++ compiler? I have just got it to compile with Clang for the M1 and
>> am hoping to run the benchmarks on Friday. My new target Intel CPU to test is
>> 12600K which looks like it could be a real winner.
>> (ie slam it through the compiler and send me back the error msgs)
>
> I'm overwhelmed, currently. I made a commitment to release six designs by
> year end (so folks could slip them into production after the holidays)
> but ended up "playing" for a few weeks (fun, but it comes with a cost).
> I'm hoping to get the second of the six done this week. Which means it
> will be a real squeeze to get all six done, given the presence of the
> holiday (despite earlier play time, I'm not keen on working THROUGH
> the holiday to meet those commitments)

> I can try sometime after new years (assuming no problems creep up
> with my product releases once other folks get their hands on them)
> I'll try to make a point to verify the compiler is installed and
> running (I've been working from a different workstation, recently,
> and haven't had need of the tools on "that" workstation)

OK. I was planning on baking last night but got my booster and
made the mistake of falling asleep immediately after. :< As
such, I didn't get a chance to "work the muscle" (which has
always helped me avoid any injection site soreness). The
prospect of kneading 10 pounds of bread dough didn't seem very
appealling....

So, I spent the night clearing the (physical!) crap off that
workstation (with only ~50 sq ft of bench space -- and maybe
two or three of those truly "free" -- anything that is not
actively being used tends to attract clutter!)

Found the compiler. But, I'd apparently uninstalled VS at
some point (likely preparing to install an upgrade in its
place -- I don't trust "overwrite installs").

Reinstalled that (which is a bitch -- 20GB?? -- especially
if doing so "offline".)

Ran a few test cases and *think* it is operational and configured
for "Intel64". So, we can try your code, if still interested.

[I likely won't be able to bake anything "of effort" tonight as I'm
still sore. Maybe some Benne Wafers... they're easy! (though I
see SWMBO has left her EMPTY biscotti on the kitchen counter as
a not-so-subtle hint...)]

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor