Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

It is easier to change the specification to fit the program than vice versa.


computers / comp.compression / One Way To Output LZ77 Codes

SubjectAuthor
o One Way To Output LZ77 CodesGerald Tamayo

1
Subject: One Way To Output LZ77 Codes
From: Gerald Tamayo
Newsgroups: comp.compression
Date: Wed, 12 Feb 2020 16:39 UTC
X-Received: by 2002:a37:9d03:: with SMTP id g3mr842550qke.39.1581525541288;
Wed, 12 Feb 2020 08:39:01 -0800 (PST)
X-Received: by 2002:aed:34c1:: with SMTP id x59mr20631993qtd.236.1581525541119;
Wed, 12 Feb 2020 08:39:01 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!peer02.am4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.compression
Date: Wed, 12 Feb 2020 08:39:00 -0800 (PST)
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=112.211.33.185; posting-account=x4y9KQoAAAAgtc4BPWDOKB7Ls5RAV5pf
NNTP-Posting-Host: 112.211.33.185
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b120c971-9570-48b0-bf6a-1b4ee836bc90@googlegroups.com>
Subject: One Way To Output LZ77 Codes
From: com...@gmail.com (Gerald Tamayo)
Injection-Date: Wed, 12 Feb 2020 16:39:01 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4439
X-Received-Body-CRC: 2148254953
View all headers
Reduced Length LZ (RLLZ)
Let me share here my one way to output LZ77 codes, which may improve LZ77/LZSS or LZW.

LZ77:

LZ77 coding transmits <offset, length> codes. Many times in the compression process, the same matching string is encoded with an <offset> and the same <length> code. We can avoid transmitting the <length> code for many strings as well as not outputting a bit to identify a literal (as in LZSS) by the following:

Instead, read the whole input or block  and gather duplicated or similar strings and encode <the whole string> (e.g., <length of the string (n>=2)> plus the actual string of characters) only this one time, <the number of the same matching strings (i.e., number of following offset codes)>, and its succeeding occurrences in the input block by transmitting only the said <offset> codes. (Or you can use an escape code to end the offset codes, perhaps BLOCK_SIZE.) Do this for all duplicated strings. This is actually a "Reduced Length LZ (RLLZ)".

The literals are outputted last *without bit flags* since they fall into the block buffers not covered or "not activated" by encoded strings. So just one array of literals can be outputted maybe at the end of the file, or, since block-based coding is more practical for shorter offset codes, at the end of the appropriately-sized block of <offsets>.

During decoding, the strings are written first in the output or write buffer using the offsets, and the literals "fill in" the unwritten positions or "holes" in the write buffer. This makes very compact LZ.


                   ***

No need to output bit flags for literals or the number of following consecutive literals in some algorithms.

Then the completely filled write buffer is written to file.

That is, e.g. after gathering all distinct repeated strings on a block,

1) no transmission of <length> code for the next occurrences of the same string (but, in the simplest way, you have to output the number of succeeding strings);

2) transmitting the literals last (in the output buffer) means no need to transmit *bit flags* for literals (and matches). This is the novel or surprising idea here: deferred literals output;

3) if you know (LZT) LZ-Tamayo (2008) algorithm where it was demonstrated complete exclusion of the <length> code, this might as well be "LZ-Tamayo2". Should improve LZ77/LZSS/LZW based compressors. Decoding is also straight-forward.

Sorry that only now after a decade of LZT (2008 ) i am releasing this. I stopped coding compression in 2010 or 2011. (Note: it seems there is already "LZT" before i named my algorithm. LZ-Tischer.)
***

I have similar ideas as "rep-codes" in 2000s. My compression ideas are from the late 90s when i became interested again in data compression.

The ones i call here "holes" they call "gaps", even in LZW. But the idea of deferring transmission of literals for an output buffer avoids bit flags for both literals and matches which is still used in some explanations of ROLZ. Other algorithms still need to output the number of following literals, or literal_length. Deferred, meaning this is not an "online" algorithm.

-- Gerald R. Tamayo

(Reposted/edited from Sept. 2018 post)



1
rocksolid light 0.7.2
clearneti2ptor