Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

Don't compare floating point numbers solely for equality.


computers / comp.compression / Another way to store a canonical Huffman table

SubjectAuthor
* Another way to store a canonical Huffman tablenews
`- Re: Another way to store a canonical Huffman tableHarry Potter

1
Subject: Another way to store a canonical Huffman table
From: new...@zzo38computer.org.invalid
Newsgroups: comp.compression
Organization: Aioe.org NNTP Server
Date: Fri, 7 Feb 2020 21:56 UTC
Path: i2pn2.org!i2pn.org!news.swapon.de!aioe.org!.POSTED.T3p5Iplo8Qapd43srEM3cA.user.gioia.aioe.org!not-for-mail
From: new...@zzo38computer.org.invalid
Newsgroups: comp.compression
Subject: Another way to store a canonical Huffman table
Date: Fri, 07 Feb 2020 21:56:32 +0000
Organization: Aioe.org NNTP Server
Lines: 80
Message-ID: <1581103940.bystand@zzo38computer.org>
NNTP-Posting-Host: T3p5Iplo8Qapd43srEM3cA.user.gioia.aioe.org
Mime-Version: 1.0
X-Complaints-To: abuse@aioe.org
User-Agent: bystand/0.7.3
X-Notice: Filtered by postfilter v. 0.9.2
View all headers
I thought of the way to store a canonical Huffman table: First start one
bit telling if the largest code length is odd or even. And then store the
number of codes of each length using truncated binary. And then store the
values in truncated binary, alternating lowest and highest values, in order
that the window is made narrow after each one.

For the storage of the values, consider that values already specified are
no longer possible for longer codes, so they can be skipped when making the
window of possible values. Also consider that, because the number of codes
of a length is known, you might know that the next value (which is the
lowest, or maybe highest, used value with this code length) cannot be the
few highest (or maybe lowest) values in the window, because then it will
not leave enough room for any other values for codes of this length, so
this reduces the window further. Sometimes the window will be reduced to
a single value in this way, meaning it can be coded in zero bits.

For the list of how many codes per length, note that to determine the
maximum, start up to 1 for codes of zero length, and then double the number
of unused codes for the number of codes of the next length. The bit at the
beginning specifies if the largest code length is odd or even, so if you
know this isn't the largest code length then the maximum can temporarily be
decreased by one.

For example:
  3 -- space a e
  4 -- f h i m n s t
  5 -- l o p r u x

Assume these are 8-bit ASCII codes, so the range of values is 0 to 255.

The maximum code length is five. Count codes per length:
  0: 0/1 *
  1: 0/2
  2: 0/4 *
  3: 3/8
  4: 7/10 *
  5: 6/6

The ones with asterisks are one less maximum. So the encoding will be:
  1..0.00.011.1101.111 (fourteen bits)

Now we have to encode the values that the codes represent. The values to
be encoded, and the ranges that they are encoded in, are:
  Length 3:
    (0-255 / 0-253) 32
    (33-255 / 34-255) 101
    (33-100) 97
  Length 4:
    (0-255 except 32, 97, 101 / 0-249) 102
    (103-255 / 108-255) 116
    (103-115 / 103-111) 104
    (105-115 / 108-115) 115
    (105-114 / 105-112) 105
    (106-114 / 107-114) 110
    (106-109) 109
  Length 5:
    (0-255 except numbers used above / 0-250) 108
    (111-255 except 115, 116 / 115-255) 120
    (111-119 except 115, 116 / 111-114) 111
    (112-119 except 115, 116 / 114-119) 117
    (112-114 / 112-113) 112
    (113-114) 114

Each range in parentheses first gives the range not considering the number
of codes of this length, and then the second range after the slash uses the
consideration of how many codes of this length to shorten the window (with
the same exceptions listed before the slash).

The encoding into binary is left as an exercise for the reader.

Further improvements may also be possible, which I have not considered.
(Possibly variable base coding, nCr coding, etc. Those might be more
complicated to implement though; I am unsure.)

I have not compared this working with other schemes, such as the way used
in DEFLATE. Do you know how well it work compared with the other way? Are
there other schemes that I might not know of?

--
Note: I am not always able to read/post messages during Monday-Friday.


Subject: Re: Another way to store a canonical Huffman table
From: Harry Potter
Newsgroups: comp.compression
Date: Thu, 26 Mar 2020 22:04 UTC
References: 1
X-Received: by 2002:a37:b886:: with SMTP id i128mr11260835qkf.410.1585260245648;
Thu, 26 Mar 2020 15:04:05 -0700 (PDT)
X-Received: by 2002:aed:33a6:: with SMTP id v35mr11026973qtd.328.1585260245493;
Thu, 26 Mar 2020 15:04:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder7.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.compression
Date: Thu, 26 Mar 2020 15:04:05 -0700 (PDT)
In-Reply-To: <1581103940.bystand@zzo38computer.org>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=71.190.145.58; posting-account=xRocggoAAACFej4w6sQauoZjUP9yroE5
NNTP-Posting-Host: 71.190.145.58
References: <1581103940.bystand@zzo38computer.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ca607944-becf-4574-b369-cf189c2b1ba7@googlegroups.com>
Subject: Re: Another way to store a canonical Huffman table
From: rose.jos...@yahoo.com (Harry Potter)
Injection-Date: Thu, 26 Mar 2020 22:04:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
View all headers
I have my own technique to shorten Canonical Huffman codes.  It follows:

*  Start with the 16-code ranges to be included.  Basically, 16 bits are used, where each bit contains whether a given 16-value range will be excluded from the header.  I prepend a bit to determine whether any range will be excluded.  This can save 15 bits.

*  Write the least and most bits per entry.  The most only needs to take into account the range from least to 16.  Arithmetic coding can be used.  I have a way of shortening values that are *not* in a range of a power of 2 but am not currently going to reveal it.

*  Write a bit to indicate whether a 0-value is used.

*  Write the most-used value.

*  Scan the Huffman table:
   *   Write a bit to determine whether the current value is the same as the previous.
   *   If not, write a bit to determine whether it is the most-often-used value.
   *   If not, write the value minus the least size plus whether the zero-length value is used, or, if a zero-value, 0, and skip the previous value and the most-occurring value.

Try this out, and tell me what you think.

BTW, I find that skipping blocks compressible with LZ helps with the size of the header.

BTW, I'd better study your idea nd see if I can add it to my technique.


1
rocksolid light 0.7.2
clearneti2ptor