Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

I'm not tense, just terribly, terribly alert!


aus+uk / uk.comp.sys.mac / Re: Duplicate EPUBs, finding and removing?

SubjectAuthor
* Duplicate EPUBs, finding and removing?J. J. Lodder
`* Re: Duplicate EPUBs, finding and removing?J. J. Lodder
 `* Re: Duplicate EPUBs, finding and removing?Chris Ridd
  `* Re: Duplicate EPUBs, finding and removing?J. J. Lodder
   +- Re: Duplicate EPUBs, finding and removing?Mark
   `* Re: Duplicate EPUBs, finding and removing?Chris Ridd
    `* Re: Duplicate EPUBs, finding and removing?J. J. Lodder
     `- Re: Duplicate EPUBs, finding and removing?Chris Ridd

1
Duplicate EPUBs, finding and removing?

<1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=16997&group=uk.comp.sys.mac#16997

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@de-ster.demon.nl (J. J. Lodder)
Newsgroups: uk.comp.sys.mac
Subject: Duplicate EPUBs, finding and removing?
Date: Mon, 17 Jul 2023 21:30:09 +0200
Organization: De Ster
Lines: 33
Message-ID: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl>
Reply-To: jjlax32@xs4all.nl (J. J. Lodder)
Injection-Info: dont-email.me; posting-host="39a2dc993262294f2e5807c73dabf857";
logging-data="1411428"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/X4fDgaloYVv/PPgHsk38jqq2nGMQyVQo="
User-Agent: MacSOUP/2.8.5 (ea919cf118) (Mac OS 10.12.6)
Cancel-Lock: sha1:KqFksqjBDYh0SVoUQwFNQqgB/8I=
 by: J. J. Lodder - Mon, 17 Jul 2023 19:30 UTC

Anyone who has more than a few EPUBs will have noticed
that these tend to proliferate, with many bitwise non-identical copies
all having effectively the same content.

Finding bitwise identical files is easy, with a suitable utility,
even if the name and/or extension has been changed.

The problem with EPUBs is that they come in slightly different versions,
which are effectively equal in the way of content,
but which are differing in a few bytes.
The cause is usually that someone has imported the file into Calibre,
and exported it again.
(this may change the .opf file and/or the cover .jpg)

It is of course possible to detect this manually,
for each pair of files.
(convert to .zip, expand, delete duplicate files in the expanded
folders, and see what remains)
This is very labour-intensive, and usually not worth the trouble.
(unless the EPUBs are really huge)

Question: does anyone know of an effective way
to detect -essentially identical- EPUBs?
That is, .epub files with the same readable content,
that differ only in some trivial invisible files?

All 'duplicate finders' that I have seen, including Calibre,
only pretend that they can do this,
but in reality they cannnot,

Jan

Re: Duplicate EPUBs, finding and removing?

<1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17070&group=uk.comp.sys.mac#17070

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@de-ster.demon.nl (J. J. Lodder)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Wed, 19 Jul 2023 16:04:10 +0200
Organization: De Ster
Lines: 50
Message-ID: <1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl>
Reply-To: jjlax32@xs4all.nl (J. J. Lodder)
Injection-Info: dont-email.me; posting-host="7c77f659e48574786a7cb79a4fafefc1";
logging-data="2295880"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19IT48HwLHKw1gcgbkxoRRm6HqtV4oVD+E="
User-Agent: MacSOUP/2.8.5 (ea919cf118) (Mac OS 10.12.6)
Cancel-Lock: sha1:OoFJo+NeDiB8sZOvzEW3QqdcToo=
 by: J. J. Lodder - Wed, 19 Jul 2023 14:04 UTC

J. J. Lodder <nospam@de-ster.demon.nl> wrote:

As expected, no answers.
It would seem that there just isn't an effective way
to detect and eliminate essentially duplicate .epubs,
except labouriously, one by one, and by hand.
The best one can do is just delete similar-sized ones with the same
title, and hope for the best.

This certainly isn't the best of designs,

Jan

PS The .cbr and .cbz formats suffer from the same problem.
Merely using a different compressor on the same folder
will already result in different, but essentially identical files.

> Anyone who has more than a few EPUBs will have noticed
> that these tend to proliferate, with many bitwise non-identical copies
> all having effectively the same content.
>
> Finding bitwise identical files is easy, with a suitable utility,
> even if the name and/or extension has been changed.
>
> The problem with EPUBs is that they come in slightly different versions,
> which are effectively equal in the way of content,
> but which are differing in a few bytes.
> The cause is usually that someone has imported the file into Calibre,
> and exported it again.
> (this may change the .opf file and/or the cover .jpg)
>
> It is of course possible to detect this manually,
> for each pair of files.
> (convert to .zip, expand, delete duplicate files in the expanded
> folders, and see what remains)
> This is very labour-intensive, and usually not worth the trouble.
> (unless the EPUBs are really huge)
>
> Question: does anyone know of an effective way
> to detect -essentially identical- EPUBs?
> That is, .epub files with the same readable content,
> that differ only in some trivial invisible files?
>
> All 'duplicate finders' that I have seen, including Calibre,
> only pretend that they can do this,
> but in reality they cannnot,
>
> Jan

Re: Duplicate EPUBs, finding and removing?

<u994tq$2807l$1@dont-email.me>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17071&group=uk.comp.sys.mac#17071

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chrisr...@mac.com (Chris Ridd)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Wed, 19 Jul 2023 18:01:45 +0100
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <u994tq$2807l$1@dont-email.me>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl>
<1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 19 Jul 2023 17:01:46 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5b094fbbc67e0a88982acc6c2d8b5f50";
logging-data="2359541"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cHOqzgFLQdlSzx7xP+ljsB9zDzYmFxT8="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.13.0
Cancel-Lock: sha1:THyVmV6/QDoOqDpJ7ex9HaK+AB0=
In-Reply-To: <1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl>
 by: Chris Ridd - Wed, 19 Jul 2023 17:01 UTC

On 19/07/2023 15:04, J. J. Lodder wrote:
> J. J. Lodder <nospam@de-ster.demon.nl> wrote:
>
> As expected, no answers.
> It would seem that there just isn't an effective way
> to detect and eliminate essentially duplicate .epubs,

What makes two epubs "essentially identical"?

What if one used <div>s for each paragraph and the other <p>s? What
about different CSS?

It isn't an easy problem to solve.

That's why things like Calibre and BookFusion exist, to keep your
library in one place and relatively well organized.

--
Chris

Re: Duplicate EPUBs, finding and removing?

<1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17073&group=uk.comp.sys.mac#17073

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@de-ster.demon.nl (J. J. Lodder)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Wed, 19 Jul 2023 21:26:43 +0200
Organization: De Ster
Lines: 44
Message-ID: <1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl> <1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl> <u994tq$2807l$1@dont-email.me>
Reply-To: jjlax32@xs4all.nl (J. J. Lodder)
Injection-Info: dont-email.me; posting-host="7c77f659e48574786a7cb79a4fafefc1";
logging-data="2410972"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Y9t28vU6nWiScDMQilIjowC27vT06knk="
User-Agent: MacSOUP/2.8.5 (ea919cf118) (Mac OS 10.12.6)
Cancel-Lock: sha1:PMf3P9Xfxs6efnqC7e0mfW7oKfw=
 by: J. J. Lodder - Wed, 19 Jul 2023 19:26 UTC

Chris Ridd <chrisridd@mac.com> wrote:

> On 19/07/2023 15:04, J. J. Lodder wrote:
> > J. J. Lodder <nospam@de-ster.demon.nl> wrote:
> >
> > As expected, no answers.
> > It would seem that there just isn't an effective way
> > to detect and eliminate essentially duplicate .epubs,
>
> What makes two epubs "essentially identical"?

Having all files that contribute to what you actually see
when you read the book bitwise identical.
(so the actual text files and the jpgs)

The difference that remains are usually only some bits in the .opf file.
(which is merely an invisible html description with metadata)
Or the difference is merely caused by using a different zip utility
on the same folder.

> What if one used <div>s for each paragraph and the other <p>s? What
> about different CSS?
>
> It isn't an easy problem to solve.
>
> That's why things like Calibre and BookFusion exist, to keep your
> library in one place and relatively well organized.

Yes, but organising a library isn't the problem.
The problem is having several, perhaps many, different versions
of essentially the same book.
You would like to eliminate those duplicates,
(without too much work)
just like ordinary duplicate finders do for bitwise identical files.

The problems are in fact -caused- mostly by Calibre.
Importing into and exporting from again usually produces
a slightly different, but essentially identical file.
(depending on the settings)
Calibre may also add invisible files of its own,

Jan

Re: Duplicate EPUBs, finding and removing?

<u99fg8$29v0v$1@dont-email.me>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17076&group=uk.comp.sys.mac#17076

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: captain....@gmail.com (Mark)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Wed, 19 Jul 2023 21:02:16 +0100
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <u99fg8$29v0v$1@dont-email.me>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl> <1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl> <u994tq$2807l$1@dont-email.me> <1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: dont-email.me; posting-host="4702efcabf567409352f9915bc34137e";
logging-data="2423839"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18z13pB1Y6DT2+4b9/t0JCJoH35rG4iuog="
User-Agent: Unison/2.2
Cancel-Lock: sha1:RP0miQfu+CjyAxij6Ufz2i/0Hvg=
 by: Mark - Wed, 19 Jul 2023 20:02 UTC

On 2023-07-19 19:26:43 +0000, J. J. Lodder said:

> Chris Ridd <chrisridd@mac.com> wrote:
>
>> On 19/07/2023 15:04, J. J. Lodder wrote:
>>> J. J. Lodder <nospam@de-ster.demon.nl> wrote:
>>>
>>> As expected, no answers.
>>> It would seem that there just isn't an effective way
>>> to detect and eliminate essentially duplicate .epubs,
>>
>> What makes two epubs "essentially identical"?
>
> Having all files that contribute to what you actually see
> when you read the book bitwise identical.
> (so the actual text files and the jpgs)
>
> The difference that remains are usually only some bits in the .opf file.
> (which is merely an invisible html description with metadata)
> Or the difference is merely caused by using a different zip utility
> on the same folder.
>
>> What if one used <div>s for each paragraph and the other <p>s? What
>> about different CSS?
>>
>> It isn't an easy problem to solve.
>>
>> That's why things like Calibre and BookFusion exist, to keep your
>> library in one place and relatively well organized.
>
> Yes, but organising a library isn't the problem.
> The problem is having several, perhaps many, different versions
> of essentially the same book.
> You would like to eliminate those duplicates,
> (without too much work)
> just like ordinary duplicate finders do for bitwise identical files.
>
> The problems are in fact -caused- mostly by Calibre.
> Importing into and exporting from again usually produces
> a slightly different, but essentially identical file.
> (depending on the settings)
> Calibre may also add invisible files of its own,
>
> Jan

You could try here <https://www.mobileread.com/forums/forumdisplay.php?f=110>
--
Cheers ... Mark

Re: Duplicate EPUBs, finding and removing?

<u9bpih$2q2js$1@dont-email.me>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17081&group=uk.comp.sys.mac#17081

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chrisr...@mac.com (Chris Ridd)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Thu, 20 Jul 2023 18:06:25 +0100
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <u9bpih$2q2js$1@dont-email.me>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl>
<1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl>
<u994tq$2807l$1@dont-email.me>
<1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 20 Jul 2023 17:06:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7fdca5614dcb53f827b3ded2cfbd11cf";
logging-data="2951804"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ib1CGfJDsm3eIDT2ROGUsRPVeBUpXB24="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.13.0
Cancel-Lock: sha1:TvHgcINkTSZtCzdX2zVU7CEYxhE=
In-Reply-To: <1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl>
 by: Chris Ridd - Thu, 20 Jul 2023 17:06 UTC

On 19/07/2023 20:26, J. J. Lodder wrote:
> Chris Ridd <chrisridd@mac.com> wrote:
>
>> On 19/07/2023 15:04, J. J. Lodder wrote:
>>> J. J. Lodder <nospam@de-ster.demon.nl> wrote:
>>>
>>> As expected, no answers.
>>> It would seem that there just isn't an effective way
>>> to detect and eliminate essentially duplicate .epubs,
>>
>> What makes two epubs "essentially identical"?
>
> Having all files that contribute to what you actually see
> when you read the book bitwise identical.
> (so the actual text files and the jpgs)
>
> The difference that remains are usually only some bits in the .opf file.
> (which is merely an invisible html description with metadata)

No, the OPF is very definitely not HTML.

> Or the difference is merely caused by using a different zip utility
> on the same folder.

Yes, one solution here is to unzip both files to separate directories
and to use something like `diff -r --brief book1 book2`.

> Yes, but organising a library isn't the problem.
> The problem is having several, perhaps many, different versions
> of essentially the same book.
> You would like to eliminate those duplicates,
> (without too much work)
> just like ordinary duplicate finders do for bitwise identical files.

The problem is however much lessened by not having books scattered
around your disk. At least if they're all in the same place, it is
easier to find duplicates.

Personally, duplicate books are not a problem. I'm not obtaining books
in bulk or anything, so that probably makes it easier to avoid dupes.

--
Chris

Re: Duplicate EPUBs, finding and removing?

<1qe6nfa.17pzxd96uqrbgN%nospam@de-ster.demon.nl>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17083&group=uk.comp.sys.mac#17083

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@de-ster.demon.nl (J. J. Lodder)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Thu, 20 Jul 2023 21:10:12 +0200
Organization: De Ster
Lines: 116
Message-ID: <1qe6nfa.17pzxd96uqrbgN%nospam@de-ster.demon.nl>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl> <1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl> <u994tq$2807l$1@dont-email.me> <1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl> <u9bpih$2q2js$1@dont-email.me>
Reply-To: jjlax32@xs4all.nl (J. J. Lodder)
Injection-Info: dont-email.me; posting-host="5c971b362225ffc0bc3288f18ce7e2a9";
logging-data="2994334"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18HkpGR7Xg6DUu4COMEHA1GisD0wz+HjYw="
User-Agent: MacSOUP/2.8.5 (ea919cf118) (Mac OS 10.12.6)
Cancel-Lock: sha1:Vr2t4OLAicTsAU8/rsYxDGXXjlY=
 by: J. J. Lodder - Thu, 20 Jul 2023 19:10 UTC

Chris Ridd <chrisridd@mac.com> wrote:

> On 19/07/2023 20:26, J. J. Lodder wrote:
> > Chris Ridd <chrisridd@mac.com> wrote:
> >
> >> On 19/07/2023 15:04, J. J. Lodder wrote:
> >>> J. J. Lodder <nospam@de-ster.demon.nl> wrote:
> >>>
> >>> As expected, no answers.
> >>> It would seem that there just isn't an effective way
> >>> to detect and eliminate essentially duplicate .epubs,
> >>
> >> What makes two epubs "essentially identical"?
> >
> > Having all files that contribute to what you actually see
> > when you read the book bitwise identical.
> > (so the actual text files and the jpgs)
> >
> > The difference that remains are usually only some bits in the .opf file.
> > (which is merely an invisible html description with metadata)
>
> No, the OPF is very definitely not HTML.

Perhaps, but if this [1] isn't some kind of html, what is it?

> > Or the difference is merely caused by using a different zip utility
> > on the same folder.
>
> Yes, one solution here is to unzip both files to separate directories
> and to use something like `diff -r --brief book1 book2`.

Yes. (supposing that the epub files are not bitwise duplicates)
The results may vary from all files different
to no different files at all.
Usualy there are a few trivial small files that are different.
(often the .opf file)

For example, I have seen machine-generated conversions from .pdf files.
They are excessively ugly inside, and quite different from true epubs
that were created as epubs.

> > Yes, but organising a library isn't the problem.
> > The problem is having several, perhaps many, different versions
> > of essentially the same book.
> > You would like to eliminate those duplicates,
> > (without too much work)
> > just like ordinary duplicate finders do for bitwise identical files.
>
> The problem is however much lessened by not having books scattered
> around your disk. At least if they're all in the same place, it is
> easier to find duplicates.

Yes, but how do you find which really are duplicates?

> Personally, duplicate books are not a problem. I'm not obtaining books
> in bulk or anything, so that probably makes it easier to avoid dupes.

Sure it is not a realy a problem.
Storage cost essentialy nothing.
Epub books are perhaps a megabyte on average,
so storage cost is measured in millicents.
Even very big epubs are rarely more than 100 MB.
It is just that they clutter up directories and lists.

Hence the question is for an utility that will do it automatically,
with just a few mouseclicks for a lot of files.

Jan

--
[1] A typical sample:

<manifest>
<item href="nav.xhtml" id="nav" media-type="application/xhtml+xml"
properties="nav"/>
<item href="toc.ncx" id="toc" media-type="application/x-dtbncx+xml"/>
<item href="xhtml/cover.xhtml" id="html-cover-page"
media-type="application/xhtml+xml"/>
<item href="xhtml/title.xhtml" id="tit"
media-type="application/xhtml+xml"/>
<item href="xhtml/mini_toc.xhtml" id="mini_toc"
media-type="application/xhtml+xml"/>
<item href="xhtml/copyrightnotice.xhtml" id="copyrightnotice"
media-type="application/xhtml+xml"/>
<item href="xhtml/dedication.xhtml" id="dedication"
media-type="application/xhtml+xml"/>
<item href="xhtml/foreword.xhtml" id="foreword"
media-type="application/xhtml+xml"/>
<item href="xhtml/part1.xhtml" id="part1"
media-type="application/xhtml+xml"/>
<item href="xhtml/chapter1.xhtml" id="chapter1"
media-type="application/xhtml+xml"/>
<item href="xhtml/chapter2.xhtml" id="chapter2"
media-type="application/xhtml+xml"/>
<item href="xhtml/part2.xhtml" id="part2"
media-type="application/xhtml+xml"/>
[-]
media-type="application/xhtml+xml"/>
<item href="xhtml/acknowledgments.xhtml" id="acknowledgments"
media-type="application/xhtml+xml"/>
<item href="xhtml/abouttheauthor.xhtml" id="abouttheauthor"
media-type="application/xhtml+xml"/>
<item href="xhtml/newsletter.xhtml" id="newsletter"
media-type="application/xhtml+xml"/>
<item href="xhtml/contents.xhtml" id="contents"
media-type="application/xhtml+xml"/>
<item href="xhtml/copyright.xhtml" id="copyright"
media-type="application/xhtml+xml"/>
<item href="images/9781250876577.jpg" id="cover-image"
media-type="image/jpeg" properties="cover-image"/>
<item href="images/author.jpg" id="image-002" media-type="image/jpeg"/>
<item href="images/title.jpg" id="image-003" media-type="image/jpeg"/>
<item href="styles/stylesheet.css" id="css" media-type="text/css"/>
<item href="images/NewsletterSignup.jpg" id="image-0002"
media-type="image/jpeg"/>
</manifest>

Re: Duplicate EPUBs, finding and removing?

<u9d401$34f45$1@dont-email.me>

  copy mid

https://www.novabbs.com/aus+uk/article-flat.php?id=17084&group=uk.comp.sys.mac#17084

  copy link   Newsgroups: uk.comp.sys.mac
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chrisr...@mac.com (Chris Ridd)
Newsgroups: uk.comp.sys.mac
Subject: Re: Duplicate EPUBs, finding and removing?
Date: Fri, 21 Jul 2023 06:10:24 +0100
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <u9d401$34f45$1@dont-email.me>
References: <1qe16kw.w6x1yv766ybtN%nospam@de-ster.demon.nl>
<1qe4hdv.4j5cqexbazt5N%nospam@de-ster.demon.nl>
<u994tq$2807l$1@dont-email.me>
<1qe4uyb.1jwouip63jiwN%nospam@de-ster.demon.nl>
<u9bpih$2q2js$1@dont-email.me>
<1qe6nfa.17pzxd96uqrbgN%nospam@de-ster.demon.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 21 Jul 2023 05:10:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="83fa5d15476b9c80aa3e3ce621b2710a";
logging-data="3292293"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19RKMxWSCRp985hYw8fzKbwHMkltWON7FQ="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.13.0
Cancel-Lock: sha1:RRsdinEqsZAfiMB4d47GxcUyk/E=
In-Reply-To: <1qe6nfa.17pzxd96uqrbgN%nospam@de-ster.demon.nl>
 by: Chris Ridd - Fri, 21 Jul 2023 05:10 UTC

On 20/07/2023 20:10, J. J. Lodder wrote:
> Chris Ridd <chrisridd@mac.com> wrote:
>
>> On 19/07/2023 20:26, J. J. Lodder wrote:
>>> Chris Ridd <chrisridd@mac.com> wrote:
>>>
>>>> On 19/07/2023 15:04, J. J. Lodder wrote:
>>>>> J. J. Lodder <nospam@de-ster.demon.nl> wrote:
>>>>>
>>>>> As expected, no answers.
>>>>> It would seem that there just isn't an effective way
>>>>> to detect and eliminate essentially duplicate .epubs,
>>>>
>>>> What makes two epubs "essentially identical"?
>>>
>>> Having all files that contribute to what you actually see
>>> when you read the book bitwise identical.
>>> (so the actual text files and the jpgs)
>>>
>>> The difference that remains are usually only some bits in the .opf file.
>>> (which is merely an invisible html description with metadata)
>>
>> No, the OPF is very definitely not HTML.
>
> Perhaps, but if this [1] isn't some kind of html, what is it?

XML. See the current specification for it:
https://www.w3.org/TR/epub-33/#sec-package-doc

>> Personally, duplicate books are not a problem. I'm not obtaining books
>> in bulk or anything, so that probably makes it easier to avoid dupes.

BTW, I meant for me they're not a problem as I don't purchase books twice.

> Sure it is not a realy a problem.
> Storage cost essentialy nothing.

Correct, the space required is irrelevant. You just don't want to read a
book twice accidentally - or you want to make sure you have the cleanest
version of the book.

--
Chris

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor