novaBBS - rec.photo.digital - Re: Convert an imagebook to a textbook (perhaps using OCR?)

Convert an imagebook to a textbook (perhaps using OCR?)

<ucbj2f$92df$1@dont-email.me>

https://www.novabbs.com/tech/article-flat.php?id=14436&group=rec.photo.digital#14436

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf rec.photo.digital

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: RudolphR...@nospam.net (Rudolph Rhein)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Subject: Convert an imagebook to a textbook (perhaps using OCR?)
Date: Sat, 26 Aug 2023 04:05:18 +0300
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <ucbj2f$92df$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 26 Aug 2023 01:04:15 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1b394b889a8fad79903b42b083cd85ce";
logging-data="297391"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/7+Me2sMwI6cpfFNOugTaLvzB5kGTwO2w="
User-Agent: 40tude_Dialog/2.0.15.41 (Beta 38)
Cancel-Lock: sha1:j73SmgY3xOgWR3POr2us71XZPHY=

by: Rudolph Rhein - Sat, 26 Aug 2023 01:05 UTC

My sister's next-month Great Books is Noel Coward's play comedy named
"Private Lives" from the 1930s. She's almost blind from complications.

She is not technical and she only has an iPad and an iPhone but I have
Android & Windows so she asked me to help her with IOS text to speech.

She sent me the link to the PDF because it won't text-to-speech read out.
<https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>

Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
it) but just a set of scanned images of the book (with no actual text).

I tried converting that PDF with Calibre on Windows to an EPUB format,
but the EPUB was nothing more than a set of the same images in a file.

What's a good way for me to convert that "imagebook" (whatever you call it)
to a "textbook" so that I can send it to her to use TTS on her iPad?

Re: Convert an imagebook to a textbook (perhaps using OCR?)

<eli$2308252156@qaz.wtf>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=14438&group=rec.photo.digital#14438

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf rec.photo.digital

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!.POSTED.panix5.panix.com!qz!not-for-mail
From: *...@eli.users.panix.com (Eli the Bearded)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Subject: Re: Convert an imagebook to a textbook (perhaps using OCR?)
Date: Sat, 26 Aug 2023 02:00:45 -0000 (UTC)
Organization: Some absurd concept
Message-ID: <eli$2308252156@qaz.wtf>
References: <ucbj2f$92df$1@dont-email.me>
Injection-Date: Sat, 26 Aug 2023 02:00:45 -0000 (UTC)
Injection-Info: reader2.panix.com; posting-host="panix5.panix.com:166.84.1.5";
logging-data="22385"; mail-complaints-to="abuse@panix.com"
User-Agent: Vectrex rn 2.1 (beta)
X-Liz: It's actually happened, the entire Internet is a massive game of Redcode
X-Motto: "Erosion of rights never seems to reverse itself." -- kenny@panix
X-US-Congress: Moronic Fucks.
X-Attribution: EtB
XFrom: is a real address
Encrypted: double rot-13

by: Eli the Bearded - Sat, 26 Aug 2023 02:00 UTC

Follow-ups set to comp.text.pdf.

In rec.photo.digital, Rudolph Rhein <RudolphRhein@nospam.net> wrote:
> My sister's next-month Great Books is Noel Coward's play comedy named
> "Private Lives" from the 1930s. She's almost blind from complications.
....
> <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>
> Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
> it) but just a set of scanned images of the book (with no actual text).

It's archive.org. They have documents in multiple formats already.

https://archive.org/details/in.ernet.dli.2015.210130

DOWNLOAD OPTIONS
* ABBYY GZ download
* DAISY download For print-disabled users
* EPUB download
* FULL TEXT download
* ITEM TILE download
* KINDLE download
* PDF download
* PDF WITH TEXT download
* SINGLE PAGE PROCESSED JP2 ZIP

Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
typical OCR errors in it.

Elijah
------
does not know what all of the formats are

Re: Convert an imagebook to a textbook (perhaps using OCR?)

<ucbri3$e2jo$1@dont-email.me>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=14439&group=rec.photo.digital#14439

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf rec.photo.digital

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@needed.invalid (Paul)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Subject: Re: Convert an imagebook to a textbook (perhaps using OCR?)
Date: Fri, 25 Aug 2023 23:29:06 -0400
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <ucbri3$e2jo$1@dont-email.me>
References: <ucbj2f$92df$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 26 Aug 2023 03:29:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0beb5b91b207cae615ad4e69fb97fc3b";
logging-data="461432"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/q6XqdEVTByMYpB9BIHLQUBZ5nE/u2NpU="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:TILsR9CsOlXUyYNW1jrH/LltqWg=
Content-Language: en-US
In-Reply-To: <ucbj2f$92df$1@dont-email.me>

by: Paul - Sat, 26 Aug 2023 03:29 UTC

On 8/25/2023 9:05 PM, Rudolph Rhein wrote:
> My sister's next-month Great Books is Noel Coward's play comedy named
> "Private Lives" from the 1930s. She's almost blind from complications.
>
> She is not technical and she only has an iPad and an iPhone but I have
> Android & Windows so she asked me to help her with IOS text to speech.
>
> She sent me the link to the PDF because it won't text-to-speech read out.
> <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>
>
> Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
> it) but just a set of scanned images of the book (with no actual text).
>
> I tried converting that PDF with Calibre on Windows to an EPUB format,
> but the EPUB was nothing more than a set of the same images in a file.
>
> What's a good way for me to convert that "imagebook" (whatever you call it)
> to a "textbook" so that I can send it to her to use TTS on her iPad?
>

Noel Coward is a genius.

He picked the perfect font, to prevent OCR :-)

Italics font, with rough edges. The scanning team did a great job, but maybe
they should have tried OCR first, before cleanup.

*******

https://archive.org/stream/in.ernet.dli.2015.210130/2015.210130.Private-Lives_djvu.txt <=== try TTS on this

( https://archive.org/details/in.ernet.dli.2015.210130 )

Ocr ABBYY FineReader 11.0
Ppi 600 <=== Didn't look like 600 to me...

Each scanned page is 2800 x 4000 pixels, so it would
depend on the size of the printed page, as to whether
600 is true or not.

Windows apparently has an OCR library. Fat lot of good that does me.

https://blogs.windows.com/windowsdeveloper/2016/02/08/optical-character-recognition-ocr-for-windows-10/

If you watch how the OCR in the old Acrobat Distiller package
used to work, first it does layout analysis. It recognizes text columns
in a three-column layout. Then, it selects lines of text (pixmap sections)
and does OCR on them, and it associates the text with the column.

The Microsoft OCR library, at a guess, does not do layout analysis. It
takes whatever pixmap section you feed it, and makes a line of text
(with little or no punctuation or layout info). This is why the
sample image they fed it, only had one line of text in it, because
the output result would be indistinguishable from whether a layout
engine had been present or not. If the image had just two lines of
text, you would realize what its capabilities actually were.

I could easily feed the sample through some package running
Tesseract, but we all know how that will turn out.

Paul

Re: Convert an imagebook to a textbook (perhaps using OCR?)

<ucc6gd$fjkg$1@dont-email.me>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=14440&group=rec.photo.digital#14440

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf rec.photo.digital

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: RudolphR...@nospam.net (Rudolph Rhein)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Subject: Re: Convert an imagebook to a textbook (perhaps using OCR?)
Date: Sat, 26 Aug 2023 09:37:00 +0300
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <ucc6gd$fjkg$1@dont-email.me>
References: <ucbj2f$92df$1@dont-email.me> <eli$2308252156@qaz.wtf>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 26 Aug 2023 06:35:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1b394b889a8fad79903b42b083cd85ce";
logging-data="511632"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zhi+BUpRBroZAoUsM98yQBbACW4Xaikg="
User-Agent: 40tude_Dialog/2.0.15.41 (Beta 38)
Cancel-Lock: sha1:t/ipe2LVRBRY1ZgmoV1BFnqQilc=

by: Rudolph Rhein - Sat, 26 Aug 2023 06:37 UTC

Eli the Bearded <*@eli.users.panix.com> wrote:

> It's archive.org. They have documents in multiple formats already.

How the heck did you know that?

> https://archive.org/details/in.ernet.dli.2015.210130

That's a much better link (to send to the other Great Bookers!).
> DOWNLOAD OPTIONS
> * ABBYY GZ download
> * DAISY download For print-disabled users
> * EPUB download
> * FULL TEXT download

Even though I was aiming for a PDF, a "full text" seems to be the most
native for a speech-to-text program, wouldn't you think it would be?

> * ITEM TILE download
> * KINDLE download
> * PDF download
> * PDF WITH TEXT download
> * SINGLE PAGE PROCESSED JP2 ZIP

Usually I'm comfortable starting with an EPUB or Kindle for conversion.
But what's the difference between "PDF" and "PDF with text" anyway?

> Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
> typical OCR errors in it.

How do you know that?
Are you saying the EPUB/Kindle are the most faithful then?

> Elijah
> ------
> does not know what all of the formats are

Kindle:
<https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.mobi>

EPUB:
<https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.epub>

I opened that EPUB file in the Windows Calibre program.
It had a mixture of mostly text, but some scanned pages.

The disclaimer at the beginning said:
"This book was produced in EPUB format by the Internet
Archive.The book pages were scanned and converted to EPUB
format automatically. This process relies on optical
character recognition, and is somewhat susceptible to
errors. The book may not offer the correct reading
sequence, and there may be weird characters, nonwords, and incorrect
guesses at structure. Some page numbers and headers or footers may remain
from the scanned page. The process which identifies images might have found
stray marks on the page which are not actually images from the book. The
hidden page numbering which may be available to your ereader corresponds to
the numbered pages in the print edition, but is not an exact match; page
numbers will increment at the same rate as the corresponding print edition,
but we may have started numbering before the print book's visible page
numbers. The Internet Archive is working to improve the scanning process
and resulting books, but in the meantime, we hope that this book will be
useful to you."

Using Calibre, I converted that 271KB EPUP into a 625KB PDF file instead.
Unlike before, the font is a normal font now, and it seems to be PDF text.

I think, thanks to you, that the mission was accomplished.
But I'll only know later when her iPad reads that PDF out as text.

Re: Convert an imagebook to a textbook (perhaps using OCR?)

<MPG.3f53bf2b155d7b4f990185@news.individual.net>

copy mid

https://www.novabbs.com/tech/article-flat.php?id=14441&group=rec.photo.digital#14441

copy link Newsgroups: alt.comp.os.windows-10 comp.text.pdf rec.photo.digital

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: the_stan...@fastmail.fm (Stan Brown)
Newsgroups: alt.comp.os.windows-10,comp.text.pdf,rec.photo.digital
Subject: Re: Convert an imagebook to a textbook (perhaps using OCR?)
Date: Sat, 26 Aug 2023 08:50:35 -0700
Organization: Oak Road Systems
Lines: 12
Message-ID: <MPG.3f53bf2b155d7b4f990185@news.individual.net>
References: <ucbj2f$92df$1@dont-email.me> <eli$2308252156@qaz.wtf> <ucc6gd$fjkg$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: individual.net s4o5IZOflkV8sn2DEIuGMgQq2fK82+VLMkGUivNg9jB2gM0TDT
Cancel-Lock: sha1:lq3Tsv5y0mIkdRDhEu/Mv61YjUg= sha256:Wma4DzORgSkGipR+dJ9NgvOfZoR0qtwGM2Q7qi+oHe8=
User-Agent: MicroPlanet-Gravity/3.0.11 (GRC)

by: Stan Brown - Sat, 26 Aug 2023 15:50 UTC

If this is timesharing, give me my share right now.

tech / rec.photo.digital / Re: Convert an imagebook to a textbook (perhaps using OCR?)

Subject	Author
Convert an imagebook to a textbook (perhaps using OCR?)	Rudolph Rhein
Re: Convert an imagebook to a textbook (perhaps using OCR?)	Eli the Bearded
Re: Convert an imagebook to a textbook (perhaps using OCR?)	Rudolph Rhein
Re: Convert an imagebook to a textbook (perhaps using OCR?)	Stan Brown
Re: Convert an imagebook to a textbook (perhaps using OCR?)	Paul