novaBBS - comp.os.linux.misc - Re: OCRing old mainframe listings

OCRing old mainframe listings

<3qvRJ.39829$z688.30424@fx35.iad>

https://www.novabbs.com/computers/article-flat.php?id=7068&group=comp.os.linux.misc#7068

copy link Newsgroups: comp.os.linux.misc

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx35.iad.POSTED!not-for-mail
Newsgroups: comp.os.linux.misc
From: cgi...@kltpzyxm.invalid (Charlie Gibbs)
Subject: OCRing old mainframe listings
User-Agent: slrn/1.0.3 (Linux)
Lines: 15
Message-ID: <3qvRJ.39829$z688.30424@fx35.iad>
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Wed, 23 Feb 2022 18:54:55 UTC
Date: Wed, 23 Feb 2022 18:54:55 GMT
X-Received-Bytes: 1171

by: Charlie Gibbs - Wed, 23 Feb 2022 18:54 UTC

I have a lot of listings from my mainframe days that I would love
to scan and convert to text files. (Even better would be to find
a 9-track drive that could read my tapes, but that might be harder.)
I've tried scanning a small listing to PDF, and then played with
ocrmypdf and gocr, but got mostly garbage out. Does anyone have
experience with this sort of thing who might be able to give me
some pointers?

aTdHvAaNnKcSe...

--
/~\ Charlie Gibbs | Microsoft is a dictatorship.
\ / <cgibbs@kltpzyxm.invalid> | Apple is a cult.
X I'm really at ac.dekanfrus | Linux is anarchy.
/ \ if you read it the right way. | Pick your poison.

Re: OCRing old mainframe listings

<sv61lu$g1b$1@dont-email.me>

copy mid

https://www.novabbs.com/computers/article-flat.php?id=7069&group=comp.os.linux.misc#7069

copy link Newsgroups: comp.os.linux.misc

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ric...@example.invalid (Rich)
Newsgroups: comp.os.linux.misc
Subject: Re: OCRing old mainframe listings
Date: Wed, 23 Feb 2022 19:24:46 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <sv61lu$g1b$1@dont-email.me>
References: <3qvRJ.39829$z688.30424@fx35.iad>
Injection-Date: Wed, 23 Feb 2022 19:24:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d348d337ae2d8cc4060ed1638d092564";
logging-data="16427"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18rD6rQxZ46D8+/aI845K3I"
User-Agent: tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/3.10.17 (x86_64))
Cancel-Lock: sha1:Za8cZ46jVjnmiP0WvezU59QkJvE=

by: Rich - Wed, 23 Feb 2022 19:24 UTC

Charlie Gibbs <cgibbs@kltpzyxm.invalid> wrote:
> I have a lot of listings from my mainframe days that I would love
> to scan and convert to text files. (Even better would be to find
> a 9-track drive that could read my tapes, but that might be harder.)
> I've tried scanning a small listing to PDF, and then played with
> ocrmypdf and gocr, but got mostly garbage out. Does anyone have
> experience with this sort of thing who might be able to give me
> some pointers?

1) Try Tesseract -- https://en.wikipedia.org/wiki/Tesseract_(software)
2) Scan at a fairly high res (300 or 600 dpi)

Sometimes thresholding to 2-color black and white also works well --
but you'll have to test on samples to see.

The wikipedia page for Tesseract has some more suggestions:

Tesseract's output will have very poor quality if the input images
are not preprocessed to suit it: Images (especially screenshots)
must be scaled up such that the text x-height is at least 20
pixels,[13] any rotation or skew must be corrected or no text will
be recognized, low-frequency changes in brightness must be high-pass
filtered, or Tesseract's binarization stage will destroy much of the
page, and dark borders must be manually removed, or they will be
misinterpreted as characters.[14]

Re: OCRing old mainframe listings

<eli$2202231427@qaz.wtf>

copy mid

https://www.novabbs.com/computers/article-flat.php?id=7070&group=comp.os.linux.misc#7070

copy link Newsgroups: comp.os.linux.misc

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!.POSTED.panix5.panix.com!qz!not-for-mail
From: *...@eli.users.panix.com (Eli the Bearded)
Newsgroups: comp.os.linux.misc
Subject: Re: OCRing old mainframe listings
Date: Wed, 23 Feb 2022 19:27:24 -0000 (UTC)
Organization: Some absurd concept
Message-ID: <eli$2202231427@qaz.wtf>
References: <3qvRJ.39829$z688.30424@fx35.iad>
Injection-Date: Wed, 23 Feb 2022 19:27:24 -0000 (UTC)
Injection-Info: reader1.panix.com; posting-host="panix5.panix.com:166.84.1.5";
logging-data="1978"; mail-complaints-to="abuse@panix.com"
User-Agent: Vectrex rn 2.1 (beta)
X-Liz: It's actually happened, the entire Internet is a massive game of Redcode
X-Motto: "Erosion of rights never seems to reverse itself." -- kenny@panix
X-US-Congress: Moronic Fucks.
X-Attribution: EtB
XFrom: is a real address
Encrypted: double rot-13

by: Eli the Bearded - Wed, 23 Feb 2022 19:27 UTC

In comp.os.linux.misc, Charlie Gibbs <cgibbs@kltpzyxm.invalid> wrote:
> I have a lot of listings from my mainframe days that I would love to
> scan and convert to text files. (Even better would be to find a
> 9-track drive that could read my tapes, but that might be harder.)
> I've tried scanning a small listing to PDF, and then played with
> ocrmypdf and gocr, but got mostly garbage out. Does anyone have
> experience with this sort of thing who might be able to give me some
> pointers?

AEK, who has http://www.bitsavers.org/, does a ton of old document
scanning and describes his tooling on his website. He uses Acrobat for
the OCR step, but describes it as the most time consuming part.

Check the archive.org blog, too, in the past they used to describe their
methods and probably continue to do so.

AEK may know where you can get old tapes converted, he has some
connection with a computer history museum.

Elijah
------
the issues with OCR make one despair for real AI vision

Re: OCRing old mainframe listings

<j7nga2Fmds0U1@mid.individual.net>

copy mid

https://www.novabbs.com/computers/article-flat.php?id=7071&group=comp.os.linux.misc#7071

copy link Newsgroups: comp.os.linux.misc

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: jpstew...@personalprojects.net (John-Paul Stewart)
Newsgroups: comp.os.linux.misc
Subject: Re: OCRing old mainframe listings
Date: Wed, 23 Feb 2022 14:30:09 -0500
Lines: 16
Message-ID: <j7nga2Fmds0U1@mid.individual.net>
References: <3qvRJ.39829$z688.30424@fx35.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Trace: individual.net XuDKElpyeupBthBsRdVyXwUgE6bhjrqDrOXRO5yCspGK/Fy2qF
Cancel-Lock: sha1:7Huz2YZhCoG3BH1K/e60CFrJiS8=
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Content-Language: en-CA
In-Reply-To: <3qvRJ.39829$z688.30424@fx35.iad>

by: John-Paul Stewart - Wed, 23 Feb 2022 19:30 UTC

On 2022-02-23 13:54, Charlie Gibbs wrote:
> I have a lot of listings from my mainframe days that I would love
> to scan and convert to text files. (Even better would be to find
> a 9-track drive that could read my tapes, but that might be harder.)
> I've tried scanning a small listing to PDF, and then played with
> ocrmypdf and gocr, but got mostly garbage out. Does anyone have
> experience with this sort of thing who might be able to give me
> some pointers?

Try asking on the cctalk mailing list from http://www.classiccmp.org/

There have been recent discussions on the list (check the archives on
their website) about doing OCR on old code listings. There are also
people on that list with working 9-track tape drives that could either
read the tapes for you or help you set up a drive of your own to do it
yourself.

Re: OCRing old mainframe listings

<sv693v$a3n$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/computers/article-flat.php?id=7072&group=comp.os.linux.misc#7072

copy link Newsgroups: comp.os.linux.misc

Path: i2pn2.org!i2pn.org!aioe.org!Y0Pn+7Jz/2BhYbunvnvmQQ.user.46.165.242.75.POSTED!not-for-mail
From: not...@telling.you.invalid (Computer Nerd Kev)
Newsgroups: comp.os.linux.misc
Subject: Re: OCRing old mainframe listings
Date: Wed, 23 Feb 2022 21:31:43 -0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID: <sv693v$a3n$1@gioia.aioe.org>
References: <3qvRJ.39829$z688.30424@fx35.iad>
Injection-Info: gioia.aioe.org; logging-data="10359"; posting-host="Y0Pn+7Jz/2BhYbunvnvmQQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: tin/2.0.1-20111224 ("Achenvoir") (UNIX) (Linux/2.4.31 (i586))
X-Notice: Filtered by postfilter v. 0.9.2

by: Computer Nerd Kev - Wed, 23 Feb 2022 21:31 UTC

If your plan is to share the scans online, then one option is to
simply upload them to the Internet Archive, who automatically OCR
uploaded documents. If you upload a PDF they will convert that
into a new OCRed PDF, as well as plain text. Or you can also upload
a .zip full of numbered images (001.png, 002.png, 003.png...) and
they'll convert that as well (they had instructions somewhere with
the exact details of how to upload such scans).

Their processing seems to be very good, and may be better than can
be achieved with free software. Obviously you'll still need to make
sure that the resolution is high enough (I usually use 300dpi), and
that the pages are completely flat against the scanner (if
possible).

--
__ __
#_ < |\| |< _#

Re: OCRing old mainframe listings

<WIQRJ.19020$979a.6049@fx14.iad>

copy mid

https://www.novabbs.com/computers/article-flat.php?id=7074&group=comp.os.linux.misc#7074

copy link Newsgroups: comp.os.linux.misc

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx14.iad.POSTED!not-for-mail
Newsgroups: comp.os.linux.misc
From: cgi...@kltpzyxm.invalid (Charlie Gibbs)
Subject: Re: OCRing old mainframe listings
References: <3qvRJ.39829$z688.30424@fx35.iad> <eli$2202231427@qaz.wtf>
User-Agent: slrn/1.0.3 (Linux)
Lines: 35
Message-ID: <WIQRJ.19020$979a.6049@fx14.iad>
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Thu, 24 Feb 2022 19:08:38 UTC
Date: Thu, 24 Feb 2022 19:08:38 GMT
X-Received-Bytes: 2185

by: Charlie Gibbs - Thu, 24 Feb 2022 19:08 UTC

On 2022-02-23, Eli the Bearded <*@eli.users.panix.com> wrote:

> In comp.os.linux.misc, Charlie Gibbs <cgibbs@kltpzyxm.invalid> wrote:
>
>> I have a lot of listings from my mainframe days that I would love to
>> scan and convert to text files. (Even better would be to find a
>> 9-track drive that could read my tapes, but that might be harder.)
>> I've tried scanning a small listing to PDF, and then played with
>> ocrmypdf and gocr, but got mostly garbage out. Does anyone have
>> experience with this sort of thing who might be able to give me some
>> pointers?
>
> AEK, who has http://www.bitsavers.org/, does a ton of old document
> scanning and describes his tooling on his website. He uses Acrobat for
> the OCR step, but describes it as the most time consuming part.

I've scanned a lot of old mainframe manuals and uploaded them to Bitsavers.
AEK seems to know a lot about PDFs - he compressed my uploads to a third
their size without noticeable loss of quality. Also, he apparently OCRed
them because the reprocessed files have searchable text; pdftotext can
extract it.

> Check the archive.org blog, too, in the past they used to describe their
> methods and probably continue to do so.
>
> AEK may know where you can get old tapes converted, he has some
> connection with a computer history museum.

Thanks, I'll look into that.

The world is coming to an end ... SAVE YOUR BUFFERS!!!

computers / comp.os.linux.misc / Re: OCRing old mainframe listings

Subject	Author
OCRing old mainframe listings	Charlie Gibbs
Re: OCRing old mainframe listings	Rich
Re: OCRing old mainframe listings	Eli the Bearded
Re: OCRing old mainframe listings	Charlie Gibbs
Re: OCRing old mainframe listings	John-Paul Stewart
Re: OCRing old mainframe listings	Computer Nerd Kev