[linux-elitists] [kragen@pobox.com: Current state of free-software OCR: not good]

Eugen Leitl eugen@leitl.org
Thu Oct 20 01:49:41 PDT 2005

----- Forwarded message from kragen@pobox.com -----

From: kragen@pobox.com
Date: Thu, 20 Oct 2005 03:37:01 -0400 (EDT)
To: kragen-tol@canonical.org
Subject: Current state of free-software OCR: not good

I downloaded Walter Parquhar Hook's 1842 Church Dictionary
<http://www.archive.org/details/ChurchDictionary> from the Internet
Archive and tried OCRing some text from it, using free software.  I
didn't have a lot of success, but success looks tantalizingly close.

I used DjView to extract the first page that has actual text on it.


gOCR renders the first four lines of the sample book, as output by
DjView, more or less as follows:

    __E stronges_ __ecommendation o_ the _olIo_-
    @g _orb cons@ts @@ the statemen_ {_f @s be@gg
    _oR the __ost paRt, me,Rely a Comp@at@n; a__d tb@
    _eneraI ac___o_ledgment rendel_s @ unnecessarY

It actually reads:

    THE strongest recommendation of the follow-
    ing Work consists in the statement of its being,
    for the most part, merely a Compilation; and this
    general acknowledgment renders it unnecessary

A second try, using the command-line "gocr -C 
'- abcdefghijklmnopqrstuvwxyz,;ABCDEFGHIJKLMNOPQRSTUVWXYZ.'
ChurchDictionary0004.pbm" yielded, after 100 seconds of CPU time, the
following results:

    __E stronges_ __ecommendation o_ the _olIo__
    \code(011d)ng _orb consìsts ìn the statemen_ i_f ìts beîngg
    _oR the __ost paRt, me,Rely a Compìlatìon; a__d tbìs
    _eneraI ac___o_ledgment rendel_s ít unnecessarY


(I don't remember what version of Ocrad this was --- probably 0.12, but
definitely not 0.13.)

Ocrad took only 19 seconds and produced the following results:

    rHE strongest _.ecommen_ation of _he fo_lo__
    ing Work consists iA the staLe_ent l_f its being,
    for the _oost yart, _oe,rely a Co__iI_tion; al_d _his
    gen_ral achl_o_ledg_oent rende__s ié unnecessary

Upon being told that it was trying to recognize ASCII (-c ascii), it

    rHE strongest _.ecommen_ation of _he fo_lo__
    ing Work consists iA the staLe_ent l_f its being,
    for the _oost yart, _oe,rely a Co__iI_tion; al_d _his
    gen_ral achl_o_ledg_oent rende__s it unnecessary

That's nearly good enough to be corrected with a dictionary.  

gOCR did better on the second "the", "in", "statement", "part",
"merely", "compilation", "and", "this", and "acknowledgment" --- 9 of
the 29 words, while Ocrad did better on the first "the", "strongest",
"of", "Work", "consists", "its", "being,", "for", "the", "this", and
"unnecessary", 12 of the 29.


It took me a long time to find ClaraOCR, because the web site has been
stolen.  I couldn't figure out how to get ClaraOCR to produce OCR
output at first; the answer is to train it, iterating the training
until you get an acceptable output.  This is a slow process, and
doesn't produce good results even after considerable effort (the
recognizer is kind of dumb, and there's apparently no way to correct
the segmentation into symbols), but it seems that it can ultimately
produce better results than the alternative methods.

I was eventually able to train it to get the first few lines mostly correct:

     T HE strongest r ecommendation of the follow-
     ing Work consists in t he statement of its being,
     for the most part, merely a Compi, an[104]d th
     g eneral acknowledgment renders it unnecessary

The next four lines looked like this:

     [216]o meri[227]ro[204] [229]he [231]a[219]io[201]is so[207][208]ce[210][194] ro[214] [215]h
     i[202]h it as
     ee[245] [258]o[246] r ed [262][269]ra[264]ts ia[235][265]. [266]e[253] o teii [240]ade
     almost [293]or o[296] [297]or roiri so[305]e of o[281]i[283] eates[278]
     r[320]irie[331][318], a[341]d t e [352]o[354] i er as ee[322] sorrre[350]i[336][319][351][337] ce[325]  

They actually read as follows:

    to mention the various sources from which it has
    been compiled.  Extracts have been often made
    almost word for word from some of our greatest
    Divines, and the compiler has been sometimes cen-

Slightly better segmentation and a dictionary would help considerably
here.  "meri.ro." would probably be "men.io." with better
segmentation, and /usr/share/dict/words has only one possibility for
that; likewise ".a.io.is" occurs only as "raviolis", and would
probably be ".a.ious" with better segmentation --- yielding
possibilities "carious", "sanious", and "various".  My frequency
analysis of the British National Corpus
<http://pobox.com/~kragen/sw/wordlist> found "various" 15503 times,
"carious" 7 times, and "sanious" and "raviolis" less than 5 times, so
it should be pretty easy to pick the right one.

This is after 140 cycles of recognition, which took nearly an hour.

ClaraOCR also includes some code for cooperative web-based OCR, but I
haven't tried it yet.

ClaraOCR has a user interface that makes its OCR process dramatically
more transparent than gOCR's and Ocrad's, so I feel better about it
than the above miserable performance would lead one to expect.


ocre wouldn't work on the ASCII PBM, claiming it wasn't a PBM, only
the binary PBM; eventually it popped up a bunch of windows to get help
with letters it was having trouble with.  It seemed to be having
trouble with a lot of letters, so I eventually gave up.  


DjVu isn't in the same category as ocre, ClaraOCR, gOCR, and Ocrad,
but it's important.  The DjVuLibre tools provide powerful
free-software compression algorithms for handling page images,
particularly bilevel page images, and DjView is a far better document
reader than xpdf, ghostview, gv, or xdvi, because it's usually
instantly responsive, supports copy and paste, and supports text

It isn't presently the case that DjVu has a good free encoder for
scanned data.  Presumably that will eventually change, after it
becomes important.

In the meantime, DjVu will be important for several reasons:
- the Internet Archive is releasing a large volume of public-domain
  page images as part of their Million Books Project in DjVu format,
  but presently have no OCR data for them.
- DjVu files can take advantage of OCR output for copying and
  full-text search.  The free-software "djvused" command can add OCR
  output to existing DjVu files without re-encoding the images.
- DjVu files may be a useful format for acquiring training and
  evaluation data for OCR programs, because unlike every widely-used
  file format, they contain both the raw (possibly compressed) page
  images, and "ground truth" OCR output.  Consequently the adoption of
  DjVu will make OCR training and evaluation data much easier to come
  by than in the past.

Directions for improvement

Better UI: ClaraOCR has spent the largest amount of work on its user
interface, but despite all that, it's far from obvious how to get any
output at all, and entering the correct transliteration for a single
five-letter word requires five mouse clicks or five arrow-key presses;
and classifying a "symbol" that its segmenter has discovered as
"noise" requires many more mouse clicks.

Most of the improvements suggested below could also be used to
dramatically speed up the training process.

Dictionaries: all 26 of the distinct words in the text segment I
tested with occur in my word list mentioned above; the least frequent
are "ing", with 123 occurrences, and "acknowledgment" with 96 (the
modern spelling "acknowledgement" has 554.)  (The total of all
frequencies therein is 90080933, a little over 90 million, but that
doesn't include the words that occurred fewer than five times.)  This
suggests that a little bit of fixing up based on language-specific
frequencies could help a lot.

Combination: it is at least possible that, for example, gOCR and Ocrad
together could produce better results than either alone, since each
excelled the other on certain words.

Ambiguous output: if the text output is to be used to reformat the
scanned text (for example, for columns of a different width, for a
device with less storage space, or for a text reader) it is obviously
essential to choose the single most likely reading of the text; but if
OCR is being used merely to make a set of images searchable,
"f?ro(m|iri)" is a perfectly good transliteration of "from", which
ClaraOCR rendered above as "roiri".

Better algorithms: the text I was OCRing is eminently readable to
human eyes, despite being slightly askew and printed with somewhat
worn and dented type.  It's absurd that gOCR could only correctly
recognize "a" and a couple of instances of "the" in the original text;
that Ocrad got only 11 of the 29 words correct, even without a
dictionary; and that after an hour of training of ClaraOCR, it had a
similar success rate to (untrained!) gOCR on the next four lines.

Better evaluation data: a standard OCR corpus for evaluating and
training software would probably help a lot.  There's the ISRI OCRtk
OCR Performance Toolkit, which contains "a large and diverse corpus of
280 scanned page images with corresponding ground-truth text," but
it's not clear whether it's free software.

----- End forwarded message -----
Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07100, 11.36820            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20051020/3ceae47b/attachment.pgp 

More information about the linux-elitists mailing list