The OCR systems within the Extended DMS system powers the following functionality:

  1. Auto-save incoming paper correspondence to a case, by receiving the pages from the scanning machine
  2. Enable all scanned documents to be searched using keywords that the document contains

Default OCR Engine

The OCR functionality is enabled by default, using the tesseract OCR engine (maintained by google). Tesseract 3 is included in the nuxeo container and will be used for OCRing by default. This is only recommended for low document numbers and reasonably low image-pdf document throughput.

Alternatives to default options

Commercial OCR Applications

It is also possible to utilise any OCR engine that can be called by the command line (eg. via ssh). There are fees levied by the OCR software vendors; however, they typically produce the most accurate OCR results. Please contact Patrix if you would like to arrange to setup an alternative commercial OCR system as part of your DMS installation.

For the OCR tools to be used, they must be able to convert PDF or images to text from the command line.  

Here are some options that are available:

  • OCRKit (~$75, for OS X) - Integration as a stand alone OCR appliance possible: see here
  • ABBYY OCR for Linux (version 11 CLI, pricing varies according to volume OCRed)

See a comparison of OCR offerings at Wikipedia.

 

  • No labels