This page describes the setup of an independent OCR appliance for use with the Extended DMS environment. DMS will access the OCR appliance and issue commands so that OCR is performed. The advantages are that the DMS appliance's resources (CPU, RAM) are thus not consumed by the OCR processing. Given its advantages, the below instructions describe the setup of a dockerized OCR engine in a separate VM.

Components and setup required:

  1. DMS appliance needs /<storagepath>/nuxeo/data/ and /<storagepath>/nuxeo/tmp/ NFS-exports
  2. Provide core linux VM ("OCR appliance VM")
    1. Docker must be installed in OCR appliance OCR appliance VM
    2. OCR appliance VM must be set up to allow remote administration using ssh from the DMS appliance's console by public key authentication for a user with administrative rights (<<adminuser>>)
    3. OCR appliance VM must mount DMS NFS-shares to /var/lib/nuxeo/data/ and /opt/nuxeo/server/tmp/ respectively and make sure these mounts will be auto mounted (kept alive)
  3. In DMS appliance, place the below script with name "tesseract" in /<storagepath>/nuxeo/scripts/ and make executable (chmod +x). This script contains the commands to drive the OCR process for each file.
  4. Create ssh key in DMS appliance using "ssh-keygen" command, copy public key (id_rsa.pub) and paste into <<adminuserhome>>/.ssh/authorized_keys file of OCR appliance
  5. Login from DMS appliance to OCR appliance using ssh. After successful login, move the DMS appliance's key files "id_rsa", "id_rsa.pub" and "known_hosts" from ~/.ssh/ to /<storagepath>/nuxeo/ssh/ 
  6. Set key "ocr.engine.name" in PAT_DMS_SETTINGS table of Patricia db to "tesseract"
  7. Add below change commands to the "commands.conf" file of auto-deploy client specific repository (also found in ~/deploy/config/)
  8. Start re-deploy of DMS using ~/deploy/deploy_script/deploy.sh command as outlined here: Auto-deploy script

Scripts:

  • What follows is the "tesseract" script:
#!/bin/bash
echo "variables $1 $2"
ssh <<adminuser>>@<<ocr.appliance.name>> "docker run -v /var/lib/nuxeo/data/:/var/lib/nuxeo/data -v /opt/nuxeo/server/tmp/:/opt/nuxeo/server/tmp/ --rm practiceinsight/dms_tesseract:1.0 $1 $2"

$1 is the path and filename of the output file, $2 is the path and filename of the input file that are handed over when the DMS calls the tesseract script. The command in tesseract script must be such that the OCR engine reads the input file (pointed to by $2) and writes to the output file (pointed to by $2).

Make sure you replace <<adminuser>> with correct user name of an administrative user in OCR appliance, and <<ocr.appliance.name>> with proper FQND or IP address.

  • Make sure in "commands.conf" file of auto-deploy client specific repository, the following commands are added to the nuxeo container definition (under section "elif [ ${1} = "NUXEO" ] then") so as to be added the container run command.:
  add_volume "/<storage_path>/nuxeo/ssh" "/home/nuxeo/.ssh"
  add_volume "/<storage_path>/nuxeo/scripts/tesseract" "/usr/local/bin/tesseract"
  • No labels