...
Given its advantages, the below instructions describe the setup of a OCR engine in a separate VM.
Components and setup required:
- DMS appliance needs
/<storagepath>/nuxeo/data/
and/<storagepath>/nuxeo/tmp/
NFS-exports - Provide OCR engine in VM VM ("OCR appliance")
- Docker must be installed on VM
- OCR appliance must be set up to allow remote administration using ssh from the DMS appliance's console by public key authentication for a user with administrative rights (<<adminuser>>)
/<storagepath>/nuxeo/data/
and/<storagepath>/nuxeo/tmp/
NFS-exports- OCR appliance must mount
- DMS NFS-shares to
/var/lib/nuxeo/data/
and/opt/nuxeo/server/tmp/
respectively and make
- sure these mounts will be auto mounted (kept alive)
- In DMS appliance, place a script "tesseract" in
/<storagepath>/nuxeo/scripts/
and make executable (chmod +x). This script contains the commands to drive the OCR process for each file. - Create ssh key in DMS appliance using "ssh-keygen" command, copy public key (id_rsa.pub) and paste into
<<adminuserhome>>/.ssh/authorized_keys
file of OCR appliance - Login from DSM appliance to OCR appliance using ssh. After successful login, move the DMS appliance's key files "id_rsa", "id_rsa.pub" and "known_hosts" from
~/.ssh/
to/<storagepath>/nuxeo/ssh/
- Set key "ocr.engine.name" in PAT_DMS_SETTINGS table of Patricia db to "tesseract"
- Add below change commands to the "commands.conf" file of auto-deploy client specific repository (also found in
~/deploy/config/
) - Start re-deploy of DMS using ~/deploy/deploy_script/deploy.sh command as outlined here: Auto-deploy script
Scripts:
- Make sure you replace <<adminuser>> with correct user name of an administrative user in OCR appliance, and <<ocr.appliance.name>> with proper FQND or IP address:
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash echo "variables $1 $2" ssh pidemo@192root@192.168.72.1140 "nice -n 10 /usr/bin/tesseractdocker run -v /var/lib/nuxeo/data/:/var/lib/nuxeo/data -v /opt/nuxeo/server/tmp/:/opt/nuxeo/server/tmp/ --rm jitesoft/tesseract-ocr $1 $2" |
$1 is the path and filename of the output file, $2 is the path and filename of the input file that are handed over when the DMS calls the tesseract script. The command in tesseract script must be such that the OCR engine reads the input file (pointed to by $2) and writes to the output file (pointed to by $2).
- Make sure in "commands.conf" file of auto-deploy client specific repository, the following commands are added to the nuxeo container definition (under section
"elif [ ${1} = "NUXEO" ] then"
) so as to be added the container run command.:
Code Block |
---|
add_volume "/<storage_path>/nuxeo/ssh" "/home/nuxeo/.ssh"
add_volume "/<storage_path>/nuxeo/scripts/tesseract" "/usr/local/bin/tesseract" |