Page History

This page describes the setup of a Google Cloud based OCR setup for use with the Extended DMS environment. DMS will issue commands so that documents to be OCR'ed are uploaded to Google Cloud, OCR is performed there, and the result is returned. The advantage is that the DMS appliance's resource ( CPU , RAM) is not consumed by the OCR processing and that the OCR process takes advantage of the high accuracy and potential speed of the Google Cloud Vision technology.

Note that the Google Cloud service is not free of charge an requires an active Google Cloud account. See: https://cloud.google.com/vision/pricing

Also note that the performance of this OCR method is largely determined by your internet connection speed as all pdf documents must be uploaded to your Google Storage bucket. In addition to this, depending on setup, this OCR method may consume significant RAM if multiple parallel OCR processes are configured.

The below instructions describe the setup of the DMS host machine. Using this OCR method will not significantly affect performance of the DMS service in a significant way; the scripts involved are only making https calls to the google Google Cloud API and will be relaying documents but relay documents/return information, however, they do not perform any processing of the documents.

...

Configure Google Cloud account: enable Google Vision API there and create Google Cloud Storage bucket for temp files created while processing
Configure Google Cloud SDK on DMS host machine: (i) see https://cloud.google.com/sdk/install and note that if you installed your DMS from the base OVA image provided by Patrix/Practice Insight, you operating system is CentOS. (ii) Further see https://cloud.google.com/sdk/docs/initializing and https://cloud.google.com/sdk/docs/authorizing.
Prepare service account key in JSON format for Google Cloud here: https://console.cloud.google.com/apis/credentials/serviceaccountkey
Create 2 permanent (can be done with adding lines to environment variables. For example, add the following 2 lines to /etc/environment) environment variables (replace $gcloud_storage_bucket_name and $path_to_json with actual values)::
Code Block
language bash
GCLOUD_OCR_BUCKET=
$gcloud
<$gcloud_storage_bucket_
name
name> GOOGLE_APPLICATION_CREDENTIALS=
$path_to_json
<$path_to_json>
replace <$gcloud_storage_bucket_name> and <$path_to_json> with their resective values
Install Java Development Kit in the DMS VM by running
Code Block
language bash
yum install java-1.8.0-openjdk
Download pi-gcloud-ocr.jar (Link will be provided after successful testing) https://www.pace-ip.com/edms/downloads/components/pi-google-ocr.jar) and copy it to /opt/pi-gcloud-ocr.jar of the DMS host needs /<storagepath>/nuxeo/data linked to /var/lib/nuxeo/data and /<storagepath>/nuxeo/tmp linked to /opt/nuxeo/server/tmpVM host machine
Copy the below Place the below script with name "pi-google-ocr" script to /usr/bin/ and of the DMS VM host machine and make it executable (chmod +x). This script contains the commands to drive the OCR process for each file.
Place Copy the below script with name "piocr" in script to /<storagepath>/nuxeo/scripts/ and make it executable (chmod +x). This script contains the commands to access pi-google-ocr script from inside of nuxeo container.the DMS
Create ssh key in DMS appliance using VM host machine using "ssh-keygen" command, copy public key (id_rsa.pub) and paste into <<adminuserhome>>/.ssh/authorized_keys file of DMS host vmDMS VM host machine (i.e. the same box)
DMS host needs /<storagepath>/nuxeo/data mounted to /var/lib/nuxeo/data and /<storagepath>/nuxeo/tmp mounted to /opt/nuxeo/server/tmp (see below section "Deploy Script changes")
Login from DMS appliance to VM host machine to itself using ssh. After successful login, move the DMS applianceVM host machine's key files "id_rsa", "id_rsa.pub" and "known_hosts" from ~/.ssh/ to /<storagepath>/nuxeo/ssh/
Set permissions for the files in /<storagepath>/nuxeo/ssh/ folder to 1000:1000
Set key "ocr.engine.name" in PAT_DMS_SETTINGS table of Patricia db to "piocr"
Add below change commands to the "commands.conf" file of auto-deploy client specific repository (also found in ~/deploy/config/)
Start re-deploy of DMS using ~/deploy/deploy_script/deploy.sh command as outlined here: Auto-deploy script

Scripts:

...

Code Block

language	bash
title	pi-google-ocr

#!/usr/bin/env bash
gsutil cp "$2" gs://${GCLOUD_OCR_BUCKET}
input_filename=$(basename $2)
output_filename=$(basename $1)
json=$(java -jar /opt/pi-google-ocr.jar gs://${GCLOUD_OCR_BUCKET}/${input_filename} gs://${GCLOUD_OCR_BUCKET}/${output_filename})
echo "$json"  > "$1"
gsutil rm "gs://${GCLOUD_OCR_BUCKET}/${input_filename}"
gsutil rm "gs://${GCLOUD_OCR_BUCKET}/${output_filename}output-1-to-1.json*"

piocr script

Code Block

language	bash
title	piocr

#!/bin/bash
echo "variables $1 $2"
touch ~/.ssh_config && ssh -F ~/.ssh_config <<adminuser>>@<<dms.host.name>> "pi-google-ocr $1 $2"

$1 is the path and filename of the output file, $2 is the path and filename of the input file that are handed over when the DMS calls the piocr script. The command in piocr script must be such that the OCR engine reads the input file (pointed to by $2) and writes to the output file (pointed to by $1).

Make sure you replace <<adminuser>> with the correct user name of an administrative user

in OCR appliance

of the DMS host VM, and <<dms.host.name>> with

proper

the FQDN or IP address of the DMS host VM.

Deploy script changes:

Make sure that in the "commands.conf" file of the auto-deploy client specific repository, the following commands are added to the nuxeo container definition (under section `"elif [${1} = "NUXEO"] then"`) so as to be added the container run command

.

:

Code Block
add_volume "/<storage_path>/nuxeo/ssh" "/home/nuxeo/.ssh" add_volume "/<storage_path>/nuxeo/scripts/piocr" "/usr/local/bin/piocr"

Space shortcuts

Child pages

Versions Compared

Old Version 4

New Version Current

Key

Scripts:

Deploy script changes:

Make sure that in the "commands.conf" file of the auto-deploy client specific repository, the following commands are added to the nuxeo container definition (under section `"elif [${1} = "NUXEO"] then"`) so as to be added the container run command

:

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 4

New Version Current

Key

Scripts:

Deploy script changes:

Make sure that in the "commands.conf" file of the auto-deploy client specific repository, the following commands are added to the nuxeo container definition (under section "elif [${1} = "NUXEO"] then") so as to be added the container run command

:

Make sure that in the "commands.conf" file of the auto-deploy client specific repository, the following commands are added to the nuxeo container definition (under section `"elif [${1} = "NUXEO"] then"`) so as to be added the container run command