Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page describes the setup of a Google Cloud based OCR setup for use with the Extended DMS environment. DMS will issue commands so that documents to be OCR'ed are uploaded to Google Cloud, OCR is performed there, and the result is returned. The advantage is that the DMS appliance's resource (CPU, RAM) is not consumed by the OCR processing and that OCR takes advantage of the high accuracy and potential speed of the Google Cloud Vision technology.

Note that the Google Cloud service is not free of charge an requires an active Google Cloud account. See: https://cloud.google.com/vision/pricing

The below instructions describe the setup of the DMS host machine. Using this OCR method will not significantly affect performance of the DMS service in a significant way; the scripts involved are only making https calls to the google Google Cloud API and will be relaying documents but relay documents/return information, however, they do not perform any processing of the documents.

...

  1. Configure Google Cloud account: enable Google Vision API there and create Google Cloud Storage bucket for temp files created while processing
  2. Configure Google Cloud SDK on DMS host machine: (i) see https://cloud.google.com/sdk/install and note that if you installed your DMS from the base OVA image provided by Patrix/Practice Insight, you operating system is CentOS. (ii) Further see https://cloud.google.com/sdk/docs/initializing and https://cloud.google.com/sdk/docs/authorizing.
  3. Prepare service account key in JSON format for Google Cloud here: https://console.cloud.google.com/apis/credentials/serviceaccountkey
  4. Create 2 permanent (can be done with adding lines to environment variables. For example, add the following 2 lines to /etc/environment) environment variables (replace $gcloud_storage_bucket_name and $path_to_json with actual values)::

    Code Block
    languagebash
    GCLOUD_OCR_BUCKET=
    $gcloud
    <$gcloud_storage_bucket_
    name
    name>
    GOOGLE_APPLICATION_CREDENTIALS=
    $path
    <$path_to_json>

    replace <$gcloud_storage_bucket_name> and <$path_to_

    json

    json> with their resective values.

  5. Download pi-gcloud-ocr.jar (Link will be provided after successful testing) to /opt/pi-gcloud-ocr.jar

  6. DMS host needs /<storagepath>/nuxeo/data linked to /var/lib/nuxeo/data and /<storagepath>/nuxeo/tmp linked to /opt/nuxeo/server/tmp

  7. Place the below script with name "pi-google-ocr" to /usr/bin/ and make it executable (chmod +x). This script contains the commands to drive the OCR process for each file.
  8. Place the below script with name "piocr" in /<storagepath>/nuxeo/scripts/ and make it executable (chmod +x). This script contains the commands to access pi-google-ocr script from inside of nuxeo container.
  9. Create ssh key in DMS appliance using "ssh-keygen" command, copy public key (id_rsa.pub) and paste into <<adminuserhome>>/.ssh/authorized_keys file of DMS host vm
  10. Login from DMS appliance to itself using ssh. After successful login, move the DMS appliance's key files "id_rsa", "id_rsa.pub" and "known_hosts" from ~/.ssh/ to /<storagepath>/nuxeo/ssh/ 
  11. Set key "ocr.engine.name" in PAT_DMS_SETTINGS table of Patricia db to "piocr"
  12. Add below change commands to the "commands.conf" file of auto-deploy client specific repository (also found in ~/deploy/config/)
  13. Start re-deploy of DMS using ~/deploy/deploy_script/deploy.sh command as outlined here: Auto-deploy script

...