The overall concept of the new importer is modified from the old one by adding a staging path where all import documents are moved to at first and then, in a second step, are moved away into their final destinations based on the mapping file. This works much faster (file import to staging is about 5-10 times faster) than the usual import process (as it’s one open transaction for all import and not opening a transaction per file) and requires no file duplication as the files are then moved to their right destination paths using the import sweeper.
Step by Step
The instructions for the testing of the new importer are as follows:
- Set a minimum of “1.9.8.2.5” as version for nuxeo and casebrowser in containers.conf and make sure to have atleast 120% of additional storage space for the import.
- Pull latest version of import script (and deploy script - always a good idea)
- stop all dms docker containers with "docker stop service cb nuxeo postgres elastic"
- Set dbo.pat_dms_settings file.import.completed to FALSE
- Configure import_script/config/import.conf (only if needed adapt import_script/importer/config.py , it refers to internal paths that must be static and usually the standard settings fit)
- Note: If you changed the default Nuxeo password, you need to change it within import_script/importer/config.py too.
- Start import script with new parameter ‘fs’ (“
./import.sh fs
”). This step will import all the files into a staging folder in nuxeo (/default-domain/Workspaces/Patricia/Import/
) - Wait until the import finished.
- Create mapping .txt file in
/storage/nuxeo/data/import_mappings/
folder. This Import_mappings folder and its content must be owned by “1000:1000” user ("chown 1000:1000/storage/nuxeo/data/import_mappings
"). Delimiter for mapping file is pipe (“|”) - Set dbo.pat_dms_settings file.import.completed to TRUE. Set is.import.office.hours to FALSE. (Keep in mind, that the reload of the pat_dms_settings can take up to 10 minutes.
- Redeploy the system with the deploy script into production mode as usual, a new import sweeper will move data from
/default-domain/Workspaces/Patricia/Import/
to the correct paths according to the mapping file. You can monitor the progress on INFO level of the/storage/logs/nuxeo/server.log
. Users can use the system in production mode while import sweeper is running, however, the system will of course require some resource to move the data around. The import sweeper is relatively low footprint though. - once that mapping job is done, the file '
mapping.txt
' in/storage/nuxeo/data/import_mappings/
is being renamed to 'mapping.txt.completed
'
A word of caution regarding sweepers in general: It is possible, but not recommended to handle the sweeper process during production time. It’s possible to run the normal sweepers in parallel while the import sweeper is also working (and while users are working). This is, however, truly challenging from a CPU/memory/storage i/o perspective so we highly recommend to do the usual controlled sweeping process:
(a) turn off all sweepers and set do.not.modify.case.onupdate to TRUE, as well as do.not.ocr.onupdate to TRUE.
(b) Then when import sweeper is done, turn on email sweeper and wait until finished.
(c) Then turn on metadata sweeper,
(d) full textsync sweeper,
(e) preview and, finally,
(f) set do.not.modify.case.onupdate to FALSE again and turn on text extraction.
It’s highly recommendable to wait at least until step (d) has concluded before yet another import is run.