cogApplication for managing document import with OCR and the Tesseract library
TypeXAR
CategoryApplication
Developed by

Clément Aubin

Active Installs3
Rating
Rate!
1 Votes
LicenseGNU Lesser General Public License 2.1
Installable with the Extension Manager

Description

This application allows the importation of graphical documents (images or PDFs) containing text into wiki pages. The importation process is done through Optical Character Recognition with the Tesseract library.

Administration

The Tesseract OCR Application comes with two administration sub-categories that can be reached when clicking on the "OCR" category in the wiki administration.

Configuration

The Tesseract configuration is meant to override some of the default application parameters. Though one UI is available through the wiki administration, note that you can also use configuration keys in xwiki.properties in order to define your own custom parameters. 

The configuration parameters given through the XWiki administration UI will prevail on the parameters defined in xwiki.properties if any.

Configuration options

Parameter nameConfiguration string in xwiki.propertiesDescriptionDefault value
Data pathocr.tesseract.dataPathA path to store the Tesseract training files. Note that the training files will be effectively stored in <DATA_PATH>/tessdata.XWiki permanent directory
Training files URLocr.tesseract.trainingFilesURLThe REST endpoint that should be used in order to fetch new Tesseract training files.https://api.github.com/repos/tesseract-ocr/tessdata/contents?ref=3.04.00
Data Store update intervalocr.tesseract.dataStoreUpdateIntervalThe number of seconds that should pass before needing to update the training files data store.864000 (10 days)
Default languageocr.tesseract.defaultLanguageWill be used in 1.2. The default language to use when importing a data file.English (eng)
Allow auto downloadocr.tesseract.allowAutoDownloadWill be used in 1.2. Define if a data file should be automatically downloaded when performing an import needing this data file.true

Data Store

The Tesseract data store is meant to list the Tesseract training files contained on the XWiki server and the training files available for download. Those training files are used in order to improve the quality of the character recognition when a document is imported. It is then important to keep and up to date list of available training data files that the wiki users can use in order to import their documents.

Initialization & Update

When installing the extension, the data store has to be initialized : create its own folder, fetch a list of remotely available training files and check if no training files are already locally available. In order to initialize the store, go to the "Tesseract - Data Store" subsection in the XWiki administration and click on "Update now". You don't need to stay on the data store page during the update process.

By default, the application will trigger an update of its data store every time the server is restarted. Note that you can still trigger an update by going on the same administration page. We also recommend to update the data store from time to time, in order to check if new training data files are available for download.
In order to get a list of available training data files, we are performing REST API calls on a GitHub repository provided by the Tesseract team that contains already crafted data files. You can choose to use another repository or service as long as the REST API that this service provides is the same as the GitHub repository contents v3 API.

Managing training data files

Once the data store is correctly initialized, a list of training files available for download should be displayed in the UI. You can then download any available training file by clicking on the "Download" button next to it. As the data store update, this process is asynchronous and will not stop if you leave the administration page. Once a training data file is downloaded, the document importation wizard will include the new language provided by this training file as a choice for the language of the document to import.

In order to remove a file from the data store, simply click on the "Remove" button next to the file to remove in the store administration page.

General Usage

Importing a new document in your wiki using Tesseract is meant to be as simple as possible. Once the application is installed, a new entry in the wiki application panel should be created linking to the Tesseract importation wizard.

On this page, you can upload the document to import (the importer currently support most of the common image formats and PDF documents), choose the language of the document to import (see Managing training data files) and choose the path of the document to create on the wiki.

Once all these information submitted, a job will perform the importation and notify you on the same page when the import is finished.

Report a bug or contribute to the project

If you found a bug while working with this extension, feel free to report it on jira.xwiki.org.

If you wish to contribute to the project, you can either pick an issue in Jira's open issues list or help in the project translations. We provide three translation packs available on l10n.xwiki.org:

Prerequisites & Installation Instructions

We recommend using the Extension Manager to install this extension (Make sure that the text "Installable with the Extension Manager" is displayed at the top right location on this page to know if this extension can be installed with the Extension Manager). Note that installing Extensions when being offline is currently not supported and you'd need to use some complex manual method.

You can also use the following manual method, which is useful if this extension cannot be installed with the Extension Manager or if you're using an old version of XWiki that doesn't have the Extension Manager:

  1. Log in the wiki with a user having Administration rights
  2. Go to the Administration page and select the Import category
  3. Follow the on-screen instructions to upload the downloaded XAR
  4. Click on the uploaded XAR and follow the instructions
  5. You'll also need to install all dependent Extensions that are not already installed in your wiki

Dependencies

Dependencies for this extension (org.xwiki.contrib:application-ocr-tesseract-ui 1.1):

  • org.xwiki.contrib:application-ocr-tesseract-api 1.1
  • org.xwiki.contrib:application-ocr-tesseract-data 1.1
  • org.xwiki.contrib:application-ocr-tesseract-default 1.1
  • org.xwiki.contrib:application-ocr-tesseract-filter 1.1
  • org.xwiki.contrib:application-ocr-tesseract-script 1.1
  • org.xwiki.contrib:application-ocr-ui 1.1
Tags:
Created by Clément Aubin on 2018/02/17 10:20
    

Get Connected