Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata

nmishra5
Level 1
Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata

I am trying to use Python notebook template "image processing for text extraction" for my custom requirement. I followed steps mentioned in Plugin documentation of Tesseract-OCR and from notebook I also set plugin code env and before running my code I just restarted kernal to make sure everything work properly as a normal practice. I get following   TesseractError : 

(1, 'Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

Request if someone could help here.

0 Kudos
3 Replies
ZachM
Dataiker

Hi @nmishra5,

This error indicates that Tesseract wasn't able to find the data file for English.

Could you please verify if the file "/usr/share/tesseract/4/tessdata/eng.traineddata" exists?

If the file doesn't exist, you'll need to install it. For more information, see the "Specific languages" section in the README: https://github.com/dataiku/dss-plugin-tesseract-ocr/tree/v1.0.2#specific-languages

 

Thanks,

Zach

0 Kudos
nmishra5
Level 1
Author

Hi ZachM,

Thanks for your reply. 

I checked and  "/usr/share/tesseract/4/tessdata/eng.traineddata"  file is missing currently.

Could you please help me with the installation line code for the same?

 

Nirbhay

 

 

0 Kudos
ZachM
Dataiker

If you're using a RHEL-based distro, such as CentOS or AlmaLinux, you can install it using the following command:

yum install tesseract-langpack-eng

 

If you're using a Debian-based distro, such as Ubuntu, you can install it using the following command:

apt install tesseract-ocr-eng

 

If you're using a different distro or are unsure, could you please let me know what distro (including the version) that you're using? For example, CentOS 7.

0 Kudos