Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata
I am trying to use Python notebook template "image processing for text extraction" for my custom requirement. I followed steps mentioned in Plugin documentation of Tesseract-OCR and from notebook I also set plugin code env and before running my code I just restarted kernal to make sure everything work properly as a normal practice. I get following TesseractError :
(1, 'Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
Request if someone could help here.
Answers
-
Hi @nmishra5
,This error indicates that Tesseract wasn't able to find the data file for English.
Could you please verify if the file "/usr/share/tesseract/4/tessdata/eng.traineddata" exists?
If the file doesn't exist, you'll need to install it. For more information, see the "Specific languages" section in the README: https://github.com/dataiku/dss-plugin-tesseract-ocr/tree/v1.0.2#specific-languages
Thanks,
Zach
-
nmishra5 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 Partner
Hi ZachM,
Thanks for your reply.
I checked and "/usr/share/tesseract/4/tessdata/eng.traineddata" file is missing currently.
Could you please help me with the installation line code for the same?
Nirbhay
-
If you're using a RHEL-based distro, such as CentOS or AlmaLinux, you can install it using the following command:
yum install tesseract-langpack-eng
If you're using a Debian-based distro, such as Ubuntu, you can install it using the following command:
apt install tesseract-ocr-eng
If you're using a different distro or are unsure, could you please let me know what distro (including the version) that you're using? For example, CentOS 7.