Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to use Python notebook template "image processing for text extraction" for my custom requirement. I followed steps mentioned in Plugin documentation of Tesseract-OCR and from notebook I also set plugin code env and before running my code I just restarted kernal to make sure everything work properly as a normal practice. I get following TesseractError :
(1, 'Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
Request if someone could help here.
This error indicates that Tesseract wasn't able to find the data file for English.
Could you please verify if the file "/usr/share/tesseract/4/tessdata/eng.traineddata" exists?
If the file doesn't exist, you'll need to install it. For more information, see the "Specific languages" section in the README: https://github.com/dataiku/dss-plugin-tesseract-ocr/tree/v1.0.2#specific-languages
Thanks for your reply.
I checked and "/usr/share/tesseract/4/tessdata/eng.traineddata" file is missing currently.
Could you please help me with the installation line code for the same?
If you're using a RHEL-based distro, such as CentOS or AlmaLinux, you can install it using the following command:
yum install tesseract-langpack-eng
If you're using a Debian-based distro, such as Ubuntu, you can install it using the following command:
apt install tesseract-ocr-eng
If you're using a different distro or are unsure, could you please let me know what distro (including the version) that you're using? For example, CentOS 7.