Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata

nmishra5
nmishra5 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 Partner

I am trying to use Python notebook template "image processing for text extraction" for my custom requirement. I followed steps mentioned in Plugin documentation of Tesseract-OCR and from notebook I also set plugin code env and before running my code I just restarted kernal to make sure everything work properly as a normal practice. I get following TesseractError :

(1, 'Error opening data file /usr/share/tesseract/4/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

Request if someone could help here.

Answers

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker

    Hi @nmishra5
    ,

    This error indicates that Tesseract wasn't able to find the data file for English.

    Could you please verify if the file "/usr/share/tesseract/4/tessdata/eng.traineddata" exists?

    If the file doesn't exist, you'll need to install it. For more information, see the "Specific languages" section in the README: https://github.com/dataiku/dss-plugin-tesseract-ocr/tree/v1.0.2#specific-languages

    Thanks,

    Zach

  • nmishra5
    nmishra5 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 Partner

    Hi ZachM,

    Thanks for your reply.

    I checked and "/usr/share/tesseract/4/tessdata/eng.traineddata" file is missing currently.

    Could you please help me with the installation line code for the same?

    Nirbhay

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    edited July 17

    If you're using a RHEL-based distro, such as CentOS or AlmaLinux, you can install it using the following command:

    yum install tesseract-langpack-eng

    If you're using a Debian-based distro, such as Ubuntu, you can install it using the following command:

    apt install tesseract-ocr-eng

    If you're using a different distro or are unsure, could you please let me know what distro (including the version) that you're using? For example, CentOS 7.

Setup Info
    Tags
      Help me…