using tesseract to read pdf

mhussain79
Level 1
using tesseract to read pdf

Hi all,

 

I am a python script that uses tesseract engine in order to extract text from scanned pdf files. I have already tried to use tesseract OCR plugin but the results aren't what I am looking for. The python script that I wrote in my laptop is working fine. However, When I am using the same code in dataiku server I got this error.

both python script and dataiku notebook error are attached here.

 

please let me know how to fix this issue

Thanks

0 Kudos
2 Replies
JuanE
Dataiker

The error message is quite clear. You need to install Tesseract version 3.05 or newer in the DSS server so that the pytesseract library can work properly. There are more detail's in the library documentation:

https://github.com/madmaze/pytesseract#installation

0 Kudos
shivangisingh88
Level 1

Hi mhussain79,
Could you please provide the python code for it ?

0 Kudos