Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

using tesseract to read pdf

Level 1
using tesseract to read pdf

Hi all,


I am a python script that uses tesseract engine in order to extract text from scanned pdf files. I have already tried to use tesseract OCR plugin but the results aren't what I am looking for. The python script that I wrote in my laptop is working fine. However, When I am using the same code in dataiku server I got this error.

both python script and dataiku notebook error are attached here.


please let me know how to fix this issue


0 Kudos
1 Reply

The error message is quite clear. You need to install Tesseract version 3.05 or newer in the DSS server so that the pytesseract library can work properly. There are more detail's in the library documentation:

0 Kudos