Discover the winners & finalists of the 2022 Dataiku Frontrunner Awards!READ THEIR USE CASES

How to read all words in pdf

Solved!
LaurentS
Level 3
How to read all words in pdf

Hello,

I am seeking a way to make NLP on pdf files.

I know that there is a tuto to import a pdf in a project, but this tuto is to extract tables from a pdf.

My objective is to extract the different sentences from the pdf, including tables.
 
I understand that some Python codes would be necessary, and I am ok with this.
 
Will be grateful if someone can tell me how to read all words and sentences in natural language in a pdf file.
 
Kindest regards
0 Kudos
1 Solution
HarizoR
Developer Advocate

Hi Laurent,

If you want to extract actual text from PDF files within DSS, you can use the Tesseract plugin. It is based on the Tesseract Engine and allows you to perform OCR on a variety of input formats. 

Note that for the plugin to work properly, having Tesseract installed on the machine hosting your DSS instance is a mandatory pre-requisite.

Best,

Harizo

View solution in original post

6 Replies
HarizoR
Developer Advocate

Hi Laurent,

If you want to extract actual text from PDF files within DSS, you can use the Tesseract plugin. It is based on the Tesseract Engine and allows you to perform OCR on a variety of input formats. 

Note that for the plugin to work properly, having Tesseract installed on the machine hosting your DSS instance is a mandatory pre-requisite.

Best,

Harizo

LaurentS
Level 3
Author

Hi and thanks for your help. 

I have used this plug in the I am a bit disappointed by the results.

Anyway, thanks a lot for suggesting this plug in.  I'll try another solution.

Kindest regards. 

0 Kudos
Mahmoud
Level 1

Hi Laurent,

Did you manage to find another solution for this? I'd love to get some more insight!

Thanks in advance,

0 Kudos
LaurentS
Level 3
Author

Hi; , I managed by importing other libraries, in particular import pdftotext

SO, at the end, I used Python codes, not the proposed plug in.   

 

If you are curious about it : please refer to the followings:

pip install pdftotext
https://pypi.org/project/pdftotext/#description

Hope it will help ^^

Kindest regards

Sunshine
Level 1

Hi @LaurentS, I have tried installing pdftotext in Dataiku code env, ends with below error.

How you install pdftotext in dataiku code env ? Any help ?

 

 pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory   #include <poppler/cpp/poppler-document.h>            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  compilation terminated.  error: command 'gcc' failed with exit status 1  ----------------------------------------  ERROR: Failed building wheel for pdftotext

 

0 Kudos
HarizoR
Developer Advocate

Hi Sunshine,

According to the error message you are missing a system dependency. Please double-check that you fulfilled all the pre-requisites listed on the library's documentation before importing your package.

Best,

Harizo