Do you know the History of Data Science? READ MORE

How to read all words in pdf

Solved!
LaurentS
Level 3
How to read all words in pdf

Hello,

I am seeking a way to make NLP on pdf files.

I know that there is a tuto to import a pdf in a project, but this tuto is to extract tables from a pdf.

My objective is to extract the different sentences from the pdf, including tables.
 
I understand that some Python codes would be necessary, and I am ok with this.
 
Will be grateful if someone can tell me how to read all words and sentences in natural language in a pdf file.
 
Kindest regards
0 Kudos
1 Solution
HarizoR
Dataiker
Dataiker

Hi Laurent,

If you want to extract actual text from PDF files within DSS, you can use the Tesseract plugin. It is based on the Tesseract Engine and allows you to perform OCR on a variety of input formats. 

Note that for the plugin to work properly, having Tesseract installed on the machine hosting your DSS instance is a mandatory pre-requisite.

Best,

Harizo

View solution in original post

0 Kudos
2 Replies
HarizoR
Dataiker
Dataiker

Hi Laurent,

If you want to extract actual text from PDF files within DSS, you can use the Tesseract plugin. It is based on the Tesseract Engine and allows you to perform OCR on a variety of input formats. 

Note that for the plugin to work properly, having Tesseract installed on the machine hosting your DSS instance is a mandatory pre-requisite.

Best,

Harizo

View solution in original post

0 Kudos
LaurentS
Level 3
Author

Hi and thanks for your help. 

I have used this plug in the I am a bit disappointed by the results.

Anyway, thanks a lot for suggesting this plug in.  I'll try another solution.

Kindest regards. 

0 Kudos
A banner prompting to get Dataiku DSS