How to read all words in pdf
Hello,
I am seeking a way to make NLP on pdf files.
I know that there is a tuto to import a pdf in a project, but this tuto is to extract tables from a pdf.
Best Answer
-
Hi Laurent,
If you want to extract actual text from PDF files within DSS, you can use the Tesseract plugin. It is based on the Tesseract Engine and allows you to perform OCR on a variety of input formats.
Note that for the plugin to work properly, having Tesseract installed on the machine hosting your DSS instance is a mandatory pre-requisite.
Best,
Harizo
Answers
-
LaurentS Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 21 ✭✭✭✭
Hi and thanks for your help.
I have used this plug in the I am a bit disappointed by the results.
Anyway, thanks a lot for suggesting this plug in. I'll try another solution.
Kindest regards.
-
Hi Laurent,
Did you manage to find another solution for this? I'd love to get some more insight!
Thanks in advance,
-
LaurentS Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 21 ✭✭✭✭
Hi; , I managed by importing other libraries, in particular import pdftotext
SO, at the end, I used Python codes, not the proposed plug in.
If you are curious about it : please refer to the followings:
pip install pdftotext
https://pypi.org/project/pdftotext/#descriptionHope it will help ^^
Kindest regards
-
Hi @LaurentS
, I have tried installing pdftotext in Dataiku code env, ends with below error.How you install pdftotext in dataiku code env ? Any help ?
pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory #include <poppler/cpp/poppler-document.h> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. error: command 'gcc' failed with exit status 1 ---------------------------------------- ERROR: Failed building wheel for pdftotext
-
Hi Sunshine,
According to the error message you are missing a system dependency. Please double-check that you fulfilled all the pre-requisites listed on the library's documentation before importing your package.
Best,
Harizo