How to read all words in pdf

Options
LaurentS
LaurentS Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 21 ✭✭✭✭

Hello,

I am seeking a way to make NLP on pdf files.

I know that there is a tuto to import a pdf in a project, but this tuto is to extract tables from a pdf.

My objective is to extract the different sentences from the pdf, including tables.
I understand that some Python codes would be necessary, and I am ok with this.
Will be grateful if someone can tell me how to read all words and sentences in natural language in a pdf file.
Kindest regards

Best Answer

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker
    Answer ✓
    Options

    Hi Laurent,

    If you want to extract actual text from PDF files within DSS, you can use the Tesseract plugin. It is based on the Tesseract Engine and allows you to perform OCR on a variety of input formats.

    Note that for the plugin to work properly, having Tesseract installed on the machine hosting your DSS instance is a mandatory pre-requisite.

    Best,

    Harizo

Answers

  • LaurentS
    LaurentS Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 21 ✭✭✭✭
    Options

    Hi and thanks for your help.

    I have used this plug in the I am a bit disappointed by the results.

    Anyway, thanks a lot for suggesting this plug in. I'll try another solution.

    Kindest regards.

  • Mahmoud
    Mahmoud Partner, Registered Posts: 1 Partner
    Options

    Hi Laurent,

    Did you manage to find another solution for this? I'd love to get some more insight!

    Thanks in advance,

  • LaurentS
    LaurentS Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 21 ✭✭✭✭
    Options

    Hi; , I managed by importing other libraries, in particular import pdftotext

    SO, at the end, I used Python codes, not the proposed plug in.

    If you are curious about it : please refer to the followings:

    pip install pdftotext
    https://pypi.org/project/pdftotext/#description

    Hope it will help ^^

    Kindest regards

  • Sunshine
    Sunshine Registered Posts: 1 ✭✭✭
    edited July 17
    Options

    Hi @LaurentS
    , I have tried installing pdftotext in Dataiku code env, ends with below error.

    How you install pdftotext in dataiku code env ? Any help ?

     pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory   #include <poppler/cpp/poppler-document.h>            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  compilation terminated.  error: command 'gcc' failed with exit status 1  ----------------------------------------  ERROR: Failed building wheel for pdftotext

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker
    Options

    Hi Sunshine,

    According to the error message you are missing a system dependency. Please double-check that you fulfilled all the pre-requisites listed on the library's documentation before importing your package.

    Best,

    Harizo

Setup Info
    Tags
      Help me…