Read text from pdf and power point

Options
User
User Registered Posts: 5 ✭✭✭

Hello,

I need to extract text from different kind of files (pdf, ppt but also docx).
I first tried with pdf files but :
- Tesseract plugin does not really work
- Tabula only extract tables so I do not have my full text
- I tried using pdfplumber (the module I usually use for pdf) but I have the following error :

UnsupportedOperation: seek

How can I manage this ? I think I will have the same kind of errors with other documents type.

Thank you in advance for your help.
Best regards

Answers

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi @User
    ,

    In this post @HarizoR
    mentioned a pre-requisite for tesseract to work properly, and other solutions are presented for reading PDF's. \

    hope this helps!

    Jurre

  • User
    User Registered Posts: 5 ✭✭✭
    Options

    Hello @Jurre
    ,

    Thank you for your help!
    Indeed it works but as I will need to extract text from different supports (pdf but also power points , word, etc...) I would like to understand how to directly manipulate "folders" objects by applying usual python code... Since dedicated plugin don't always exist !
    Best

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @User

    I’ve been experimenting with the tika parser. There is a fairly extensive list of supported file types. However, it is also a fairly heavy install as well. I’ve been able to get this to work with Dataiku DSS.

    HTTPS://pypi.org/project/tika:

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi @User
    ,

    What got me started were the "advanced code" courses in the academy, for example this one about managed folders. Another great resource is the forum itself; a lot of very informative posts with code examples can easily be found with some basic keywords in the searchfunctionality (click the magnifying glass top-right of your screen).

    If the inputformats are limited in number an option might be to convert them to pdf first and then process them. Libreoffice can do that, or for example unzip for the docx/xlsx/pptx formats. @tgb417
    's suggestion looks promising aswell (and very helpfull as that might be the solution for a challenge i'm currently facing with reading some exotic formats). Thanx Tom!

  • User
    User Registered Posts: 5 ✭✭✭
    Options

    Hello Tom,

    Thank you very much for the advice I'm going to try with this parser.
    I was hoping that there was a unique way to transform folder and work with all usual python libraries but it seems to not work like that in DataIku
    But thanks again

  • devipram
    devipram Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1 Partner
    Options

    Hi,

    I also use Tika parser for parsing pdf to txt in Dataiku. It's been going well for months, but suddenly it gives me an error "Unable to start Tika server" when pointing to downloaded tika-server.jar in Linux as server for Dataiku DSS.

    Within that server, there's also a Java 8 exist, but still gives me the same error.

    Would you help me to this setup settings between Tika - Java - Dataiku DSS?

    Thank you

Setup Info
    Tags
      Help me…