Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Read text from pdf and power point

User
Level 2
Read text from pdf and power point

Hello, 

I need to extract text from different kind of files (pdf, ppt but also docx).
I first tried with pdf files but : 
- Tesseract plugin does not really work
- Tabula only extract tables so I do not have my full text
- I tried using pdfplumber (the module I usually use for pdf) but I have the following error : 

UnsupportedOperation: seek

How can I manage this ? I think I will have the same kind of errors with other documents type.

Thank you in advance for your help.
Best regards

0 Kudos
6 Replies
Jurre
Neuron
Neuron

Hi @User ,

In this post @HarizoR  mentioned a pre-requisite for tesseract to work properly, and other solutions are presented for reading PDF's. \

hope this helps! 

Jurre 

 

0 Kudos
User
Level 2
Author

Hello @Jurre , 

Thank you for your help! 
Indeed it works but as I will need to extract text from different supports (pdf but also power points , word, etc...) I would like to understand how to directly manipulate "folders" objects by applying usual python code... Since dedicated plugin don't always exist ! 
Best

0 Kudos
tgb417
Neuron
Neuron

@User 

I’ve been experimenting with the tika parser. There is a fairly extensive list of supported file types. However, it is also a fairly heavy install as well. I’ve been able to get this to work with Dataiku DSS.  

HTTPS://pypi.org/project/tika:

--Tom
User
Level 2
Author

Hello Tom, 

Thank you very much for the advice I'm going to try with this parser.
I was hoping that there was a unique way to transform folder and work with all usual python libraries but it seems to not work like that in DataIku
But thanks again 🙂 

0 Kudos
devipram
Level 1
Level 1

Hi,

I also use Tika parser for parsing pdf to txt in Dataiku. It's been going well for months, but suddenly it gives me an error "Unable to start Tika server" when pointing to downloaded tika-server.jar in Linux as server for Dataiku DSS.

Within that server, there's also a Java 8 exist, but still gives me the same error.

Would you help me to this setup settings between Tika - Java - Dataiku DSS?

Thank you

0 Kudos
Jurre
Neuron
Neuron

Hi @User , 

What got me started were the "advanced code" courses in the academy, for example this one about managed folders. Another great resource is the forum itself; a lot of very informative posts with code examples can easily be found with some basic keywords in the searchfunctionality (click the magnifying glass top-right of your screen). 

If the inputformats are limited in number an option might be to convert them to pdf first and then process them. Libreoffice can do that, or for example unzip for the docx/xlsx/pptx formats. @tgb417 's suggestion looks promising aswell  (and very helpfull as that might be the solution for a challenge i'm currently facing with reading some exotic formats). Thanx Tom!