Read text from pdf and power point

User · February 2022

Hello,

I need to extract text from different kind of files (pdf, ppt but also docx).
I first tried with pdf files but :
- Tesseract plugin does not really work
- Tabula only extract tables so I do not have my full text
- I tried using pdfplumber (the module I usually use for pdf) but I have the following error :

UnsupportedOperation: seek

How can I manage this ? I think I will have the same kind of errors with other documents type.

Thank you in advance for your help.
Best regards

Jurre · February 2022

Hi @User
,

In this post @HarizoR
mentioned a pre-requisite for tesseract to work properly, and other solutions are presented for reading PDF's. \

hope this helps!

Jurre

User · February 2022

Hello @Jurre
,

Thank you for your help!
Indeed it works but as I will need to extract text from different supports (pdf but also power points , word, etc...) I would like to understand how to directly manipulate "folders" objects by applying usual python code... Since dedicated plugin don't always exist !
Best

tgb417 · February 2022

@User

I’ve been experimenting with the tika parser. There is a fairly extensive list of supported file types. However, it is also a fairly heavy install as well. I’ve been able to get this to work with Dataiku DSS.

HTTPS://pypi.org/project/tika:

Jurre · February 2022

Hi @User
,

What got me started were the "advanced code" courses in the academy, for example this one about managed folders. Another great resource is the forum itself; a lot of very informative posts with code examples can easily be found with some basic keywords in the searchfunctionality (click the magnifying glass top-right of your screen).

If the inputformats are limited in number an option might be to convert them to pdf first and then process them. Libreoffice can do that, or for example unzip for the docx/xlsx/pptx formats. @tgb417
's suggestion looks promising aswell (and very helpfull as that might be the solution for a challenge i'm currently facing with reading some exotic formats). Thanx Tom!

User · February 2022

Hello Tom,

Thank you very much for the advice I'm going to try with this parser.
I was hoping that there was a unique way to transform folder and work with all usual python libraries but it seems to not work like that in DataIku
But thanks again

devipram · May 2022

Hi,

I also use Tika parser for parsing pdf to txt in Dataiku. It's been going well for months, but suddenly it gives me an error "Unable to start Tika server" when pointing to downloaded tika-server.jar in Linux as server for Dataiku DSS.

Within that server, there's also a Java 8 exist, but still gives me the same error.

Would you help me to this setup settings between Tika - Java - Dataiku DSS?

Thank you

Read text from pdf and power point

Answers

Categories

Setup Info

Tags