Read text from pdf and power point
Hello,
I need to extract text from different kind of files (pdf, ppt but also docx).
I first tried with pdf files but :
- Tesseract plugin does not really work
- Tabula only extract tables so I do not have my full text
- I tried using pdfplumber (the module I usually use for pdf) but I have the following error :
UnsupportedOperation: seek
How can I manage this ? I think I will have the same kind of errors with other documents type.
Thank you in advance for your help.
Best regards
Answers
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
-
Hello @Jurre
,Thank you for your help!
Indeed it works but as I will need to extract text from different supports (pdf but also power points , word, etc...) I would like to understand how to directly manipulate "folders" objects by applying usual python code... Since dedicated plugin don't always exist !
Best -
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I’ve been experimenting with the tika parser. There is a fairly extensive list of supported file types. However, it is also a fairly heavy install as well. I’ve been able to get this to work with Dataiku DSS.
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
Hi @User
,What got me started were the "advanced code" courses in the academy, for example this one about managed folders. Another great resource is the forum itself; a lot of very informative posts with code examples can easily be found with some basic keywords in the searchfunctionality (click the magnifying glass top-right of your screen).
If the inputformats are limited in number an option might be to convert them to pdf first and then process them. Libreoffice can do that, or for example unzip for the docx/xlsx/pptx formats. @tgb417
's suggestion looks promising aswell (and very helpfull as that might be the solution for a challenge i'm currently facing with reading some exotic formats). Thanx Tom! -
Hello Tom,
Thank you very much for the advice I'm going to try with this parser.
I was hoping that there was a unique way to transform folder and work with all usual python libraries but it seems to not work like that in DataIku
But thanks again -
devipram Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1 Partner
Hi,
I also use Tika parser for parsing pdf to txt in Dataiku. It's been going well for months, but suddenly it gives me an error "Unable to start Tika server" when pointing to downloaded tika-server.jar in Linux as server for Dataiku DSS.
Within that server, there's also a Java 8 exist, but still gives me the same error.
Would you help me to this setup settings between Tika - Java - Dataiku DSS?
Thank you