Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I need to extract text from different kind of files (pdf, ppt but also docx).
I first tried with pdf files but :
- Tesseract plugin does not really work
- Tabula only extract tables so I do not have my full text
- I tried using pdfplumber (the module I usually use for pdf) but I have the following error :
How can I manage this ? I think I will have the same kind of errors with other documents type.
Thank you in advance for your help.
Hello @Jurre ,
Thank you for your help!
Indeed it works but as I will need to extract text from different supports (pdf but also power points , word, etc...) I would like to understand how to directly manipulate "folders" objects by applying usual python code... Since dedicated plugin don't always exist !
Thank you very much for the advice I'm going to try with this parser.
I was hoping that there was a unique way to transform folder and work with all usual python libraries but it seems to not work like that in DataIku
But thanks again 🙂
I also use Tika parser for parsing pdf to txt in Dataiku. It's been going well for months, but suddenly it gives me an error "Unable to start Tika server" when pointing to downloaded tika-server.jar in Linux as server for Dataiku DSS.
Within that server, there's also a Java 8 exist, but still gives me the same error.
Would you help me to this setup settings between Tika - Java - Dataiku DSS?
Hi @User ,
What got me started were the "advanced code" courses in the academy, for example this one about managed folders. Another great resource is the forum itself; a lot of very informative posts with code examples can easily be found with some basic keywords in the searchfunctionality (click the magnifying glass top-right of your screen).
If the inputformats are limited in number an option might be to convert them to pdf first and then process them. Libreoffice can do that, or for example unzip for the docx/xlsx/pptx formats. @tgb417 's suggestion looks promising aswell (and very helpfull as that might be the solution for a challenge i'm currently facing with reading some exotic formats). Thanx Tom!