How to extract text from doc files?
Rakesh
Dataiku DSS Core Designer, Registered Posts: 3 ✭✭
I have few .doc files in my managed folder, I want to extract the text from the files using python recipe.
Please guide me how can I achieve this.
Or is there any way to convert the .doc files into .docx file programmatically and then extracting the text from the converted file?
Thank you in Advance.
Operating system used: Windows
Operating system used: Windows
Tagged:
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Word documents are typically unstructured data so there are built-in capabilities to read them. You can use Python libraries like the one mentioned below but this will need custom code to extract the data:
https://python-docx.readthedocs.io/en/latest/user/documents.html