Text extraction with plugins
Hi,
We are trying to do a POC where we would like to extract a specific word, let us say "I need to get the number of hours worked by the employees" from a sentence in the text data.
For example:
Person 1 says : Hi, I have worked for 40 hrs last week.
Person 2 says : Hi, I was on leave for 2 days and so I have worked for 24 hours.
So from the text input I would like to get 40 hrs and 24 hours as output so that I can aggregate the total number of hours worked by them.
Can you give us an idea on how to fetch the exact content irrespective of the sentence format used and also let us know whether we achieve this either with NLP plugins or is there any other way?
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
@pranauv
,DSS has the number extractor that is part of visual recipes.
However, for your example, it appears that you need more than just number extraction. It appears that you need some understanding of time units, days, hours... or even some understanding of language.
I'm wondering if there is a Python Library or R Package that is designed to extract time values from free text.
I found datefinder on GitHub for dates. This article "2 PACKAGES FOR EXTRACTING DATES FROM A STRING OF TEXT IN PYTHON" looks interesting. Dataiku can use code recipes to integrate snipits of Python and R code into flows.
However, that may not be correct for your use case. You may be actually wanting a time finder. And given your second example some NLP understanding.
I'm wondering if there are any NLP folks who know enough about things like spaCy and others of the ML tools to be able to comment.