Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

extracting defined skills from documents - NLP

Level 2
extracting defined skills from documents - NLP

I have a question with automatic natural language processing, I would like to automatically visualize skills (communication, marketing, statistics, etc.) taken from a document.

I already have a defined list of 10 skills and I have texts extracted from the documents. These documents are written with 1 to 3 skills in mind. According to my defined list, I would like to be able to extract the relevant skills from the document.

I would like to know what are the ideal methods to achieve my goal? (Like bag of words, word embeddings, etc)

1 Reply

Hello ,

If I understand your question correctly, you are only interested in extracting words from a column containing natural text. If so, then you can use the Extract with Regular Expression processor in a Prepare Recipe. For example, in a column containing home descriptions, I may want to extract the words: Coronavirus, basement and Beautiful. I can use the following pattern to find those words ((?:Coronavirus|basement|Beautiful)), and create a new column shown below:

Screen Shot 2021-09-14 at 8.38.59 AM.png

If you have version 9.0.0 or above, this process is even easier using the Smart Pattern Builder. Instead of writing your own regex expression, you can simply highlight the words from the text that you want to extract. 

As for bag of words and word embeddings, these are not used for text extraction. We would need to know more about your use case in order to give further guidance.

I hope this is helpful!



0 Kudos
A banner prompting to get Dataiku DSS