Removing (209) specific words/strings/sp characters from string column

Solved!
SuzanneD
Level 1
Removing (209) specific words/strings/sp characters from string column

I am preparing a column containing 'Comments/narrative' (string) for a word cloud data set. I have a list of 209 specific words i'd like to remove from the column ('A', 'AN', 'THE', etc).

I'd rather not use the 'find and replace' recipe, for obvious reasons.

Can anyone recommend a more efficient solution?  Thank you!

0 Kudos
1 Solution
MiguelangelC
Dataiker

Hi,

From the example words it seems the focus is on the so-called 'stop-words'. If so, an option would be to use a Prepare recipe and apply the 'Simplify text' processor. Enable the 'Clear stop words' checkbox. There is a tutorial about this in the documentation: https://knowledge.dataiku.com/latest/ml-analytics/nlp/concept-text-data-cleaning.html

If you want to build your own specific solution, there are some Python packages that can be of use, such as NLTK or Gensim.

View solution in original post

0 Kudos
2 Replies
MiguelangelC
Dataiker

Hi,

From the example words it seems the focus is on the so-called 'stop-words'. If so, an option would be to use a Prepare recipe and apply the 'Simplify text' processor. Enable the 'Clear stop words' checkbox. There is a tutorial about this in the documentation: https://knowledge.dataiku.com/latest/ml-analytics/nlp/concept-text-data-cleaning.html

If you want to build your own specific solution, there are some Python packages that can be of use, such as NLTK or Gensim.

0 Kudos
SuzanneD
Level 1
Author

Thank you! I will give this a try and post back result. I appreciate the response.

0 Kudos