Removing (209) specific words/strings/sp characters from string column

Suzanne · July 2023

I am preparing a column containing 'Comments/narrative' (string) for a word cloud data set. I have a list of 209 specific words i'd like to remove from the column ('A', 'AN', 'THE', etc).

I'd rather not use the 'find and replace' recipe, for obvious reasons.

Can anyone recommend a more efficient solution? Thank you!

Miguel Angel · July 2023

Hi,

From the example words it seems the focus is on the so-called 'stop-words'. If so, an option would be to use a Prepare recipe and apply the 'Simplify text' processor. Enable the 'Clear stop words' checkbox. There is a tutorial about this in the documentation: https://knowledge.dataiku.com/latest/ml-analytics/nlp/concept-text-data-cleaning.html

If you want to build your own specific solution, there are some Python packages that can be of use, such as NLTK or Gensim.

Suzanne · July 2023

Thank you! I will give this a try and post back result. I appreciate the response.

jp1 · October 2023

@MiguelangelC
Can I look at those predefined stopwords list? I want to extract them!! Can you pls suggest me on how can I them?

Suzanne · October 2023

Hi jp1, they are specific to automobile damages - in addition to the 'stop words' in the DSS prepare recipie.

jp1 · October 2023

I understand, But I want to get the list of stop words which DataIKU prepare recipe processing!! That's what I was asking in the last comment!!

jp1 · October 2023

If we can look at those stop word list could you pls suggest me on that?

Suzanne · October 2023

So sorry i thought you were asking me : )

Removing (209) specific words/strings/sp characters from string column

Best Answer

Answers

Categories

Setup Info

Tags