Custom stopwords
cris
Registered Posts: 5 ✭✭✭✭
Dear all,
I would like to add Custom stop words for text variable. If I export the model into a DSS notebook i can't see the code related to the removal of stopwords as well. Maybe I'm wrong?
Any suggestions/instructions would be really appreciated.
Cris
I would like to add Custom stop words for text variable. If I export the model into a DSS notebook i can't see the code related to the removal of stopwords as well. Maybe I'm wrong?
Any suggestions/instructions would be really appreciated.
Cris
Tagged:
Answers
-
Hi,
For adding custom stopwords, you can either add a find and replace step in the script part of your analysis; or write a custom vectoriser as an option in the models > feature handling screen. Let me know if you need any further guidance on these two options.
[UPDATE]
For the second option (custom vectorizer) you can activate it on the screen below:
The scikit-learn doc has very useful examples of such custom vectorizer. For instance, they provide a CountVectorizer class with a stopwords argument. See: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
Cheers,
Alexandre
-
Thanks Alex for your reply, and yes I would to know about the second option, like "write a custom vectoriser" in the feature handling section. I'm wondering how to code and check the code with dataiku, and pass the features in the code. If it possible to have some references on how to manage variables and dataframe within the custom code? Could be really useful an example as well :-)
Thanks in advance!
C. -
I added these to my answer. Hope it helps, Alex
-
Thanks Alex fot your reply, it is clear the way to get the goal, but It is not transparent what is the input parameter of the CountVectorizer in dataiku. Please suggest some references where I can learn about it?
Thanks again
All my best
C. -
Hi, In the custom preprocessing screen of Dataiku, we expect the user to assign a "Vectorizer" instance to the "processor" object. This vectorizer instance would be from scikit-learn. I suggest to read https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer for reference.