Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Custom stopwords

Level 1
Custom stopwords
Dear all,

I would like to add Custom stop words for text variable. If I export the model into a DSS notebook i can't see the code related to the removal of stopwords as well. Maybe I'm wrong?

Any suggestions/instructions would be really appreciated.



Cris
0 Kudos
5 Replies
Dataiker
Dataiker

Hi,



For adding custom stopwords, you can either add a find and replace step in the script part of your analysis; or write a custom vectoriser as an option in the models > feature handling screen. Let me know if you need any further guidance on these two options. 



[UPDATE] 



For the second option (custom vectorizer) you can activate it on the screen below:





The scikit-learn doc has very useful examples of such custom vectorizer. For instance, they provide a CountVectorizer class with a stopwords argument. See: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text



Cheers,



Alexandre

0 Kudos
Level 1
Author
Thanks Alex for your reply, and yes I would to know about the second option, like "write a custom vectoriser" in the feature handling section. I'm wondering how to code and check the code with dataiku, and pass the features in the code. If it possible to have some references on how to manage variables and dataframe within the custom code? Could be really useful an example as well 🙂

Thanks in advance!

C.
0 Kudos
Dataiker
Dataiker
I added these to my answer. Hope it helps, Alex
0 Kudos
Level 1
Author
Thanks Alex fot your reply, it is clear the way to get the goal, but It is not transparent what is the input parameter of the CountVectorizer in dataiku. Please suggest some references where I can learn about it?

Thanks again
All my best

C.
0 Kudos
Dataiker
Dataiker
Hi, In the custom preprocessing screen of Dataiku, we expect the user to assign a "Vectorizer" instance to the "processor" object. This vectorizer instance would be from scikit-learn. I suggest to read https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer for reference.
0 Kudos
Labels (3)