Encoding for categorical variables

cwentz
Level 3
Encoding for categorical variables

I would like to know your preferences for data preparation for molding. Do you bring in a dataset and change all categorical to encoding to analyze and model, or do you leave the data set as is and let DSS make the changes? 

Are there specific use cases where one is better than the other? 

Thank you for your responses. ๐Ÿ˜

-Christy

0 Kudos
1 Reply
tgb417

@cwentz 

I'd like to invite others to reply as well.

Here are my $0.02, In my personal practice, I tend to create dynamic connections between source datasets wherever possible.  Using SQL or REST, or some other connection mechanism.  This allows me to dynamically update my dataset. Rather than having to manually re-import data every time the source data changes.

I will then do Schema work and basic cleanup that I think can be used in all of my analysis.  This includes things like parsing dates.  But, I tend not to throw out any records at this point.  The hope is that this gives me one dataset that is the whole truth and nothing but the truth.

From there I might have a step that creates features that I can not create through Visual Analysis.  Or I'll enrich from other resources, like Weather, Housing Prices, and the like.  From there I'll tend to create Training and Validation Sets.  Then bring the Training Dataset into a Visual Analysis.

Here I'll use the script to make changes that are specific to the analysis or model in question.  And for things like One Hot Encoding, Missing Value Imputation, Sampling, and the like I'll use the Model Design pages unless I have a good reason to do it else ware.

I hope that is helpful.  I'm sure that others have a number of ideas about how they proceed.  I'd enjoy hearing what folks are doing.  Particularly if your go-to starting points are Python and R coding as a starting point.  

--Tom