Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

NaN values after split dataset

Solved!
psagrera
Level 1
NaN values after split dataset

 I'm working all the time with a dataset using first records as a sampling method - 10K  and at some point using a visual recipe I remove NaN values from those columns that shows empty values in the Analysis option. I then check that there are no longer any empty value.

I split the dataset (apparently without NaN) in 4 datasets (and some of the new datasets are showing NaN values )

Questions: 

1) Why might this be happening?

2) How can I remove NaN from the whole dataset and not only from the sample ? (if actually only apply to the sample)

 

Thanks

 

Thanks

 

 

0 Kudos
1 Solution
tim-wright
Neuron
Neuron

@psagrera. Are the empty values in the split dataset coming from the columns which you used to filter empty values? I suspect this is you may have filtered on NaN in Columns (A,B) for example and then downstream after the split find NaN in Column C. -- please correct me if this is wrong.

The problem here would be that the visual analysis only profiles your sample. If there are nulls that are not in your sample you will NOT see them and this issue can pop up. Two simple options I see:

  1. Use this prep step to remove rows with your visual recipe. If you know beforehand which columns will have Nan, you can specify just those. If you do not (sounds like you do not), you can specify to check all columns.
  2. If your dataset fits in memory and you would prefer, you can change the sample so that it includes your entire dataset. Then if you analyze the columns you needn't worry about the impact of your sample possibly missing a Nan. Note: if your dataset changes and some columns that do not currently have Nan do in the future, the same issue you are experiencing could arise.

The analysis steps done on a sample are the same set of steps that are performed on the whole dataset. IF however you are making some analysis step decisions based on the a analysis of that sample those can lead to unexpected results due to the sampling effect. In those situations, it is best to make sure your analysis uses the whole data (or a sufficiently large sample) that you are confident.

I'd recommend option #1. If 

View solution in original post

0 Kudos
4 Replies
tim-wright
Neuron
Neuron

@psagrera. Are the empty values in the split dataset coming from the columns which you used to filter empty values? I suspect this is you may have filtered on NaN in Columns (A,B) for example and then downstream after the split find NaN in Column C. -- please correct me if this is wrong.

The problem here would be that the visual analysis only profiles your sample. If there are nulls that are not in your sample you will NOT see them and this issue can pop up. Two simple options I see:

  1. Use this prep step to remove rows with your visual recipe. If you know beforehand which columns will have Nan, you can specify just those. If you do not (sounds like you do not), you can specify to check all columns.
  2. If your dataset fits in memory and you would prefer, you can change the sample so that it includes your entire dataset. Then if you analyze the columns you needn't worry about the impact of your sample possibly missing a Nan. Note: if your dataset changes and some columns that do not currently have Nan do in the future, the same issue you are experiencing could arise.

The analysis steps done on a sample are the same set of steps that are performed on the whole dataset. IF however you are making some analysis step decisions based on the a analysis of that sample those can lead to unexpected results due to the sampling effect. In those situations, it is best to make sure your analysis uses the whole data (or a sufficiently large sample) that you are confident.

I'd recommend option #1. If 

View solution in original post

0 Kudos
psagrera
Level 1
Author

Thank you very much for your detailed explanation. I'll use option number 1. I guess that if I want to use a most sophisticated method of dealing with null/NaN (i.e remove some rows and fill others by means of interpolation etc.. ), it has to done via code recipe right ? 

0 Kudos
tim-wright
Neuron
Neuron

@psagrera If you know which columns you want to impute values for you can also do that in a visual prepare recipe without code.

You can use the "Impute with Calculated Value" step. This will allow you to select a single, multiple, all or only columns matching a certain pattern. The missing values for those columns can be replaced with the Mean, Median, or Mode values. If you want to impute mean for some columns, median for others, mode for others, you would need to use multiple steps. Each step should execute one imputation scheme on the relevant columns only

Alternatively if you are planning on using some ML recipe downstream, you will probably want to do the feature imputation within the context of your ML experiments. Specifically you don't want to impute record level values from a full dataset before you do your test/train splitting because it will leak information from the test set into the training set. The result of this could overestimate the ability of your models to generalize to unseen data. Within the Design Tab of a visual analysis, you can select "Feature Handling". From within that you can select independently for each variable how missing values should be treated (drop records, impute them with mean, etc.). 

If you want to use some more advanced model-based imputation schemes, you will have to code those (as far as I know)

 

Hope that helps. Let me know if you have any issues.

 

0 Kudos
psagrera
Level 1
Author

Thanks for you clear explanation. My plan is to use a ML recipe downstream, so I will impute there. BTW , I think it's not possible , but just in case, using ML visual analysis is it possible to select more than one target in a regression problem ? 

Cheers

0 Kudos
A banner prompting to get Dataiku DSS