fillter column using NaN

Solved!
phuongphan
Level 2
fillter column using NaN

Hi,

Is there any function in processor library which can help to filter and remove columns having more than 85% of NaN?

Imagine when I have a dataset which has several thousands of columns and most of them have a lot of NaN, how can I remove those columns automatically in recipes?

Thank you very much!

0 Kudos
1 Solution
ATsao
Dataiker

Hi,

You can filter out rows in DSS (either removing or clearing them) where these columns contain NaN or null values. However, if you are looking to remove the column itself based on this condition, your best bet would be to create your own code recipe to handle this logic accordingly. For example, you could read the input dataset into a dataframe, whether through Python or R, iterate through the columns to calculate the % of NaNs that can be found in that particular column, and then remove the corresponding column(s) if it exceeds this condition (and write the resulting dataframe into your output dataset). 

I hope that this helps!

Best,
Andrew

View solution in original post

4 Replies
Liev
Dataiker Alumni

Hi @phuongphan 

In your prepare recipe, you can switch to "Columns View" (top right)

From this view you should be able to select to view the % of empty or non-empty records (screenshot attached). Then on the left side you should be able to select several columns at a time and delete in one go.

I hope this helps!

 

phuongphan
Level 2
Author

Hi @Liev , very nice way. Thank you for your screenshot. By the way, if i have thousands of columns, it is still very hard to deal all of them at once. So I had to create code recipes.

0 Kudos
ATsao
Dataiker

Hi,

You can filter out rows in DSS (either removing or clearing them) where these columns contain NaN or null values. However, if you are looking to remove the column itself based on this condition, your best bet would be to create your own code recipe to handle this logic accordingly. For example, you could read the input dataset into a dataframe, whether through Python or R, iterate through the columns to calculate the % of NaNs that can be found in that particular column, and then remove the corresponding column(s) if it exceeds this condition (and write the resulting dataframe into your output dataset). 

I hope that this helps!

Best,
Andrew

phuongphan
Level 2
Author

Thank you @ATsao for your reply. Yes, I finally had to create code recipes to remove all columns which have high percents of NaN, and to create new additional features.