Delete entirely empty columns at import

user1278910
Level 1
Delete entirely empty columns at import

Hello - I am importing multiple sources of Excel data over several years, same reporting format (field names, field types) and seeing quite a few entirely empty columns ~ being picked up for combining into a single flat table. This results in random columns being placed in the initial combined data set and triggering the prepare recipe (at post combining) to fail, e.g., trimming. Does anyone have pointers to prevent this? Thanks. 

0 Kudos
8 Replies
SarinaS
Dataiker

Hi @user1278910,

The best way to handle this will probably depend quite a bit on what your input Excel files look like and what the issue in your current output dataset looks like. From your description, it sounds like even though the input files are similar, they may not have exactly the same schema, or some of the files may have an additional column or two that leads to shifting of some of the data unexpectedly. Is that understanding correct? In order to determine the simplest way to reconcile / remove the empty columns, it would be most useful to see a couple of sample rows of two of the mismatched input files, along with a screenshot highlighting the issue in the output dataset in order to determine the best way to resolve. 

Thanks,
Sarina

0 Kudos
Jurre
Level 5

Hi @user1278910 ,

In addition to @SarinaS  : these empty columns, do they hold data in the original files ? Is that  data anything special (like non-standard textcharacters for example greek or cyrillic) ?

cheers

Jurre

shosho88
Level 2

I'm starting to learn Dataiku

I have about a same issue. I imported a dataset (excel file) with many empty column and want to delete it all in one time. Do you have a method ?

 

Thanks for your help ๐Ÿ™‚

 

0 Kudos
Jurre
Level 5

Hi and Welcome @shosho88 ,

To delete many cols at once the column view comes in handy, the 'hamburger' button on the right of your screen is for that (see attached screenshot #1). Now check the checkboxes of cols you want to delete, or select all and de-select everything you want to keep. In 'actions' (screenshot #2) you'll find "Delete" which deletes the selected columns. 

Cheers!

shosho88
Level 2

Thanks @Jurre 

I did this but i Have a problem I want to filter only the column empty to select all once

I took a screenshot

Thanks again for your help ๐Ÿ™‚

 

0 Kudos
AlexT
Dataiker

Hi @shosho88 

If you have a high number of empty columns.

You could also use a Python recipe and dropna in pandas to remove all columns that are completely empty. 

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
input_dataset = dataiku.Dataset("dataset_name")


df = input_dataset.get_dataframe().dropna(axis=1, how='all')


# Write recipe outputs
testing = dataiku.Dataset("testing")
testing.write_with_schema(df)

 

Jurre
Level 5

Thanx for the help @AlexT  ! 

@shosho88  a bit of backgroundinfo to Alex' solution (code examples and what it does)

0 Kudos
shosho88
Level 2

Perfect! thanks guys @Jurre and @AlexT