Delete entirely empty columns at import

user1278910 · January 2022

Hello - I am importing multiple sources of Excel data over several years, same reporting format (field names, field types) and seeing quite a few entirely empty columns ~ being picked up for combining into a single flat table. This results in random columns being placed in the initial combined data set and triggering the prepare recipe (at post combining) to fail, e.g., trimming. Does anyone have pointers to prevent this? Thanks.

Sarina · March 2022

Hi @user1278910
,

The best way to handle this will probably depend quite a bit on what your input Excel files look like and what the issue in your current output dataset looks like. From your description, it sounds like even though the input files are similar, they may not have exactly the same schema, or some of the files may have an additional column or two that leads to shifting of some of the data unexpectedly. Is that understanding correct? In order to determine the simplest way to reconcile / remove the empty columns, it would be most useful to see a couple of sample rows of two of the mismatched input files, along with a screenshot highlighting the issue in the output dataset in order to determine the best way to resolve.

Thanks,
Sarina

Jurre · March 2022

Hi @user1278910
,

In addition to @SarinaS
: these empty columns, do they hold data in the original files ? Is that data anything special (like non-standard textcharacters for example greek or cyrillic) ?

cheers

Jurre

shosho88 · April 2022

I'm starting to learn Dataiku

I have about a same issue. I imported a dataset (excel file) with many empty column and want to delete it all in one time. Do you have a method ?

Thanks for your help

Jurre · April 2022

Hi and Welcome @shosho88
,

To delete many cols at once the column view comes in handy, the 'hamburger' button on the right of your screen is for that (see attached screenshot #1). Now check the checkboxes of cols you want to delete, or select all and de-select everything you want to keep. In 'actions' (screenshot #2) you'll find "Delete" which deletes the selected columns.

Cheers!

shosho88 · April 2022

Thanks @Jurre

I did this but i Have a problem I want to filter only the column empty to select all once

I took a screenshot

Thanks again for your help

Alexandru · April 2022

Hi @shosho88

If you have a high number of empty columns.

You could also use a Python recipe and dropna in pandas to remove all columns that are completely empty.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
input_dataset = dataiku.Dataset("dataset_name")


df = input_dataset.get_dataframe().dropna(axis=1, how='all')


# Write recipe outputs
testing = dataiku.Dataset("testing")
testing.write_with_schema(df)

Jurre · April 2022

Thanx for the help @AlexT
!

@shosho88
a bit of backgroundinfo to Alex' solution (code examples and what it does)

shosho88 · April 2022

Perfect! thanks guys @Jurre
and @AlexT

Delete entirely empty columns at import

Answers

Categories

Setup Info

Tags