Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello - I am importing multiple sources of Excel data over several years, same reporting format (field names, field types) and seeing quite a few entirely empty columns ~ being picked up for combining into a single flat table. This results in random columns being placed in the initial combined data set and triggering the prepare recipe (at post combining) to fail, e.g., trimming. Does anyone have pointers to prevent this? Thanks.
The best way to handle this will probably depend quite a bit on what your input Excel files look like and what the issue in your current output dataset looks like. From your description, it sounds like even though the input files are similar, they may not have exactly the same schema, or some of the files may have an additional column or two that leads to shifting of some of the data unexpectedly. Is that understanding correct? In order to determine the simplest way to reconcile / remove the empty columns, it would be most useful to see a couple of sample rows of two of the mismatched input files, along with a screenshot highlighting the issue in the output dataset in order to determine the best way to resolve.
Hi and Welcome @shosho88 ,
To delete many cols at once the column view comes in handy, the 'hamburger' button on the right of your screen is for that (see attached screenshot #1). Now check the checkboxes of cols you want to delete, or select all and de-select everything you want to keep. In 'actions' (screenshot #2) you'll find "Delete" which deletes the selected columns.
If you have a high number of empty columns.
You could also use a Python recipe and dropna in pandas to remove all columns that are completely empty.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs input_dataset = dataiku.Dataset("dataset_name") df = input_dataset.get_dataframe().dropna(axis=1, how='all') # Write recipe outputs testing = dataiku.Dataset("testing") testing.write_with_schema(df)