When reducing the columns in a dataset, identify the columns to keep, not the columns to remove

Manuel · ‎01-10-2022

Often, a source dataset includes many more columns than the ones we need for the downstream flow. In a Prepare recipe, the temptation is to use a Remove processor to delete the columns that you don’t need to carry forward in the flow.

However, if, later, more columns are added to the source dataset, you will also need to update the Remove processor adding those columns. Otherwise, you may inadvertently add the newly added columns to the downstream schemas.

Instead, the tip is to use the Keep processor instead, positively identifying the columns you want to carry forward.

In this case, when new columns are added to the source dataset, the Prepare recipe does not need to be updated, as the Keep processor automatically ignores any newly added columns, keeping downstream schemas consistent.

Now the tricky bit, when you have many columns, how do you easily select the ones to keep? It is counterintuitive, but the easiest way is to actually convert a Remove processor into a Keep processor:

In your Prepare recipe, change to column view;
Select the columns you want to KEEP;
In the Actions menu, select Delete (trust me);

4. In the processor, at the bottom of the columns list, switch from Remove to Keep.

"Voilà”, you have a Keep processor that positively identifies the columns you need.

When reducing the columns in a dataset, identify the columns to keep, not the columns to remove

Labels

Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer

Generating a dropdown in a Dataiku App using the python do function

Automate deployment on API node