Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

When reducing the columns in a dataset, identify the columns to keep, not the columns to remove

Manuel
Dataiker
Dataiker
1 min read 5 1 454

Often, a source dataset includes many more columns than the ones we need for the downstream flow. In a Prepare recipe, the temptation is to use a Remove processor to delete the columns that you don’t need to carry forward in the flow.

1.png

 

 

However, if, later, more columns are added to the source dataset, you will also need to update the Remove processor adding those columns. Otherwise, you may inadvertently add the newly added columns to the downstream schemas.

Instead, the tip is to use the Keep processor instead, positively identifying the columns you want to carry forward.

2.png


In this case, when new columns are added to the source dataset, the Prepare recipe does not need to be updated, as the Keep processor automatically ignores any newly added columns, keeping downstream schemas consistent.

Now the tricky bit, when you have many columns, how do you easily select the ones to keep? It is counterintuitive, but the easiest way is to actually convert a Remove processor into a Keep processor:

  1. In your Prepare recipe, change to column view;
  2. Select the columns you want to KEEP;
  3. In the Actions menu, select Delete (trust me);

3.png

 

 

4. In the processor, at the bottom of the columns list, switch from Remove to Keep.

4.png

 

 

"Voilà”, you have a Keep processor that positively identifies the columns you need.

Labels

?
1 Comment
Share: