The Prepare recipe is one of the most used features within Dataiku, and a favorite for users who prefer to use a visual interface to clean and prepare their data without coding. The Dataiku Online Academy gives a great grounding, but it’s worthwhile spending some time exploring the capabilities as functionality can change with new version releases.
Based on my years of experience developing data and analytics projects, the majority of time in every project is spent understanding and preparing the data, so here are five tips that might save you time in your day-to-day.
1. Column Analyzer
Like most people, the first thing that I do when I get data is to try to understand what I’m looking at. The dataset views (the blue squares in the flow) are a good place to start, but there are a few advantages of doing this in the Prepare recipe, such as quickly cleaning any data issues.
The Column Analyzer is available in both the dataset viewer and the Prepare recipe. This gives you some high-level statistics on each variable. To access, click on the variable name at the top of the column, then select ‘Analyze.’
Unlike in the dataset viewer version, there are also actions that you can choose here. One of the most useful ones is the ability to deal with outliers either by clipping, removing, or clearing using InterQuartile Range. This is a common step in data analytics because outliers can overly influence the model training.
For categorical variables, you can either perform manual binning or use Fuzzy Values Clustering. Fuzzy Values Clustering allows you to quickly group (or normalize) similar text fields, which is very useful for reducing dimensions before modeling or aggregating and charting.
Note that the data shown here is dependent on the Design Sample chosen, and the IQR will translate to a static value, not a formula. This means that as the data changes over time, the cut-off value for outliers will not change. If this is important, you can use the Group recipe to recalculate the values over the entire dataset each time instead.
2. The Columns View and Bulk Actions
One of the things I always enjoyed about coding is the ability to perform actions to several variables at once. When I first started using the Prepare recipe I was a little frustrated at having to select each column and remove it individually, especially when working with large datasets that had lots of low cardinality or empty columns. Then a colleague asked why I didn’t just use the Columns view…
The Columns view is located on the right-hand side of the screen, just near the quick column stats button. The columns view allows you to look across all of your variables at once, filter by variable type, view variable level statistics, and perform bulk actions.
This allows you to do actions like select all character and numeric columns, sort by cardinality, and then group select to remove anything with very low or very high cardinality. Much easier than in the table view and very useful for bulk removing variables that are sparsely populated or contain only a single value.
Like the rest of the Prepare recipe, the actions available to you will change depending on which variable type you have selected. For example, ‘Natural Language’ variables allow bulk tokenization and normalization, so it’s worth playing around with this yourselves.
This is also an easy way for coders to convert all names to a Python-compatible format by selecting all and then using ‘Rename’ > ‘Simplify and lowercase.’ You can even rename them with regular expressions here if that’s your preference!
So if you find yourself wanting to apply the same change across multiple columns, or are looking for a way to quickly slim down a very wide dataset, then columns view is the place for you.
3. Copy and Paste
Copy and paste provide some very cool features in Dataiku, most notably for copying data from datasets into interactive scoring in AutoML models. It’s also great for steps in the Prepare recipe.
With one or more steps in the Prepare recipe selected, right-click and select ‘Copy…’ (or ctrl+c), go to the new Prepare recipe, and right-click and select ‘Paste after selection’ (or ctrl+v).
This becomes very useful for replicating work done in another project or even within the same project. For example, when replicating steps between training and scoring datasets, or some transformations you did in the Lab that you want to bring into the flow. You can even do things such as browse projects online and copy from projects you see, such as those in the Dataiku gallery, or send the json to a coworker by pasting the step into an email or Slack message.
4. Highlighting and Regular Expressions
Extracting information from columns is really useful, particularly when dealing with longer strings. An easy way to see the options available is to just highlight a piece of text in the interface. This opens a menu that shows options like ‘Split,’ ‘Replace,’ and ‘Keep rows containing.’ For more complex cases, regular expressions (regex) are incredibly powerful.
Regexes provide a standard format for identifying patterns in strings. For example, if we had some log messages and just wanted to extract a particular pattern such as “ERROR number 1” or “WARNING number 4.” We could translate this as a series of capital letters, some lowercase letters and spaces, and ending with a digit, which would become a regex similar to “[A-Z]+.*\d+.”
When I was delivering projects, a lot of time was spent writing regexes for cleansing. People who are new to them can find them challenging (just look at all the websites dedicated to this topic), but after Dataiku 10, it’s a little easier.
Introducing the ‘smart pattern builder,’ a handy regular expression helper!
Highlight the pattern you want from the test and select ‘Extract text like…’ (or go to ‘New Step’ > ‘Strings’ > ‘Extract with Regular Expression’), pick your column, and select ‘Find with Smart Pattern.’
Users just have to highlight the word or words they want and watch the regular expression build itself, along with other options. The more columns that match the expression, the higher the ‘Match rate,’ shown below in green.
For more complex regexes, such as selecting multiple groups, you can still use the input field in the recipe, but this is a great way to get started. You can also test your own regex here.
Dataiku contains many options for reshaping, but we are often asked how to replicate the “melt” pandas function. In Dataiku, it is called ‘Folding.’ This is the reverse of a pivot; it takes rows with values in multiple columns and switches them into individual rows.
This is very useful for converting reported data into something easier to use for a model. In this particular example the ‘wide’ summary view on the left is better for displaying information to end users or on a dashboard. The melted ‘long’ view on the right is required for running time series analysis or forecasting.
Now that you’re all familiar with regular expressions, Dataiku also gives you the option to ‘Fold multiple columns by pattern’ using the ‘smart pattern builder’. This allows you to quickly select multiple column names based on a pattern rather than manually adding them, a great time saver.
Thanks for reading. There are a lot more features and functions that we didn’t cover, so it’s a good idea to have a look through the list every now and then to refresh your memory. If you have any favorites, we would love to hear about them in the comments below!