Prepare recipe API

natejgardner
natejgardner Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 151 Neuron

The prepare recipe is one of the most complex and useful aspects of Dataiku. But currently, the only way to create a prepare recipe programmatically is to define it using json. This can be a lot to manage for more complex recipes, especially since the developer is left needing to manually create recipes to find out what options are valid for the json document.

An API could make interacting with these recipes substantially easier, especially when it comes to debugging and knowing which options are valid with which processors.

It looks like other Dataiku users are also looking for this functionality.

Some features that would be really cool to have:

  • Define a prepare recipe as a list of objects, each object instantiating its processor's class.
    • This allows them to be traversed easily with code without losing access to methods on each processor object
  • Each processor's parameters being documented and accessible
  • Recipe validation: for any given prepare recipe, execute validation and indicate whether any steps caused errors
    • Really valuable when generating many prepare recipes at once.
  • For any step in the recipe, check the current state of the dataset - column names, types, and meanings. Edit meanings and types.
    • Useful for programmatic validation
  • Preview a dataset sample up to any step in the recipe
    • Useful for programmatic validation - if I need to generate thousands of prepare recipes, there may be outlying datasets that my assumptions for the correct transformation steps don't apply to. Getting a dataframe back that I can run custom validation on specific to my use-case would be really helpful.
  • Validation for Dataiku formula language, template strings / bind parameters for Dataiku formula language (pass in dynamic identifiers for objects); code injection security - perhaps take inspiration from AQL's model
  • Pass in mappings for processors such as rename columns and find and replace as native data structures such as dictionaries rather than as TSV strings.
  • Access shaker processors independently of a prepare recipe - integrate them directly into code recipes to expand transformation capabilities.
  • Apply all this to visual analysis scripts too
  • Access whether any given processor is compatible with a given engine, provide custom SQL to replace the code for incompatible processors with SQL engines (just as is currently supported in the UI).

An example use-case:

I have a data warehouse I'd like to create semantic views for. After completing the necessary joins and stacks, the data needs to be transformed so it can be consumed by human users. This workflow involves renaming columns, changing column types and meanings, transforming strings, and mapping enums. Every recipe is slightly different depending on the configuration of the source dataset. For more than 7,700 datasets, I need to create prepare recipes that will perform and validate the necessary transformations, providing user-friendly datasets on the other side. These can then be exposed for integration into other projects where the data will be used. Currently, this can be done with a Python recipe, which will iterate through the datasets and generate recipes. However, the process is rather fragile since for every dataset, a complex json object defining all the processors for each prepare recipe must be constructed. With an API, developing and maintaining scripts to programmatically define these prepare recipes would be much easier.

3
3 votes

In the Backlog · Last Updated

Comments

Setup Info
    Tags
      Help me…