Exporting pre-processing in custom deployment
Hi,
I am using Dataiku to train a model and have included several 'pre-processing' recipes (Join tables...etc) before model training using Keras. After training, i wish to export the model in Keras format and deploy it in my existing deployment architecture elsewhere. I would like to ask 2 questions.
1. Am i limited to export a pretrained model in Dataiku's defined formats? (E.g. PMML, Jar...etc). Or can i export it to any of the frameworks i chose? (E.g. Keras, PyTorch...etc)
2. I also need to make sure the same 'Pre-processing' steps are deployed in the right order on my existing deployment architecture, preferbly as python code. Is this possible?
Thank you.
Regards,
Jax
Best Answer
-
Hi,
Dataiku provides direct export capabilities that allow to very simply obtain a working standalone model in PMML or JAR. However, these export capabilities are currently not available for DL models trained with Keras (they are available for most scikit-learn models, XGBoost models and most MLLib models).
It is important to note that DSS does not "lock" your models in. When you train a model with Keras through the DSS visual ML capabilities, your Keras model is serialized in standard H5 format, and although DSS does not natively provide "export" capabilities, you can access the Keras serialization format.
For a model deployed in the Flow (i.e. a saved model), you'll find this model in DATA_DIR/saved_models/PROJECT/MODEL_ID/versions/VERSION_ID
It is however important to note that the model itself is fitted on the output of what Dataiku calls "preprocessing", i.e. feature handling. This notably includes dummification of categorical variables, rescaling of numerical variables. The model is trained (as all Keras and scikit-learn models) on a purely numerical no-missing matrix, which is the output of preprocessing. These operations are not performed using scikit-learn or Keras capabilities are they are more specialized, and Dataiku does not currently have specific export capabilities for them. Each transformation is however described by open JSON files in the same folder.
If your model takes only non-rescaled numerical as input, your initial feature space and the preprocessed feature spaces will be the same, so you can use your .h5 diectly.
Finally, about the other Flow recipes (join, ...). Most of these can be exported to SQL, Hive, Impala or SparkSQL code (depending on which engine they run on). They cannot be exported to Python code.
The "Prepare" recipe can not always be exported to code, because it does not work by generating code but by leveraging business-value-added transformations. Some case of prepare recipes can be exported to SQL or SparkSQL
Answers
-
Thanks Clement.
1. I gather that Dataiku's preprocessing is implicit, and these transformation is described by the JSON files. Would it make sense for me to write a plugin/recipe to read in these transformation and perform the preprocessing by referencing the relevant dataset?
2. If i were to use the Automation Node in deployment, would these still be an issue?
-
Hi,
Indeed, one of Dataiku's main value propositions is to allow you to deploy these models to its Automation Node without having to export and recode any of the model code, or the upstream preprocessing steps. Furthermore, remaining within Dataiku throughout the model deployment lifecycle will allow you to benefit from the centralized security, governance, and model management features (e.g. model retraining and redeployment, model versioning, etc...) all of which can be automated to the degree that suits your needs.
Of course, we understand that each organization has its own standards and practices for production deployment. Our customers have been successful in integrating Dataiku into these frameworks, even in highly-regulated industries, while still benefiting from Dataiku's full suite of features. We'd be happy to work with you to help you see what that would look like in your particular case.