Dataiku functionality and its features

SB9 · ‎05-05-2020

Hello Dataiku,

I have been exploring Dataiku for a couple of weeks now, me and my team are very impressed with its features and capabilities.

I was wondering if you could please help me understand or explain the following confusion/questions that I with the Dataiku features and functionality.

1. When the ML model is exposed as a REST API for real time scoring, what is the method employed by Dataiku to handle the data preprocessing on the real time data. What is the method used by Dataiku to perform the same data processing steps as on the training data to the real time data before making the predictions to the incoming real time data.

2. I was wondering if the data flow as represented in the Dataiku is extendable or is it not platform agnostic. Can it be integrated to the data lineage tools like Ranger/Atlas?

3. Can data masking/anonymization/tokenization be performed on the data by the data governance feature offered by Dataiku?

4. Can Dataiku deploy a custom built model (using any frame work/or some selected frame works) as a REST API.

5. My understanding is, operations like Data Ingestion (data motion from one source to another destination) and Data Transformation (changing file formats, joining data from various sources) can be performed using Dataiku. But is it one of primary functionality of Dataiku? Is it optimal to use Dataiku for such operations?

Your explanations/answers are highly appreciated.

Thank You.

GCase · ‎05-19-2020

Sorry for the delay on these.

1. To perform the same set of feature processing steps on your model as you would the rest of your flow you when you built it, you would either need to

* consolidate those steps all into the "Steps" component of the AutoML capability - see the scoring.png

* or Pre-process these and use an Enrichment (dataset lookup), as part of your API.

* or Create a set of custom code and deploy this as part of an arbitrary Python or R code.

2. DSS has its own lineage tooling in the form of the Catalog. You have the potential to read out that lineage by reading it out of the REST API . You can list the datasets, recipes, models, and plugins and rebuild the flow lineage. Our most complete external catalog integration to date is with Alation. Any other integrations would involve Dataiku services or your own internal development.

3. This would require an in-depth discussion. DSS has some capabilities in this area, but for any industry-based standard (PCI, HIPPA, etc.), we would likely look to include a partner tool for assistance.

4. DSS can deploy the AutoML prediction models, custom Python and R models, and arbitrary Python or R code with any combinations of libraries. Please look through our documentation on API Endpoints to get a better sense of what we can accomplish. Happy to take follow-ups. https://doc.dataiku.com/dss/latest/apinode/endpoints.html

5. Yes. One of the course tenets of DSS is being an end-to-end solution for data science and self-service analytics. Data ingestion and transformations are things we excel at inside DSS. The caveat I would make is we are focused more on analysts and data science communities and their particular problems. While we have teams that use DSS for more EDW loading scenarios, DSS can make use of the same Spark compute clusters used by these tools), it is not a core function of DSS and customers would likely be better served with the likes of an Informatica, Data Stage, and others.

As a side note, we typically would answer these questions as part of a follow-up meeting, RFP, or during an evaluation. If you have an interest in exploring further, please use the Contact button and someone can get in touch with you. Best Regards, Grant

Sign up to take part

Dataiku functionality and its features

Dataiku functionality and its features