Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I'm looking for a way to integrate DataIku into a standalone Data Catalog tool. For example, DataHub. This stems from the fact that some initial data load and transformation happens inside the DWH through orchestration tool like Airflow and transformation tool like dbt. This creates initial datasets that are then used inside DataIku.
Inside DataIku we can track lineage through Thread plugin, for example. With properly described attributes, this plugin creates a catalog with lineage across the projects, tags, descriptions and so on. However, it cannot see what happens before data is added to DataIku.
On the other hand, third-party Data Catalog tools can work with dbt and Airflow, can query the DWH itself, but it cannot see what happens with datasets inside the DataIku.
Hence, the question: is there a way to integrate DataIku's lineage into third party Data Catalog \ Data Observability Platforms? An example would be DataHub, but other options are available. Has anyone tried doing something similar?
Operating system used: CentOs 7
Can you clarify what would be the expectation in the Data Catalog ?
The challenge with lineage is that most of the tools are leveraging SQL to understand what happens where. With Dataiku, you can code your own step of the flow so lineage in this case must be specific to the code written by the developer.
Understanding the expectation would help understand the need.
This is exactly it! I want to see a lineage from the input dataset all the way to the end. Vanilla catalog tools use either dbt graphs or straight sql to understand connections across the datasets, but because we use DataIku, this connection between datasets is lost from the perspective of catalog tool.
So the question is: is there a way to export DAGs from DataIku in such a way that any catalog tool (DataHub like) could understand it and merge it with the lineage from dbt.