External data catalog integration

zloe
Level 3
External data catalog integration

Hi everyone,

I'm looking for a way to integrate DataIku into a standalone Data Catalog tool. For example, DataHub. This stems from the fact that some initial data load and transformation happens inside the DWH through orchestration tool like Airflow and transformation tool like dbt. This creates initial datasets that are then used inside DataIku.

Inside DataIku we can track lineage through Thread plugin, for example. With properly described attributes, this plugin creates a catalog with lineage across the projects, tags, descriptions and so on. However, it cannot see what happens before data is added to DataIku.

On the other hand, third-party Data Catalog tools can work with dbt and Airflow, can query the DWH itself, but it cannot see what happens with datasets inside the DataIku.

Hence, the question: is there a way to integrate DataIku's lineage into third party Data Catalog \ Data Observability Platforms? An example would be DataHub, but other options are available. Has anyone tried doing something similar?


Operating system used: CentOs 7

0 Kudos
4 Replies
Jean-guillaumeA
Dataiker

Hi, 

Can you clarify what would be the expectation in the Data Catalog ? 

The challenge with lineage is that most of the tools are leveraging SQL to understand what happens where. With Dataiku, you can code your own step of the flow so lineage in this case must be specific to the code written by the developer. 

Understanding the expectation would help understand the need.

Regards

Jean-Guillaume

0 Kudos
zloe
Level 3
Author

Hi,

This is exactly it! I want to see a lineage from the input dataset all the way to the end. Vanilla catalog tools use either dbt graphs or straight sql to understand connections across the datasets, but because we use DataIku, this connection between datasets is lost from the perspective of catalog tool.

So the question is: is there a way to export DAGs from DataIku in such a way that any catalog tool (DataHub like) could understand it and merge it with the lineage from dbt.

 

Thanks!

0 Kudos
rrb
Level 1

Hello, how are you?

This topic, it's an excellent question, has there been any feedback?

Thanks!

0 Kudos
zloe
Level 3
Author

Hi, there has been no update. But internally we partnered with Ataccama, and they use DataIKU's API to parse projects and then map it to other stuff to create a complete lineage (table level, not column level) in the catalog.

I would assume that with enough resources and perseverance, the same thing can be achieved with any existing catalog. Probably, easier with open-source solutions, but then you'd be the one to develop and support it.