External data catalog integration

Niyazi · January 2023

Hi everyone,

I'm looking for a way to integrate DataIku into a standalone Data Catalog tool. For example, DataHub. This stems from the fact that some initial data load and transformation happens inside the DWH through orchestration tool like Airflow and transformation tool like dbt. This creates initial datasets that are then used inside DataIku.

Inside DataIku we can track lineage through Thread plugin, for example. With properly described attributes, this plugin creates a catalog with lineage across the projects, tags, descriptions and so on. However, it cannot see what happens before data is added to DataIku.

On the other hand, third-party Data Catalog tools can work with dbt and Airflow, can query the DWH itself, but it cannot see what happens with datasets inside the DataIku.

Hence, the question: is there a way to integrate DataIku's lineage into third party Data Catalog \ Data Observability Platforms? An example would be DataHub, but other options are available. Has anyone tried doing something similar?

Operating system used: CentOs 7

Jean-Guillaume · January 2023

Hi,

Can you clarify what would be the expectation in the Data Catalog ?

The challenge with lineage is that most of the tools are leveraging SQL to understand what happens where. With Dataiku, you can code your own step of the flow so lineage in this case must be specific to the code written by the developer.

Understanding the expectation would help understand the need.

Regards

Jean-Guillaume

Niyazi · January 2023

Hi,

This is exactly it! I want to see a lineage from the input dataset all the way to the end. Vanilla catalog tools use either dbt graphs or straight sql to understand connections across the datasets, but because we use DataIku, this connection between datasets is lost from the perspective of catalog tool.

So the question is: is there a way to export DAGs from DataIku in such a way that any catalog tool (DataHub like) could understand it and merge it with the lineage from dbt.

Thanks!

rrb · January 2024

Hello, how are you?

This topic, it's an excellent question, has there been any feedback?

Thanks!

Niyazi · January 2024

Hi, there has been no update. But internally we partnered with Ataccama, and they use DataIKU's API to parse projects and then map it to other stuff to create a complete lineage (table level, not column level) in the catalog.

I would assume that with enough resources and perseverance, the same thing can be achieved with any existing catalog. Probably, easier with open-source solutions, but then you'd be the one to develop and support it.

hari · September 2024

Hello, I'm not sure if this topic is still active, but is the partnership with Ataccama still ongoing? I haven't found any documentation or web articles about it. We're interested in a data cataloging tool and I'd like to know if there have been any updates since January 2023.

crodey · April 3

@ Jean-Guillaume

> The challenge with lineage is that most of the tools are leveraging SQL to understand what happens where. With Dataiku, you can code your own step of the flow so lineage in this case must be specific to the code written by the developer.

Well yes. I believe the challenge is that at many organizations, Dataiku is only one data transformation framework of many. It would be convenient to bring lineage of Dataiku datasets into an external tool more focused on lineage and observability, like DataHub.

Anyone currently working on this?

External data catalog integration

Answers

Categories

Setup Info

Tags