External data catalog integration

Options
Niyazi
Niyazi Registered Posts: 12 ✭✭✭✭

Hi everyone,

I'm looking for a way to integrate DataIku into a standalone Data Catalog tool. For example, DataHub. This stems from the fact that some initial data load and transformation happens inside the DWH through orchestration tool like Airflow and transformation tool like dbt. This creates initial datasets that are then used inside DataIku.

Inside DataIku we can track lineage through Thread plugin, for example. With properly described attributes, this plugin creates a catalog with lineage across the projects, tags, descriptions and so on. However, it cannot see what happens before data is added to DataIku.

On the other hand, third-party Data Catalog tools can work with dbt and Airflow, can query the DWH itself, but it cannot see what happens with datasets inside the DataIku.

Hence, the question: is there a way to integrate DataIku's lineage into third party Data Catalog \ Data Observability Platforms? An example would be DataHub, but other options are available. Has anyone tried doing something similar?


Operating system used: CentOs 7

Tagged:

Answers

  • Jean-guillaumeA
    Jean-guillaumeA Dataiker, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 2 Dataiker
    Options

    Hi,

    Can you clarify what would be the expectation in the Data Catalog ?

    The challenge with lineage is that most of the tools are leveraging SQL to understand what happens where. With Dataiku, you can code your own step of the flow so lineage in this case must be specific to the code written by the developer.

    Understanding the expectation would help understand the need.

    Regards

    Jean-Guillaume

  • Niyazi
    Niyazi Registered Posts: 12 ✭✭✭✭
    Options

    Hi,

    This is exactly it! I want to see a lineage from the input dataset all the way to the end. Vanilla catalog tools use either dbt graphs or straight sql to understand connections across the datasets, but because we use DataIku, this connection between datasets is lost from the perspective of catalog tool.

    So the question is: is there a way to export DAGs from DataIku in such a way that any catalog tool (DataHub like) could understand it and merge it with the lineage from dbt.

    Thanks!

  • rrb
    rrb Registered Posts: 1
    Options

    Hello, how are you?

    This topic, it's an excellent question, has there been any feedback?

    Thanks!

  • Niyazi
    Niyazi Registered Posts: 12 ✭✭✭✭
    Options

    Hi, there has been no update. But internally we partnered with Ataccama, and they use DataIKU's API to parse projects and then map it to other stuff to create a complete lineage (table level, not column level) in the catalog.

    I would assume that with enough resources and perseverance, the same thing can be achieved with any existing catalog. Probably, easier with open-source solutions, but then you'd be the one to develop and support it.

Setup Info
    Tags
      Help me…