Managed-datasets Metadata Synchronization Across Multiple DSS Instances

Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 419 Neuron
2
2 votes

New · Last Updated

Use Case

As an organization, we utilize three distinct DSS instances to manage our data analytic and ML workflows:

  1. Self-Service and Data Products Consumption Instance: For end-users to consume data products, and work independently by having access to curated data.
  2. Design and Development Instance: For designing and developing data products that we intend to operationalize.
  3. DSS Automation Instance: For putting critical data products into production.

A significant portion of our critical data products consists of datasets created through ETL processes that adhere to specific data models. These datasets are shared across the self-service and design instances. However, a critical challenge we face is the lack of metadata synchronization with the production node. The metadata of datasets generated in the automation instance, datasets which are stored and shared in a data warehouse, is inaccessible to the self-service and design instances.

Feature Request

We request the introduction of a feature that enables metadata synchronization (and lineage within Dataiku projects) between datasets shared across distinct DSS instances.

Proposed Method

A feasible method for implementing this feature could involve centralizing the data catalogues in a single instance. This approach is similar to the functionality currently available in the Govern and Deployer Dataiku instances.

Benefits

  1. Consistent Metadata: Ensures that all instances have access to the most up-to-date and consistent metadata, improving data governance and accuracy.
  2. Enhanced Collaboration: Facilitates better collaboration between teams working on different instances by providing a unified view of dataset metadata.
  3. Improved Efficiency: Reduces the need for manual metadata management and synchronization, leading to more efficient data operations.

Conclusion

Implementing metadata synchronization across DSS instances would significantly enhance our data management capabilities, ensuring seamless integration and consistency of data products across the organization. This feature will streamline our workflows and improve overall productivity and data governance. Furthermore, this new feature would be useful to any other organization handling some of their ETL process in Dataiku

Welcome!

It looks like you're new here. Sign in or register to get started.

Comments

  • Registered Posts: 1

    I see it's been a while since this was discussed, but I'm curious—has anyone found a good way to automate metadata sync across multiple DSS instances? I imagine custom APIs or some scripting might work, but I'd love to hear if anyone has a setup that works well in practice. Also, how do you handle conflicts when metadata updates happen in different instances at the same time?

  • Registered Posts: 2 ✭✭

    You could try using an external database to store and sync metadata across instances. A central metadata repository with scheduled sync jobs might help keep everything consistent and up to date.

  • Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 419 Neuron

    Yes, but that would be like using an external data catalog solution. Unless you are talking about the metadata that Dataiku uses internally, so the Data Catalogs for each DSS instance remain the same, and the lineage is also shown in the Dataiku interface.

    In that sense, @Stemmkoli curiosity goes in the right direction. If the dataiku API allowed somehow to create scripts to sync the Dataiku metadata between instances, that would be great.

  • Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,389 Neuron

    Wouldn't a project deployment sync your datasets metadata? I don't see the extra need to synchronize this without a deployment. Or am I missing something here?

  • Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 419 Neuron
    edited April 3

    OK, I see that is something missing in the product idea description. The idea is that in the self-service instance, the user can interact with the data products made in the automation node, using the same new lineage features that Dataiku has been introducing.

    This is still no clear enough (sorry, my language barrier interfering with my ideas), let me give another example. If I go to the automation node (where many ETLs are run, and schemas and tables are created) I can create a Data Collection using the Dataiku capabilities. The Data Collection in the automation node will have all the information we have added to the tables, like comments in dataiku, stewards, last build, etc. Also, inspecting the lineage with the Dataiku interface works right away from the box.

    However, the users in the self-service node are not allowed to work in the automation node. So, we need to start building a new Data Collection in the self-service node data catalog, to try to match what is visible in the automation node, but we lost all the lineage features, the dataiku comments, the stewards, the last build date, the data quality checks, etc.

    If instead of running the ETLs in the automation node, we would run them in the self-service node, we wouldn't have this problem, but it would defeat the purpose of having an automation node.

    Please let me know if it is still unclear.

  • Registered Posts: 2 ✭✭
    edited 4:52AM

    I ran into a similar issue before, and one thing that helped was setting up a regular metadata sync using APIs. If you're trying to maintain consistency across multiple DSS instances, automating the sync with a scheduled job can reduce mismatches. It's kind of like how top DAM solutions for photographers maintain metadata integrity across different platforms, ensuring files stay organized no matter where they're accessed.

  • Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 419 Neuron

    Thank you, everyone, for your comments. I have definitely made a pretty confusing description of the product idea I was proposing, by not being specific about the metadata.

    I was not talking about the metadata that can be stored into the target databases (like tables and columns descriptions), but about Dataiku's own metadata associated to each dataset, which provides the information for the DSS interfaces available under "Data Catalog", "Data Collection", and "Column Lineage".

    Here is a link to a video where I try to show what I'm talking about:

    Notice how the "Data Collection" I create, which includes dataiku metadata about last build name, datasets comments, data quality and lineage, in the Automation node, is lost when I create a "Data Collection" now at the self-service node.

    Of course, this is not a bug, I'm connecting to the new datasets by importing the prepared tables from our PostgreSQL data warehouse. But I don't see how I can communicate or synchronize (easily) the dataiku metadata about the parent flow, last build time, data quality criteria, etc.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.