Managed-datasets Metadata Synchronization Across Multiple DSS Instances

Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 417 Neuron
2
2 votes

New · Last Updated

Use Case

As an organization, we utilize three distinct DSS instances to manage our data analytic and ML workflows:

  1. Self-Service and Data Products Consumption Instance: For end-users to consume data products, and work independently by having access to curated data.
  2. Design and Development Instance: For designing and developing data products that we intend to operationalize.
  3. DSS Automation Instance: For putting critical data products into production.

A significant portion of our critical data products consists of datasets created through ETL processes that adhere to specific data models. These datasets are shared across the self-service and design instances. However, a critical challenge we face is the lack of metadata synchronization with the production node. The metadata of datasets generated in the automation instance, datasets which are stored and shared in a data warehouse, is inaccessible to the self-service and design instances.

Feature Request

We request the introduction of a feature that enables metadata synchronization (and lineage within Dataiku projects) between datasets shared across distinct DSS instances.

Proposed Method

A feasible method for implementing this feature could involve centralizing the data catalogues in a single instance. This approach is similar to the functionality currently available in the Govern and Deployer Dataiku instances.

Benefits

  1. Consistent Metadata: Ensures that all instances have access to the most up-to-date and consistent metadata, improving data governance and accuracy.
  2. Enhanced Collaboration: Facilitates better collaboration between teams working on different instances by providing a unified view of dataset metadata.
  3. Improved Efficiency: Reduces the need for manual metadata management and synchronization, leading to more efficient data operations.

Conclusion

Implementing metadata synchronization across DSS instances would significantly enhance our data management capabilities, ensuring seamless integration and consistency of data products across the organization. This feature will streamline our workflows and improve overall productivity and data governance. Furthermore, this new feature would be useful to any other organization handling some of their ETL process in Dataiku

Welcome!

It looks like you're new here. Sign in or register to get started.

Comments

  • Registered Posts: 1

    I see it's been a while since this was discussed, but I'm curious—has anyone found a good way to automate metadata sync across multiple DSS instances? I imagine custom APIs or some scripting might work, but I'd love to hear if anyone has a setup that works well in practice. Also, how do you handle conflicts when metadata updates happen in different instances at the same time?

  • Registered Posts: 1

    You could try using an external database to store and sync metadata across instances. A central metadata repository with scheduled sync jobs might help keep everything consistent and up to date.

  • Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 417 Neuron

    Yes, but that would be like using an external data catalog solution. Unless you are talking about the metadata that Dataiku uses internally, so the Data Catalogs for each DSS instance remain the same, and the lineage is also shown in the Dataiku interface.

    In that sense, @Stemmkoli curiosity goes in the right direction. If the dataiku API allowed somehow to create scripts to sync the Dataiku metadata between instances, that would be great.

  • Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,371 Neuron

    Wouldn't a project deployment sync your datasets metadata? I don't see the extra need to synchronize this without a deployment. Or am I missing something here?

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.