Managed-datasets Metadata Synchronization Across Multiple DSS Instances

Ignacio_Toledo
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

Use Case

As an organization, we utilize three distinct DSS instances to manage our data analytic and ML workflows:

  1. Self-Service and Data Products Consumption Instance: For end-users to consume data products, and work independently by having access to curated data.
  2. Design and Development Instance: For designing and developing data products that we intend to operationalize.
  3. DSS Automation Instance: For putting critical data products into production.

A significant portion of our critical data products consists of datasets created through ETL processes that adhere to specific data models. These datasets are shared across the self-service and design instances. However, a critical challenge we face is the lack of metadata synchronization with the production node. The metadata of datasets generated in the automation instance, datasets which are stored and shared in a data warehouse, is inaccessible to the self-service and design instances.

Feature Request

We request the introduction of a feature that enables metadata synchronization (and lineage within Dataiku projects) between datasets shared across distinct DSS instances.

Proposed Method

A feasible method for implementing this feature could involve centralizing the data catalogues in a single instance. This approach is similar to the functionality currently available in the Govern and Deployer Dataiku instances.

Benefits

  1. Consistent Metadata: Ensures that all instances have access to the most up-to-date and consistent metadata, improving data governance and accuracy.
  2. Enhanced Collaboration: Facilitates better collaboration between teams working on different instances by providing a unified view of dataset metadata.
  3. Improved Efficiency: Reduces the need for manual metadata management and synchronization, leading to more efficient data operations.

Conclusion

Implementing metadata synchronization across DSS instances would significantly enhance our data management capabilities, ensuring seamless integration and consistency of data products across the organization. This feature will streamline our workflows and improve overall productivity and data governance. Furthermore, this new feature would be useful to any other organization handling some of their ETL process in Dataiku

2
2 votes

New · Last Updated

Setup Info
    Tags
      Help me…