Managed-datasets Metadata Synchronization Across Multiple DSS Instances

Ignacio_Toledo · June 2024

Use Case

As an organization, we utilize three distinct DSS instances to manage our data analytic and ML workflows:

Self-Service and Data Products Consumption Instance: For end-users to consume data products, and work independently by having access to curated data.
Design and Development Instance: For designing and developing data products that we intend to operationalize.
DSS Automation Instance: For putting critical data products into production.

A significant portion of our critical data products consists of datasets created through ETL processes that adhere to specific data models. These datasets are shared across the self-service and design instances. However, a critical challenge we face is the lack of metadata synchronization with the production node. The metadata of datasets generated in the automation instance, datasets which are stored and shared in a data warehouse, is inaccessible to the self-service and design instances.

Feature Request

We request the introduction of a feature that enables metadata synchronization (and lineage within Dataiku projects) between datasets shared across distinct DSS instances.

Proposed Method

A feasible method for implementing this feature could involve centralizing the data catalogues in a single instance. This approach is similar to the functionality currently available in the Govern and Deployer Dataiku instances.

Benefits

Consistent Metadata: Ensures that all instances have access to the most up-to-date and consistent metadata, improving data governance and accuracy.
Enhanced Collaboration: Facilitates better collaboration between teams working on different instances by providing a unified view of dataset metadata.
Improved Efficiency: Reduces the need for manual metadata management and synchronization, leading to more efficient data operations.

Conclusion

Implementing metadata synchronization across DSS instances would significantly enhance our data management capabilities, ensuring seamless integration and consistency of data products across the organization. This feature will streamline our workflows and improve overall productivity and data governance. Furthermore, this new feature would be useful to any other organization handling some of their ETL process in Dataiku

Stemmkoli · March 27

I see it's been a while since this was discussed, but I'm curious—has anyone found a good way to automate metadata sync across multiple DSS instances? I imagine custom APIs or some scripting might work, but I'd love to hear if anyone has a setup that works well in practice. Also, how do you handle conflicts when metadata updates happen in different instances at the same time?

Bostetwa · March 27

You could try using an external database to store and sync metadata across instances. A central metadata repository with scheduled sync jobs might help keep everything consistent and up to date.

Ignacio_Toledo · March 27

https://community.dataiku.com/discussion/comment/45669#Comment_45669

Yes, but that would be like using an external data catalog solution. Unless you are talking about the metadata that Dataiku uses internally, so the Data Catalogs for each DSS instance remain the same, and the lineage is also shown in the Dataiku interface.

In that sense, @Stemmkoli curiosity goes in the right direction. If the dataiku API allowed somehow to create scripts to sync the Dataiku metadata between instances, that would be great.

Turribeach · March 27

Wouldn't a project deployment sync your datasets metadata? I don't see the extra need to synchronize this without a deployment. Or am I missing something here?

Ignacio_Toledo · April 3

https://community.dataiku.com/discussion/comment/45678#Comment_45678

OK, I see that is something missing in the product idea description. The idea is that in the self-service instance, the user can interact with the data products made in the automation node, using the same new lineage features that Dataiku has been introducing.

This is still no clear enough (sorry, my language barrier interfering with my ideas), let me give another example. If I go to the automation node (where many ETLs are run, and schemas and tables are created) I can create a Data Collection using the Dataiku capabilities. The Data Collection in the automation node will have all the information we have added to the tables, like comments in dataiku, stewards, last build, etc. Also, inspecting the lineage with the Dataiku interface works right away from the box.

However, the users in the self-service node are not allowed to work in the automation node. So, we need to start building a new Data Collection in the self-service node data catalog, to try to match what is visible in the automation node, but we lost all the lineage features, the dataiku comments, the stewards, the last build date, the data quality checks, etc.

If instead of running the ETLs in the automation node, we would run them in the self-service node, we wouldn't have this problem, but it would defeat the purpose of having an automation node.

Please let me know if it is still unclear.

Bostetwa · April 4

https://community.dataiku.com/discussion/comment/45669#Comment_45669

I ran into a similar issue before, and one thing that helped was setting up a regular metadata sync using APIs. If you're trying to maintain consistency across multiple DSS instances, automating the sync with a scheduled job can reduce mismatches. It's kind of like how top DAM solutions for photographers maintain metadata integrity across different platforms, ensuring files stay organized no matter where they're accessed.

Ignacio_Toledo · April 4

Thank you, everyone, for your comments. I have definitely made a pretty confusing description of the product idea I was proposing, by not being specific about the metadata.

I was not talking about the metadata that can be stored into the target databases (like tables and columns descriptions), but about Dataiku's own metadata associated to each dataset, which provides the information for the DSS interfaces available under "Data Catalog", "Data Collection", and "Column Lineage".

Here is a link to a video where I try to show what I'm talking about:

https://www.youtube.com/watch?v=rtGQcc32dbY

Notice how the "Data Collection" I create, which includes dataiku metadata about last build name, datasets comments, data quality and lineage, in the Automation node, is lost when I create a "Data Collection" now at the self-service node.

Of course, this is not a bug, I'm connecting to the new datasets by importing the prepared tables from our PostgreSQL data warehouse. But I don't see how I can communicate or synchronize (easily) the dataiku metadata about the parent flow, last build time, data quality criteria, etc.

Managed-datasets Metadata Synchronization Across Multiple DSS Instances

Use Case

Feature Request

Proposed Method

Benefits

Conclusion

New · Last Updated June 2024

Comments

Categories

Setup Info

Tags