Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on June 11, 2024 2:34PM
Likes: 2
Replies: 4
As an organization, we utilize three distinct DSS instances to manage our data analytic and ML workflows:
A significant portion of our critical data products consists of datasets created through ETL processes that adhere to specific data models. These datasets are shared across the self-service and design instances. However, a critical challenge we face is the lack of metadata synchronization with the production node. The metadata of datasets generated in the automation instance, datasets which are stored and shared in a data warehouse, is inaccessible to the self-service and design instances.
We request the introduction of a feature that enables metadata synchronization (and lineage within Dataiku projects) between datasets shared across distinct DSS instances.
A feasible method for implementing this feature could involve centralizing the data catalogues in a single instance. This approach is similar to the functionality currently available in the Govern and Deployer Dataiku instances.
Implementing metadata synchronization across DSS instances would significantly enhance our data management capabilities, ensuring seamless integration and consistency of data products across the organization. This feature will streamline our workflows and improve overall productivity and data governance. Furthermore, this new feature would be useful to any other organization handling some of their ETL process in Dataiku
I see it's been a while since this was discussed, but I'm curious—has anyone found a good way to automate metadata sync across multiple DSS instances? I imagine custom APIs or some scripting might work, but I'd love to hear if anyone has a setup that works well in practice. Also, how do you handle conflicts when metadata updates happen in different instances at the same time?
You could try using an external database to store and sync metadata across instances. A central metadata repository with scheduled sync jobs might help keep everything consistent and up to date.
Yes, but that would be like using an external data catalog solution. Unless you are talking about the metadata that Dataiku uses internally, so the Data Catalogs for each DSS instance remain the same, and the lineage is also shown in the Dataiku interface.
In that sense, @Stemmkoli curiosity goes in the right direction. If the dataiku API allowed somehow to create scripts to sync the Dataiku metadata between instances, that would be great.
Wouldn't a project deployment sync your datasets metadata? I don't see the extra need to synchronize this without a deployment. Or am I missing something here?