Data Set Name Alias

Problem: Data set names aren't user friendly in hindsight

Example: Data set name = GreatData and I do a prep step and it defaults to "GreatData_prepared" which is fine, but later I decide this is the "final data set" that should be used by others and I'd like a more intuitive name.  I understand changing data set names is not recommended.

Solution: Can we have an alias name for data sets? Then I could create an alias for this data set called "User_Demographics" or "Final_GreatData" or "Dashboard_GreatData" etc.  Then allow for an alias name view in the flow?

12 Comments

Hi @VMaus, I actually regularly rename datasets and haven't had a problem yet. I do need to change references manually in the associated SQL Script and Python recipes but after I do that all seems fine. I agree that the first name I select often isn't what I ultimately want and it's worth it to me to do the renaming because more accurate/ descriptive names means it's easier for me and others to understand the flow later.  

Ideally renaming datasets would be a fully supported operation.

Marlan

Hi @VMaus, I actually regularly rename datasets and haven't had a problem yet. I do need to change references manually in the associated SQL Script and Python recipes but after I do that all seems fine. I agree that the first name I select often isn't what I ultimately want and it's worth it to me to do the renaming because more accurate/ descriptive names means it's easier for me and others to understand the flow later.  

Ideally renaming datasets would be a fully supported operation.

Marlan

VMaus
Level 2

Thanks for that feedback @Marlan I've been too afraid to change the names and end up adding descriptions to help with this.

Thanks for that feedback @Marlan I've been too afraid to change the names and end up adding descriptions to help with this.

@Marlan,

Years ago when I first started using DSS (somewhere maybe V4 or V5).  I made swiss cheese out of a project by renaming a dataset and could never recover it again.  Since then I've avoided changing names.  

I agree, default names are never the actual idea of the data set when you are going to make the project production or turn it over to someone else, the dataset names never really make any sense.

I have been known to walk through parts of projects connecting new datasets with better names to existing steps, then running the step, and then connecting the next visual step to the newly created dataset with a better name.  However, that is all kinds of ways painful.

Based on your comments, and 4 or 5 versions worth of bugs having been cleaned up in DSS.  I might try to rename datasets again.

Making Refactoring DSS Project element Names in all areas of the system would be a great help for re-usability and discoverability.

--Tom

@Marlan,

Years ago when I first started using DSS (somewhere maybe V4 or V5).  I made swiss cheese out of a project by renaming a dataset and could never recover it again.  Since then I've avoided changing names.  

I agree, default names are never the actual idea of the data set when you are going to make the project production or turn it over to someone else, the dataset names never really make any sense.

I have been known to walk through parts of projects connecting new datasets with better names to existing steps, then running the step, and then connecting the next visual step to the newly created dataset with a better name.  However, that is all kinds of ways painful.

Based on your comments, and 4 or 5 versions worth of bugs having been cleaned up in DSS.  I might try to rename datasets again.

Making Refactoring DSS Project element Names in all areas of the system would be a great help for re-usability and discoverability.

Hi @tgb417, note that the context in which I've renamed datasets has always been with associated SQL Script or Python recipes. Renaming might be more risky with visual recipes. Just wanted to share that caveat. 

Marlan 

Hi @tgb417, note that the context in which I've renamed datasets has always been with associated SQL Script or Python recipes. Renaming might be more risky with visual recipes. Just wanted to share that caveat. 

Marlan 

@Marlan ,

Noted.  

--Tom

@Marlan ,

Noted.  

AshleyW
Dataiker

Thanks for your idea, @VMaus. Your idea meets the criteria for submission, we'll reach out should we require more information.

If youโ€™re reading this post and think that being able to easily rename datasets would be a great capability to add to DSS, be sure to kudos the original post! Feel free to leave a comment in the discussion about how this capability would help you or your team.

Take care,
Ashley

Status changed to: In the Backlog

Thanks for your idea, @VMaus. Your idea meets the criteria for submission, we'll reach out should we require more information.

If youโ€™re reading this post and think that being able to easily rename datasets would be a great capability to add to DSS, be sure to kudos the original post! Feel free to leave a comment in the discussion about how this capability would help you or your team.

Take care,
Ashley

VMaus
Level 2

Thanks @AshleyW!

Thanks @AshleyW!

I frequently overflow the length limit for table names in my database. While I really appreciate Dataiku's design decision to semantically name tables, which is definitely better for non-Dataiku users in our data environment than just naming them with a hash, I'm working today with a dataset named 

MASTSCHD_1_copy_by_LINE_NUM_stacked_by_LINE_NUM_joined_filtered_by_model_joined_by_load_min_min_joined_prepared

I think an alias like this would be useful, especially if it can be inherited by downstream datasets in lieu of the underlying name. That said, ultimately, a safe rename would be even better for my use-case. I've usually been able to get away with renaming a dataset before it has references, but I've also run into issues depending on where the dataset is used.

But Dataiku has really good reference tracking- in every aspect of the UI I can think of, there's already a dedicated field listing the input datasets, and the rename feature already searches these automatically and rewires everything, with the exception so far of code recipes. I wonder if a safe rename feature could be implemented to make the search exhaustive, allowing us to just rename any dataset no matter how many references there are. Downstream datasets could also be renamed automatically, making it quick to clean up large projects. Ideally, even the underlying tables, which currently are not renamed by the rename feature, would also be renamed on the next build. Even broken flows (which, though an anti-pattern, I've occasionally needed), where I've imported a managed dataset I generated elsewhere in my project directly as though it were unmanaged, could have their references automatically updated with a more powerful renaming feature.

Also, an alias feature could be really useful for condensed references downstream- it'd be nice to set an alias upstream so the full name of a dataset is descriptive, but then refer to in an abbreviated way downstream, especially in the names of other datasets, similarly to the pattern that's already common with SQL aliasing.

I frequently overflow the length limit for table names in my database. While I really appreciate Dataiku's design decision to semantically name tables, which is definitely better for non-Dataiku users in our data environment than just naming them with a hash, I'm working today with a dataset named 

MASTSCHD_1_copy_by_LINE_NUM_stacked_by_LINE_NUM_joined_filtered_by_model_joined_by_load_min_min_joined_prepared

I think an alias like this would be useful, especially if it can be inherited by downstream datasets in lieu of the underlying name. That said, ultimately, a safe rename would be even better for my use-case. I've usually been able to get away with renaming a dataset before it has references, but I've also run into issues depending on where the dataset is used.

But Dataiku has really good reference tracking- in every aspect of the UI I can think of, there's already a dedicated field listing the input datasets, and the rename feature already searches these automatically and rewires everything, with the exception so far of code recipes. I wonder if a safe rename feature could be implemented to make the search exhaustive, allowing us to just rename any dataset no matter how many references there are. Downstream datasets could also be renamed automatically, making it quick to clean up large projects. Ideally, even the underlying tables, which currently are not renamed by the rename feature, would also be renamed on the next build. Even broken flows (which, though an anti-pattern, I've occasionally needed), where I've imported a managed dataset I generated elsewhere in my project directly as though it were unmanaged, could have their references automatically updated with a more powerful renaming feature.

Also, an alias feature could be really useful for condensed references downstream- it'd be nice to set an alias upstream so the full name of a dataset is descriptive, but then refer to in an abbreviated way downstream, especially in the names of other datasets, similarly to the pattern that's already common with SQL aliasing.

AshleyW
Dataiker

Congrats - we're adding this to our roadmap! While timelines are always tricky, we'll let you know how it's progressing as updates are available.

If you've kudoed the post or added some comments about your particular use case, we may reach out to get some feedback.

Take care,
Ashley

Status changed to: Developing

Congrats - we're adding this to our roadmap! While timelines are always tricky, we'll let you know how it's progressing as updates are available.

If you've kudoed the post or added some comments about your particular use case, we may reach out to get some feedback.

Take care,
Ashley

VMaus
Level 2

Great news!  Thanks @AshleyW !

Great news!  Thanks @AshleyW !