[Beginner] How to visually deal with a lot of datasets ?

bricejoosten · ‎12-03-2023

Hello there,

I'm a beginner in using Dataiku (container version with Docker) so sorry if I didn't find the answer that could already be in documentation despite having searched already.

I'm doing basic data analysis and prediction AI stuff based on the topic of political elections thanks to my government's public data. I've something like 20 datasets that I must do prepare process for each one, most of them have same subtasks in the process (for now, it'll be even more in the future) at the beginning of the whole data flow.

How do I make it look nice and visually convenient ? I've learned about zones but I'm still not sure if I'm using it right or not. This is what it actually looks like :

Isn't there a way to put them on some kind of rectangular grid to let them take less space, visually ?
The goal at the end is to be able to have a fine presentation of my data flow in some slides so it has to be kept as much visually convenient as possible, just like on this picture : https://doc.dataiku.com/dss/latest/_images/zones-view-with.png

Thanks a lot if you can help me and sorry again if this topic is a duplicate of another one. I could delete it if needed / close with message to redirect to original topic.

Operating system used: Windows 10

bricejoosten · ‎12-03-2023

Self-answer as an update :

I don't know why I didn't think about this earlier but maybe this will be the cleanest thing to do : for my 22 base raw datasets in input, as I have to prepare them individually and merge them all together, I think I can divide everything into 4 zones : 3 zones of 6 datasets preparations and the last one of 4 datasets preparations.

These zones will technically concerns the cleaning part of the data and should result in 4 output datasets, which will be visually way better than 22, especially when I'll have some enrichment recipes afterwards.

And this is where I realize how subflows come handy : I would like to be able, in the end, to have four big zones in the project : cleaning, normalization, enrichment, and exploitation with IA so for instance, cleaning zone will contain all these not visually convenient zones.

I think I've pretty much well fixed my problem the most optimal way, with the existing solutions on the moment. Don't hesitate to tell me if there are some problems to my logical way of thinking or if it will bring problems on terms , suggesting me things or whatever. Thanks for the help.

View solution in original post

Turribeach · ‎12-03-2023

As you have found out Flow Zones give you some way of making the flow simpler but they can only go so far. What Dataiku is missing really is to allow nested flow zones as suggested by this product idea (feel free to up-vote for it). Until then you don't have many options. One thing you could do is to move your complex input section to another project and share the output dataset. This will remove the complexity from your main flow at the cost of having to deal with more than 1 project.

bricejoosten · ‎12-03-2023

Unfortunately, I think nested zones would be pertinent when you have a need for subflows but in my case, the datasets are the most non-divisible part of the project so subflows is completely another topic compared to visual stacking of elements on screen.

What would be optimal is to be able to manually move elements on the grid the way we want with a functionality to auto-rewire the connections to make it visually convenient and harmonious.

As for putting the source datasets in another project, I think this isn't what projects are intended for and it would result as pretending to clean the project's flow area while it's not.

Is there really no other legit way to do it the intended way for the intended purpose ?

Turribeach · ‎12-03-2023

I don't think there is another way. But your flow design might a way out. I see you have 22 files uploaded which then get synced to a File System Dataset and condensed to 2 output datasets by two Python recipes. Are these 22 files based on the same structure? Why can't you have them in a single managed folder to begin with?

bricejoosten · ‎12-03-2023

I've thought about this but the only way to treat your data when it's a folder of files is with code (Python, R, etc) and as I have to do some cleaning, normalization and enriching before merging all together, it was either :
- I use Python for everything (which make useless all the other kinds of process in Dataiku which I don't want as I want a visual diagram of each step in my flow) ;
- I create one Python recipe for each sub task which is as absurd as the first option, leading to a non explicit visual flow, made only of Python steps which is bad to represent easily the distinct steps ;

In both cases, it prevents me to get the better of the visual flow system in Dataiku but as you can see below in the answers, I may have found a better solution.

bricejoosten · ‎12-03-2023

Self-answer as an update :

I don't know why I didn't think about this earlier but maybe this will be the cleanest thing to do : for my 22 base raw datasets in input, as I have to prepare them individually and merge them all together, I think I can divide everything into 4 zones : 3 zones of 6 datasets preparations and the last one of 4 datasets preparations.

These zones will technically concerns the cleaning part of the data and should result in 4 output datasets, which will be visually way better than 22, especially when I'll have some enrichment recipes afterwards.

And this is where I realize how subflows come handy : I would like to be able, in the end, to have four big zones in the project : cleaning, normalization, enrichment, and exploitation with IA so for instance, cleaning zone will contain all these not visually convenient zones.

I think I've pretty much well fixed my problem the most optimal way, with the existing solutions on the moment. Don't hesitate to tell me if there are some problems to my logical way of thinking or if it will bring problems on terms , suggesting me things or whatever. Thanks for the help.

Turribeach · ‎12-03-2023

If the files are of the same structure you can use the Files in Folders dataset to load/merge them. There is also a nice feature to even to see the file name and row ID for every row you load for full data traceability:

https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214

bricejoosten · ‎12-03-2023

I've thought about this but the only way to treat your data when it's a folder of files is with code (Python, R, etc) and as I have to do some cleaning, normalization and enriching before merging all together, it was either :
- I use Python for everything (which make useless all the other kinds of process in Dataiku which I don't want as I want a visual diagram of each step in my flow) ;
- I create one Python recipe for each sub task which is as absurd as the first option, leading to a non explicit visual flow, made only of Python steps which is bad to represent easily the distinct steps ;

In both cases, it prevents me to get the better of the visual flow system in Dataiku but as you can see below in the answers, I may have found a better solution.

Sign up to take part

[Beginner] How to visually deal with a lot of datasets ?

[Beginner] How to visually deal with a lot of datasets ?

Setup info