Recipe to create multiple datasets from a single dataset

MNOP

I have a dataset called "MasterData" in the flow.

I want to subset this data based on the "Country" column and save it after including the name of the country (e.g. "MasterDataAustralia") and save it in a zone built exclusively for that Country.
I have around 60+ countries in the master data, and new countries may be added to the list in the future. The flow should work even at the time of a new addition of a country.

What is the best way to handle such scenarios in Dataiku?

Operating system used: Windows

Turribeach

The Split recipe can split data into different datasets based on a column value but it's really designed to split data into train/test datasets for machine learning. Why would you want to split the data into 60+ datasets and how would that achieve anything in the flow? The Dataiku flow doesn't really catter for dynamic dataset creation. What exactly are you trying to achieve? Describe your goal not how you think you can achieve it, there might be a better way of getting there.

MNOP

Thanks for the reply
We are in the process of developing machine-learning models for each market and country. We initially considered using partitioned models but discovered that we couldn't incorporate model-specific features. As a result, we have decided to divide our master data into separate country/market data and zones, and then develop models based on this data. We are open to any suggestions for enhancing this process. to improve the process.

Turribeach

If you are going to have different models and use visual recipes you are going to have different flow branches at least, if not different flows/projects altogether. Another alternative will be to do eveything in Python code recipes, ML flow models and folders but this will make it much harder for people to understand what's going on in your project and diminishes the returns of a tool like Dataiku.

So in general terms you can't have dynamic datasets like you ask. And I actually think your request is misguided because it's probably not critical to your solution. Having said that if the processing and flow zone for each country will be done exactly the same for each country you may want to look at Dataiku Applications or Dataiku Application-as-recipe. They will allow you to reuse code across the different flow zones /projects and have a consistent way of processing the country data. Having said you will still need to build these flow zones / projects separately. This task could be automated using the Dataiku Python API if you think the return for the investment exists.

MNOP

@Turribeach

"If you are going to have different models and use visual recipes, you will likely have different flow branches, if not entirely separate flows or projects.

Can you elaborate on this? What do you mean by 'flow branches'?

How can we handle this situation more efficiently and elegantly in Dataiku? The complexity of our modeling comes from the number of models. If we create a flow for each model, the flow will become a giant web, making it difficult to navigate and use."

Marlan

I thought I'd chime in here as I have some experience with this type of problem... I think the ideas outlined by @Turribeach are your choices. I have tried all three approaches and can comment from the perspective of these experiences.

If you could make the partitioned model approach work, that certainly would be easiest. Why not include all features that might be used by any model? The algorithms will choose the features that work for each model. I don't use partitioning much and so when I do it's kind of painful but still seems like this is the easiest and cleanest option if workable.

I've written an all Python solution for 95 models. This worked pretty well but of course one loses the benefits of the DSS Visual ML functionality.

I've also used the API a number of time to build flows programmatically. This would enable you to create and train separate models using Visual ML functionality but without having to do it all by hand through the UI. If you haven't used the API much there would be a decent learning curve. You might end up writing as much Python code as you would with an all Python solution. Also navigating around the flow would be somewhat painful.

If it were me I'd see if I could make the partitioning approach work. If not that I'd probably write a pure Python solution. The API approach is probably only worth it if it was exceptionally important to be able to use Visual ML and related functionality (e.g., evaluation stores - although even here you can use one without a Visual ML model).

Marlan

Turribeach

@MNOP wrote:
Can you elaborate on this? What do you mean by 'flow branches'?

Sign up to take part

Recipe to create multiple datasets from a single dataset

Recipe to create multiple datasets from a single dataset

Setup info