Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Best Practices/Ideas for Data Flow Scenario

Solved!
m_sch
Level 3
Best Practices/Ideas for Data Flow Scenario

Hello Community,

I would like to ask you for best practices/ideas to handle the following scenario.

I´m receiving files from different source systems every day and need to handle them. Every file includes updated product data from the last 30 days with a different data structure. The data needs to be cleaned and prepared and harmonized into one dataset, so that we can make analyses from the last 30 days on a daily basis. Furthermore I need to create a database / data warehouse where all this data should put in for historical analyses.

So on one side we have the current analyses of the last 30 days and on the other side we have a database that should be filled with these data and already existing datasets should be updated as well.

So I would like to discuss possible data flows and options with you.

I look forward to your input.

Many thanks!

1 Solution
emate
Neuron
Neuron

Hi @m_sch 

In general the case you are describing is a whole project with few diffrent issues and not sure what would you like to know.

But I can see that you need to take few steps;

A) Unify structure of each dataset that you want combine - to do that there are multiple ways and its very case-dependant, there is no one advice I could think of - it depends on a problem and skillset you have, but there is a wide range of solution when it comes to dealing with different data sources - starting from simple prepare recipe and ending with coding.

It's also important if you have raw files that you will upload every day manually? (maybe there is a way to automate this) or its a DB you are connecting to.

B) For keeping only 30 days of data, I am super lazy and I like simple solution, in one of my project each month I had to  keep and use only last 24 months, so I calculated index column with Date and Max_date columns that was showing me diffrence in months between my max date and rows data - and every month I was keeping only rows that index =<24 or something like that.

In your case maybe you should explore "Partitioning"

https://doc.dataiku.com/dss/latest/partitions/index.html
https://www.youtube.com/watch?v=MB0en4HBuV8&ab_channel=Dataiku

Don't know if this helps

Mateusz

View solution in original post

Partitioning a dataset in Dataiku refers to the splitting of a dataset based on one or multiple dimensions. When a dataset is partitioned, each chunk or part...
3 Replies
m_sch
Level 3
Author

Hello Community,

sorry, perhaps my question was too general for a discussion.

Maybe I need to be a little more specific for an initial discussion?

I am happy about any input.

If the question is too general, I also appreciate feedback. Maybe this topic can then be closed and handled with specific individual questions from my side.

Many thanks!

0 Kudos
emate
Neuron
Neuron

Hi @m_sch 

In general the case you are describing is a whole project with few diffrent issues and not sure what would you like to know.

But I can see that you need to take few steps;

A) Unify structure of each dataset that you want combine - to do that there are multiple ways and its very case-dependant, there is no one advice I could think of - it depends on a problem and skillset you have, but there is a wide range of solution when it comes to dealing with different data sources - starting from simple prepare recipe and ending with coding.

It's also important if you have raw files that you will upload every day manually? (maybe there is a way to automate this) or its a DB you are connecting to.

B) For keeping only 30 days of data, I am super lazy and I like simple solution, in one of my project each month I had to  keep and use only last 24 months, so I calculated index column with Date and Max_date columns that was showing me diffrence in months between my max date and rows data - and every month I was keeping only rows that index =<24 or something like that.

In your case maybe you should explore "Partitioning"

https://doc.dataiku.com/dss/latest/partitions/index.html
https://www.youtube.com/watch?v=MB0en4HBuV8&ab_channel=Dataiku

Don't know if this helps

Mateusz

View solution in original post

Partitioning a dataset in Dataiku refers to the splitting of a dataset based on one or multiple dimensions. When a dataset is partitioned, each chunk or part...
m_sch
Level 3
Author

@emate 

Many thanks for your input!

I will also have a look at "Partitioning".

Best regards!

 

0 Kudos
A banner prompting to get Dataiku DSS