Best Practices/Ideas for Data Flow Scenario

m_sch
m_sch Registered Posts: 12 ✭✭✭✭

Hello Community,

I would like to ask you for best practices/ideas to handle the following scenario.

I´m receiving files from different source systems every day and need to handle them. Every file includes updated product data from the last 30 days with a different data structure. The data needs to be cleaned and prepared and harmonized into one dataset, so that we can make analyses from the last 30 days on a daily basis. Furthermore I need to create a database / data warehouse where all this data should put in for historical analyses.

So on one side we have the current analyses of the last 30 days and on the other side we have a database that should be filled with these data and already existing datasets should be updated as well.

So I would like to discuss possible data flows and options with you.

I look forward to your input.

Many thanks!

Best Answer

  • Mateusz
    Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭
    Answer ✓

    Hi @m_sch

    In general the case you are describing is a whole project with few diffrent issues and not sure what would you like to know.

    But I can see that you need to take few steps;

    A) Unify structure of each dataset that you want combine - to do that there are multiple ways and its very case-dependant, there is no one advice I could think of - it depends on a problem and skillset you have, but there is a wide range of solution when it comes to dealing with different data sources - starting from simple prepare recipe and ending with coding.

    It's also important if you have raw files that you will upload every day manually? (maybe there is a way to automate this) or its a DB you are connecting to.

    B) For keeping only 30 days of data, I am super lazy and I like simple solution, in one of my project each month I had to keep and use only last 24 months, so I calculated index column with Date and Max_date columns that was showing me diffrence in months between my max date and rows data - and every month I was keeping only rows that index =<24 or something like that.

    In your case maybe you should explore "Partitioning"

    https://doc.dataiku.com/dss/latest/partitions/index.html
    https://www.youtube.com/watch?v=MB0en4HBuV8&ab_channel=Dataiku

    Don't know if this helps

    Mateusz

Answers

  • m_sch
    m_sch Registered Posts: 12 ✭✭✭✭

    Hello Community,

    sorry, perhaps my question was too general for a discussion.

    Maybe I need to be a little more specific for an initial discussion?

    I am happy about any input.

    If the question is too general, I also appreciate feedback. Maybe this topic can then be closed and handled with specific individual questions from my side.

    Many thanks!

  • m_sch
    m_sch Registered Posts: 12 ✭✭✭✭

    @emate

    Many thanks for your input!

    I will also have a look at "Partitioning".

    Best regards!

Setup Info
    Tags
      Help me…