Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
This question is about finding a suitable Dataiku recipe for neatly combining recent daily billing data to structurally similar monthly data with the aim of avoiding data duplication. Provided details below.
We are dealing with two billing datasets - monthly and daily billing data. These are normally complimentary, but under some situations overlaps could occur. For example: Monthly data for January 2022 to July 2022; daily data for 07/01/2022 to 08/22/2022. These two dataset are manually extracted source and avoiding overlaps is not attainable in near future.
Currently, we are stacking monthly and aggregated daily data which is producing duplicate numbers for overlapping date range. Could you please suggest a suitable solution for dealing with this data duplication issue?
Operating system used: macOS Monterey
Welcome to the Dataiku Community. We are so glad to have you among us.
I think that you get it, it would be best to avoid duplicates at the system that is sourcing your data and get that system to only provide the new data or direct access to the current full dataset. However, I sense at this time that level of data governance may not currently be achievable. Certainly worth seeing if you can directly query that source system rather than getting overlapping updates that you have to piece back together. If you have some social clout you could bring to the situation like a boss or customer status. These things might be worth trying.
If you have to go it on your own. Do you have a consistent "Key" like an order number that is stable over time? For example, an order number might work in a system where orders are put into the system once and never changed. However, it has been my experience that this is very very often not the case. My experience is that orders change all the time, from cancelations, to changed products and pricing.
Can you get transactional data from the source system. Information about every item added, deleted, changed in orders. One can often use this type of information to piece together both the current state of the orders. But often with time series analysis and other kinds of data modeling, being able to piece together both current states of orders and past states of orders which can be useful.
If you are left with fixed reports that you have to piece the data together from you will really have challenges. Not necessarily because of something missing from Dataiku DSS but the data does not tell you it's current status. Particularly when orders are deleted or changed. Doing something with a list of order that you think are active on each date and taking the most recent update you have is the kind of thing you are left with.
Others please jump in here if you have any good ideas for @s_ray
@s_ray if you are able to share some example data (without revealing any sensitive information about customers) those insights might help other be of more help.
To remove the duplicate data, you can use the Group recipe, indicating the unique keys and deciding the aggregations for the other columns (first value, max value, etc.).
Here is a video describe the recipe.
Please tell me if this helps.