I have a dataset [attached example] and I want to create 2 tables out of this;
1st - sum(sales) by market and exclude duplicated rows, so I want to end up with Sales for each market in specific date rage (Data column) but exclude duplicated - I have them because 1 product can be in more than 1 group
So first table would look like:
market sales data
MRCC I 180 43831... etc
Then second table I would like to look like above one, but add as a 'dictionary' aditionall column with uniqe product name within Market and Date, so for MRCC I it would look like:
market product sales
MRCC I ELI 180
MRCC I SUT 180
Can I achive this with only visual recipes? Or I need some SQL part of code? if yes could you give me a hint on a syntax 'structure' So i can replicate this in my case?
I was trying to use group by, but I couldn't remove those duplicated product rows.
Thanks in advance for any kind of help,
In the attached PDF (containing screenshots) and the paragraphs below, I've outlined a couple things you can try in DSS that may answer your questions.
For the first table, you can use the Group recipe (as you mentioned) using “market” and “data” as the Group Keys and “sum” sales in the Per Field Aggregations window. To remove duplicates in the Group recipe, you can use the Post-filter to keep only Distinct rows. Additionally, in Pre-filter you can remove duplicates before doing the groupby.
For the second table, you can use a Window recipe. Select “market” and “data” as the partitioning columns (the fields you will use to group the dataset), and “sum” sales in Aggregations. Finally, remove duplicates in Post-filter. Window recipes are fantastic and are quite useful for many things. This page has more info if you're interested: https://academy.dataiku.com/latest/tutorial/visual-recipes/window.html
Hope that helps!
Thanks a lot for your input, I will definitely try this. Your outputs looks almost as they should, only there is still duplicated sum of sales in both of them, for example, MRCC I in both tables should be 180. But I hope with some fuctions you mentioned I will be able to handle this.
Hi @emate ,
Ah, yes there are duplicate Market--Sales pairs in the dataset. You may have found a way to fix it already, but here are some thoughts:
In this case, removing duplicates in Pre-filter in the Group Recipe won't work because it looks for duplicates across entire rows. While Market and Sales is the same in multiple rows, the Group field makes these rows distinct.
You could drop the Group field in a prepare recipe first. Then, remove duplicates using Pre-filter in the Group Recipe to get 180 for MRCC I. Alternatively, you can also remove duplicates in the Prepare Recipe. Using Filter in Pre-filter might also work for you depending on your use case.
I'm almost done with the first table, first I removed duplicates (based on product, sales, , data, market), then I am using your group recipe and it giving me correct result for sales per market, but my 'date' column is missing, and im not sure why... is there a way to keep it? I will need it later to join this table with my orginal data.
**** Ok, nevermind, I got confused, I should group it by market, sales and DATE... sorry for the confusion, i'm in the process of transition between tools I used to for last 6 years.. 🙂