Changing data during a join

Solved!
Amarilla
Level 2
Changing data during a join

Good morning all!

I am in study and I use Dataiku but I am blocking on certain point.

At this stage I made a join between two Dataset but the data of a column has been changed.

The data are sources of advice, we go from 5,975 Qualitelis, 3,862 Booking, 118 Trip to 8,838 Booking and 1,162 Tripodvisor at the end of the join.

Am I making a mistake? Thanks in advance: D

0 Kudos
1 Solution
Andrey
Dataiker Alumni

Hi, sorry for a late response. Did you try running the analysis on the whole column data and not just on a sample? 

You can choose the whole data in the dropdown that currently says "Sample":

Andrey_0-1616435290742.png

 

Andrey Avtomonov
R&D Engineer @ Dataiku

View solution in original post

0 Kudos
10 Replies
Andrey
Dataiker Alumni

Hello, 

If I understood the question correctly you can go to the dataset settings -> schema and redetect it from the changed data. Then you can use the schema propagation tool on the flow to apply the new schema downstream. If needed you can also change the join recipe settings if you want to join the data differently.

 

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Amarilla
Level 2
Author

Thanks for your feedback @Andrey ! In settings I can't find redetect it from changed data ๐Ÿ˜ž

0 Kudos
Andrey
Dataiker Alumni
 

It's the "Check now" button under the "Schema" tab

โ€ƒ

Screenshot 2021-03-18 at 14.22.04.png

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Amarilla
Level 2
Author

It notes me this: " The schema ans the data are consistent.

 

That unfortunately didn't solve my problem ..

0 Kudos
Andrey
Dataiker Alumni

Looks like I didn't understand the question. What did you mean by "but the data of a column has been changed".

Is it the data in one of the datasets that got changed? Did the structure of that data change (e.g. the schema got different)?

 

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Amarilla
Level 2
Author

In my first dataset I have a column with data corresponding to: 5,975 data lines named "Qualitelis", 3,862 "Booking", 118 "Trip".

And at the exit of my join the data of this column this find to be: 8 838 "Reservation" and 1 162 "Tripodvisor".

I don't know if I was clearer ๐Ÿ™‚

0 Kudos
Andrey
Dataiker Alumni

could you please send a screenshot of both dataset contents and also of a join recipe settings (a tab with all join conditions) to see exactly how they're being joined?

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Amarilla
Level 2
Author

Here is the side of my two datasets (columns are missing because there are a lot of them).

Capture dโ€™รฉcran 2021-03-18 ร  17.11.37.pngCapture dโ€™รฉcran 2021-03-18 ร  17.12.15.png

 

The problems that I encounter this storuve on the column "SourceAvis"

Capture dโ€™รฉcran 2021-03-18 ร  17.12.26.png 

 

This is my data before the join

Capture dโ€™รฉcran 2021-03-18 ร  17.12.35.png

And here just outside the join

Capture dโ€™รฉcran 2021-03-18 ร  17.13.03.pngI don't know if you have enough information about the join

Capture dโ€™รฉcran 2021-03-18 ร  17.14.05.png

 

I had already encountered this same problem with a Window recipe

0 Kudos
Andrey
Dataiker Alumni

Hi, sorry for a late response. Did you try running the analysis on the whole column data and not just on a sample? 

You can choose the whole data in the dropdown that currently says "Sample":

Andrey_0-1616435290742.png

 

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Amarilla
Level 2
Author

Hello, it's my turn for the late response. Indeed with all the data we find the initial data!

 

Capture dโ€™รฉcran 2021-03-25 ร  11.39.22.pngCapture dโ€™รฉcran 2021-03-25 ร  11.39.54.png

However, given the number of lines I think my join is not correct.

 

 

Thank you very much: D

0 Kudos