Changing data during a join

Options
Amarilla
Amarilla Registered Posts: 11 ✭✭✭✭

Good morning all!

I am in study and I use Dataiku but I am blocking on certain point.

At this stage I made a join between two Dataset but the data of a column has been changed.

The data are sources of advice, we go from 5,975 Qualitelis, 3,862 Booking, 118 Trip to 8,838 Booking and 1,162 Tripodvisor at the end of the join.

Am I making a mistake? Thanks in advance: D

Best Answer

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Answer ✓
    Options

    Hi, sorry for a late response. Did you try running the analysis on the whole column data and not just on a sample?

    You can choose the whole data in the dropdown that currently says "Sample":

    Andrey_0-1616435290742.png

Answers

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Options

    Hello,

    If I understood the question correctly you can go to the dataset settings -> schema and redetect it from the changed data. Then you can use the schema propagation tool on the flow to apply the new schema downstream. If needed you can also change the join recipe settings if you want to join the data differently.

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Thanks for your feedback @Andrey
    ! In settings I can't find redetect it from changed data

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Options

    It's the "Check now" button under the "Schema" tab

    Screenshot 2021-03-18 at 14.22.04.png

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    It notes me this: " The schema ans the data are consistent.

    That unfortunately didn't solve my problem ..

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Options

    Looks like I didn't understand the question. What did you mean by "but the data of a column has been changed".

    Is it the data in one of the datasets that got changed? Did the structure of that data change (e.g. the schema got different)?

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    In my first dataset I have a column with data corresponding to: 5,975 data lines named "Qualitelis", 3,862 "Booking", 118 "Trip".

    And at the exit of my join the data of this column this find to be: 8 838 "Reservation" and 1 162 "Tripodvisor".

    I don't know if I was clearer

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Options

    could you please send a screenshot of both dataset contents and also of a join recipe settings (a tab with all join conditions) to see exactly how they're being joined?

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Here is the side of my two datasets (columns are missing because there are a lot of them).

    Capture d’écran 2021-03-18 à 17.11.37.pngCapture d’écran 2021-03-18 à 17.12.15.png

    The problems that I encounter this storuve on the column "SourceAvis"

    Capture d’écran 2021-03-18 à 17.12.26.png

    This is my data before the join

    Capture d’écran 2021-03-18 à 17.12.35.png

    And here just outside the join

    Capture d’écran 2021-03-18 à 17.13.03.pngI don't know if you have enough information about the join

    Capture d’écran 2021-03-18 à 17.14.05.png

    I had already encountered this same problem with a Window recipe

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Hello, it's my turn for the late response. Indeed with all the data we find the initial data!

    Capture d’écran 2021-03-25 à 11.39.22.pngCapture d’écran 2021-03-25 à 11.39.54.png

    However, given the number of lines I think my join is not correct.

    Thank you very much: D

Setup Info
    Tags
      Help me…