How to concat multiple columns by index?

Options
gavin
gavin Registered Posts: 6

I have 2 dataset with different column names but same length.

Dataset 1 have column A,B, length is 10.

Dataset 2 have column C,D, length is 10.

I want to get Dataset 3 by concating Dataset1&2, Its column name are A,B,C,D and length is 10.

I have tried join and stack, but I can't get what I expect.


Operating system used: Windows

Best Answer

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Answer ✓
    Options

    There are different ways of uploading files to Dataiku. I presume you have simply dropped the files in the flow in which case Dataiku will put the files in a server's file system connection and load them in a dataset with an arrow pointing up, indicating they have been uploaded. There is however a better way of loading files in Dataiku. First go to the Dataiku Administration => Connections settings and create a new server's file system connection pointing the path to a folder of your choice in your system. Then go your flow click on the Dataset button and select Folder. Create a new folder in the new server's file system connection. Once the new folder is created and you are opening it from the flow you can add files to it. Either drag and drop from Windows Explorer (multiple files selected are allowed) or click on the Add File button. Having now added the files to the folder you can add the Files in Folder dataset as per my advice on the linked solution above.

    This is a much better solution because you can now easily update the files in the physical folder in your file system and have Dataiku reload the data without having to modify the flow. Whereas with the manual upload you did you will need manually update each upload dataset to refresh the data.

    Finally as I said before you can easily add a row number to your datasets using the Window visual recipe. Just set the order to whatever column you want to order on in the Window definitions section of the recipe and then click on Row Number on the Aggregations section. Run the recipe and you will have a dataset with row numbers. Then use the Join recipe to join via row number and set the Join type to Full Outer. If this doesn't work post screen shots of your flow and how you are setting your recipes and what you get an output which is not what you expect.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    Can you please give some sample dummy data records for your datasets and what the output should look like? It sounds to me that you want to join the datasets not stack them. But they don’t seem to have any common columns so it’s unclear to me how you expect Dataiku to do the join.

  • gavin
    gavin Registered Posts: 6
    Options

    Both Dataset1 and Dataset2 are csv files.

    Dataset1:

    AB
    1a
    2b
    3c
    4a
    5b
    6c
    7a
    8b
    9c
    10a

    Dataset2:

    CD
    ffg1
    sg3
    fg5
    sgsf7
    fgs9
    sfg2
    sg4
    fg6
    sg8
    sfg10

    Dataset3 (what I expect)

    ABCD
    1affg1
    2bsg3
    3cfg5
    4asgsf7
    5bfgs9
    6csfg2
    7asg4
    8bfg6
    9csg8
    10asfg10

    I want to join Dataset1 and Dataset2 by index,

    but it seems that they should have some relations.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    Do you want a visual or code recipe? This is very easy to do in a Python code recipe because Dataiku will load the CSV files into a Pandas data frame which has an index column.

  • gavin
    gavin Registered Posts: 6
    Options

    I would like a visual recipe. I know that it's easy in python, but I want to try completely no-code usage.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    I would use the Files in Folder dataset to load the CSV files adding the Row ID as shown in this article I posted. Then simply use a Join recipe, expose all columns in the output and join by the Row ID column using an Outer join so you get rows from both sides (in case the files don't have the same number of records).

  • gavin
    gavin Registered Posts: 6
    Options

    It is a good idea to add row ID by loading folder then join.

    However, it seems cannot contain same row.

    Moreover, if I get the Dataset1 and Dataset2 by other recipes rather than csv file in folders, it will be a little complicated (need to save and load again).

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    "However, it seems cannot contain same row."

    >> I don't really understand what you mean by this. Explain the steps, process and errors that you are seeing.

    "Moreover, if I get the Dataset1 and Dataset2 by other recipes rather than csv file in folders, it will be a little complicated (need to save and load again)."

    >> I can only advise on solutions as you present the problems. You can't expect a solution to apply to a different problem. You said you had CSVs that's why I proposed using Files in Folder. If you have a dataset stored on other technologies like SQL you can use the rank() SQL function to number the rows. This can also be done on the Window recipe.

  • gavin
    gavin Registered Posts: 6
    Options

    1. I mistake something. It works.

    2.Sorry for my imprecise descriptions. I am a newer of this system so that I don't know all of these components. I use "upload file" to import csvfile Dataset1 and Dataset2. After using "prepare recipe", I get Dataset1_processed, Dataset2_processed. They have no relate columns but I want to concatenate them by index (just like previous description).

    I want to use no-code way(only by clicking) to achieve it.

    Maybe it doesn't have this function yet (or need some complicated operations.)

    Many thanks for your answer!

  • gavin
    gavin Registered Posts: 6
    Options

    Thank you very much.

    I finally find the solution what I want.

    I have learn a lot from you.

    Thanks again.

Setup Info
    Tags
      Help me…