How to concat multiple columns by index?
I have 2 dataset with different column names but same length.
Dataset 1 have column A,B, length is 10.
Dataset 2 have column C,D, length is 10.
I want to get Dataset 3 by concating Dataset1&2, Its column name are A,B,C,D and length is 10.
I have tried join and stack, but I can't get what I expect.
Operating system used: Windows
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
There are different ways of uploading files to Dataiku. I presume you have simply dropped the files in the flow in which case Dataiku will put the files in a server's file system connection and load them in a dataset with an arrow pointing up, indicating they have been uploaded. There is however a better way of loading files in Dataiku. First go to the Dataiku Administration => Connections settings and create a new server's file system connection pointing the path to a folder of your choice in your system. Then go your flow click on the Dataset button and select Folder. Create a new folder in the new server's file system connection. Once the new folder is created and you are opening it from the flow you can add files to it. Either drag and drop from Windows Explorer (multiple files selected are allowed) or click on the Add File button. Having now added the files to the folder you can add the Files in Folder dataset as per my advice on the linked solution above.
This is a much better solution because you can now easily update the files in the physical folder in your file system and have Dataiku reload the data without having to modify the flow. Whereas with the manual upload you did you will need manually update each upload dataset to refresh the data.
Finally as I said before you can easily add a row number to your datasets using the Window visual recipe. Just set the order to whatever column you want to order on in the Window definitions section of the recipe and then click on Row Number on the Aggregations section. Run the recipe and you will have a dataset with row numbers. Then use the Join recipe to join via row number and set the Join type to Full Outer. If this doesn't work post screen shots of your flow and how you are setting your recipes and what you get an output which is not what you expect.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
Can you please give some sample dummy data records for your datasets and what the output should look like? It sounds to me that you want to join the datasets not stack them. But they don’t seem to have any common columns so it’s unclear to me how you expect Dataiku to do the join.
-
Both Dataset1 and Dataset2 are csv files.
Dataset1:
A B 1 a 2 b 3 c 4 a 5 b 6 c 7 a 8 b 9 c 10 a Dataset2:
C D ffg 1 sg 3 fg 5 sgsf 7 fgs 9 sfg 2 sg 4 fg 6 sg 8 sfg 10 Dataset3 (what I expect)
A B C D 1 a ffg 1 2 b sg 3 3 c fg 5 4 a sgsf 7 5 b fgs 9 6 c sfg 2 7 a sg 4 8 b fg 6 9 c sg 8 10 a sfg 10 I want to join Dataset1 and Dataset2 by index,
but it seems that they should have some relations.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
Do you want a visual or code recipe? This is very easy to do in a Python code recipe because Dataiku will load the CSV files into a Pandas data frame which has an index column.
-
I would like a visual recipe. I know that it's easy in python, but I want to try completely no-code usage.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
I would use the Files in Folder dataset to load the CSV files adding the Row ID as shown in this article I posted. Then simply use a Join recipe, expose all columns in the output and join by the Row ID column using an Outer join so you get rows from both sides (in case the files don't have the same number of records).
-
It is a good idea to add row ID by loading folder then join.
However, it seems cannot contain same row.
Moreover, if I get the Dataset1 and Dataset2 by other recipes rather than csv file in folders, it will be a little complicated (need to save and load again).
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
"However, it seems cannot contain same row."
>> I don't really understand what you mean by this. Explain the steps, process and errors that you are seeing.
"Moreover, if I get the Dataset1 and Dataset2 by other recipes rather than csv file in folders, it will be a little complicated (need to save and load again)."
>> I can only advise on solutions as you present the problems. You can't expect a solution to apply to a different problem. You said you had CSVs that's why I proposed using Files in Folder. If you have a dataset stored on other technologies like SQL you can use the rank() SQL function to number the rows. This can also be done on the Window recipe.
-
1. I mistake something. It works.
2.Sorry for my imprecise descriptions. I am a newer of this system so that I don't know all of these components. I use "upload file" to import csvfile Dataset1 and Dataset2. After using "prepare recipe", I get Dataset1_processed, Dataset2_processed. They have no relate columns but I want to concatenate them by index (just like previous description).
I want to use no-code way(only by clicking) to achieve it.
Maybe it doesn't have this function yet (or need some complicated operations.)
Many thanks for your answer!
-
Thank you very much.
I finally find the solution what I want.
I have learn a lot from you.
Thanks again.