I have Full outer join - best way to merge data with similar columns

aw30 · November 2020

Hi - I have 2 data sets I would like to join together.

1) has forecast hours for a time slice

2) has actual hours for a time slice

If I perform a left join then the actual hours get appended to each line that the hours exist (good). If I perform an outer join then I additionally see actual hours that no forecast exists (good).

What is not good is that in order to create a dataset that has the same columns for all rows I have to select the columns from the right side in the join and rename them for them to be selected. Then for each column I have to do a check if there isn't data from the left side, use the right side column.

Is there an easier way to do this because my actual dataset has 10+ columns?

Example

Left side columns

time_slice
cost_center
forecast_hrs
business_unit

Right side columns

time_slice
cost_center
actual_hrs
business_unit

After the join I have columns:

time_slice
cost_center
forecast_hrs
business_unit
time_slice2
cost_center2
actual_hrs
business_unit2

I then add a prepare recipe and for each column use an if statement => if(isBlank(time_slice),time_slice2,time_slice)

Thank you for any ideas on simplifying this!

Ignacio_Toledo · November 2020

Hi @aw30
. So, just to be sure, after the last step you would like to have this table:

time_slice
cost_center
business_unit
forecast_hrs
actual_hrs

Is that correct?

Also, the time_slice data, what format does it have?

With those 2 pieces of information I think I would be able to help you.

aw30 · November 2020

Hi @Ignacio_Toledo,

Yes I would like to have the resulting dataset have only 5 columns. It doesn't matter the format of time_slice as it is just an example but the data type is the same in both data sets.

Best,

Anne

Ignacio_Toledo · November 2020

Hi @aw30
. Ok, I'll work in an example, and I'll assume that the 'time_slice' column just have a format/type that is common in both data sets, just like the cost center and business unit.

However, if the time slice was actually a date with a time, and let's say you should have in general 1 entry for every day or every hour, we could solve the issue using a resampling method, that would generate the missing time_slices with empty forecast or actual hours, making easier the outer join latter. (If that is the case let me know).

I'll get back to you when I have the example ready.

Cheers!

Ignacio_Toledo · November 2020

Hi @aw30
. After some trials, in the case that you have a 'time_slice' column that doesn't contain a valid timestamp (or data plus time), I wasn't able to found a procedure different to what you described (join recipe + prepare recipe), or more efficient at least (with less steps).

However, if is ok with you to use a python recipe, in that case I have a python script that can do the job in one step. Let me know if you would be interested on that.

(Better yet, if you could attach two dataset samples to use as inputs, I would be able to test other approachs too.. but I know that is not an option if your data is sensitive or not public)

aw30 · November 2020

Hi @Ignacio_Toledo
,

Thank you so much for looking into this! I would be interested in seeing your solution and you could change the field from a time_slice to a string or integer as that could be handled separately. I do have sensitive data so cannot upload but if you want to share your python script I can test it and see how it works.

Thank you again!

Ignacio_Toledo · November 2020

Hi @aw30
. Here is project attached with a python recipe. The pandas method `pd.merge` does exactly what you want when you do an outer join!

Let me know if you need more information to adapt the project to your data.

Cheers! I.

aw30 · November 2020

Thank you for sharing and I'll mark this as accepted solution once I have run it.

aw30 · November 2020

Hi - I started looking at this but am unsure how to import the project into my DSS as I have not done this before. Can you point me to the documentation on how to do this if you are able? Thanks again for all the help!

Ignacio_Toledo · November 2020

Hi @aw30
. No problem! Here is a short video showing how to import a project:

https://youtu.be/yZjUoZdaF2k

In my case, you will see an error in a moment: that is because I already have a project with the same name. But it was easy to solve by just giving a new name.

Hope this helps!

I.

aw30 · November 2020

Hi - thank you for the video but I am given errors upon importing but I did find the actual python recipe in your zip file. The fields I am joining are different and there are some differences in some of the fields I want to select so I first renamed the joined fields to be the same.

I then ran the script and the output joined the datasets but appended _x to the column name of the first dataset field that were the same and _y to the column name of the second dataset for columns not in the join criteria. For instance res_dept became res_dept_x and res_dept_y. I would still then have to do a prep recipe that merges the values together which is what I was hoping not to do.

I am not going to mark this as a solution but the python script may be helpful for others and really appreciate the time in helping me on this. If I come with with a total solution I will update this as well.

Thanks again!

Ignacio_Toledo · November 2020

Hi @aw30
. Ok, it might be happening that since the datasets in my project were connected to a local postgresql database, you are getting an error related to that, so here is a new version of the Project. If you get a warning related to missing plugins, just ignore them safetly.

The python recipe is here:

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs: Change the names of the inputs datasets here
work_actual_data_sql = dataiku.Dataset("work_actual_data_sql_copy")
work_actual_data_sql_df = work_actual_data_sql.get_dataframe()
work_planned_data_sql = dataiku.Dataset("work_planned_data_sql_copy")
work_planned_data_sql_df = work_planned_data_sql.get_dataframe()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
work_actual_data_sql_df.head()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
work_planned_data_sql_df.head()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
work_actual_data_sql_df.columns

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# The join columns will change for your use case of course
joinned_table_df = pd.merge(work_actual_data_sql_df,
work_planned_data_sql_df,
on=['WOWO_CODE', 'WOWO_EQUIPMENT', 'WOWO_JOB', 'WOWO_JOB_TYPE'], ## join columns
how='outer')

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
joinned_table_df.WOWO_CODE.nunique()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
joinned_table = dataiku.Dataset("joinned_table")
joinned_table.write_with_schema(joinned_table_df)

Hope this helps, because the good thing about the python script is that you don't need an extra step!

I.

Ignacio_Toledo · November 2020

sorry @aw30
, I miss the point about the _x and _y suffixes. Do you have the names of the columns of your two datasets? So I can help you with the change.

Best!

I.

aw30 · November 2020

Hi - I included one above res_dept. You can just choose whatever names you want but the problem is that they are the same name but not part of the join criteria so are getting _x and _y appended to them.

Thanks for the help!

Ignacio_Toledo · November 2020

Sorry @aw30
, I forgot the answer was in the description!

So, if you use the merge method from pandas, and we name your two dataframes as left_df and right_df, your code should need to look like this (the most important part is the parameter on):

pd.merge(
    left_df,
    right_df,
    on=['time_slice', 'cost_center', 'business_unit'],
    how='outer'
)

Provided that the input are the two dataframes you mentioned in the question, the output df should have the columns ['time_slice', 'cost_center', 'forecast_hrs', 'actual_hrs', 'business_unit'] , and no '_x' or '_y' suffixes.

The output, will have the either some None in 'actual_hours' and some in 'business_unit'.

Let me know if that works.

I have Full outer join - best way to merge data with similar columns

Answers

Categories

Setup Info

Tags