Join 2 datasets matching a date into an interval
 
            
                
                    UserBird                
                
                    Dataiker, Alpha Tester Posts: 535 Dataiker                
            
                        
            
                    Hello
I have to datasets that I want to join. The first dataset as a DATE column, but the second dataset as a BEGIN_DATE and END_DATE column.
I want to join rows from dataset2 into dataset1 analysis.
Is there a way to join the 2 datasets directly or a way to enrich dataset2 and add rows matching BEGIN_DATE en END_DATE?
Best regards
Geoff.
                        I have to datasets that I want to join. The first dataset as a DATE column, but the second dataset as a BEGIN_DATE and END_DATE column.
I want to join rows from dataset2 into dataset1 analysis.
Is there a way to join the 2 datasets directly or a way to enrich dataset2 and add rows matching BEGIN_DATE en END_DATE?
Best regards
Geoff.
            Tagged:
            
        
            Answers
- 
            I succeeded in doing the "join" (actually it is not, but for my data it is). The goal is to tag each record from DATASET1 for being/not being in a specific time period. Time periods are defined in DATASET2. I did it using Python. For those interested, here is the script (As i'm not a Python expert, more elegant code can be proposed): # -*- coding: utf-8 -*-
 import dataiku
 import pandas as pd, numpy as np
 from dataiku import pandasutils as pdu
 # Input datasets
 DATASET1 = dataiku.Dataset("DATASET1")
 DATASET1_df = histo_MAG_JF_ZV.get_dataframe()
 DATASET2 = dataiku.Dataset("DATASET2")
 DATASET2_df = soldes_prepared.get_dataframe()
 # Output datasets
 DATASET1_outuput = dataiku.Dataset("DATASET1_outuput")
 # Pour chaque ligne du tableau de Soldes
 for d in DATASET2.iter_rows():
 # FOR DEBUG
 #print d
 begin_date = d['begin']
 end_date = d['end']
 cond1 = (DATASET1_df['Date'] >= begin_date)
 cond2 = (DATASET1_df['Date'] <= end_date)
 matching_records = DATASET1_df[cond1][cond2]
 # Add begin/end date to output dataset
 matching_records['begin_date'] = begin_date
 matching_records['end_date'] = end_date
 DATASET1_outuput.write_with_schema(matching_records)Next I left join DATASET1 with DATASET1_output to recover non mathing records, as the following analysis needs both matching and non matching records. Geoff