Try your hand at analyzing royal sentiment in Dataiku DSS! Learn more

Writing to a HDFS dataset without overwriting, from python

Level 2
Writing to a HDFS dataset without overwriting, from python

Hi!

I have built a Dataiku WebApp with python as a backend. It works very well.

Right now, I would like to log some of the user interactions, into a Dataiku HDFS Dataset.

Codes like the following work, but they overwrite the dataset. 

dataset = dataiku.Dataset("my_dataset")
d = {'col1': [1, 2], 'col2': [3, 4]}
output = pd.DataFrame(data=d)
dataset.write_dataframe(output)
dataset = dataiku.Dataset("my_dataset")
d = {'col1': [1, 2], 'col2': [3, 4]}
output = pd.DataFrame(data=d)
writer = dataset.get_writer()
try :
	writer.write_dataframe(output)
finally:
	writer.close()

 

Is there a way to append data instead of overwriting it?

Thanks!

0 Kudos
1 Reply
Level 2
Author

I was able to get it done by taking advantage of the HiveExecutor.

In general, my implementation looks as follows:

 

import dataiku 
        
#Insert records into existing dataset
myDataikuDataset = dataiku.Dataset("datasetName")
executor = HiveExecutor(dataset=myDataikuDataset)
resultdf = executor.query_to_df(myQuery) #The resultDF isn't used

#Syncronize to Hive metastore to make the changes visible to Hive
client = dataiku.api_client()
project = client.get_project('proyectName')
myDataikuDataset2 = project.get_dataset('datasetName')
myDataikuDataset2.synchronize_hive_metastore()

#The query in myQuery should have the following structure:
INSERT INTO databaseName.dss_ProyectName_DatasetName
SELECT 'myStringValue', myNumericValue

 

0 Kudos