Writing to a HDFS dataset without overwriting, from python
Hi!
I have built a Dataiku WebApp with python as a backend. It works very well.
Right now, I would like to log some of the user interactions, into a Dataiku HDFS Dataset.
Codes like the following work, but they overwrite the dataset.
dataset = dataiku.Dataset("my_dataset") d = {'col1': [1, 2], 'col2': [3, 4]} output = pd.DataFrame(data=d) dataset.write_dataframe(output)
dataset = dataiku.Dataset("my_dataset") d = {'col1': [1, 2], 'col2': [3, 4]} output = pd.DataFrame(data=d) writer = dataset.get_writer() try : writer.write_dataframe(output) finally: writer.close()
Is there a way to append data instead of overwriting it?
Thanks!
Best Answer
-
I was able to get it done by taking advantage of the HiveExecutor.
In general, my implementation looks as follows:
import dataiku #Insert records into existing dataset myDataikuDataset = dataiku.Dataset("datasetName") executor = HiveExecutor(dataset=myDataikuDataset) resultdf = executor.query_to_df(myQuery) #The resultDF isn't used #Syncronize to Hive metastore to make the changes visible to Hive client = dataiku.api_client() project = client.get_project('proyectName') myDataikuDataset2 = project.get_dataset('datasetName') myDataikuDataset2.synchronize_hive_metastore() #The query in myQuery should have the following structure: INSERT INTO databaseName.dss_ProyectName_DatasetName SELECT 'myStringValue', myNumericValue
Answers
-
Hi,
in classical Hadoop, which DSS uses, HDFS datasets can't be appended to. The Hadoop libs consider directories as either not existing or ready (the trick being that while the data is produced, the files are actually in a hidden subfolder and moved at the end).
For the use case you describe you can :
- use a managed folder instead of a dataset, and use the api to write csv files in it (or whatever format you prefer and can produce). The folder can then be read as a dataset using a "Files in folder" dataset. For example a click of a button could run this in the python backend:
import dataiku import pandas as pd from flask import request from io import BytesIO from datetime import datetime @app.route('/push-one') def push_one(): f = dataiku.Folder('f') df = pd.DataFrame({"foo":[1, 2], "bar":["x", "y"]}) buf = BytesIO() df.to_csv(buf, header=True, sep=",", index=False) file_name = "added_%s.csv" % datetime.now().strftime("%Y-%m-%d-%H-%M-%S.%f") f.upload_data(file_name, buf.getvalue()) return 'done'
- partition the dataset and write a new partition each time
Both options will produce lots of small files, so it's probably a good idea to find a way to merge data to be written in the webapp (if doable)