Writing to a HDFS dataset without overwriting, from python

AndresSilva · ‎07-28-2020

Hi!

I have built a Dataiku WebApp with python as a backend. It works very well.

Right now, I would like to log some of the user interactions, into a Dataiku HDFS Dataset.

Codes like the following work, but they overwrite the dataset.

dataset = dataiku.Dataset("my_dataset")
d = {'col1': [1, 2], 'col2': [3, 4]}
output = pd.DataFrame(data=d)
dataset.write_dataframe(output)

dataset = dataiku.Dataset("my_dataset")
d = {'col1': [1, 2], 'col2': [3, 4]}
output = pd.DataFrame(data=d)
writer = dataset.get_writer()
try :
	writer.write_dataframe(output)
finally:
	writer.close()

Is there a way to append data instead of overwriting it?

Thanks!

AndresSilva · ‎08-05-2020

I was able to get it done by taking advantage of the HiveExecutor.

In general, my implementation looks as follows:

import dataiku 
        
#Insert records into existing dataset
myDataikuDataset = dataiku.Dataset("datasetName")
executor = HiveExecutor(dataset=myDataikuDataset)
resultdf = executor.query_to_df(myQuery) #The resultDF isn't used

#Syncronize to Hive metastore to make the changes visible to Hive
client = dataiku.api_client()
project = client.get_project('proyectName')
myDataikuDataset2 = project.get_dataset('datasetName')
myDataikuDataset2.synchronize_hive_metastore()

#The query in myQuery should have the following structure:
INSERT INTO databaseName.dss_ProyectName_DatasetName
SELECT 'myStringValue', myNumericValue

View solution in original post

AndresSilva · ‎08-05-2020

I was able to get it done by taking advantage of the HiveExecutor.

In general, my implementation looks as follows:

import dataiku 
        
#Insert records into existing dataset
myDataikuDataset = dataiku.Dataset("datasetName")
executor = HiveExecutor(dataset=myDataikuDataset)
resultdf = executor.query_to_df(myQuery) #The resultDF isn't used

#Syncronize to Hive metastore to make the changes visible to Hive
client = dataiku.api_client()
project = client.get_project('proyectName')
myDataikuDataset2 = project.get_dataset('datasetName')
myDataikuDataset2.synchronize_hive_metastore()

#The query in myQuery should have the following structure:
INSERT INTO databaseName.dss_ProyectName_DatasetName
SELECT 'myStringValue', myNumericValue

fchataigner2 · ‎09-08-2020

Hi,

in classical Hadoop, which DSS uses, HDFS datasets can't be appended to. The Hadoop libs consider directories as either not existing or ready (the trick being that while the data is produced, the files are actually in a hidden subfolder and moved at the end).

For the use case you describe you can :

- use a managed folder instead of a dataset, and use the api to write csv files in it (or whatever format you prefer and can produce). The folder can then be read as a dataset using a "Files in folder" dataset. For example a click of a button could run this in the python backend:

import dataiku
import pandas as pd
from flask import request
from io import BytesIO
from datetime import datetime

@app.route('/push-one')
def push_one():
    f = dataiku.Folder('f')
    df = pd.DataFrame({"foo":[1, 2], "bar":["x", "y"]})

    buf = BytesIO()
    df.to_csv(buf, header=True, sep=",", index=False)
    file_name = "added_%s.csv" % datetime.now().strftime("%Y-%m-%d-%H-%M-%S.%f")
    f.upload_data(file_name, buf.getvalue())
    return 'done'

- partition the dataset and write a new partition each time

Both options will produce lots of small files, so it's probably a good idea to find a way to merge data to be written in the webapp (if doable)

Sign up to take part

Writing to a HDFS dataset without overwriting, from python

Writing to a HDFS dataset without overwriting, from python