Writing to a HDFS dataset without overwriting, from python

AndresSilva
AndresSilva Registered Posts: 9 ✭✭✭✭

Hi!

I have built a Dataiku WebApp with python as a backend. It works very well.

Right now, I would like to log some of the user interactions, into a Dataiku HDFS Dataset.

Codes like the following work, but they overwrite the dataset.

dataset = dataiku.Dataset("my_dataset")
d = {'col1': [1, 2], 'col2': [3, 4]}
output = pd.DataFrame(data=d)
dataset.write_dataframe(output)
dataset = dataiku.Dataset("my_dataset")
d = {'col1': [1, 2], 'col2': [3, 4]}
output = pd.DataFrame(data=d)
writer = dataset.get_writer()
try :
    writer.write_dataframe(output)
finally:
    writer.close()

Is there a way to append data instead of overwriting it?

Thanks!

Best Answer

  • AndresSilva
    AndresSilva Registered Posts: 9 ✭✭✭✭
    edited July 2024 Answer ✓

    I was able to get it done by taking advantage of the HiveExecutor.

    In general, my implementation looks as follows:

    import dataiku 
            
    #Insert records into existing dataset
    myDataikuDataset = dataiku.Dataset("datasetName")
    executor = HiveExecutor(dataset=myDataikuDataset)
    resultdf = executor.query_to_df(myQuery) #The resultDF isn't used
    
    #Syncronize to Hive metastore to make the changes visible to Hive
    client = dataiku.api_client()
    project = client.get_project('proyectName')
    myDataikuDataset2 = project.get_dataset('datasetName')
    myDataikuDataset2.synchronize_hive_metastore()
    
    #The query in myQuery should have the following structure:
    INSERT INTO databaseName.dss_ProyectName_DatasetName
    SELECT 'myStringValue', myNumericValue

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    edited July 2024

    Hi,

    in classical Hadoop, which DSS uses, HDFS datasets can't be appended to. The Hadoop libs consider directories as either not existing or ready (the trick being that while the data is produced, the files are actually in a hidden subfolder and moved at the end).

    For the use case you describe you can :

    - use a managed folder instead of a dataset, and use the api to write csv files in it (or whatever format you prefer and can produce). The folder can then be read as a dataset using a "Files in folder" dataset. For example a click of a button could run this in the python backend:

    import dataiku
    import pandas as pd
    from flask import request
    from io import BytesIO
    from datetime import datetime
    
    @app.route('/push-one')
    def push_one():
        f = dataiku.Folder('f')
        df = pd.DataFrame({"foo":[1, 2], "bar":["x", "y"]})
    
        buf = BytesIO()
        df.to_csv(buf, header=True, sep=",", index=False)
        file_name = "added_%s.csv" % datetime.now().strftime("%Y-%m-%d-%H-%M-%S.%f")
        f.upload_data(file_name, buf.getvalue())
        return 'done'

    - partition the dataset and write a new partition each time

    Both options will produce lots of small files, so it's probably a good idea to find a way to merge data to be written in the webapp (if doable)

Setup Info
    Tags
      Help me…