How to add data to a existing dataset with python?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
I have data set by name weather_data , i want to add data everyday to this dataset

How can i do this with python?

Best Answers

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    edited July 17 Answer ✓

    Hi,

    I would suggest reading the input dataset in as a Pandas dataframe, handling the append in the dataframe itself, and then writing the resulting dataframe (in overwrite mode) into your output dataset.

    For example, something like:

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    inter = dataiku.Dataset("inter")
    input_df = inter.get_dataframe()
    
    # Create dataframe containing row you want to append
    append_row = {'my_column': ['foobar']}
    append_df = pd.DataFrame(data=append_row)
    
    # Append row to input dataframe
    output_df = input_df.append(append_df)
    
    # Write recipe outputs
    inter_temp = dataiku.Dataset("inter_temp")
    inter_temp.write_with_schema(output_df)

    I hope that this helps! I would also suggest checking out the following Pandas documentation, which provides more examples and details about how to use DataFrame.append:
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

    Best,

    Andrew

  • Xx-KaAzZ-xX
    Xx-KaAzZ-xX Registered Posts: 2 ✭✭✭✭
    edited July 17 Answer ✓

    Hi ATsao,

    Thanks for your reply it could help me in the future. I found a way without dataframe. Post here for those who whants an exemple of write_row_dict as I wanted yesterday.

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku.core.sql import SQLExecutor2
    from dataiku import pandasutils as pdu
    
    # On lit le schema de la BDD en entrée et on le copie dans le dataset temporaire
    input_dataset = dataiku.Dataset("dataset")
    schema = input_dataset.read_schema()
    output_dataset = dataiku.Dataset("dataset_temp")
    output_dataset.write_schema(schema)
    
    ##Il faut ensuite ouvrir un writer pour ajouter des lignes
    
    try :
        writer = output_dataset.get_writer()
        foobar="foobar"
        values = {
            "colonne1": foobar,
            "colonne2": foobar,
            "colonne3": 1,
            "colonne4" : foobar
        }
        ##Ne prend que des valeurs de type dictionnaire
        writer.write_row_dict(values)
    except:
        writer.close()

Answers

  • larispardo
    larispardo Registered Posts: 28 ✭✭✭✭✭
    Of course, now that I read ot after a good sleep I see what you mean, thanks dev. It would depend on how you have your new data, I guess the easiest way will be with the panda's append function. I believe this page will help you: https://stackoverflow.com/questions/14988480/pandas-version-of-rbind
  • kenjil
    kenjil Dataiker, Alpha Tester, Product Ideas Manager Posts: 19 Dataiker
    You may want to use

    - the append mode on the output dataset. This setting is available in the input/output tab of the recipe => The data with be append in the output dataset. Note that this mode is only available on recipe with output dataset using an infrastructure allowing append (e.g. it is not possible with HDFS)

    - the partition on the output dataset. Each day you will write your weather data in the partition of that day on the dataset. This mode works whatever the connection. See https://doc.dataiku.com/dss/latest/partitions/index.html for more details.
  • Xx-KaAzZ-xX
    Xx-KaAzZ-xX Registered Posts: 2 ✭✭✭✭
    edited July 17

    Hi,

    That's one of the only topic I found, and I have the same problem as @UserBird
    .

    I would like to add one row to an existing dataset with a python recipe.

    I'm looking for examples on the Internet and I can't find any... This is what I would like to do :

    input_dataset = dataiku.Dataset("inter")
    output_dataset = dataiku.Dataset("inter_temp")
    foobar="foobar"
    
    output_dataset.iter_rows(columns='my_column', values=foobar)
    ##Or something else but it should be very easy and I can't find a way...

    If anyone has an answer, it would be gladly appreciated !

    Have a good day.

  • Sprint_Chase
    Sprint_Chase Registered Posts: 1 ✭✭✭

    Sure, The marked answer is correct.

    But R language is also used for Data Science.

    When it comes to appending data frames, the rbind() and cbind() function comes to mind because they can concatenate the data frames horizontally and vertically. In this example, we will see how to use the rbind() function to append data frames.

    To append data frames in R, use the rbind() function. The rbind() is a built-in R function that can combine several vectors, matrices, and/or data frames by rows.

  • Alka
    Alka Registered Posts: 2

    I'm currently smashing my head on this subject for few days.

    I'm treating a flow of data wich I dispatch in partitions, some of my python code run from scenarios to be able to properly switch from reading and writing partitions on the go.

    The datas are stored on an azure blob storage, csv-like.

    When I have to write additional datas to a partitions I can't find a way to do it as an append by simply adding a file to a partition.

    For example, I'm also running a continuous kafka sync recipe, which does exactly what I want since I can list the partitions and get :

    Capture d'écran 2023-11-15 121225.png

    On the contrary, my python script in scenario only generate 1 file every time, so I have to reload all the datas from the partition on memory and rewrite everything with the additional datas.
    Since i'm switching partitions I can not use a python script in a recipe and simply clic the "append button".

    I just want a simple way to say to a writer to put datas into a specific new files at the specific partition location, how is that so difficult ?

    And no, pandas-like answers are not valid since they are too time consuming.

    Any help ?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    Can you please start a new thread? The original thread is from 2017 has already been marked as solved. Thanks

Setup Info
    Tags
      Help me…