Creating dataset from Pandas

RadAniba
RadAniba Registered Posts: 6 ✭✭✭✭
edited July 16 in Using Dataiku

Hello nice people

I am pulling data from REST API in a jupyter notebook in DSS and do a lot of things on the pandas dataframe I am creating

I would like to save the dataframe as a dataset I can later explore whitin the project I am working in.

I am trying something like :

Unable to fetch schema for PROJ1.participant_screening_20210623-233350: dataset does not exist: PROJ1.participant_screening_20210623-233350

But I am always running through a problem

if not results.empty:
        output_data = dataiku.Dataset(instrument_name+"_" + event_name + "_"  + timestr)
        output_data.write_dataframe(results)

I tried some other alternatives to write the dataframe in a dataset but dss seems to look for a scheman with the project name (PROJ1) ?

Any easy way to get the dataframes into dss datasets ?

PS : this is a test instance, am not using a database for intermediate datasets but writing on disc for testing purposes

Thanks

Rad

Tagged:

Answers

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
    edited July 17

    Hi @RadAniba
    ,

    Following is an example of how to create a new dataset in Python and then write a dataframe to it.

    import dataiku
    
    dataset_name = 'TEST'
    
    # Get a handle to the current project
    client = dataiku.api_client()
    project = client.get_project(dataiku.default_project_key())
    
    # Create a SQL dataset (you can create other types by specifying different parameters for the with_store_into method)
    # Documentation here: https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#programmatic-creation-and-setup-managed-datasets
    # Note that documentation shows project.new_managed_dataset which is incorrect
    builder = project.new_managed_dataset_creation_helper(dataset_name)
    builder.with_store_into("NZ_DSWRK")
    builder.create() 
    
    # Write dataframe to dataset
    dataiku.Dataset(dataset_name).write_with_schema(df)

    The dataset will show in the UI as not built. You can right click on the dataset and choose "mark as built" to fix this.

    Note that this example uses both the "external" api (dataikuapi) to create the dataset and the internal api (dataiku) to write the dataframe to the dataset. More the differences here.

    Hope this helps.

    Marlan

  • RadAniba
    RadAniba Registered Posts: 6 ✭✭✭✭

    Thank you @Marlan
    this is really helpful

    Would it be possible to create a dataset without the SQL method ? If I understand very well this is based on a postgres connection to store intermediate files, but what if I wanted to test by writing a dataset on a disk instead ?

    Rad

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

    Hi @RadAniba
    ,

    You can write to any type of SQL database you have a connection set up for. The example I gave included a connection for a Netezza database. Not sure what you mean about storing intermediate tables. We write both intermediate and final output data to Netezza.

    To create a file dataset, use the filesystem folders connection in the "with_store_into" method, e.g., builder.with_store_into('filesystem_folders')

    Note also that you can pass "overwrite=True" to the create method.

    Marlan

  • RadAniba
    RadAniba Registered Posts: 6 ✭✭✭✭

    Fantastic,

    thank you for clarifying, this helped very much

Setup Info
    Tags
      Help me…