Python code to create a new Dataiku dataset

N_JAYANTH
N_JAYANTH Registered Posts: 11 ✭✭✭✭

I would like to create massive dataiku dataset using python interpretor, without using creating them manually in the recipe

Note: The following command works only if I have created a dataiku dataset called "myoutputdataset" in my recipe. But, my problem is to create a new dataiku Dataset with out creating it before in my recipe and save my pandas dataframe in it


output_ds = dataiku.Dataset("myoutputdataset")
output_ds.write_with_schema(my_dataframe)
Tagged:

Answers

  • Thomas
    Thomas Dataiker Alumni Posts: 19 ✭✭✭✭✭
    edited July 17

    Hi,

    "myoutputdataset" and "my_dataframe" are just placeholders that need to be changed with your own names / code.

    For instance, the following (complete) recipe has a output DSS dataset called "results" which is filled by a Pandas dataframe called "o":



    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd

    # Recipe inputs
    titanic = dataiku.Dataset("titanic")
    df = titanic.get_dataframe()

    # Some Python code
    # ...
    o = df.sort('PassengerId')

    # Recipe outputs
    output = dataiku.Dataset("results")
    output.write_with_schema(o)

    Hope this helps.

  • N_JAYANTH
    N_JAYANTH Registered Posts: 11 ✭✭✭✭
    I think you mis-understood my question. I know that "myoutputdataset" and "my_dataframe" are just placeholders. In your code

    output = dataiku.Dataset("results")

    what is "results". I suppose its a dataiku database, So you have already have a dataiku database named "results". Thats why you are able to write into it. My Question is how do you create the "results" database in dataiku using python code
  • Thomas
    Thomas Dataiker Alumni Posts: 19 ✭✭✭✭✭

    The "results" Dataset is not created by the Python code, but when you create your Recipe first:

  • kenjil
    kenjil Dataiker, Alpha Tester, Product Ideas Manager Posts: 19 Dataiker
    The output dataset of a recipe is created in the recipe creation modal.

    In case you really want to massively create datasets, there is an python API to administer DSS that you can use
    http://doc.dataiku.com/dss/latest/api/public/index.html
    Note that this API is NOT intended to be used to create the output dataset of a single recipe.
  • N_JAYANTH
    N_JAYANTH Registered Posts: 11 ✭✭✭✭
    So how do I create massive datasets like "results" without mentioning them in the recipe?
  • N_JAYANTH
    N_JAYANTH Registered Posts: 11 ✭✭✭✭
    Yes @kenjil I would like to create massive datasets
  • kenjil
    kenjil Dataiker, Alpha Tester, Product Ideas Manager Posts: 19 Dataiker
    This has nothing to do with the size of the dataset but with the number of datasets you want to create. There is not point using that API for creating a single dataset, whatever its size.
  • N_JAYANTH
    N_JAYANTH Registered Posts: 11 ✭✭✭✭
    I want to create a large number of datasets, Is there any method to do this, please note I have a COMMUNITY EDITION license for DSS
  • kenjil
    kenjil Dataiker, Alpha Tester, Product Ideas Manager Posts: 19 Dataiker
    I'm sorry. The admin API is not available in DSS Free Edition.
  • N_JAYANTH
    N_JAYANTH Registered Posts: 11 ✭✭✭✭
    So there is no other way to create large number of datasets with DSS Free edition?
  • kenjil
    kenjil Dataiker, Alpha Tester, Product Ideas Manager Posts: 19 Dataiker
    Note : If these datasets are linked to existing tables in a SQL connection, you can just mass create datasets for these tables in the connection settings UI in the administration of DSS.
  • N_JAYANTH
    N_JAYANTH Registered Posts: 11 ✭✭✭✭
    What if my data files are csv files, Is there a way to convert a large number of csv files to large number of dataiku datasets? @kenjil
  • Pouya-ku
    Pouya-ku Dataiker Alumni Posts: 2 ✭✭✭✭
    You can write some python code that reads your CSV files from a static path and then writes them individually into DSS.
  • lfleck
    lfleck Registered Posts: 1 ✭✭✭✭
    But this is exactly was question I think. How can one create the "results" dataset using only the Python code inside the recipe? Or in other words: How can a Python recipe add outputs to itself?
  • gblack686
    gblack686 Partner, Registered Posts: 62 Partner

    @N_JAYANTH
    Any luck in finding a solution?

  • ibn-mohey
    ibn-mohey Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4
    edited July 17
    Exception: None: b'dataset does not exist: EGMED.s22'

    I know that error happens because there is no s22 place holder but my question is can I create that place hold automatically?

  • Brennan
    Brennan Registered Posts: 4 ✭✭✭
    edited July 17

    import dataiku
    import pandas as pd, numpy as np
    
    # EXPOSE CLIENT AND CURRENT PROJECT IN ORDER TO CREATE NEW DATASETS
    client = dataiku.api_client()
    project = client.get_default_project()
    
    # CREATE NEW DATASET -- RETURN DATAFRAME OF CREATED DATASET
    def createDataset(datasetName, schema_columns=None, data=None, ignoreFlow=True):
        new_Builder = project.new_managed_dataset(datasetName)
        new_Builder.with_store_into("filesystem_folders")
        new_Dataset = new_Builder.create(overwrite=True) # WILL OVERWRITE AN EXISTING DATASET OF THE SAME NAME
        new_Dataset_settings = new_Dataset.get_settings()
        new_Dataset_settings.set_csv_format()
    
        columnCount = 2
        if schema_columns is None:
            new_Dataset_settings.add_raw_schema_column({'name':'Default Int', 'type':'int'})
            new_Dataset_settings.add_raw_schema_column({'name':'Default String', 'type':'string'})
        else:
            columnCount = len(schema_columns)
            for column in schema_columns:
                new_Dataset_settings.add_raw_schema_column(column)
        new_Dataset_settings.save()
        new_Dataset = dataiku.Dataset(datasetName)
        try:
            if data is not None:
                writer = new_Dataset.get_writer()
                for row in data:
                    rowCellCount = len(row)
                    rowToAdd = []
                    iterativeLimit = 0
                    if columnCount > rowCellCount:
                        iterativeLimit = rowCellCount
                    else:
                        iterativeLimit = columnCount
                    for i in range(0, iterativeLimit):
                            rowToAdd.append(row[i])
                    writer.write_row_array((rowToAdd))
            else:
                writer = new_Dataset.get_writer()
                writer.write_row_array((0, "_"))
        except:
            try:
                writer.close()
            except:
                pass
        try:
            writer.close()
        except:
            pass
        if ignoreFlow:
            outputDataset = dataiku.Dataset(datasetName, ignore_flow=True) # for use in flow
            return outputDataset.get_dataframe()
        else:
            outputDataset = dataiku.Dataset(datasetName) # Notebook testing
            return outputDataset.get_dataframe()
    
    
    myData = [
        [1, "blah", "aaaaaaaaaaaaa"],
        [2, "blah blah"],
        [3, "blah blah blah"]
    ]
    
    myColumns = [
        {'name':'Integers Here', 'type':'int'},
        {'name':'super special column', 'type':'string'}
    ]
    
    createDataset("A_Great_Name", myColumns, myData, False)

    https://developer.dataiku.com/latest/api-reference/python/recipes.html#dataikuapi.dss.recipe.DSSRecipeSettings.get_recipe_inputs

    recipe inputs outputs

Setup Info
    Tags
      Help me…