Create SQL table for Dataset using python API

Options
Dima
Dima Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4

Using python API, I can create an SQL Dataset, or clear it using the DSSDataset.clear() method, but afterwards I have to manually click the "Create Table Now" in the settings tab of the dataset before using it in recipes. Is there a way to achieve the same effect as clicking the button using the python API?

I checked the dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper methods and haven't found anything. Also I thought about extracting the SQL query that Dataiku uses to create the SQL table from the dataikuapi.dss.dataset.DSSDataset object, but also couldn't find it.

Answers

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 317 Neuron
    Options

    Hi @Dima
    ,

    It would help to understand more about what you are trying to do.

    A managed dataset should have a recipe in front of it that would create and populate the underlying table. No need to create an empty table in advance - well, unless you want to append rows to the table. Is that what you want to do?

    Marlan

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,727 Neuron
    Options

    I agree with @Marlan
    , it's unclear what exactly you are are trying to achieve. Can you please let us know your actual requirement rather than the method you are trying to use to achieve it? There might be better ways of achieving what you want.

  • Dima
    Dima Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4
    Options
    1. My goal is to regularly process some data and write it to an SQL table.
    2. To do it faster, I use multiprocessing in python.
    3. To enable writing to Dataset in separate processes without the dataset being cleared in each process, I use append instead of overwrite option.
    4. But I want to append only during the recipe run, I don't want to accumulate data between runs, so I need to clear the dataset at the start of the recipe.
    5. Since the dataset is partitioned, when I run dataset.clear(partitions=partitions) on a dataset which hasn't created an SQL table yet, I get an error indicating the SQL relation doesn't exist.
    6. Since the dataset and recipe creation is done in python API, I also want to be able to create SQL table, after the Dataiku dataset is created, in python API.

    With help of Dataiku support, I came up with the following code:

    with dataset.get_writer():
    pass
    If I run this after dataset creation, it will also create the SQL table, so my problem is solved.
  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,727 Neuron
    Options

    Out of interest what’s your database technology?

  • Dima
    Dima Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4
    Options

    I'm writing to PostgreSQL 9.4.26 (Greenplum Database 6.24.3) database, if that's what you are asking.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,727 Neuron
    Options

    Yes. That’s what I was after. This all seems like that kludge solution mostly driven by using a database technology that is not built to handle large volumes of data. We recently implemented Databricks in our environment and the performance difference with a traditional relational database is mind blowing. To give an example a Sync recipe of a 1bn (yes billion) rows table with 31 columns takes less than 2 minutes to run. Most complex SQL take less than 30 seconds to run. And the Databricks SQL Warehouse is serverless and has auto scale meaning we don’t need to provision for peak and we can save on compute while we don’t use it. So yes your solution maybe working with an old relational technology which maybe the only one you have available to you. But still this doesn’t make it a good solution nor will it stand the test of time.

  • Dima
    Dima Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4
    Options

    I don't know much about Databricks. Is it a data store and compute service, kinda like Spark? If it is, what are the similarities and differences, pros and cons? Is it a configurable backend for Dataiku?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,727 Neuron
    Options

    Is it a data store and compute service, kinda like Spark? Yes, it is indeed a Spark engine combined with a data store. But the SQL Warehouse is a heavily tunned SparkSQL engine with additional custom technology on top. You also get a traditional Compute Cluster to execute PySpark.

    what are the similarities and differences, pros and cons? That's probably the subject of a pre-sales consultation with Databricks.

    Is it a configurable backend for Dataiku? Yes, natively supported since v11.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,727 Neuron
    Options
  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 317 Neuron
    Options

    Glad to hear that you have a solution. Thanks for sharing it. That's good to know about.

    Marlan

Setup Info
    Tags
      Help me…