Create SQL table for Dataset using python API

Dima · June 2024

Using python API, I can create an SQL Dataset, or clear it using the DSSDataset.clear() method, but afterwards I have to manually click the "Create Table Now" in the settings tab of the dataset before using it in recipes. Is there a way to achieve the same effect as clicking the button using the python API?

I checked the dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper methods and haven't found anything. Also I thought about extracting the SQL query that Dataiku uses to create the SQL table from the dataikuapi.dss.dataset.DSSDataset object, but also couldn't find it.

Marlan · June 2024

Hi @Dima
,

It would help to understand more about what you are trying to do.

A managed dataset should have a recipe in front of it that would create and populate the underlying table. No need to create an empty table in advance - well, unless you want to append rows to the table. Is that what you want to do?

Marlan

Turribeach · June 2024

I agree with @Marlan
, it's unclear what exactly you are are trying to achieve. Can you please let us know your actual requirement rather than the method you are trying to use to achieve it? There might be better ways of achieving what you want.

Dima · June 2024

My goal is to regularly process some data and write it to an SQL table.
To do it faster, I use multiprocessing in python.
To enable writing to Dataset in separate processes without the dataset being cleared in each process, I use append instead of overwrite option.
But I want to append only during the recipe run, I don't want to accumulate data between runs, so I need to clear the dataset at the start of the recipe.
Since the dataset is partitioned, when I run dataset.clear(partitions=partitions) on a dataset which hasn't created an SQL table yet, I get an error indicating the SQL relation doesn't exist.
Since the dataset and recipe creation is done in python API, I also want to be able to create SQL table, after the Dataiku dataset is created, in python API.

With help of Dataiku support, I came up with the following code:

with dataset.get_writer():

pass

If I run this after dataset creation, it will also create the SQL table, so my problem is solved.

Turribeach · June 2024

Out of interest what’s your database technology?

Dima · June 2024

I'm writing to PostgreSQL 9.4.26 (Greenplum Database 6.24.3) database, if that's what you are asking.

Turribeach · June 2024

Yes. That’s what I was after. This all seems like that kludge solution mostly driven by using a database technology that is not built to handle large volumes of data. We recently implemented Databricks in our environment and the performance difference with a traditional relational database is mind blowing. To give an example a Sync recipe of a 1bn (yes billion) rows table with 31 columns takes less than 2 minutes to run. Most complex SQL take less than 30 seconds to run. And the Databricks SQL Warehouse is serverless and has auto scale meaning we don’t need to provision for peak and we can save on compute while we don’t use it. So yes your solution maybe working with an old relational technology which maybe the only one you have available to you. But still this doesn’t make it a good solution nor will it stand the test of time.

Dima · June 2024

I don't know much about Databricks. Is it a data store and compute service, kinda like Spark? If it is, what are the similarities and differences, pros and cons? Is it a configurable backend for Dataiku?

Turribeach · June 2024

Is it a data store and compute service, kinda like Spark? Yes, it is indeed a Spark engine combined with a data store. But the SQL Warehouse is a heavily tunned SparkSQL engine with additional custom technology on top. You also get a traditional Compute Cluster to execute PySpark.

what are the similarities and differences, pros and cons? That's probably the subject of a pre-sales consultation with Databricks.

Is it a configurable backend for Dataiku? Yes, natively supported since v11.

Turribeach · June 2024

https://blog.dataiku.com/databricks-integration

Marlan · June 2024

Glad to hear that you have a solution. Thanks for sharing it. That's good to know about.

Marlan

Create SQL table for Dataset using python API

Answers

Categories

Setup Info

Tags