Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on June 6, 2024 8:25AM
Likes: 1
Replies: 10
Using python API, I can create an SQL Dataset, or clear it using the DSSDataset.clear() method, but afterwards I have to manually click the "Create Table Now" in the settings tab of the dataset before using it in recipes. Is there a way to achieve the same effect as clicking the button using the python API?
I checked the dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper methods and haven't found anything. Also I thought about extracting the SQL query that Dataiku uses to create the SQL table from the dataikuapi.dss.dataset.DSSDataset object, but also couldn't find it.
Hi @Dima
,
It would help to understand more about what you are trying to do.
A managed dataset should have a recipe in front of it that would create and populate the underlying table. No need to create an empty table in advance - well, unless you want to append rows to the table. Is that what you want to do?
Marlan
I agree with @Marlan
, it's unclear what exactly you are are trying to achieve. Can you please let us know your actual requirement rather than the method you are trying to use to achieve it? There might be better ways of achieving what you want.
With help of Dataiku support, I came up with the following code:
Out of interest what’s your database technology?
I'm writing to PostgreSQL 9.4.26 (Greenplum Database 6.24.3) database, if that's what you are asking.
Yes. That’s what I was after. This all seems like that kludge solution mostly driven by using a database technology that is not built to handle large volumes of data. We recently implemented Databricks in our environment and the performance difference with a traditional relational database is mind blowing. To give an example a Sync recipe of a 1bn (yes billion) rows table with 31 columns takes less than 2 minutes to run. Most complex SQL take less than 30 seconds to run. And the Databricks SQL Warehouse is serverless and has auto scale meaning we don’t need to provision for peak and we can save on compute while we don’t use it. So yes your solution maybe working with an old relational technology which maybe the only one you have available to you. But still this doesn’t make it a good solution nor will it stand the test of time.
I don't know much about Databricks. Is it a data store and compute service, kinda like Spark? If it is, what are the similarities and differences, pros and cons? Is it a configurable backend for Dataiku?
Is it a data store and compute service, kinda like Spark? Yes, it is indeed a Spark engine combined with a data store. But the SQL Warehouse is a heavily tunned SparkSQL engine with additional custom technology on top. You also get a traditional Compute Cluster to execute PySpark.
what are the similarities and differences, pros and cons? That's probably the subject of a pre-sales consultation with Databricks.
Is it a configurable backend for Dataiku? Yes, natively supported since v11.
Glad to hear that you have a solution. Thanks for sharing it. That's good to know about.
Marlan