2 Features I desperately need added to Dataiku

- Appending to output dataset in python code recipes.
- Currently the way to do this is with a check box in the settings of the recipes that says "Append instead of overwrite". However- this is limiting and does not have good functionality with respect to potential schema changes (this button has caused significant data loss for my team in the past). I would like a method added to the dataiku internal api such as: mydataset.write_with_schema(mydf, append=True) or mydataset.write_dataframe(mydf, append=True). This will allow for much more control over appending to output datasets.
- Schema settings for MS SQL datasets that are the outputs of SQL code recipes.
- Currently schema detection for these types of datasets seems to happen in Dataiku regardless of whatever settings I try to change. There seems to be no good way for me to specify a specific schema I want my dataset to follow, without dataiku auto-detecting the schema of my query and overriding my specified schema. I would like a toggle for that.
- More flexibility in auto-detection of the length of NVARCHAR columns (I can only speak to how this works with MS SQL datasets). Currently it seems the default in Dataiku is to set the character limit of any NVARCHAR column to be the maximum length of a string in that column. For example, say I have a column named "EmailAddress" in my query, with the longest value being "really_long_name@really_long_company_name.com". Dataiku will set the dtype of the "EmailAddress" column to be NVARCHAR(45). But say this query is part of an automated flow or scenario, and eventually we come across an EmailAddress that is 46 characters long. This will hit a schema/data truncation error and break the scenario, causing the automation to fail. This is a huge problem for my company and has caused many crucial automations to fail. I propose an alternative solution that grants a little bit more flexibility to NVARCHAR columns- make the default 1.5x or 2x the current maximum string length in the query output. So, the EmailAddress column would be of dtype NVARCHAR(68), allowing for a little flexibility when potentially encountering longer strings in the future.
Operating system used: Linux
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,321 Neuron
Hi, for feature requests use the Product Ideas forum.
You have 3 features there not 2 😉
1a. Upgrade to v13.1: DSS will now by default refuse to drop SQL tables for managed datasets when the parent recipe is in append mode. In case of schema mismatch, the recipe now fails. This behavior can be reverted in the advanced settings of the output dataset. And set this in your append datasets: Advanced ⇒ SQL schema inconsistency ⇒ Prevent data drop in append mode. This will prevent the dataset from being dropped and the historical data lost. But you will need to sort the schema changes yourself. Your idea doesn't really work in the real world since not all data technologies allow for data to be appended. So use the new 13.1 setting.
2a. SQL Code recipes execute in database and as such Dataiku needs to match the recipe output schema with the output dataset schema otherwise there could be data type errors. I don't really see the problem here. If you want to change your types you will need to cast the columns in the resulting SQL to whatever type you want and Dataiku will match that in the output. Alternatevely if you want full control of the schema use a SQL script recipe. See here for the differences.
2b. Instead of using dataset.write_with_schema() you can use dataset.write_dataframe(infer_schema=False) and have full control of the schema using dataset.write_schema() or dataset.write_schema_from_dataframe().
-
Hi Turribeach, thanks for your response.
1a. That's awesome! I'll see what we can do to upgrade to that version.
2a. Understood, I guess then I'm wondering what the purpose of the "schema" tab under the settings of an MS SQL dataset is? Why even let me modify the schema if these changes will always be modified by the SQL query recipe that creates the dataset? I get what you are saying as far as using a SQL script recipe if I want full control over the schema, I guess it's just then confusing that the settings tab gives the impression of schema control when it doesn't really do too much.
2b. I was thinking of a SQL dataset that is the output of a SQL query recipe, not a python recipe, so write_dataframe doesn't apply. How would you suggest I solve the issue of the max NVARCHAR length in a SQL query recipe?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,321 Neuron
2a. The schema tab is there because DSS reads data from dozens of technologies and not all are strongly typed like RDBMS databases are. It also shows you what DSS types the data has been mapped to which obviously are very important. Finally you can indeed change the types manually but if they don't match the data types then it's going to be reverted.
2b. You can use the SQL Executor to run custom queries and return them as a df to be used in Python. Then you can customise the schema. But it will obviously be slower as DSS will need to bring all the data in memory so it won't be pushdown to the database like a SQL Query will.
-
Hi Turribeach, thanks again!
2a. I feel like you aren't hearing my issue here. I'm not asking why the schema tab exists, rather, why are the fields editable. I'm asking you why can you change the types manually if it doesn't do anything. Don't you see how that could be confusing for a user?
2b. I feel like you aren't hearing my issue here. I'm not asking for a workaround so that my NVARCHAR values are never cut off. I'm asking if it's possible to change to the default settings of NVARCHAR schema processing in SQL, since I think many people would agree that setting it to the max string length in the column is not very practical.