Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on March 13, 2025 2:41PM
Likes: 1
Replies: 4
Operating system used: Linux
Hi, for feature requests use the Product Ideas forum.
You have 3 features there not 2 😉
1a. Upgrade to v13.1: DSS will now by default refuse to drop SQL tables for managed datasets when the parent recipe is in append mode. In case of schema mismatch, the recipe now fails. This behavior can be reverted in the advanced settings of the output dataset. And set this in your append datasets: Advanced ⇒ SQL schema inconsistency ⇒ Prevent data drop in append mode. This will prevent the dataset from being dropped and the historical data lost. But you will need to sort the schema changes yourself. Your idea doesn't really work in the real world since not all data technologies allow for data to be appended. So use the new 13.1 setting.
2a. SQL Code recipes execute in database and as such Dataiku needs to match the recipe output schema with the output dataset schema otherwise there could be data type errors. I don't really see the problem here. If you want to change your types you will need to cast the columns in the resulting SQL to whatever type you want and Dataiku will match that in the output. Alternatevely if you want full control of the schema use a SQL script recipe. See here for the differences.
2b. Instead of using dataset.write_with_schema() you can use dataset.write_dataframe(infer_schema=False) and have full control of the schema using dataset.write_schema() or dataset.write_schema_from_dataframe().
Hi Turribeach, thanks for your response.
1a. That's awesome! I'll see what we can do to upgrade to that version.
2a. Understood, I guess then I'm wondering what the purpose of the "schema" tab under the settings of an MS SQL dataset is? Why even let me modify the schema if these changes will always be modified by the SQL query recipe that creates the dataset? I get what you are saying as far as using a SQL script recipe if I want full control over the schema, I guess it's just then confusing that the settings tab gives the impression of schema control when it doesn't really do too much.
2b. I was thinking of a SQL dataset that is the output of a SQL query recipe, not a python recipe, so write_dataframe doesn't apply. How would you suggest I solve the issue of the max NVARCHAR length in a SQL query recipe?
2a. The schema tab is there because DSS reads data from dozens of technologies and not all are strongly typed like RDBMS databases are. It also shows you what DSS types the data has been mapped to which obviously are very important. Finally you can indeed change the types manually but if they don't match the data types then it's going to be reverted.
2b. You can use the SQL Executor to run custom queries and return them as a df to be used in Python. Then you can customise the schema. But it will obviously be slower as DSS will need to bring all the data in memory so it won't be pushdown to the database like a SQL Query will.
Hi Turribeach, thanks again!
2a. I feel like you aren't hearing my issue here. I'm not asking why the schema tab exists, rather, why are the fields editable. I'm asking you why can you change the types manually if it doesn't do anything. Don't you see how that could be confusing for a user?
2b. I feel like you aren't hearing my issue here. I'm not asking for a workaround so that my NVARCHAR values are never cut off. I'm asking if it's possible to change to the default settings of NVARCHAR schema processing in SQL, since I think many people would agree that setting it to the max string length in the column is not very practical.