Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
My team ran into an issue with computing metrics for a partitioned dataset. We created a dataset and under "Explore" we have the button "Sample", see below
After pressing "Sample" and waiting for 30 seconds we noticed it took too long and pressed "Abort", but it never aborted. Instead it started counting rows in the different partitions and continued to do so. The metrics computed were visible under "DSS Settings > Running background tasks > Long tasks", see the image below
We managed to stop it by deleting the user's Snowflake credentials.
Is there a way to stop counting the partitions when pressing "Abort" after pressing "Sample"?"
We are using DSS version 12.1.2 running on Amazon Linux.
Hi, it is not always possible to cancel the execution of a SQL statement. It will depend on the JDBC driver, the database technology, the statement you are running, etc. Sometimes aborting a statement will require a rollback which means that you need to wait for that to happen before the statement is fully aborted. I have not seen this issue with other databases so I presume it's related to Snowflake only. If you can reproduce the issue I will suggest you raise with Dataiku Support but like I said this maybe be an issue outside of DSS' control. I will be careful with the options you select in your sample screen in particular the sampling method. If it was hitting every partition then I presume you selected a sampling method which requires a full scan of the data. Check the sampling methog drop down as it explains which options require full scans.
In this case , there is not a single SQL statement there are ~2000 individual sql statement (one for each partition I think) that are run sequentially , each one seems to be calculating something for each partition.
So there is no need to cancel any SQL statement, just stop issuing new SQL statements after the operation is aborted. Unfortunately, Dataiku keeps issuing new SQL statements (all related to the same operation) hours after aborting.
This maybe because a user may have configured or requested the Records Count metric to run against the whole dataset/all partitions, which is never a good idea for partitioned datasets (go to Status => Edit tab => Records count to check the current settings).