[Samsung Fire & Marine] Need to improve the performance of Join and Group recipes

Samsung Fire & Marine Insurance has been using a statistical analysis tool called SAS for the past several years.
This time I'm trying to replace SAS with Dataiku, but there is a major obstacle and that is the performance issue of Join and Group recipes.
When performing tests, when performing a join on about 10 million data, SAS takes 48 seconds, while Dataiku takes 12 minutes and 30 seconds. For group recipes, SAS takes 11 seconds, while Dataiku takes 1 minute and 16 seconds.
When performing data analysis, Join and Group By for each data set are frequently performed, and this performance difference is considered a major obstacle in switching tools.
In data analysis, analytical performance is of the utmost importance, so I would like the Dataiku R&D department to consider this as the most important and develop and apply improvements accordingly.
Please consider this as a top priority and reflect it in the update so that the performance of Dataiku's Join and Group recipes can be improved.
We hope that Dataiku will consider this matter as important.
We look forward to your positive feedback.
Thank you.
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,252 Neuron
You are comparing apples to oranges and giving no context. SAS is an in-memory tool that's why it can perform joins fast assumning the data can be loaded into memory. But like all in-memory tools they will struggle to handle large data volumes like hundreds of millions of rows / hundreds of gigabytes and as they can't easily scale up. The other thing to consider is that you don't really say what are your timings based on. Is this using DSS engine? What's the size of your DSS VM? Or are you using any sort of SQL backend? In most cases Dataiku can push the compute down to the data layer so if you want faster joins you should look at better/newer/scalable data engines like Databricks, Snowflake, BigQuery, etc. These data engines can handle billions of rows and terabytes of data, something SAS can only dream of.