Performance issue using 'Group by' and 'Join' with CSV file

tae hong
tae hong Registered Posts: 1

Hi, I have migrated from SAS to Dataiku, but having some performance issue.

is there a way to solve the performance issue, using Dataiku for data 'group by' and 'join' recipes with CSV file?

Tagged:

Answers

  • LucOBrien
    LucOBrien Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 20 ✭✭✭

    Can you provide any additional information to help diagnose the issue? Row and column count in your CSV, for example?

    For files like CSVs, Dataiku reads into memory the entire file and then does the operations from there - it could be a file size issue related to running out of memory, or it could be the result of you choosing some complex parameters of the plugins.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,248 Neuron

    You are comparing apples to oranges and giving no context. SAS is an in-memory tool that's why it can perform joins fast assumning the data can be loaded into memory. But like all in-memory tools they will struggle to handle large data volumes like hundreds of millions of rows / hundreds of gigabytes and as they can't easily scale up. The other thing to consider is that you don't really say what you are using aside from a CSV. Is this using DSS engine? What's the size of your DSS VM? Or are you using any sort of SQL backend? Dataiku file system datasets are never going to be faster than in-memory tools. In most cases Dataiku can push the compute down to the data layer so if you want faster joins you should look at better/newer/scalable data engines like Databricks, Snowflake, BigQuery, etc which you can easily use from Dataiku. These data engines can handle billions of rows and terabytes of data, something SAS can only dream of. So perhaps it is your wrong assumption that Dataiku alone can replace a tool like SAS.

  • John_wilson
    John_wilson Registered Posts: 3 ✭✭

    Hi @tae hong,

    You should sync data first to a database then use the join and group by recipes.  Best practice for optimal performance.


    Our company L3 analytics is a Dataiku certified partner. We focus in providing platform support and administration for Dataiku. If you guys need assistance implementing and supporting the platform please reach out to us at info@l3-analyticsinc.com / john@l3-analyticsinc.com.
    Thank you.

Setup Info
    Tags
      Help me…