I am quite new at Dataiku on AWS so that I have basic questions. I need to process series of data with 40 M -> 80 M rows (mainly joins, SUM/COUNT by customer …), ending with a ML model. I would like to know what is the best practise for processing the data : (1) Use Redshift SQL (some people talk about the fact that REdshift being a column oriented database, this can be slow but on the other side the load is managed by the Redshift engine) or (2) use Pandas dataframes based on S3 or Redshift storage (here it seems that the read/write instructions can be slow). My configuration is 32 GB of RAM, 4 VCPU , 50GB DISK. Any advice? Thanks
It is difficult to provide guidance with more details about your exact use case and what you hope to accomplish as there are potentially many factors to consider. With that being said, if all your datasources and datasets will be stored in Redshift, then you may want to consider using the in-database engine for the recipes in your Flow, which will allow you to leverage the native in-database processing capabilities for various SQL and data transformation operations. As for which option would be more performant, that'd be hard to say without doing some benchmarking testing yourself to compare the differences.
Regarding your second option though, one thing to consider is that we do provide a "fast path" to quickly sync data from Redshift to S3 or vice versa. Therefore, if you find that the Redshift engine is too slow for your needs, you could always sync all your initial Redshift input datasets into S3 using this "fast path" option at the beginning of your Flow, do all the data cleaning/prep/transformation/etc work with your S3 datasets (using either the DSS engine or the Spark engine if you have Spark enabled), and then re-sync your final S3 output datasets back to Redshift (again using this "fast path" option). More information about this "fast path" can be found in our documentation here:
I hope that this helps!