Relationship between Dataiku partitioning vs S3 partitioning
Hi All,
Suppose I have a dataset in S3 partitioned and stored as parquet. Suppose I read it into Dataiku as a partitioned dataset (or try to sync it from somewhere else). What is the relationship between the two partitions? If I use the same partition key in Dataiku as S3, will Dataiku recognize that and avoid repartitioning the dataset?
I recognize that I am conflating two possibly unrelated concepts (partitioning in S3 vs partitioning in Dataiku). For context, I am working with terrabytes of data that are already available as partitioned in S3, and want to avoid having to repartition them via Dataiku.
Thanks,
Yash
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
I think the question you are asking is not the correct one. As you have identified partitioning in Dataiku is a completely different thing than partitioning in a database or an object store. Therefore there is usually no correlation between these two. In fact in most cases users will use Dataiku partitioning without any actual partitioning in the storage layer.
What you really should asking is whether adding Dataiku partitioning adds value to your project. For that I suggest you read the documentation to understand how partitioning works in Dataiku and when are they worth adding:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/index.html
But as rule of thumb Dataiku partitioning is useful when you want to compute the partitions independently as recalculating the whole dataset is either too slow or not practical.