Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE


Level 1

Hi there,

i have a problem aggregating big data file base on s3 

the data is stored like this way /2022-12-21T00/B00/part-00004-d89b41ad-3d2b-4350-9880-c5f1dfbdbea6.c000.csv.gz the T00 stands for the hour and the B00 is a group that always contains the same subjects.

now what i try to achieve what is to aggregate the B00 Group by day i have tried to partition in different ways but i keep crashing in Python because the data is to big to fit 

i have tried to partition in this way /%Y-%M-%DT.*  an like so Y-%M-%DT%H/%{dimension_2}/*. but still it is very slow i use iter_dataframes in python to iterate over the partitions.

please advise.

Best Regards Michael 

0 Kudos
2 Replies

Have you tried leveraging Spark in this case?  If the dataset is large, Spark can handle partitions with spark partition ( not DSS partitions) and built-in parallelism.

You can either use SparkSQL, PySpark or Group By recipe with Spark engine.


0 Kudos
Level 4

Hello @hille499,

In case of, you have to manage your partition by day without spark.

Why not create a part with a second dimension for a longer period than day above in the path ?

Considering you have;


T00 stands for the hour

B00 is a group as discret variable

As creates main partitions by year or month above to divide the weight or do it with discrete variables for hours/groups below:

% {yourdsicretvar} / % Y- % M- % D/ % {T00/B00}/. *



0 Kudos