Hi there,

i have a problem aggregating big data file base on s3 

the data is stored like this way /2022-12-21T00/B00/part-00004-d89b41ad-3d2b-4350-9880-c5f1dfbdbea6.c000.csv.gz the T00 stands for the hour and the B00 is a group that always contains the same subjects.

now what i try to achieve what is to aggregate the B00 Group by day i have tried to partition in different ways but i keep crashing in Python because the data is to big to fit 

i have tried to partition in this way /%Y-%M-%DT.*  an like so Y-%M-%DT%H/%{dimension_2}/*. but still it is very slow i use iter_dataframes in python to iterate over the partitions.

please advise.

Best Regards Michael 

Have you tried leveraging Spark in this case?  If the dataset is large, Spark can handle partitions with spark partition ( not DSS partitions) and built-in parallelism.

You can either use SparkSQL, PySpark or Group By recipe with Spark engine.


Hello @hille499,

In case of, you have to manage your partition by day without spark.

Why not create a part with a second dimension for a longer period than day above in the path ?

Considering you have;


T00 stands for the hour

B00 is a group as discret variable

As creates main partitions by year or month above to divide the weight or do it with discrete variables for hours/groups below:

% {yourdsicretvar} / % Y- % M- % D/ % {T00/B00}/. *



