partitioning

hille499 · May 2024

Hi there,

i have a problem aggregating big data file base on s3

the data is stored like this way /2022-12-21T00/B00/part-00004-d89b41ad-3d2b-4350-9880-c5f1dfbdbea6.c000.csv.gz the T00 stands for the hour and the B00 is a group that always contains the same subjects.

now what i try to achieve what is to aggregate the B00 Group by day i have tried to partition in different ways but i keep crashing in Python because the data is to big to fit

i have tried to partition in this way /%Y-%M-%DT.* an like so Y-%M-%DT%H/%{dimension_2}/*. but still it is very slow i use iter_dataframes in python to iterate over the partitions.

please advise.

Best Regards Michael

Alexandru · May 2024

Hi,
Have you tried leveraging Spark in this case? If the dataset is large, Spark can handle partitions with spark partition ( not DSS partitions) and built-in parallelism.

You can either use SparkSQL, PySpark or Group By recipe with Spark engine.

Thanks

Grixis · May 2024

Hello @hille499
,

In case of, you have to manage your partition by day without spark.

Why not create a part with a second dimension for a longer period than day above in the path ?

Considering you have;

/2022-12-21T00/B00/

T00 stands for the hour

B00 is a group as discret variable

As creates main partitions by year or month above to divide the weight or do it with discrete variables for hours/groups below:

% {yourdsicretvar} / % Y- % M- % D/ % {T00/B00}/. *

partitioning

Answers

Categories

Setup Info

Tags