partitioning
Hi there,
i have a problem aggregating big data file base on s3
the data is stored like this way /2022-12-21T00/B00/part-00004-d89b41ad-3d2b-4350-9880-c5f1dfbdbea6.c000.csv.gz the T00 stands for the hour and the B00 is a group that always contains the same subjects.
now what i try to achieve what is to aggregate the B00 Group by day i have tried to partition in different ways but i keep crashing in Python because the data is to big to fit
i have tried to partition in this way /%Y-%M-%DT.* an like so Y-%M-%DT%H/%{dimension_2}/*. but still it is very slow i use iter_dataframes in python to iterate over the partitions.
please advise.
Best Regards Michael
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,237 Dataiker
Hi,
Have you tried leveraging Spark in this case? If the dataset is large, Spark can handle partitions with spark partition ( not DSS partitions) and built-in parallelism.
You can either use SparkSQL, PySpark or Group By recipe with Spark engine.
Thanks -
Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 82 ✭✭✭✭✭
Hello @hille499
,In case of, you have to manage your partition by day without spark.
Why not create a part with a second dimension for a longer period than day above in the path ?
Considering you have;
/2022-12-21T00/B00/
T00 stands for the hour
B00 is a group as discret variable
As creates main partitions by year or month above to divide the weight or do it with discrete variables for hours/groups below:
% {yourdsicretvar} / % Y- % M- % D/ % {T00/B00}/. *