partitioning

Options
hille499
hille499 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 1

Hi there,

i have a problem aggregating big data file base on s3

the data is stored like this way /2022-12-21T00/B00/part-00004-d89b41ad-3d2b-4350-9880-c5f1dfbdbea6.c000.csv.gz the T00 stands for the hour and the B00 is a group that always contains the same subjects.

now what i try to achieve what is to aggregate the B00 Group by day i have tried to partition in different ways but i keep crashing in Python because the data is to big to fit

i have tried to partition in this way /%Y-%M-%DT.* an like so Y-%M-%DT%H/%{dimension_2}/*. but still it is very slow i use iter_dataframes in python to iterate over the partitions.

please advise.

Best Regards Michael

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi,
    Have you tried leveraging Spark in this case? If the dataset is large, Spark can handle partitions with spark partition ( not DSS partitions) and built-in parallelism.

    You can either use SparkSQL, PySpark or Group By recipe with Spark engine.


    Thanks

  • Grixis
    Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 56 ✭✭✭✭✭
    Options

    Hello @hille499
    ,

    In case of, you have to manage your partition by day without spark.

    Why not create a part with a second dimension for a longer period than day above in the path ?

    Considering you have;

    /2022-12-21T00/B00/

    T00 stands for the hour

    B00 is a group as discret variable

    As creates main partitions by year or month above to divide the weight or do it with discrete variables for hours/groups below:

    % {yourdsicretvar} / % Y- % M- % D/ % {T00/B00}/. *

Setup Info
    Tags
      Help me…