Activate Partition for S3

sj0071992
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

Hi,

I am processing some Logs files placed on my S3 location "

/audit/compute-resource-usage/compute-resource-usage/Year/Month/file.gz"

These log files are generated every 1-3 min and i want to process only the latest logs file received on S3 location.

So for that I activate the partition with pattern -> "/%Y/%M/.*" but i am not able to list any partition in preview.

Could you please help in resolving this?

Thanks in advance

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker

    Hi,

    Just to confirm your dataset files are stored under the format 2021/01 ?

    The dataset in S3 would be S3://audit/compute-resource-usage/compute-resource-usage

    So the partition dependency would be in the format /%Y/%M/.* means the structure has to be 2021/01 and not 2021/Jan or 2021/January.

    With this partitioning, it would be a monthly partitioning even if new files are added every couple of minutes when you rebuild you would need to rebuild for the whole month.

    Could elaborate a bit on what you are looking to do once you are able to read the files within the partition?

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    You are correct, my S3 path is like /audit/compute-resource-usage/compute-resource-usage/2021/10/file.gz"

    As you mentioned it would rebuild for the whole month, as of now it would be fine for me but after using the same pattern for partition i am not able to list any partition in preview.

    Once i read the file after partition there is one python recipe for Schema consistency across all the files and after that we have some aggregations.

    So we are building some Manual KPI's like Which project is utilizing more CPU, which SQL recipe is taking more time etc.

    I hope you are clear what i am looking for.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker

    Hi,

    It should work in your case. When you try to create an S3 dataset with the path and list the files to see any files listed?

    I've tried to replicate on my end and this is what I see and the corresponding portions being detected

    Screenshot 2021-11-10 at 10.40.08.pngScreenshot 2021-11-10 at 10.39.56.png

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    It seems its working in your case.

    1. When you try to create an S3 dataset with the path and list the files to see any files listed? - I am not able to list any files

    2. My file format is file.gz and in your case its file.tar.gz, will it make any difference?

    Thanks in Advance

Setup Info
    Tags
      Help me…