Activate Partition for S3
Hi,
I am processing some Logs files placed on my S3 location "
/audit/compute-resource-usage/compute-resource-usage/Year/Month/file.gz"
These log files are generated every 1-3 min and i want to process only the latest logs file received on S3 location.
So for that I activate the partition with pattern -> "/%Y/%M/.*" but i am not able to list any partition in preview.
Could you please help in resolving this?
Thanks in advance
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi,
Just to confirm your dataset files are stored under the format 2021/01 ?
The dataset in S3 would be S3://audit/compute-resource-usage/compute-resource-usage
So the partition dependency would be in the format /%Y/%M/.* means the structure has to be 2021/01 and not 2021/Jan or 2021/January.
With this partitioning, it would be a monthly partitioning even if new files are added every couple of minutes when you rebuild you would need to rebuild for the whole month.
Could elaborate a bit on what you are looking to do once you are able to read the files within the partition?
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
You are correct, my S3 path is like /audit/compute-resource-usage/compute-resource-usage/2021/10/file.gz"
As you mentioned it would rebuild for the whole month, as of now it would be fine for me but after using the same pattern for partition i am not able to list any partition in preview.
Once i read the file after partition there is one python recipe for Schema consistency across all the files and after that we have some aggregations.
So we are building some Manual KPI's like Which project is utilizing more CPU, which SQL recipe is taking more time etc.
I hope you are clear what i am looking for.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi,
It should work in your case. When you try to create an S3 dataset with the path and list the files to see any files listed?
I've tried to replicate on my end and this is what I see and the corresponding portions being detected
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
It seems its working in your case.
1. When you try to create an S3 dataset with the path and list the files to see any files listed? - I am not able to list any files
2. My file format is file.gz and in your case its file.tar.gz, will it make any difference?
Thanks in Advance