Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

Activate Partition for S3

sj0071992
Level 2
Level 2
Activate Partition for S3

Hi,

 

I am processing some Logs files placed on my S3 location "

/audit/compute-resource-usage/compute-resource-usage/Year/Month/file.gz"

These log files are generated every 1-3 min and i want to process only the latest logs file received on S3 location.

So for that I activate the partition with pattern -> "/%Y/%M/.*" but i am not able to list any partition in preview.

 

Could you please help in resolving this?

 

Thanks in advance

0 Kudos
4 Replies
AlexT
Dataiker
Dataiker

Hi,

Just to confirm your dataset files are stored under the format 2021/01 ?

The dataset in S3 would be S3://audit/compute-resource-usage/compute-resource-usage 

So the partition dependency would be in the format /%Y/%M/.* means the structure has to be  2021/01  and not  2021/Jan or 2021/January. 

With this partitioning, it would be a monthly partitioning even if new files are added every couple of minutes when you rebuild you would need to rebuild for the whole month. 

Could elaborate a bit on what you are looking to do once you are able to read the files within the partition? 

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

You are correct, my S3 path is like /audit/compute-resource-usage/compute-resource-usage/2021/10/file.gz"

 

As you mentioned it would rebuild for the whole month, as of now it would be fine for me but after using the same pattern for partition i am not able to list any partition in preview.

Once i read the file after partition there is one python recipe for Schema consistency across all the files and after that we have some aggregations.

So we are building some Manual KPI's like Which project is utilizing more CPU, which SQL recipe is taking more time etc.

 

I hope you are clear what i am looking for.

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

It should work in your case. When you try to create an S3 dataset with the path and list the files to see any files listed?

I've tried to replicate on my end and this is what I see and the corresponding portions being detected

Screenshot 2021-11-10 at 10.40.08.pngScreenshot 2021-11-10 at 10.39.56.png

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

It seems its working in your case.

1. When you try to create an S3 dataset with the path and list the files to see any files listed? - I am not able to list any files

2. My file format is file.gz and in your case its file.tar.gz, will it make any difference?

 

Thanks in Advance

0 Kudos
A banner prompting to get Dataiku DSS