Identify lines based on partition variable
JBR
Registered Posts: 6 ✭✭✭✭
Hi,
I'm creating datasets based on files in a S3 bucket.
The files in the bucket are in a single folder, but have several name patterns, such as "blue_01012017.csv", "red_02012017.csv", etc.
Using partitioning, I have defined "blue", "red", etc. as a partition variable called "source". This information is not included in the data itself.
What I want to do is either :
- directly split my dataset based on that "source" value
- or include a "source" column in my dataset that would have the appropriate value for each line, based on the file it came from, so I can split it later based on that value.
I can't seem to find a way to do this, can you help?
Thanks a lot in advance,
Julien
I'm creating datasets based on files in a S3 bucket.
The files in the bucket are in a single folder, but have several name patterns, such as "blue_01012017.csv", "red_02012017.csv", etc.
Using partitioning, I have defined "blue", "red", etc. as a partition variable called "source". This information is not included in the data itself.
What I want to do is either :
- directly split my dataset based on that "source" value
- or include a "source" column in my dataset that would have the appropriate value for each line, based on the file it came from, so I can split it later based on that value.
I can't seem to find a way to do this, can you help?
Thanks a lot in advance,
Julien
Tagged:
Answers
-
It is indeed not currently possible to retrieve the source partition as a value inside the data.
You can however achieve the split with multiple sync recipes that only select a single input partition using partition dependencies:
-
Thanks Clément, it does the trick perfectly !