Identify lines based on partition variable

Registered Posts: 6 ✭✭✭✭
Hi,

I'm creating datasets based on files in a S3 bucket.

The files in the bucket are in a single folder, but have several name patterns, such as "blue_01012017.csv", "red_02012017.csv", etc.

Using partitioning, I have defined "blue", "red", etc. as a partition variable called "source". This information is not included in the data itself.

What I want to do is either :

- directly split my dataset based on that "source" value

- or include a "source" column in my dataset that would have the appropriate value for each line, based on the file it came from, so I can split it later based on that value.

I can't seem to find a way to do this, can you help?

Thanks a lot in advance,

Julien

Answers

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.