S3 Dataset - Unable to read column header

Solved!
Jas
Level 2
S3 Dataset - Unable to read column header

Hi - 

I need to use a dataset(CSV) from Project A, this is created as an output of one of the recipe. However, when I try to use this particular dataset in Project B, I can read the data but the headers are missing.

I have gone through the settings, but could not resolve it. In project A โ€“ I can see the headers.

0 Kudos
2 Solutions
VitaliyD
Dataiker

Hi @Jas ,

When DSS create managed datasets, by default, headers never added. To add headers, you will need to go to the Project A recipe output dataset settings, select the "Parse next line as column headers" option and rebuild the dataset. Please refer to the screenshot below:

Screenshot 2021-07-15 at 09.45.46.png

Please note: adding headers is not recommended except for "final outputs" of Flow because it prevents a fast path in Spark. The recommended way to read Project A datasets in Project B is to use shared datasets. For more information on how to expose objects from one project to another, please check our documentation here: https://doc.dataiku.com/dss/latest/security/exposed-objects.html#exposing-objects-between-projects

View solution in original post

VitaliyD
Dataiker

@Jas I had updated my answer, not sure if you saw that before you posted your question. As I mention, this is not recommended way to use Project A dataset in Project B. Hence there is no global setting that can change this.
We would advise you to revisit your use case and use shared datasets instead of using the "Parse next line as column headers" option if possible.

View solution in original post

5 Replies
VitaliyD
Dataiker

Hi @Jas ,

When DSS create managed datasets, by default, headers never added. To add headers, you will need to go to the Project A recipe output dataset settings, select the "Parse next line as column headers" option and rebuild the dataset. Please refer to the screenshot below:

Screenshot 2021-07-15 at 09.45.46.png

Please note: adding headers is not recommended except for "final outputs" of Flow because it prevents a fast path in Spark. The recommended way to read Project A datasets in Project B is to use shared datasets. For more information on how to expose objects from one project to another, please check our documentation here: https://doc.dataiku.com/dss/latest/security/exposed-objects.html#exposing-objects-between-projects

Jas
Level 2
Author

@VitaliyD Thanks. I was missing the rebuilding step. Is there a global setting that I can apply, so that this remains checked? 

0 Kudos
VitaliyD
Dataiker

@Jas I had updated my answer, not sure if you saw that before you posted your question. As I mention, this is not recommended way to use Project A dataset in Project B. Hence there is no global setting that can change this.
We would advise you to revisit your use case and use shared datasets instead of using the "Parse next line as column headers" option if possible.

Jas
Level 2
Author

@VitaliyD  - Thanks. I would be deploying these workflows to Automation, Do I need to set the same permissions in Automation Node? 

0 Kudos
VitaliyD
Dataiker

@Jas - No, you won't need to share the dataset again in the Automation node as long as updated bundles for the projects are deployed to the Automation node.