Columns not displayed correctly from parquet file

Solved!
Carl
Level 3
Columns not displayed correctly from parquet file

Hi,

I've uploaded a parquet file from AWS S3 and the columns are not correctly displayed.

I've uploaded same parquet file locally on my computer and I share 2 screenshots, one correctly uploaded with the shape of the dataframe, variables content (Output Parquet Local Computer) and the second screenshot with the preview from Dataiku (Output Parquet from AWS S3)

The "conversion" from parquet into CSV seems not working.


Operating system used: Mac OS

0 Kudos
1 Solution
Carl
Level 3
Author

Hi,

It seems Dataiku can't read files on non HDFS datasets.

"Tried format parquet but configuration is not OK: Could not read file using format: Parquet is not supported on non-HDFS datasets"

"Dommage...."

 

View solution in original post

0 Kudos
7 Replies
fchataigner2
Dataiker

Hi,

you need to change the format to Parquet in the Format / Preview tab of your screenshot. DSS wasn't able to correctly detect that the data is parquet.

Note that parquet has a lot of ties to Hadoop libraries, and in particular it needs to work on folders rather than on individual files. Which means that you should have the tain.parquet file in some folder on S3 and select the folder for the dataset path.

0 Kudos
Carl
Level 3
Author

That's what I did, I change the format but I get another error

No such file or directory: s3a://p8west/AMEX/train_data.parquet/train_data.parquet

 Tried format but configuration is not OK: Connection failed: undefined

So I've tested creating a folder in S3 as described in error message  s3a://p8west/AMEX/train_data.parquet/train_data.parquet 

but then I get 

 No such file or directory:  s3a://p8west/AMEX/train_data.parquet/train_data.parquet/train_data.parquet 

If I select a folder instead of a file, the process never ends

 

 

0 Kudos
fchataigner2
Dataiker

if the full path to the parquet file is s3a://p8west/AMEX/train_data.parquet/train_data.parquet , then the dataset path should be s3a://p8west/AMEX/train_data.parquet/ . The name of the folder (ending in .parquet) may be misleading.

0 Kudos
Carl
Level 3
Author

That's what I did.

If I just enter the path, it nevers finnish loading the file

0 Kudos
fchataigner2
Dataiker

never loading a parquet file stored on S3 is not really expected. You'd need to generate a diagnostic of the instance in Administration > Maintenance > Diagnostic tool and create a support ticket with that on support.dataiku.com

0 Kudos
Carl
Level 3
Author

Hi,

It seems Dataiku can't read files on non HDFS datasets.

"Tried format parquet but configuration is not OK: Could not read file using format: Parquet is not supported on non-HDFS datasets"

"Dommage...."

 

0 Kudos
fchataigner2
Dataiker

as said before, Parquet has strong ties to Hadoop libraries, and only by using python libs that reimplement a good chunk of parquet can you avoid that fact. You need to put the parquet files on a HDFS connection, or HDFS-able one like S3/Azure blob/GCS (in which case the HDFS interface field of the DSS connection needs to be set)

0 Kudos