Parquet format table redetected as CSV

MarcioCoelho
Level 2
Parquet format table redetected as CSV

Hey,

I've been running into an issue where after creating a dataset which is stored in parquet, while using a pyspark recipe, the dataset is redected as csv, without a very different schema.

Here's the dataset before pressing redetect format:

Original datasetOriginal dataset

And after pressing redetect format, It goes from 18 to 75 columns:

After using redetect format.After using redetect format.

And the new columns make no sense:

New columns that shouldn't exist.New columns that shouldn't exist.

And to confirm the generated parquet files:

Parquet files.Parquet files.

 

I've deleted and recreated the dataset multiple times, but I always get the same result.

I've also checked the pyspark recipe, but it generates the 18 supposed columns, not 75.

Any help would be appreciated, as I'm at a loss on what could be causing this issue.

 

Best regards,

Mรกrcio Coelho


Operating system used: Windows

0 Kudos
2 Replies
JordanB
Dataiker

Hi @MarcioCoelho,

Thanks for writing in! Referencing your first snapshot, it appears that the dataset is originally detected as parquet. What happens if you donโ€™t select โ€œredetect formatโ€ and instead select โ€œCheck nowโ€? 

JordanB_0-1660932513661.png

Alternatively, are you able to change the dataset to Parquet format using the โ€œTypeโ€ drop-down menu and select update preview?

JordanB_1-1660932513608.png

If the steps above do not work. Would you be able to share an example of the code youโ€™re using to create the dataset?

Thanks again,

Jordan

0 Kudos
MarcioCoelho
Level 2
Author

Hey @JordanB  thanks for your reply.
We got it working properly by using spark.dku.allow.native.parquet.reader.infer set to true, from https://doc.dataiku.com/dss/latest/connecting/formats/parquet.html.

We suspected that some of data had a weird format and as such was being wrongfully inferred.