Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hey,
I've been running into an issue where after creating a dataset which is stored in parquet, while using a pyspark recipe, the dataset is redected as csv, without a very different schema.
Here's the dataset before pressing redetect format:Original dataset
And after pressing redetect format, It goes from 18 to 75 columns:
After using redetect format.
And the new columns make no sense:
New columns that shouldn't exist.
And to confirm the generated parquet files:
Parquet files.
I've deleted and recreated the dataset multiple times, but I always get the same result.
I've also checked the pyspark recipe, but it generates the 18 supposed columns, not 75.
Any help would be appreciated, as I'm at a loss on what could be causing this issue.
Best regards,
Mรกrcio Coelho
Operating system used: Windows
Hi @MarcioCoelho,
Thanks for writing in! Referencing your first snapshot, it appears that the dataset is originally detected as parquet. What happens if you donโt select โredetect formatโ and instead select โCheck nowโ?
Alternatively, are you able to change the dataset to Parquet format using the โTypeโ drop-down menu and select update preview?
If the steps above do not work. Would you be able to share an example of the code youโre using to create the dataset?
Thanks again,
Jordan
Hey @JordanB thanks for your reply.
We got it working properly by using spark.dku.allow.native.parquet.reader.infer set to true, from https://doc.dataiku.com/dss/latest/connecting/formats/parquet.html.
We suspected that some of data had a weird format and as such was being wrongfully inferred.