Reading from partitioned dataset gives doubles despite DISTINCT
Dear,
I run the following query on a partitioned dataset:
SELECT DISTINCT X.*
FROM (
SELECT
sha2(
(
nvl(col_a, '') ||
nvl(col_b, '') ||
nvl(col_c, 0) ||
nvl(col_d, '') ||
nvl(col_e, 0) ||
nvl(col_f, '') ||
nvl(col_g, '') ||
nvl(col_h, '') ||
nvl(col_i, '') ||
nvl(col_j, 0) ||
nvl(col_k, 0) ||
nvl(col_l, '') ||
nvl(col_m, '01/01/1900')
),
256
) AS hash_key,
col_a,
col_b,
col_c,
col_d,
col_e,
col_f,
col_g,
col_h,
col_i,
col_j,
col_k,
col_l,
col_m,
part_key
FROM anonymized_table
WHERE part_key = '${param_1}'
) X;
what happens is I get doubles, and I'm talking perfect double despite the distinct statement.
I might be missing something out here - it's as if the information was read "twice" separately though i cannot see any evidence other than the result
Any tips please ?
source dataset→ s3 parquet
recipe → Spark SQL
param_1 variable is well defined in the project's global variables
Thanks a lot
Dataiku version used: latest self-hosted
Dataiku version used: latest self-hosted
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,665 NeuronThe fact that your dataset is partitioned doesn't really mean it will have unique rows. If you loaded duplicates in your partition that's what you are going to get when you query the dataset. Also note that partitioned datasets do not have any relation with partitioned tables in databases. They are "logical" partitions in the dataiku world which simply allow you to avoid having to reload all dataset data every time so you just load each partition.