Reading from partitioned dataset gives doubles despite DISTINCT

Miloud
Miloud Registered Posts: 1
edited February 11 in Using Dataiku

Dear,

I run the following query on a partitioned dataset:

SELECT DISTINCT X.*
FROM (
SELECT
sha2(
(
nvl(col_a, '') ||
nvl(col_b, '') ||
nvl(col_c, 0) ||
nvl(col_d, '') ||
nvl(col_e, 0) ||
nvl(col_f, '') ||
nvl(col_g, '') ||
nvl(col_h, '') ||
nvl(col_i, '') ||
nvl(col_j, 0) ||
nvl(col_k, 0) ||
nvl(col_l, '') ||
nvl(col_m, '01/01/1900')
),
256
) AS hash_key,

col_a,
col_b,
col_c,
col_d,
col_e,
col_f,
col_g,
col_h,
col_i,
col_j,
col_k,
col_l,
col_m,
part_key

FROM anonymized_table
WHERE part_key = '${param_1}'
) X;

what happens is I get doubles, and I'm talking perfect double despite the distinct statement.

I might be missing something out here - it's as if the information was read "twice" separately though i cannot see any evidence other than the result

Any tips please ?

source dataset→ s3 parquet

recipe → Spark SQL

param_1 variable is well defined in the project's global variables

Thanks a lot

Dataiku version used: latest self-hosted

Dataiku version used: latest self-hosted

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,665 Neuron

    The fact that your dataset is partitioned doesn't really mean it will have unique rows. If you loaded duplicates in your partition that's what you are going to get when you query the dataset. Also note that partitioned datasets do not have any relation with partitioned tables in databases. They are "logical" partitions in the dataiku world which simply allow you to avoid having to reload all dataset data every time so you just load each partition.

Setup Info
    Tags
      Help me…