Reading from partitioned dataset gives doubles despite DISTINCT

Miloud · February 11

Dear,

I run the following query on a partitioned dataset:

SELECT DISTINCT X.*

FROM (

    SELECT

        sha2(

            (

                nvl(col_a, '') ||

                nvl(col_b, '') ||

                nvl(col_c, 0) ||

                nvl(col_d, '') ||

                nvl(col_e, 0) ||

                nvl(col_f, '') ||

                nvl(col_g, '') ||

                nvl(col_h, '') ||

                nvl(col_i, '') ||

                nvl(col_j, 0) ||

                nvl(col_k, 0) ||

                nvl(col_l, '') ||

                nvl(col_m, '01/01/1900')

            ),

            256

        ) AS hash_key,


        col_a,

        col_b,

        col_c,

        col_d,

        col_e,

        col_f,

        col_g,

        col_h,

        col_i,

        col_j,

        col_k,

        col_l,

        col_m,

        part_key


    FROM anonymized_table

    WHERE part_key = '${param_1}'

) X;

what happens is I get doubles, and I'm talking perfect double despite the distinct statement.

I might be missing something out here - it's as if the information was read "twice" separately though i cannot see any evidence other than the result

Any tips please ?

source dataset→ s3 parquet

recipe → Spark SQL

param_1 variable is well defined in the project's global variables

Thanks a lot

Dataiku version used: latest self-hosted

Turribeach · February 15

The fact that your dataset is partitioned doesn't really mean it will have unique rows. If you loaded duplicates in your partition that's what you are going to get when you query the dataset. Also note that partitioned datasets do not have any relation with partitioned tables in databases. They are "logical" partitions in the dataiku world which simply allow you to avoid having to reload all dataset data every time so you just load each partition.

Reading from partitioned dataset gives doubles despite DISTINCT

Answers

Categories

Setup Info

Tags