You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Partitioning - partition column duplicated when using SQL query with Athena on S3

tanguy
Level 3
Partitioning - partition column duplicated when using SQL query with Athena on S3

Hi,

I have recently stumbled upon what appears to be a bug with a partitioned table.

I wanted to filter my partitions with a simple SQL statement "SELECT * FROM {my_partioned_table} WHERE {my_filter}". This works fine when using a Spark SQL recipe, but fails with a SQL query (which runs with Athena as we work with AWS). In the latter situation, the partitioned column reappears in the output schema. IMO, this must be wrong because :

  1. it prevents the computation of metrics of the output table (duplicate column error)
  2. it prevents further SQL recipes (as the "validation" step fails because of a duplicate column error).

Note : It's a pity that the duplicate column is not explicitly pointed out (this could save the user some time, as investigating the table schema does not show any duplicate column ...).

0 Kudos
1 Reply
MikeG
Dataiker
Dataiker

Hi @tanguy ,

Thanks for posting. I am investigating your report.

I am testing with the `cloudfront_logs` dataset provided in https://docs.aws.amazon.com/athena/latest/ug/getting-started.html and DSS 10.0.5.

After creating the dataset in Athena I add the Athena dataset in DSS (with partitioning activated on column `location`):

Screen Shot 2022-05-11 at 1.33.15 PM.png

Then, in a SQL recipe I run the following query: `SELECT * FROM "mydatabase"."cloudfront_logs" WHERE location = 'IAD2';`:

Screen Shot 2022-05-11 at 1.39.41 PM.png

I am able to successfully compute all default metrics on the output dataset `sql_recipe_output` (I do not encounter the `duplicate column` error you mentioned):

Screen Shot 2022-05-11 at 1.40.16 PM.png

Note: my Flow in DSS looks like this

  • Athena dataset with partitioning activated on location named `cloudfront_logs` ->
  • SQL recipe (`SELECT * FROM "mydatabase"."cloudfront_logs" WHERE location = 'IAD2';`) ->
  • S3 dataset named `sql_recipe_output`

Screen Shot 2022-05-11 at 1.42.33 PM.png---

Next Actions

Can you answer the following questions? I think answers to these will give me the context I need to understand the issue you’re reporting and work on reproducing the issue on my end.

  1. Are you able to reproduce the issue you’re reporting using the Athena Quickstart `cloudfront_logs` dataset? This could be helpful for me to fully understand the issue you’re reporting.
  2. Can you post a screenshot of both duplicate column errors you mentioned (“It prevents the computation of metrics of the output table (duplicate column error)” and “It prevents further SQL recipes (as the “validation” step fails because of a duplicate column error”). Having screenshots may be helpful for context.
  3. Can you post a screenshot of your Flow in DSS that is encountering the issue?
  4. What version of DSS were you using when encountering the reported issue?

Thank you,
Mike

0 Kudos