Discover the winners & finalists of the 2022 Dataiku Frontrunner Awards!READ THEIR USE CASES

Partitioning - partition column duplicated when using SQL query with Athena on S3

tanguy
Level 4
Partitioning - partition column duplicated when using SQL query with Athena on S3

Hi,

I have recently stumbled upon what appears to be a bug with a partitioned table.

I wanted to filter my partitions with a simple SQL statement "SELECT * FROM {my_partioned_table} WHERE {my_filter}". This works fine when using a Spark SQL recipe, but fails with a SQL query (which runs with Athena as we work with AWS). In the latter situation, the partitioned column reappears in the output schema. IMO, this must be wrong because :

  1. it prevents the computation of metrics of the output table (duplicate column error)
  2. it prevents further SQL recipes (as the "validation" step fails because of a duplicate column error).

Note : It's a pity that the duplicate column is not explicitly pointed out (this could save the user some time, as investigating the table schema does not show any duplicate column ...).

2 Replies
MikeG
Dataiker

Hi @tanguy ,

Thanks for posting. I am investigating your report.

I am testing with the `cloudfront_logs` dataset provided in https://docs.aws.amazon.com/athena/latest/ug/getting-started.html and DSS 10.0.5.

After creating the dataset in Athena I add the Athena dataset in DSS (with partitioning activated on column `location`):

Screen Shot 2022-05-11 at 1.33.15 PM.png

Then, in a SQL recipe I run the following query: `SELECT * FROM "mydatabase"."cloudfront_logs" WHERE location = 'IAD2';`:

Screen Shot 2022-05-11 at 1.39.41 PM.png

I am able to successfully compute all default metrics on the output dataset `sql_recipe_output` (I do not encounter the `duplicate column` error you mentioned):

Screen Shot 2022-05-11 at 1.40.16 PM.png

Note: my Flow in DSS looks like this

  • Athena dataset with partitioning activated on location named `cloudfront_logs` ->
  • SQL recipe (`SELECT * FROM "mydatabase"."cloudfront_logs" WHERE location = 'IAD2';`) ->
  • S3 dataset named `sql_recipe_output`

Screen Shot 2022-05-11 at 1.42.33 PM.png---

Next Actions

Can you answer the following questions? I think answers to these will give me the context I need to understand the issue you’re reporting and work on reproducing the issue on my end.

  1. Are you able to reproduce the issue you’re reporting using the Athena Quickstart `cloudfront_logs` dataset? This could be helpful for me to fully understand the issue you’re reporting.
  2. Can you post a screenshot of both duplicate column errors you mentioned (“It prevents the computation of metrics of the output table (duplicate column error)” and “It prevents further SQL recipes (as the “validation” step fails because of a duplicate column error”). Having screenshots may be helpful for context.
  3. Can you post a screenshot of your Flow in DSS that is encountering the issue?
  4. What version of DSS were you using when encountering the reported issue?

Thank you,
Mike

0 Kudos
tanguy
Level 4
Author

Hi Mike,

Thanks for coming back to me (and sorry for the late answer).

Regarding your questions:

"Are you able to reproduce the issue you’re reporting using the Athena Quickstart `cloudfront_logs` dataset?" --> unfortunately not, I do not have sufficient permissions to run the Athena tutorial you refered to (I couldn't create 'mydatabase'; see below screenshot) nor could I directly access to the cloudfront s3 dataset to import it in dataiku.Aws_fail.jpg

However, I forgot to mention that I was working with S3 datasets, not athena datasets:

2.jpgThe reason for that being that, if I am correct, it is not possible to create another athena table in the flow from a previous dataset (my purpose is to transform data in the flow, not just to query data).

So allow me to restart with another demo with S3 datasets (using AWS Athena as the engine for SQL recipes):

3.jpg

The demo data is very simple:

4.jpg 

This demo data is then synchronized on S3 (2nd table in the flow).

I then partition the synchronized demo data on "col_B" using a 2nd sync recipe with the "redispatch" option activated:

5.jpg

Note that I partition the dataset using any partition specification (here "foo/bar", which is kind of confusing). Note also that the partitioned demo dataset (3rd table in the flow) has lost its partitioning column (which is regrettable):

6.jpg

With this partitioned table, I use Athena with an SQL recipe to filter partitions (here on partition '1') :

tanguy_0-1666526596831.jpeg

Everything works fine up to this stage. I obtain the following filtered table in ouput (4th table in the flow) :

tanguy_1-1666526596844.jpeg

Note that the partitioning column reappeared (had I used a Spark SQL recipe, this would not have happened)!

Finally, in the final step, I want to recollect all the data in a non-partitioned format (tentative to build the 5th table in the flow). To do this, I use an SQL recipe with a « SELECT * » statement (no filters applied) :

tanguy_2-1666526596883.jpeg

But this step is blocked by a « duplicate column » validation error :

tanguy_3-1666526596904.jpeg

I hope this clarifies my problem. If I recall properly, I think I was trying to do the last 2 steps in my OP (filtering on the desired partitions + recollecting all the data in a non-partitioned output dataset).

0 Kudos