Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I have recently stumbled upon what appears to be a bug with a partitioned table.
I wanted to filter my partitions with a simple SQL statement "SELECT * FROM {my_partioned_table} WHERE {my_filter}". This works fine when using a Spark SQL recipe, but fails with a SQL query (which runs with Athena as we work with AWS). In the latter situation, the partitioned column reappears in the output schema. IMO, this must be wrong because :
Note : It's a pity that the duplicate column is not explicitly pointed out (this could save the user some time, as investigating the table schema does not show any duplicate column ...).
Hi @tanguy ,
Thanks for posting. I am investigating your report.
I am testing with the `cloudfront_logs` dataset provided in https://docs.aws.amazon.com/athena/latest/ug/getting-started.html and DSS 10.0.5.
After creating the dataset in Athena I add the Athena dataset in DSS (with partitioning activated on column `location`):
Then, in a SQL recipe I run the following query: `SELECT * FROM "mydatabase"."cloudfront_logs" WHERE location = 'IAD2';`:
I am able to successfully compute all default metrics on the output dataset `sql_recipe_output` (I do not encounter the `duplicate column` error you mentioned):
Note: my Flow in DSS looks like this
---
Next Actions
Can you answer the following questions? I think answers to these will give me the context I need to understand the issue youโre reporting and work on reproducing the issue on my end.
Thank you,
Mike
Hi Mike,
Thanks for coming back to me (and sorry for the late answer).
Regarding your questions:
"Are you able to reproduce the issue youโre reporting using the Athena Quickstart `cloudfront_logs` dataset?" --> unfortunately not, I do not have sufficient permissions to run the Athena tutorial you refered to (I couldn't create 'mydatabase'; see below screenshot) nor could I directly access to the cloudfront s3 dataset to import it in dataiku.
However, I forgot to mention that I was working with S3 datasets, not athena datasets:
The reason for that being that, if I am correct, it is not possible to create another athena table in the flow from a previous dataset (my purpose is to transform data in the flow, not just to query data).
So allow me to restart with another demo with S3 datasets (using AWS Athena as the engine for SQL recipes):
The demo data is very simple:
This demo data is then synchronized on S3 (2nd table in the flow).
I then partition the synchronized demo data on "col_B" using a 2nd sync recipe with the "redispatch" option activated:
Note that I partition the dataset using any partition specification (here "foo/bar", which is kind of confusing). Note also that the partitioned demo dataset (3rd table in the flow) has lost its partitioning column (which is regrettable):
With this partitioned table, I use Athena with an SQL recipe to filter partitions (here on partition '1') :
Everything works fine up to this stage. I obtain the following filtered table in ouput (4th table in the flow) :
Note that the partitioning column reappeared (had I used a Spark SQL recipe, this would not have happened)!
Finally, in the final step, I want to recollect all the data in a non-partitioned format (tentative to build the 5th table in the flow). To do this, I use an SQL recipe with a ยซ SELECT * ยป statement (no filters applied) :
But this step is blocked by a ยซ duplicate column ยป validation error :
I hope this clarifies my problem. If I recall properly, I think I was trying to do the last 2 steps in my OP (filtering on the desired partitions + recollecting all the data in a non-partitioned output dataset).