filter doesn't return any row when it's suppose to do

Solved!
boumezrag
Level 2
filter doesn't return any row when it's suppose to do
Hi everybody,

I'm traying to filter a table of 2.7M rows in order to have a sample .

Here what I did :

- I create a filter

- I chose : Filter ON

-Keep only rows that satisfy : All the following conditions

- I put the condition

- For the sampling : I chose Whole data

When I run ; my filter doesn't return any row when it's suppose to do

What is the problem ???

Thanks in advance
0 Kudos
1 Solution
Alex_Combessie
Dataiker Alumni

Hello,



Thanks for the diagnosis. After investigation, it seems the issue was caused by a discrepancy between lowercase and uppercase in your original Parquet file versus the Hive table. Your input dataset was generated as a Parquet file manually with the column name "MANDT" (uppercase). Then it was imported from Hive to DSS. However, Hive always converts all column names to lowercase. Hence, DSS was seeing the column name as "mandt" which is incoherent to the name stored in the original Parquet file. As of today we cannot detect this type of cases automatically.



The preferred solution would be to only generate Parquet files with lowercase column names, so that they are compatible with Hive (and Impala as well). 



If that option is not possible, you may try to change the recipe engine from DSS to Hive. As a matter of fact, for large datasets, it is recommended to change the recipe engine to a Hadoop related one (Spark, Hive or Impala). You should gain in performance by pushing the computation down to your Hadoop cluster instead of having it streamed to DSS.



Cheers,



Alex

View solution in original post

11 Replies
Alex_Combessie
Dataiker Alumni
Hello,

Could you please give us more details as to the nature and content of your filter? Have you checked that the value you are filtering on is indeed in the whole dataset?

Cheers,

Alex
0 Kudos
boumezrag
Level 2
Author
Thank you alexandre for your reply,
I made multiple examples and no one works.
For example : I selected a column named mandt, then I tried mandt equals 100 ( all the rows have the value 100 ) -> result : 0 row
I tried mandt is different from 100 -> result : 0 row
mandt is defined -> result : 0 row
etc ....
๐Ÿ˜ž
0 Kudos
Alex_Combessie
Dataiker Alumni
Could you tell us how you are creating your filter? Is it a filter in the sampling definition? A step in a recipe? A filter in the view of the sample?
You can attach some screenshots to your comments to show us what you are trying to achieve.
0 Kudos
boumezrag
Level 2
Author

 



@Alex



Here are the screenshots



Alex_Combessie
Dataiker Alumni
Could you please share the job diagnosis after you run the recipe? You can download it in the page of the job, under Actions > Download job diagnosis.
0 Kudos
boumezrag
Level 2
Author
Sorry I downloaded the job diagnosis but I don't know how to share it
0 Kudos
Alex_Combessie
Dataiker Alumni
You can use any file transfer you want, for instance Wetransfer.
0 Kudos
boumezrag
Level 2
Author
Unfortunately I can't use these websites, they are all blocked in the company where I'm doing my internship
0 Kudos
Alex_Combessie
Dataiker Alumni
Can you send it as an email attachment to my address (alexandre.combessie@dataiku.com)? Side-note: if your company has subscribed to Dataiku, you can also contact our official support https://support.dataiku.com. The website answers.dataiku.com is meant for community support.
0 Kudos
Alex_Combessie
Dataiker Alumni

Hello,



Thanks for the diagnosis. After investigation, it seems the issue was caused by a discrepancy between lowercase and uppercase in your original Parquet file versus the Hive table. Your input dataset was generated as a Parquet file manually with the column name "MANDT" (uppercase). Then it was imported from Hive to DSS. However, Hive always converts all column names to lowercase. Hence, DSS was seeing the column name as "mandt" which is incoherent to the name stored in the original Parquet file. As of today we cannot detect this type of cases automatically.



The preferred solution would be to only generate Parquet files with lowercase column names, so that they are compatible with Hive (and Impala as well). 



If that option is not possible, you may try to change the recipe engine from DSS to Hive. As a matter of fact, for large datasets, it is recommended to change the recipe engine to a Hadoop related one (Spark, Hive or Impala). You should gain in performance by pushing the computation down to your Hadoop cluster instead of having it streamed to DSS.



Cheers,



Alex

boumezrag
Level 2
Author
Hello Alex,
Thank you so much for your help , I appreciate it !
0 Kudos