Community Conundrums are live! Learn more

filter doesn't return any row when it's suppose to do

Level 2
filter doesn't return any row when it's suppose to do
Hi everybody,

I'm traying to filter a table of 2.7M rows in order to have a sample .

Here what I did :

- I create a filter

- I chose : Filter ON

-Keep only rows that satisfy : All the following conditions

- I put the condition

- For the sampling : I chose Whole data

When I run ; my filter doesn't return any row when it's suppose to do

What is the problem ???

Thanks in advance
0 Kudos
11 Replies
Dataiker
Dataiker
Hello,

Could you please give us more details as to the nature and content of your filter? Have you checked that the value you are filtering on is indeed in the whole dataset?

Cheers,

Alex
0 Kudos
Level 2
Author
Thank you alexandre for your reply,
I made multiple examples and no one works.
For example : I selected a column named mandt, then I tried mandt equals 100 ( all the rows have the value 100 ) -> result : 0 row
I tried mandt is different from 100 -> result : 0 row
mandt is defined -> result : 0 row
etc ....
😞
0 Kudos
Dataiker
Dataiker
Could you tell us how you are creating your filter? Is it a filter in the sampling definition? A step in a recipe? A filter in the view of the sample?
You can attach some screenshots to your comments to show us what you are trying to achieve.
0 Kudos
Level 2
Author

 



@Alex



Here are the screenshots



Dataiker
Dataiker
Could you please share the job diagnosis after you run the recipe? You can download it in the page of the job, under Actions > Download job diagnosis.
0 Kudos
Level 2
Author
Sorry I downloaded the job diagnosis but I don't know how to share it
0 Kudos
Dataiker
Dataiker
You can use any file transfer you want, for instance Wetransfer.
0 Kudos
Level 2
Author
Unfortunately I can't use these websites, they are all blocked in the company where I'm doing my internship
0 Kudos
Dataiker
Dataiker
Can you send it as an email attachment to my address (alexandre.combessie@dataiku.com)? Side-note: if your company has subscribed to Dataiku, you can also contact our official support https://support.dataiku.com. The website answers.dataiku.com is meant for community support.
0 Kudos
Dataiker
Dataiker

Hello,



Thanks for the diagnosis. After investigation, it seems the issue was caused by a discrepancy between lowercase and uppercase in your original Parquet file versus the Hive table. Your input dataset was generated as a Parquet file manually with the column name "MANDT" (uppercase). Then it was imported from Hive to DSS. However, Hive always converts all column names to lowercase. Hence, DSS was seeing the column name as "mandt" which is incoherent to the name stored in the original Parquet file. As of today we cannot detect this type of cases automatically.



The preferred solution would be to only generate Parquet files with lowercase column names, so that they are compatible with Hive (and Impala as well). 



If that option is not possible, you may try to change the recipe engine from DSS to Hive. As a matter of fact, for large datasets, it is recommended to change the recipe engine to a Hadoop related one (Spark, Hive or Impala). You should gain in performance by pushing the computation down to your Hadoop cluster instead of having it streamed to DSS.



Cheers,



Alex

Level 2
Author
Hello Alex,
Thank you so much for your help , I appreciate it !
0 Kudos
Labels (3)