filter doesn't return any row when it's suppose to do

boumezrag · ‎01-23-2018

Hi everybody,

I'm traying to filter a table of 2.7M rows in order to have a sample .

Here what I did :

- I create a filter

- I chose : Filter ON

-Keep only rows that satisfy : All the following conditions

- I put the condition

- For the sampling : I chose Whole data

When I run ; my filter doesn't return any row when it's suppose to do

What is the problem ???

Thanks in advance

Alex_Combessie · ‎01-23-2018

Hello,

Thanks for the diagnosis. After investigation, it seems the issue was caused by a discrepancy between lowercase and uppercase in your original Parquet file versus the Hive table. Your input dataset was generated as a Parquet file manually with the column name "MANDT" (uppercase). Then it was imported from Hive to DSS. However, Hive always converts all column names to lowercase. Hence, DSS was seeing the column name as "mandt" which is incoherent to the name stored in the original Parquet file. As of today we cannot detect this type of cases automatically.

The preferred solution would be to only generate Parquet files with lowercase column names, so that they are compatible with Hive (and Impala as well).

If that option is not possible, you may try to change the recipe engine from DSS to Hive. As a matter of fact, for large datasets, it is recommended to change the recipe engine to a Hadoop related one (Spark, Hive or Impala). You should gain in performance by pushing the computation down to your Hadoop cluster instead of having it streamed to DSS.

Cheers,

Alex

View solution in original post

Alex_Combessie · ‎01-23-2018

Hello,

Could you please give us more details as to the nature and content of your filter? Have you checked that the value you are filtering on is indeed in the whole dataset?

Cheers,

Alex

boumezrag · ‎01-23-2018

Thank you alexandre for your reply,
I made multiple examples and no one works.
For example : I selected a column named mandt, then I tried mandt equals 100 ( all the rows have the value 100 ) -> result : 0 row
I tried mandt is different from 100 -> result : 0 row
mandt is defined -> result : 0 row
etc ....
😞

Alex_Combessie · ‎01-23-2018

Could you tell us how you are creating your filter? Is it a filter in the sampling definition? A step in a recipe? A filter in the view of the sample?
You can attach some screenshots to your comments to show us what you are trying to achieve.

boumezrag · ‎01-23-2018

@Alex

Here are the screenshots

Alex_Combessie · ‎01-23-2018

Could you please share the job diagnosis after you run the recipe? You can download it in the page of the job, under Actions > Download job diagnosis.

boumezrag · ‎01-23-2018

Sorry I downloaded the job diagnosis but I don't know how to share it

Alex_Combessie · ‎01-23-2018

You can use any file transfer you want, for instance Wetransfer.

boumezrag · ‎01-23-2018

Unfortunately I can't use these websites, they are all blocked in the company where I'm doing my internship

Alex_Combessie · ‎01-23-2018

Can you send it as an email attachment to my address (alexandre.combessie@dataiku.com)? Side-note: if your company has subscribed to Dataiku, you can also contact our official support https://support.dataiku.com. The website answers.dataiku.com is meant for community support.

Alex_Combessie · ‎01-23-2018

Hello,

Thanks for the diagnosis. After investigation, it seems the issue was caused by a discrepancy between lowercase and uppercase in your original Parquet file versus the Hive table. Your input dataset was generated as a Parquet file manually with the column name "MANDT" (uppercase). Then it was imported from Hive to DSS. However, Hive always converts all column names to lowercase. Hence, DSS was seeing the column name as "mandt" which is incoherent to the name stored in the original Parquet file. As of today we cannot detect this type of cases automatically.

The preferred solution would be to only generate Parquet files with lowercase column names, so that they are compatible with Hive (and Impala as well).

If that option is not possible, you may try to change the recipe engine from DSS to Hive. As a matter of fact, for large datasets, it is recommended to change the recipe engine to a Hadoop related one (Spark, Hive or Impala). You should gain in performance by pushing the computation down to your Hadoop cluster instead of having it streamed to DSS.

Cheers,

Alex

boumezrag · ‎01-25-2018

Hello Alex,
Thank you so much for your help , I appreciate it !

filter doesn't return any row when it's suppose to do

filter doesn't return any row when it's suppose to do

Labels

Filtering

Preparation

Sampling

Sign up to take part

filter doesn't return any row when it's suppose to do

filter doesn't return any row when it's suppose to do

Labels

Filtering

Preparation

Sampling