unintended data filtering from Prepare Recipe

me2 · ‎08-18-2023

Using a simple prepare recipe that does no filtering, I noticed some unintentional data loss/filtering. Very small but loss that skewed some results.

Has anyone else noticed this?

Operating system used: Windows

me2 · ‎09-09-2023

@tgb417 @Turribeach
Our internal team was able to show me details on the loss of records using the logs. The root cause was the data type and how each engine interacted with that column.

So my options were 1) lock the data type at each step in the flow that could delete records which seemed to be primarily when parsing dates or 2) remove the column at the beginning of the flow.

The solution of setting the data type in the explore window helped but didn't propagate upstream. The challenge with locking the data type was time-required and redo loop. Having to lock at each step is time consuming. Since the explore window only shows a sample, that sample might not detect an issue with the data type that results in filtering rows.

You can use the data logs but caution using the "summary", it will give you a sample but not all instances. You need to go through the logs and search "Unable to write row". That will tell you a row was filtered and the column that caused it. When you have 20k+ rows, you get bored of searching so you run it again... oh now 10k rows, search a few more. Run it again and again and again. Eventually you get 0 filtering for that engine.

This is definitely an opportunity for Dataiku to improve and simply the process.

Thank you all for your help with this. I learned a great deal and I hope you did to.

View solution in original post

Turribeach · ‎08-18-2023

Unlikely to be the case. Go the input dataset, Status tab and update/compute the row count. Then go to the output dataset and do the same. Share the results.

me2 · ‎08-20-2023

That is what I thought at first 😁

Disabling all other steps in recipe but the ones listed above.

Before
1,638,836

After
1,636,971

Delta
1,865

I then created a query to identify those 1,865 records and there isn't anything unique with the date. Those dates are parsing.

I also tried to increase the sample size of the prepare recipe to see if I can see a negative in the steps but the % loss is so small that I time out limits before I see any filtering. I don't know how to read the logs to see what is causing the filtering.

me2 · ‎08-20-2023

I created a brand new prepare recipe with only 1 step... parse date. I picked the recommended parse.

Started records

1,638,836

After prepare recipe

1,636,875

Weird, I know.

me2 · ‎08-19-2023

reposted

Turribeach · ‎08-20-2023

Yopur 3 steps shouldn't filter data out. Can you confirm that if you disable the steps the row counts are the same? Can you post screen shots of the Status counts from the input and output datasets?

me2 · ‎08-20-2023

Disabling all other steps in recipe but the ones listed in screenshot.

Before
1,638,836

After
1,636,971

Delta
1,865

tgb417 · ‎08-21-2023

@me2

As I was reading this thread an additional thought came to me. If you have not already done so, I would encourage you to also submit a support ticket on this issue. The support team has access to some tools to both diagnose and try to reproduce this kind of situation. They may also be able to provide a more definitive answer to your question. If this turn out to be an actual defect, it would be great to get this reported so that It can be resolved.

If you choose to submit a support ticket and get a useful result, I’d like to invite you to share your findings here for folks who might find this thread in the future.

--Tom

me2 · ‎08-21-2023

@tgb417 @Turribeach

Thank you, I think I will submit a ticket to understand why and I since solved the problem.

After creating the stand alone recipe I ran the recipe with several engines and the results are interesting.

First to start, all input files are stored in 'hive_write' with format 'Parquet'.

Input records 1,658,602

Output 1 using engine 'Local Stream' . Out records 1,636,875 . Duration 02m06s

Output 2 using 'Hadoop MapReduce (depracted)'. Output records 1,656,480. Duration 3m22s.

Output 3 using 'Spark'. Output records 1,658,602. Duration 4m13s.

Using the 'local stream' (it was the default) had a loss of 1.3% loss in records but it was quickest.

Using 'Hadoop...' had a loss of 0.12% and second quickest.

Spark had no record loss with longest time.

All other engines were not available for me to use. Including 'Hive' due to 'DateParser'.

For this case, I need 0 record loss and the extra 2m is worth the price. 'Spark' for the win!

me2 · ‎09-09-2023

@tgb417 @Turribeach
Our internal team was able to show me details on the loss of records using the logs. The root cause was the data type and how each engine interacted with that column.

So my options were 1) lock the data type at each step in the flow that could delete records which seemed to be primarily when parsing dates or 2) remove the column at the beginning of the flow.

The solution of setting the data type in the explore window helped but didn't propagate upstream. The challenge with locking the data type was time-required and redo loop. Having to lock at each step is time consuming. Since the explore window only shows a sample, that sample might not detect an issue with the data type that results in filtering rows.

You can use the data logs but caution using the "summary", it will give you a sample but not all instances. You need to go through the logs and search "Unable to write row". That will tell you a row was filtered and the column that caused it. When you have 20k+ rows, you get bored of searching so you run it again... oh now 10k rows, search a few more. Run it again and again and again. Eventually you get 0 filtering for that engine.

This is definitely an opportunity for Dataiku to improve and simply the process.

Thank you all for your help with this. I learned a great deal and I hope you did to.

tgb417 · ‎09-11-2023

@me2

Have you or your team submited a support ticket based on what you have discovered.

Hopefully they can resolve this before others run into a similar issue.

--Tom

me2 · ‎09-12-2023

Tom, they did not because the issue was found to be data type and unintended filtering was captured in the logs.

I do think Dataiku needs to enhance how Dataiku deals with data type and maybe how unintended data filtering is shown (buried in logs). Maybe have a page on how to read data logs to identify common problems?

Thank you.

tgb417 · ‎09-12-2023

@me2

One of the other things I'd invite you to do is add a post on this topic on the product idea pages. You can find them here.

https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas

--Tom

Sign up to take part

unintended data filtering from Prepare Recipe

unintended data filtering from Prepare Recipe

Setup info