unintended data filtering from Prepare Recipe

me2
me2 Registered Posts: 48 ✭✭✭✭✭

Using a simple prepare recipe that does no filtering, I noticed some unintentional data loss/filtering. Very small but loss that skewed some results.

Has anyone else noticed this?


Operating system used: Windows

Tagged:

Best Answer

  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭
    Answer ✓

    @tgb417
    @Turribeach

    Our internal team was able to show me details on the loss of records using the logs. The root cause was the data type and how each engine interacted with that column.

    So my options were 1) lock the data type at each step in the flow that could delete records which seemed to be primarily when parsing dates or 2) remove the column at the beginning of the flow.


    The solution of setting the data type in the explore window helped but didn't propagate upstream. The challenge with locking the data type was time-required and redo loop. Having to lock at each step is time consuming. Since the explore window only shows a sample, that sample might not detect an issue with the data type that results in filtering rows.


    You can use the data logs but caution using the "summary", it will give you a sample but not all instances. You need to go through the logs and search "Unable to write row". That will tell you a row was filtered and the column that caused it. When you have 20k+ rows, you get bored of searching so you run it again... oh now 10k rows, search a few more. Run it again and again and again. Eventually you get 0 filtering for that engine.


    This is definitely an opportunity for Dataiku to improve and simply the process.


    Thank you all for your help with this. I learned a great deal and I hope you did to.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,896 Neuron

    Unlikely to be the case. Go the input dataset, Status tab and update/compute the row count. Then go to the output dataset and do the same. Share the results.

  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭
  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭

    That is what I thought at first

    Disabling all other steps in recipe but the ones listed above.

    Before
    1,638,836

    After
    1,636,971

    Delta
    1,865

    I then created a query to identify those 1,865 records and there isn't anything unique with the date. Those dates are parsing.

    I also tried to increase the sample size of the prepare recipe to see if I can see a negative in the steps but the % loss is so small that I time out limits before I see any filtering. I don't know how to read the logs to see what is causing the filtering.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,896 Neuron

    Yopur 3 steps shouldn't filter data out. Can you confirm that if you disable the steps the row counts are the same? Can you post screen shots of the Status counts from the input and output datasets?

  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭

    Disabling all other steps in recipe but the ones listed in screenshot.

    Before
    1,638,836

    After
    1,636,971

    Delta
    1,865

  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭

    I created a brand new prepare recipe with only 1 step... parse date. I picked the recommended parse.

    Started records

    1,638,836

    After prepare recipe

    1,636,875

    Weird, I know.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @me2

    As I was reading this thread an additional thought came to me. If you have not already done so, I would encourage you to also submit a support ticket on this issue. The support team has access to some tools to both diagnose and try to reproduce this kind of situation. They may also be able to provide a more definitive answer to your question. If this turn out to be an actual defect, it would be great to get this reported so that It can be resolved.

    If you choose to submit a support ticket and get a useful result, I’d like to invite you to share your findings here for folks who might find this thread in the future.

  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭

    @tgb417
    @Turribeach

    Thank you, I think I will submit a ticket to understand why and I since solved the problem.

    After creating the stand alone recipe I ran the recipe with several engines and the results are interesting.

    First to start, all input files are stored in 'hive_write' with format 'Parquet'.

    Input records 1,658,602

    Output 1 using engine 'Local Stream' . Out records 1,636,875 . Duration 02m06s

    Output 2 using 'Hadoop MapReduce (depracted)'. Output records 1,656,480. Duration 3m22s.

    Output 3 using 'Spark'. Output records 1,658,602. Duration 4m13s.

    Using the 'local stream' (it was the default) had a loss of 1.3% loss in records but it was quickest.

    Using 'Hadoop...' had a loss of 0.12% and second quickest.

    Spark had no record loss with longest time.

    All other engines were not available for me to use. Including 'Hive' due to 'DateParser'.

    For this case, I need 0 record loss and the extra 2m is worth the price. 'Spark' for the win!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @me2

    Have you or your team submited a support ticket based on what you have discovered.

    Hopefully they can resolve this before others run into a similar issue.

  • me2
    me2 Registered Posts: 48 ✭✭✭✭✭

    Tom, they did not because the issue was found to be data type and unintended filtering was captured in the logs.

    I do think Dataiku needs to enhance how Dataiku deals with data type and maybe how unintended data filtering is shown (buried in logs). Maybe have a page on how to read data logs to identify common problems?

    Thank you.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @me2

    One of the other things I'd invite you to do is add a post on this topic on the product idea pages. You can find them here.

    https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas

Setup Info
    Tags
      Help me…