Deleting oldest entries in dataset with "append instead of overwrite"

deguodedongxi
deguodedongxi Registered Posts: 4

I am collecting data from a datasource over a long period of time. The dataset is pretty big (10+ million rows).

Because I dont want to recreate the dataset every time, I am using "append instead of overwrite" to only add the last day of data.

At the same time I dont want my dataset to get even bigger, I want to remove the data of the oldest date. I cannot find a way to do this with visual recipes. I also tried overwriting the data with a python recipe, but I am not allowed to save data into the same dataset as I am reading from.

Thanks for your help!

Tagged:

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
    edited July 17 Answer ✓

    Hi @deguodedongxi
    ,
    You should be able to use a Python recipe with a dummy output, and you need to use ignore_flow=True when defining the dataset for writing

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    crm_history = dataiku.Dataset("crm_history")
    df = crm_history.get_dataframe()
    #replace 1000000 with the number of rows you want to keep
    truncated_df = df.tail(1000000)
    # Write recipe outputs
    crm_history_out = dataiku.Dataset("crm_history", ignore_flow=True)
    crm_history_out.write_with_schema(truncated_df)
    

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @AlexT

    i have a similar use case. My dataset is being stored in an appended .CSV file based managed partitioned data set. Every hour or so I add a partition of new or changed records. My data source is very slow. So what you are showing here might be useful.

    however this leaves me with a few questions:

    • It appears from the code that you are loading the entire dataset into an pandas data frame. I’m concerned about the needed memory. I have 150 thousand records. Next year I’ll 200,000 additional records (total 350 thousand record). At some point I’m concerned about running out of memory
    • you pull a tail of the records. I’m not seeing in this example how you would control which records are being trimmed. Particularly if the data is stored in a SQL database that does not guarantee order usually? Or is there some python pandas or Dataiku library magic I’m missing in this example.
    • is there a way to trim the oldest partitions from a time partitioned managed dataset? Particularly from a scenario.

    thanks for any insights you can share.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
    edited July 17

    Hi @tgb417
    ,

    The tail was a quick example with the assumption that the logic on the rows to keep would-be customized, and the dataset would be sorted with ( sort_values())

    Another example, if you want to keep rows with a date column > 1 year

    import pandas as pd
    from datetime import datetime, timedelta
    #convert date if needed
    df['date'] = pd.to_datetime(df['date'])
    one_year_ago = datetime.now() - timedelta(days=365)
    df = df[df['date'] >= one_year_ago]

    If you write to CSV from DSS, you can preserve the order by selecting the option:
    Screenshot 2023-05-17 at 7.31.53 AM.png

    When reading from a SQL dataset, the order is not guaranteed if ORDER BY statement is not included.
    In that case, you may want to use SQLExecutor2 and use ORDER BY.
    https://doc.dataiku.com/dss/latest/python-api/sql.html#executing-queries-dataikuapi-variant

    But if you are already dealing with SQL datasets, it's probably easier to use the SQL Script recipe to truncate the tables :

    TRUNCATE TABLE table_name WHERE date_column <= '2023-05-17';

    If memory usage is a concern, you would need to stream the data using chunked reading and only write back the rows to keep.
    https://doc.dataiku.com/dss/latest/python-api/datasets-data.html#chunked-reading-and-writing-with-pandas

    Thanks,


  • deguodedongxi
    deguodedongxi Registered Posts: 4

    That worked perfectly! The dummy dataset seemed a bit weird, but as long as it works, fine with me Maybe in future versions, it would be a nice feature to have some kind of macro to automize this step.

    Thanks for your quick help!

Setup Info
    Tags
      Help me…