Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Deleting oldest entries in dataset with "append instead of overwrite"

Solved!
deguodedongxi
Level 1
Deleting oldest entries in dataset with "append instead of overwrite"

I am collecting data from a datasource over a long period of time. The dataset is pretty big (10+ million rows).

Because I dont want to recreate the dataset every time, I am using "append instead of overwrite" to only add the last day of data.

At the same time I dont want my dataset to get even bigger, I want to remove the data of the oldest date. I cannot find a way to do this with visual recipes. I also tried overwriting the data with a python recipe, but I am not allowed to save data into the same dataset as I am reading from.

Thanks for your help!

1 Solution
AlexT
Dataiker

Hi @deguodedongxi,
You should be able to use a Python recipe with a dummy output, and you need to use ignore_flow=True when defining the dataset for writing

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
crm_history = dataiku.Dataset("crm_history")
df = crm_history.get_dataframe()
#replace 1000000 with the number of rows you want to keep
truncated_df = df.tail(1000000)
# Write recipe outputs
crm_history_out = dataiku.Dataset("crm_history", ignore_flow=True)
crm_history_out.write_with_schema(truncated_df)

View solution in original post

4 Replies
AlexT
Dataiker

Hi @deguodedongxi,
You should be able to use a Python recipe with a dummy output, and you need to use ignore_flow=True when defining the dataset for writing

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
crm_history = dataiku.Dataset("crm_history")
df = crm_history.get_dataframe()
#replace 1000000 with the number of rows you want to keep
truncated_df = df.tail(1000000)
# Write recipe outputs
crm_history_out = dataiku.Dataset("crm_history", ignore_flow=True)
crm_history_out.write_with_schema(truncated_df)

@AlexT 

i have a similar use case.  My dataset is being stored in an appended .CSV file based managed partitioned data set.  Every hour or so I add a partition of new or changed records. My data source is very slow. So what you are showing here might be useful. 

however this leaves me with a few questions:

  • It appears from the code that you are loading the entire dataset into an pandas data frame.  I’m concerned about the needed memory. I have 150 thousand records. Next year I’ll 200,000 additional records (total 350 thousand record). At some point I’m concerned about running out of memory
  • you pull a tail of the records. I’m not seeing in this example how you would control which records are being trimmed. Particularly if the data is stored in a SQL database that does not  guarantee order usually?  Or is there some python pandas or Dataiku library magic I’m missing in this example. 
  • is there a way to trim the oldest partitions from a time partitioned managed dataset?  Particularly from a scenario. 

thanks for any insights you can share.  

--Tom
0 Kudos
AlexT
Dataiker

Hi @tgb417 ,

The tail was a quick example with the assumption that the logic on the rows to keep would-be customized, and the dataset would be sorted with ( sort_values()) 

Another example, if you want to keep rows with a date column > 1 year

 

import pandas as pd
from datetime import datetime, timedelta
#convert date if needed
df['date'] = pd.to_datetime(df['date'])
one_year_ago = datetime.now() - timedelta(days=365)
df = df[df['date'] >= one_year_ago]

 

If you write to CSV from DSS, you can preserve the order by selecting the option:
Screenshot 2023-05-17 at 7.31.53 AM.png

When reading from a SQL dataset, the order is not guaranteed if ORDER BY statement is not included.
In that case, you may want to use SQLExecutor2 and use ORDER BY.
https://doc.dataiku.com/dss/latest/python-api/sql.html#executing-queries-dataikuapi-variant

But if you are already dealing with SQL datasets, it's probably easier to use the SQL Script recipe to truncate the tables : 

TRUNCATE TABLE table_name WHERE date_column <= '2023-05-17';

If memory usage is a concern, you would need to stream the data using chunked reading and only write back the rows to keep. 
https://doc.dataiku.com/dss/latest/python-api/datasets-data.html#chunked-reading-and-writing-with-pa...

Thanks,

 


 

deguodedongxi
Level 1
Author

That worked perfectly! The dummy dataset seemed a bit weird, but as long as it works, fine with me 🙂 Maybe in future versions, it would be a nice feature to have some kind of macro to automize this step.

Thanks for your quick help! 

0 Kudos