Fast appending a dataset into a historical identical schema

voryzover · June 2023

hello

I am trying to append a dataset that is executed weekly into a historical dataset (identical schema) that contains years worth of data. Let's call first one as data1w and the historical one as datahist

Dataiku will not accept two input datasets (data1w and datahist ) while output is datahist (ie. same as one of the inputs) . I know sync recipe solution in append mode but that is not fast enough to regenerate the history. I need ability to fast generate datahist

This can be w/ an existing visual recipe or a python recipe .

Appreciate any help in advance. thanks

Alexandru · June 2023

You could use a partitioned dataset. This would allow you to execute multiple partitions in parallel to build out your initial dataset, and then you can run the last week of new data every time in your sync recipe.
Where is your historical data stored currently? Is on Cloud Storage/ database? Do you have a partitioning column ( date) you can use?

If you want to use python can use ignore_flow=True to essentially read and write to the same dataset in the Python recipe, but if your datahist dataset is huge it may not fit into memory, and you may need to use chuncked_reading and writing. This may not prove very fast
https://developer.dataiku.com/latest/concepts-and-examples/datasets/datasets-data.html#chunked-reading-and-writing-with-pandas

If you can leverage partitioning instead, https://doc.dataiku.com/dss/latest/partitions/index.html it may better option.

Thanks,

tgb417 · June 2023

@AlexT
,

I have been using partitioned datasets for the past 6 months. At this point I have several hundred partitions. They are stored in local file system hourly. In an append mode. What I’ve discovered is that it can take a while to spin up a recipient that looks at all of the data across the partitions. (total record count ~5 million, ) I’m wondering if what I’m doing is some how unexpected, or uninformed. And if there is a better way to work with partitioned datasets. For example using a SQL server? (if one can prune the dataset. Old copies of data in my partitioned dataset become obsolete over time.)

Alexandru · June 2023

Hi @tgb417
,
Can you elaborate on what you mean by taking a "while". Partitioning hourly on a 5 million row dataset may be a bit too expensive, doing daily partitions instead would mean listing fewer files to list which would improve the performance of listing sections.

Partition SQL dataset can be an option with SQL partition you still have a single table but when DSS interacts with this table it simply limits the query with select * from dataset where partition = ....

You can definitely prune your input using SQL to truncate e.g TRUNCATE TABLE table_name WHERE condition... or intermediate partitioned dataset by clearing partitions old partitions : https://developer.dataiku.com/latest/concepts-and-examples/datasets/datasets-other.html#clearing-data

tgb417 · June 2023

@AlexT

I have a very slow data source that will give me only 1 record per second.

I can have 10s of thousands of records changed per day.

So some times I need to be pulling records around the clock to keep up with the changes in the data source.

Data retrieval from this data source can go wrong from time to time. So part of the reason for the partition is to avoid loosing lots of data in situations where there are data retrieval challenges. This has been why I’ve been using these smaller partitions.

I’m not at my DSS computer at the moment so, I’m going to still be a bit hand wavy. “A while”. A scenario with a file system based partitioned dataset starts and does not actually kick off any actual flow steps for ~2 to 4 minutes.

Turribeach · November 2023

@tgb417
Tom one solution to your problem could be to split your data in two datasets: historical dataset is partitioned daily and intraday dataset is partitioned hourly. This will allow you to load the data hourly efficiently as you do now but reduce the cost of historical queries. All you might need to add to your flow is a reload of the historical dataset "todays" partition which will be reloaded every time the intraday dataset has new rows added to it.

Fast appending a dataset into a historical identical schema

Answers

Categories

Setup Info

Tags