How to perform analysis in a python notebook on a part of a dataset without loading all the dataset in memory

Highlighted
UserBird Dataiker
Dataiker
How to perform analysis in a python notebook on a part of a dataset without loading all the dataset in memory
Jump to solution
Hi,

I have a dataset representing transactions for all the products.

I would like to perform a loop for on product in a python notebook for loading the transaction for this product, then perform analysis and write the results in a dataset.

How can I load only a partition of my dataset from a python Notebook ?

Thanks in advance
0 Kudos
1 Solution

Accepted Solutions
jereze Dataiker
Dataiker
Re: How to perform analysis in a python notebook on a part of a dataset without loading all the dataset in memory
Jump to solution

Hi,



I assume you use the get_dataframe() method and then work with a pandas dataframe. (Let me know if you do something different).



Here is what you can do:



1) Get only a sample of a dataset with my_dataset.get_dataframe(sampling='head', limit=10000)



2) Load the dataset by chunks with my_dataser.iter_dataframes(chunksize=10000)




my_dataset = dataiku.Dataset("name_dataset")
for partial_dataframe in my_dataset.iter_dataframes(chunksize=10000):
# Insert here applicative logic on each partial dataframe.
pass


You can read more in the documentation.

Jeremy, Product Manager at Dataiku

View solution in original post

0 Kudos
1 Reply
jereze Dataiker
Dataiker
Re: How to perform analysis in a python notebook on a part of a dataset without loading all the dataset in memory
Jump to solution

Hi,



I assume you use the get_dataframe() method and then work with a pandas dataframe. (Let me know if you do something different).



Here is what you can do:



1) Get only a sample of a dataset with my_dataset.get_dataframe(sampling='head', limit=10000)



2) Load the dataset by chunks with my_dataser.iter_dataframes(chunksize=10000)




my_dataset = dataiku.Dataset("name_dataset")
for partial_dataframe in my_dataset.iter_dataframes(chunksize=10000):
# Insert here applicative logic on each partial dataframe.
pass


You can read more in the documentation.

Jeremy, Product Manager at Dataiku

View solution in original post

0 Kudos
Labels (4)