Is that possible to run python recipe in loop?
- medinfo_dataset has 26131 records, that's not too much.
- Is there a way to create a loop between my python recipe and the medinfo_dataset recipe to run python code per dataset pieces? For instance: First, run the python code considering the first 5000 records of the medinfo_dataset and store the data frame output. Then run the following 5000 reacords and concatenate to the same output data frame and so forth?
Another important point, I'm not having any problem to read the data frame, my problem is with transformation, as I'm using matrices and it's generating memory errors. So, what I need is to execute all my code, but in small pieces of data. It's like to put the recipe in loop and execute my code per dataset pieces of information.
Tks
Best Answer
-
Hi,
You need use the chunked reading/writing capability of the Dataiku API described in https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
That of course supposes that you can indeed apply your transformation independently to small pieces of data, which is not always possible.
The basic usage would be:
input_dataset = dataiku.Dataset("input") output_dataset = dataiku.Dataset("output") with output_dataset .get_writer() as writer: for input_chunk_df in input_dataset.iter_dataframes(5000): # input_chunk_df is a dataframe containing just a # chunk of data from input_dataset, with at most 5000 records # Process here the data ... output_chunk_df = big_processing_function(input_chunk_df) # Append the processed chunk to the output writer.write_dataframe(output_chunk_df)
However, this usage supposes that you already have set the schema on the output_dataset. If you need to set the output_dataset schema, it's a bit more complex. An important thing to note is that you need to set the schema before opening the writer, so code looks like:
input_dataset = dataiku.Dataset("input") output_dataset = dataiku.Dataset("output") first_chunk = True writer = None for input_chunk_df in input_dataset.iter_dataframes(5000): # input_chunk_df is a dataframe containing just a # chunk of data from input_dataset, with at most 5000 records # Process here the data ... output_chunk_df = big_processing_function(input_chunk_df) if first_chunk: # This is the first chunk, so first set the schema and open # writer output_dataset.write_schema_from_dataframe(output_chunk_df) writer = output_dataset.get_writer() # Append the processed chunk to the output writer.write_dataframe(output_chunk_df) # Very important, when not using "with", you must explicitly close writer.close()
Answers
-
Many thanks! It worked!