Is that possible to run python recipe in loop?

rrodr244 · April 2020

Hi folks,

I have the following flow:

medinfo_dataset has 26131 records, that's not too much.

I'm creating q python code for feature extraction from text using some NLP approach, however I'm getting memory errors at some points due matrix size. So my question is:

Is there a way to create a loop between my python recipe and the medinfo_dataset recipe to run python code per dataset pieces? For instance: First, run the python code considering the first 5000 records of the medinfo_dataset and store the data frame output. Then run the following 5000 reacords and concatenate to the same output data frame and so forth?

Another important point, I'm not having any problem to read the data frame, my problem is with transformation, as I'm using matrices and it's generating memory errors. So, what I need is to execute all my code, but in small pieces of data. It's like to put the recipe in loop and execute my code per dataset pieces of information.

Tks

Clément_Stenac · April 2020

Hi,

You need use the chunked reading/writing capability of the Dataiku API described in https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

That of course supposes that you can indeed apply your transformation independently to small pieces of data, which is not always possible.

The basic usage would be:

input_dataset = dataiku.Dataset("input")
output_dataset = dataiku.Dataset("output")

with output_dataset .get_writer() as writer:
    for input_chunk_df in input_dataset.iter_dataframes(5000):
        # input_chunk_df is a dataframe containing just a
        # chunk of data from input_dataset, with at most 5000 records

        # Process here the data ...
        output_chunk_df = big_processing_function(input_chunk_df)

        # Append the processed chunk to the output
        writer.write_dataframe(output_chunk_df)

However, this usage supposes that you already have set the schema on the output_dataset. If you need to set the output_dataset schema, it's a bit more complex. An important thing to note is that you need to set the schema before opening the writer, so code looks like:

input_dataset = dataiku.Dataset("input")
output_dataset = dataiku.Dataset("output")

first_chunk = True
writer = None

for input_chunk_df in input_dataset.iter_dataframes(5000):
    # input_chunk_df is a dataframe containing just a
    # chunk of data from input_dataset, with at most 5000 records

    # Process here the data ...
    output_chunk_df = big_processing_function(input_chunk_df)

    if first_chunk:
         # This is the first chunk, so first set the schema and open
         # writer
         output_dataset.write_schema_from_dataframe(output_chunk_df)
         writer = output_dataset.get_writer()

    # Append the processed chunk to the output
    writer.write_dataframe(output_chunk_df)

# Very important, when not using "with", you must explicitly close
writer.close()

rrodr244 · April 2020

Many thanks! It worked!

Is that possible to run python recipe in loop?

Best Answer

Answers

Categories

Setup Info

Tags