Is that possible to run python recipe in loop?

rrodr244
rrodr244 Registered Posts: 2 ✭✭✭✭
Hi folks,
I have the following flow:
image.png

  • medinfo_dataset has 26131 records, that's not too much.
I'm creating q python code for feature extraction from text using some NLP approach, however I'm getting memory errors at some points due matrix size. So my question is:
  • Is there a way to create a loop between my python recipe and the medinfo_dataset recipe to run python code per dataset pieces? For instance: First, run the python code considering the first 5000 records of the medinfo_dataset and store the data frame output. Then run the following 5000 reacords and concatenate to the same output data frame and so forth?

Another important point, I'm not having any problem to read the data frame, my problem is with transformation, as I'm using matrices and it's generating memory errors. So, what I need is to execute all my code, but in small pieces of data. It's like to put the recipe in loop and execute my code per dataset pieces of information.

Tks

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 17 Answer ✓

    Hi,

    You need use the chunked reading/writing capability of the Dataiku API described in https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

    That of course supposes that you can indeed apply your transformation independently to small pieces of data, which is not always possible.

    The basic usage would be:

    input_dataset = dataiku.Dataset("input")
    output_dataset = dataiku.Dataset("output")
    
    with output_dataset .get_writer() as writer:
        for input_chunk_df in input_dataset.iter_dataframes(5000):
            # input_chunk_df is a dataframe containing just a
            # chunk of data from input_dataset, with at most 5000 records
    
            # Process here the data ...
            output_chunk_df = big_processing_function(input_chunk_df)
    
            # Append the processed chunk to the output
            writer.write_dataframe(output_chunk_df)

    However, this usage supposes that you already have set the schema on the output_dataset. If you need to set the output_dataset schema, it's a bit more complex. An important thing to note is that you need to set the schema before opening the writer, so code looks like:

    input_dataset = dataiku.Dataset("input")
    output_dataset = dataiku.Dataset("output")
    
    first_chunk = True
    writer = None
    
    for input_chunk_df in input_dataset.iter_dataframes(5000):
        # input_chunk_df is a dataframe containing just a
        # chunk of data from input_dataset, with at most 5000 records
    
        # Process here the data ...
        output_chunk_df = big_processing_function(input_chunk_df)
    
        if first_chunk:
             # This is the first chunk, so first set the schema and open
             # writer
             output_dataset.write_schema_from_dataframe(output_chunk_df)
             writer = output_dataset.get_writer()
    
        # Append the processed chunk to the output
        writer.write_dataframe(output_chunk_df)
    
    # Very important, when not using "with", you must explicitly close
    writer.close()

Answers

Setup Info
    Tags
      Help me…