Error while running a script step When processing 'ArrayFold', the first 1 rows already used 1024 MB

chi_wong
chi_wong Registered Posts: 5 ✭✭✭✭

I'm getting this error when trying to parse through output from an API call which comes back as 1 long json string.

Setting the sample size doesn't help in this case because I'm just working with that 1 json string.

any way to get around this error/warning?

output is expected to be around 10k rows and 5 columns


Operating system used: Windows

Tagged:

Best Answer

  • chi_wong
    chi_wong Registered Posts: 5 ✭✭✭✭
    edited July 2024 Answer ✓

    just a quick followup on how I was able to move forward. The 1024mb error seems to be a limitation with the way ArrayFold and Split and Fold functions are implemented.

    I discovered that changing my input dataset from a single row with 10k embedded rows to 10 rows with 1k embedded rows resulted in the same error due to hitting the memory limitation.

    For me to move forward, I sent a single JSON column to a python recipe and copied a function called "tidy_split" from stack overflow to break out the rows

    Not having any python experience, it was important to me to minimize coding if at all possible, so this code snippet shows mostly whatever Dataiku gives me in the initial coding window:

    # Read recipe inputs
    EQL_Results_testParse = dataiku.Dataset("EQL_Results_testParse")
    EQL_Results_testParse_df = EQL_Results_testParse.get_dataframe()
    
    
    # Compute recipe outputs from inputs
    # TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
    # NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.
    
    ParsedValues_df = tidy_split(EQL_Results_testParse_df,"testOut",sep='[') 
    
    
    # Write recipe outputs
    ParsedValues = dataiku.Dataset("ParsedValues")
    ParsedValues.write_with_schema(ParsedValues_df)

    the resulting output data is passed to a prepare step where the rows are split into columns using standard processors. I could have split out the columns in python but wanted to minimize coding for readability for the next guy.

Answers

Setup Info
    Tags
      Help me…