Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Another important point, I'm not having any problem to read the data frame, my problem is with transformation, as I'm using matrices and it's generating memory errors. So, what I need is to execute all my code, but in small pieces of data. It's like to put the recipe in loop and execute my code per dataset pieces of information.
Tks
Hi,
You need use the chunked reading/writing capability of the Dataiku API described in https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
That of course supposes that you can indeed apply your transformation independently to small pieces of data, which is not always possible.
The basic usage would be:
input_dataset = dataiku.Dataset("input")
output_dataset = dataiku.Dataset("output")
with output_dataset .get_writer() as writer:
for input_chunk_df in input_dataset.iter_dataframes(5000):
# input_chunk_df is a dataframe containing just a
# chunk of data from input_dataset, with at most 5000 records
# Process here the data ...
output_chunk_df = big_processing_function(input_chunk_df)
# Append the processed chunk to the output
writer.write_dataframe(output_chunk_df)
However, this usage supposes that you already have set the schema on the output_dataset. If you need to set the output_dataset schema, it's a bit more complex. An important thing to note is that you need to set the schema before opening the writer, so code looks like:
input_dataset = dataiku.Dataset("input")
output_dataset = dataiku.Dataset("output")
first_chunk = True
writer = None
for input_chunk_df in input_dataset.iter_dataframes(5000):
# input_chunk_df is a dataframe containing just a
# chunk of data from input_dataset, with at most 5000 records
# Process here the data ...
output_chunk_df = big_processing_function(input_chunk_df)
if first_chunk:
# This is the first chunk, so first set the schema and open
# writer
output_dataset.write_schema_from_dataframe(output_chunk_df)
writer = output_dataset.get_writer()
# Append the processed chunk to the output
writer.write_dataframe(output_chunk_df)
# Very important, when not using "with", you must explicitly close
writer.close()
Hi,
You need use the chunked reading/writing capability of the Dataiku API described in https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
That of course supposes that you can indeed apply your transformation independently to small pieces of data, which is not always possible.
The basic usage would be:
input_dataset = dataiku.Dataset("input")
output_dataset = dataiku.Dataset("output")
with output_dataset .get_writer() as writer:
for input_chunk_df in input_dataset.iter_dataframes(5000):
# input_chunk_df is a dataframe containing just a
# chunk of data from input_dataset, with at most 5000 records
# Process here the data ...
output_chunk_df = big_processing_function(input_chunk_df)
# Append the processed chunk to the output
writer.write_dataframe(output_chunk_df)
However, this usage supposes that you already have set the schema on the output_dataset. If you need to set the output_dataset schema, it's a bit more complex. An important thing to note is that you need to set the schema before opening the writer, so code looks like:
input_dataset = dataiku.Dataset("input")
output_dataset = dataiku.Dataset("output")
first_chunk = True
writer = None
for input_chunk_df in input_dataset.iter_dataframes(5000):
# input_chunk_df is a dataframe containing just a
# chunk of data from input_dataset, with at most 5000 records
# Process here the data ...
output_chunk_df = big_processing_function(input_chunk_df)
if first_chunk:
# This is the first chunk, so first set the schema and open
# writer
output_dataset.write_schema_from_dataframe(output_chunk_df)
writer = output_dataset.get_writer()
# Append the processed chunk to the output
writer.write_dataframe(output_chunk_df)
# Very important, when not using "with", you must explicitly close
writer.close()
Many thanks! It worked!