Writing df on chunks with buillt in Dataiku functionality

PapaA · ‎01-13-2021

Hi team,

I try to write in chnunks a data frame with 1000 columns as the memory cant take. I am writing this on a SQL database table. However, I am receiving a schema error. The target table is empty since I just created.

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
inp = dataiku.Dataset("dto_1")
out = dataiku.Dataset("dto_features_unswifted_1")


with out.get_writer() as writer:

    for df in inp.iter_dataframes( chunksize=10500):
        # Write the processed dataframe
        writer.write_dataframe(df)

HenriC · ‎01-13-2021

Hi @PapaA !

Welcome to the community!

I think you did not import the right file but I guess the error was saying that the output schema had 0 column while the input had 1000.

To fix this error, you must proceed in two times. First, you need to replicate the schema and then, load your data.

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
inp = dataiku.Dataset("dto_1")
out = dataiku.Dataset("dto_features_unswifted_1")

out.write_schema_from_dataframe(inp.get_dataframe())
with out.get_writer() as writer:
    for df in inp.iter_dataframes( chunksize=10500):
        # Write the processed dataframe
        writer.write_dataframe(df)

If I did not get the right error you were receiving, could you please verify that you sent the right file please?

Have a great day,

Henri

View solution in original post

HenriC · ‎01-13-2021

Hi @PapaA !

Welcome to the community!

I think you did not import the right file but I guess the error was saying that the output schema had 0 column while the input had 1000.

To fix this error, you must proceed in two times. First, you need to replicate the schema and then, load your data.

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
inp = dataiku.Dataset("dto_1")
out = dataiku.Dataset("dto_features_unswifted_1")

out.write_schema_from_dataframe(inp.get_dataframe())
with out.get_writer() as writer:
    for df in inp.iter_dataframes( chunksize=10500):
        # Write the processed dataframe
        writer.write_dataframe(df)

If I did not get the right error you were receiving, could you please verify that you sent the right file please?

Have a great day,

Henri

PapaA · ‎01-18-2021

HI Henric,

After implementing this solution we are still receiving memory issues even though we are using really small chunks. The data frame that we try to write to sql has 900 columns and 70K rows. The machine we are using for Dataiku has 128GB Ram with 16 cores.

Can this behaviour be attributed to something else?

Kr,
Al

HenriC · ‎01-19-2021

Hey @PapaA,

What kind of SQL dataset are you using?

Searching on the web this error, I think this page could bring you some information on the way to solve it : https://mariadb.com/kb/en/troubleshooting-row-size-too-large-errors-with-innodb/

If you do not find the solution, I'd be happy to help 🙂

Writing df on chunks with buillt in Dataiku functionality

Writing df on chunks with buillt in Dataiku functionality

Labels

Datasets

File formats

SQL databases

Sign up to take part

Writing df on chunks with buillt in Dataiku functionality

Writing df on chunks with buillt in Dataiku functionality

Labels

Datasets

File formats

SQL databases