Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Writing df on chunks with buillt in Dataiku functionality

Solved!
PapaA
Level 3
Writing df on chunks with buillt in Dataiku functionality

Hi team,

I try to write in chnunks a data frame with 1000 columns as the memory cant take. I am writing this on a SQL database table. However, I am receiving a schema error. The target table is empty since I just created.

 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
inp = dataiku.Dataset("dto_1")
out = dataiku.Dataset("dto_features_unswifted_1")


with out.get_writer() as writer:

    for df in inp.iter_dataframes( chunksize=10500):
        # Write the processed dataframe
        writer.write_dataframe(df)

 

 

0 Kudos
1 Solution
HenriC
Dataiker
Dataiker

Hi @PapaA !

Welcome to the community!

I think you did not import the right file but I guess the error was saying that the output schema had 0 column while the input had 1000.

To fix this error, you must proceed in two times. First, you need to replicate the schema and then, load your data.

 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
inp = dataiku.Dataset("dto_1")
out = dataiku.Dataset("dto_features_unswifted_1")

out.write_schema_from_dataframe(inp.get_dataframe())
with out.get_writer() as writer:
    for df in inp.iter_dataframes( chunksize=10500):
        # Write the processed dataframe
        writer.write_dataframe(df)

If I did not get the right error you were receiving, could you please verify that you sent the right file please?

Have a great day,

Henri

View solution in original post

3 Replies
HenriC
Dataiker
Dataiker

Hi @PapaA !

Welcome to the community!

I think you did not import the right file but I guess the error was saying that the output schema had 0 column while the input had 1000.

To fix this error, you must proceed in two times. First, you need to replicate the schema and then, load your data.

 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
inp = dataiku.Dataset("dto_1")
out = dataiku.Dataset("dto_features_unswifted_1")

out.write_schema_from_dataframe(inp.get_dataframe())
with out.get_writer() as writer:
    for df in inp.iter_dataframes( chunksize=10500):
        # Write the processed dataframe
        writer.write_dataframe(df)

If I did not get the right error you were receiving, could you please verify that you sent the right file please?

Have a great day,

Henri

View solution in original post

PapaA
Level 3
Author

HI Henric,

 

After implementing this solution we are still receiving memory issues even though we are using really small chunks. The data frame that we try to write to sql has 900 columns and 70K rows. The machine we are using for Dataiku has 128GB Ram with 16 cores.

Can this behaviour be attributed to something else?

Kr,
Al 

0 Kudos
HenriC
Dataiker
Dataiker

Hey @PapaA,

What kind of SQL dataset are you using?

Searching on the web this error, I think this page could bring you some information on the way to solve it : https://mariadb.com/kb/en/troubleshooting-row-size-too-large-errors-with-innodb/

If you do not find the solution, I'd be happy to help 🙂

0 Kudos
Labels (3)
A banner prompting to get Dataiku DSS