Zip error when trying to write_with_schema using Python

Solved!
info-rchitect
Level 6
Zip error when trying to write_with_schema using Python

I have 2 dataset outputs (filesystem) and get this error when writing either one:

[16:12:33] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWriter]  - Done initializing output writer
[16:12:33] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWritingService]  - Pushed data to write session Vid70erxcJ : 0 rows
[16:12:33] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWritingService]  - Finished write session: Vid70erxcJ (current count=0)
[16:12:33] [DEBUG] [dku.jobs]  - Command /tintercom/datasets/push-data processed in 30ms
[16:12:33] [DEBUG] [dku.jobs]  - Command /tintercom/datasets/wait-write-session processed in 91ms
[16:12:33] [INFO] [dku.utils]  - 0 rows successfully written (Vid70erxcJ)
[16:12:33] [INFO] [dku.utils]  - *************** Recipe code failed **************
[16:12:33] [INFO] [dku.utils]  - Begin Python stack
[16:12:33] [INFO] [dku.utils]  - Traceback (most recent call last):
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/data_dir/jobs/TESTEXPLORER/Build_test_explorer_data__NP__2022-02-14T21-10-10.951/compute_test_explorer_data_NP/cpython-recipe/pyoutEYeL0KruNePt/python-exec-wrapper.py", line 208, in <module>
[16:12:33] [INFO] [dku.utils]  -     exec(f.read())
[16:12:33] [INFO] [dku.utils]  -   File "<string>", line 405, in <module>
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dataset.py", line 631, in write_with_schema
[16:12:33] [INFO] [dku.utils]  -     self.write_dataframe(df, True, dropAndCreate)
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dataset.py", line 660, in write_dataframe
[16:12:33] [INFO] [dku.utils]  -     writer.write_dataframe(df)
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dataset_write.py", line 397, in write_dataframe
[16:12:33] [INFO] [dku.utils]  -     quoting=csv.QUOTE_ALL,).save()
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dku_pandas_csv.py", line 197, in save
[16:12:33] [INFO] [dku.utils]  -     self._save()
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dku_pandas_csv.py", line 296, in _save
[16:12:33] [INFO] [dku.utils]  -     self._save_chunk(start_i, end_i)
[16:12:33] [INFO] [dku.utils]  -   File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dku_pandas_csv.py", line 318, in _save_chunk
[16:12:33] [INFO] [dku.utils]  -     for col_loc, col in zip(b.mgr_locs, d):
[16:12:33] [INFO] [dku.utils]  - TypeError: zip argument #2 must support iteration

The dataframes are valid with the following shapes:

(14295, 9)
(3623976, 17)

 Here is the code where I create the dataset and then write to schema:

 

 

 

    test_explorer_data = dataiku.Dataset("test_explorer_data")
    test_explorer_data.write_with_schema(test_explorer_data_df)
    test_explorer_event_data = dataiku.Dataset("test_explorer_event_data")
    test_explorer_event_data.write_with_schema(test_explorer_event_data_df)

 

 

 

 BTW, I am using pandas 1.2.3.  I have verified this has nothing to do with my datasets, rather it is because dataiku (at least version 9) is incompatible with pandas 1.2.3.  I did this using a simple Python recipe that just copies the input dataframe to the output dataframe and it fails with the exact same error.

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
artifactory_ids = dataiku.Dataset("artifactory_ids")
artifactory_ids_df = artifactory_ids.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.
pandas_1p2p3_bug_df = artifactory_ids_df # For this sample code, simply copy input to output

# Write recipe outputs
pandas_1p2p3_bug = dataiku.Dataset("pandas_1p2p3_bug")
pandas_1p2p3_bug.write_with_schema(pandas_1p2p3_bug_df)

 

 Reverting to pandas 1.0.0 makes the recipe complete successfully.


Operating system used: Windows 10

0 Kudos
2 Solutions
HarizoR
Developer Advocate

Hi,

Rest assured that our Engineering team is working hard on that topic 🙂

To give you more details, the case of pandas is special: compared to other libraries, a lot of Dataiku's internal tooling relies on pandas as a dependency. For example, when you execute the following code:

df = dataiku.Dataset("my_dataset").get_dataframe()

 

...Dataiku's Python library relies on pandas to load the data from your Dataset into the df DataFrame. 

Historically speaking, every time we decided to bump the supported version, we had to run thorough tests to make sure that it wouldn't break anything in all of the product's internal components, overall stability being key in delivering a good user experience. 

The good news is that the support of much more recent versions of pandas (and Python) is planned to roll out soon! I will keep you posted in this thread once it becomes available.

Thanks again for your patience !

Best,

Harizo

View solution in original post

HarizoR
Developer Advocate

Hi,

Upgrading your DSS instance to version 10.0.4 should allow you to bump the Python/pandas versions to more recent ones (up to 3.10 for Python, up to 1.3 for Pandas). You can read the release notes for more details.

Best,

Harizo

View solution in original post

5 Replies
HarizoR
Developer Advocate

Hi,

As you have correctly noticed, the issue comes from a version of pandas (1.2.3) that is not currently supported by Dataiku (as of version 10), so reverting to a supported pandas version (1.0.0) is the solution.

Best,

Harizo

0 Kudos
info-rchitect
Level 6
Author

Hi HarizoR,

 

Pandas 1.0.0 was relased on January 29, 2020.  Should users expect a 2 year lag in the pandas version that works with dataiku?  If, not what is the plan to close this gap?

 

thx

0 Kudos
HarizoR
Developer Advocate

Hi,

Rest assured that our Engineering team is working hard on that topic 🙂

To give you more details, the case of pandas is special: compared to other libraries, a lot of Dataiku's internal tooling relies on pandas as a dependency. For example, when you execute the following code:

df = dataiku.Dataset("my_dataset").get_dataframe()

 

...Dataiku's Python library relies on pandas to load the data from your Dataset into the df DataFrame. 

Historically speaking, every time we decided to bump the supported version, we had to run thorough tests to make sure that it wouldn't break anything in all of the product's internal components, overall stability being key in delivering a good user experience. 

The good news is that the support of much more recent versions of pandas (and Python) is planned to roll out soon! I will keep you posted in this thread once it becomes available.

Thanks again for your patience !

Best,

Harizo

HarizoR
Developer Advocate

Hi,

Upgrading your DSS instance to version 10.0.4 should allow you to bump the Python/pandas versions to more recent ones (up to 3.10 for Python, up to 1.3 for Pandas). You can read the release notes for more details.

Best,

Harizo

CoreyS
Dataiker Alumni

Hi @info-rchitect we can confirm that as of yesterday, we have added support for Pandas 1.1, Pandas 1.2 and Pandas 1.3. More information can be found in our release notes.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!