Zip error when trying to write_with_schema using Python
I have 2 dataset outputs (filesystem) and get this error when writing either one:
[16:12:33] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWriter] - Done initializing output writer [16:12:33] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWritingService] - Pushed data to write session Vid70erxcJ : 0 rows [16:12:33] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWritingService] - Finished write session: Vid70erxcJ (current count=0) [16:12:33] [DEBUG] [dku.jobs] - Command /tintercom/datasets/push-data processed in 30ms [16:12:33] [DEBUG] [dku.jobs] - Command /tintercom/datasets/wait-write-session processed in 91ms [16:12:33] [INFO] [dku.utils] - 0 rows successfully written (Vid70erxcJ) [16:12:33] [INFO] [dku.utils] - *************** Recipe code failed ************** [16:12:33] [INFO] [dku.utils] - Begin Python stack [16:12:33] [INFO] [dku.utils] - Traceback (most recent call last): [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/data_dir/jobs/TESTEXPLORER/Build_test_explorer_data__NP__2022-02-14T21-10-10.951/compute_test_explorer_data_NP/cpython-recipe/pyoutEYeL0KruNePt/python-exec-wrapper.py", line 208, in <module> [16:12:33] [INFO] [dku.utils] - exec(f.read()) [16:12:33] [INFO] [dku.utils] - File "<string>", line 405, in <module> [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dataset.py", line 631, in write_with_schema [16:12:33] [INFO] [dku.utils] - self.write_dataframe(df, True, dropAndCreate) [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dataset.py", line 660, in write_dataframe [16:12:33] [INFO] [dku.utils] - writer.write_dataframe(df) [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dataset_write.py", line 397, in write_dataframe [16:12:33] [INFO] [dku.utils] - quoting=csv.QUOTE_ALL,).save() [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dku_pandas_csv.py", line 197, in save [16:12:33] [INFO] [dku.utils] - self._save() [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dku_pandas_csv.py", line 296, in _save [16:12:33] [INFO] [dku.utils] - self._save_chunk(start_i, end_i) [16:12:33] [INFO] [dku.utils] - File "/data/dataiku/dataiku-dss-9.0.5/python/dataiku/core/dku_pandas_csv.py", line 318, in _save_chunk [16:12:33] [INFO] [dku.utils] - for col_loc, col in zip(b.mgr_locs, d): [16:12:33] [INFO] [dku.utils] - TypeError: zip argument #2 must support iteration
The dataframes are valid with the following shapes:
(14295, 9)
(3623976, 17)
Here is the code where I create the dataset and then write to schema:
test_explorer_data = dataiku.Dataset("test_explorer_data") test_explorer_data.write_with_schema(test_explorer_data_df) test_explorer_event_data = dataiku.Dataset("test_explorer_event_data") test_explorer_event_data.write_with_schema(test_explorer_event_data_df)
BTW, I am using pandas 1.2.3. I have verified this has nothing to do with my datasets, rather it is because dataiku (at least version 9) is incompatible with pandas 1.2.3. I did this using a simple Python recipe that just copies the input dataframe to the output dataframe and it fails with the exact same error.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs artifactory_ids = dataiku.Dataset("artifactory_ids") artifactory_ids_df = artifactory_ids.get_dataframe() # Compute recipe outputs from inputs # TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe # NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc. pandas_1p2p3_bug_df = artifactory_ids_df # For this sample code, simply copy input to output # Write recipe outputs pandas_1p2p3_bug = dataiku.Dataset("pandas_1p2p3_bug") pandas_1p2p3_bug.write_with_schema(pandas_1p2p3_bug_df)
Reverting to pandas 1.0.0 makes the recipe complete successfully.
Operating system used: Windows 10
Best Answers
-
Hi,
Rest assured that our Engineering team is working hard on that topic
To give you more details, the case of pandas is special: compared to other libraries, a lot of Dataiku's internal tooling relies on pandas as a dependency. For example, when you execute the following code:
df = dataiku.Dataset("my_dataset").get_dataframe()
...Dataiku's Python library relies on pandas to load the data from your Dataset into the df DataFrame.
Historically speaking, every time we decided to bump the supported version, we had to run thorough tests to make sure that it wouldn't break anything in all of the product's internal components, overall stability being key in delivering a good user experience.
The good news is that the support of much more recent versions of pandas (and Python) is planned to roll out soon! I will keep you posted in this thread once it becomes available.
Thanks again for your patience !
Best,
Harizo
-
Hi,
Upgrading your DSS instance to version 10.0.4 should allow you to bump the Python/pandas versions to more recent ones (up to 3.10 for Python, up to 1.3 for Pandas). You can read the release notes for more details.
Best,
Harizo
Answers
-
Hi,
As you have correctly noticed, the issue comes from a version of pandas (1.2.3) that is not currently supported by Dataiku (as of version 10), so reverting to a supported pandas version (1.0.0) is the solution.
Best,
Harizo
-
Hi HarizoR,
Pandas 1.0.0 was relased on January 29, 2020. Should users expect a 2 year lag in the pandas version that works with dataiku? If, not what is the plan to close this gap?
thx
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi @info-rchitect we can confirm that as of yesterday, we have added support for Pandas 1.1, Pandas 1.2 and Pandas 1.3. More information can be found in our release notes.