String getting truncated using python recipe

Solved!
pipscity
Level 1
String getting truncated using python recipe

Hi all,

Just starting with dataiku:. 

I am:

1) reading an image from a managed folder 

2) converting that image to base64 (5k character) 

3) when i export it into my recipe output dataframe it get truncated to 1k character (string limit)

How do ensure that my string output don't get truncated? tried to change the string max length manually but it seems that it is gettting a reset every time i am running my script.

My script is as below:

import dataiku
import pandas as pd, numpy as np
import base64
from dataiku import pandasutils as pdu


#read Image Folder
images_folder = dataiku.Folder("Pictures")
folder_info=images_folder.get_info()
print(folder_info)


#Read Image
with images_folder.get_download_stream("template.JPG") as f:
data = f.read()

#convert Image to base64
base64_encoded_data = base64.b64encode(data)
print(base64_encoded_data)

#convert to dataframe
base64_df = pd.DataFrame.from_dict({'template':base64_encoded_data}, orient='index') # Compute a Pandas dataframe to write into Base64


# Write recipe outputs
output_base64v2 = dataiku.Dataset("output_base64v2")
output_base64v2.write_with_schema(base64_df)

0 Kudos
1 Solution
AlexT
Dataiker

Hi,

When using write_with_schema it will update the schema every time and reset the default string limit to 1000 characters. One way to increase this would be to manually define the schema and set maxLenght to the value you choose and use write_dafaframe instead. Here is sample based on your current code : 

 

import dataiku
import pandas as pd, numpy as np
import base64
from dataiku import pandasutils as pdu


#read Image Folder
images_folder = dataiku.Folder("kHGFYqt4")
folder_info=images_folder.get_info()
print(folder_info)


#Read Image
with images_folder.get_download_stream("image.png") as f:
    data = f.read()

#convert Image to base64
base64_encoded_data = base64.b64encode(data)
print(base64_encoded_data)

#convert to dataframe
base64_df = pd.DataFrame.from_dict({'template':base64_encoded_data}, orient='index') # Compute a Pandas dataframe to write into Base64


# Write recipe outputs
output_base64v2 = dataiku.Dataset("base64")
output_base64v2.write_schema([
{
  "name": "base64_encoded_data",
  "type": "string",
   "maxLength": 65000
    
}
])

output_base64v2.write_dataframe(base64_df)

 

View solution in original post

0 Kudos
2 Replies
AlexT
Dataiker

Hi,

When using write_with_schema it will update the schema every time and reset the default string limit to 1000 characters. One way to increase this would be to manually define the schema and set maxLenght to the value you choose and use write_dafaframe instead. Here is sample based on your current code : 

 

import dataiku
import pandas as pd, numpy as np
import base64
from dataiku import pandasutils as pdu


#read Image Folder
images_folder = dataiku.Folder("kHGFYqt4")
folder_info=images_folder.get_info()
print(folder_info)


#Read Image
with images_folder.get_download_stream("image.png") as f:
    data = f.read()

#convert Image to base64
base64_encoded_data = base64.b64encode(data)
print(base64_encoded_data)

#convert to dataframe
base64_df = pd.DataFrame.from_dict({'template':base64_encoded_data}, orient='index') # Compute a Pandas dataframe to write into Base64


# Write recipe outputs
output_base64v2 = dataiku.Dataset("base64")
output_base64v2.write_schema([
{
  "name": "base64_encoded_data",
  "type": "string",
   "maxLength": 65000
    
}
])

output_base64v2.write_dataframe(base64_df)

 

0 Kudos
pipscity
Level 1
Author

Thanks that really helped.

There was one only issue left: updating column names in the DF to make it work but it's all good now! Final Script is as below.

 

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
import base64
from dataiku import pandasutils as pdu


#read Image Folder
images_folder = dataiku.Folder("Pictures")
folder_info=images_folder.get_info()
print(folder_info)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
#Read Image
with images_folder.get_download_stream("template.JPG") as f:
data = f.read()

#convert Image to base64
base64_encoded_data = base64.b64encode(data).encode('utf-8')
strLength=len(base64_encoded_data)+1

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
#convert out base64 to a dataframe
base64_df = pd.DataFrame.from_dict({'template':base64_encoded_data}, orient='index', columns=['Value'])
print(base64_df)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
output_base64v2 = dataiku.Dataset("output_base64v2")
output_base64v2.write_schema([
{
"name": "Value",
"type": "string",
"maxLength": strLength
}
])

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
output_base64v2.write_dataframe(base64_df)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
print(output_base64v2.read_schema())

0 Kudos