Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Please tell me how to read the gzip-compressed json file in the Dataiku folder (S3) into the dictionary. The S3 path consists of 5 minute data partitions as shown below, and in the Python recipe I want to ungzip and read the data in the 5 minute partition.
YYYY / MM / DD / HH / 00 / xxxxxx00.json.gz
YYYY / MM / DD / HH / 05xxxxxx05.json.gz
Can you give me some sample Python code?
The following error has occurred, so please give me some advice.
[2022/04/02-02:24:42.513] [FRT-35-FlowRunnable] [INFO] [dku.flow.activity] - Run thread failed for activity compute_aka4jcxL_2022-03-01-00__00 com.dataiku.common.server.APIError$SerializedErrorException: Error in python process: At line 37: <class 'AttributeError'>: 'bytes' object has no attribute 'read' at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleErrorFile(JobExecutionResultHandler.java:65) at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResultNoProcessDiedException(JobExecutionResultHandler.java:32)
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
predict_input_folder = dataiku.Folder("predict_input_folder")
predict_input_folder_info = predict_input_folder.get_info()
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Read if input json
current_partition = dataiku.dku_flow_variables["DKU_DST_MINUTE"]
dir_path = f'{dataiku.dku_flow_variables["DKU_DST_YYYYMMDDHH"].replace("-", "/")}/{dataiku.dku_flow_variables["DKU_DST_MINUTE"]}'
filename = f'waf_logs_{dataiku.dku_flow_variables["DKU_DST_YYYYMMDDHH"].replace("-", "")}{dataiku.dku_flow_variables["DKU_DST_MINUTE"]}.json.gz'
s3_path = f'{dir_path}/{filename}'
with predict_input_folder.get_download_stream(s3_path) as f:
data = f.read()
jsondict_list = json.load(gzip.decompress(data))