Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Read gzip compressed json file in partitioned folder of s3

Solved!
Toshi
Level 2
Read gzip compressed json file in partitioned folder of s3

Please tell me how to read the gzip-compressed json file in the Dataiku folder (S3) into the dictionary. The S3 path consists of 5 minute data partitions as shown below, and in the Python recipe I want to ungzip and read the data in the 5 minute partition.

YYYY / MM / DD / HH / 00 / xxxxxx00.json.gz
YYYY / MM / DD / HH / 05xxxxxx05.json.gz

 

Can you give me some sample Python code?

0 Kudos
1 Solution
Toshi
Level 2
Author
Young-Sang san

I was able to read the file normally. thank you. This ticket will be closed.

Best Regards,
Toshiaki

View solution in original post

0 Kudos
3 Replies
Toshi
Level 2
Author

The following error has occurred, so please give me some advice.

[2022/04/02-02:24:42.513] [FRT-35-FlowRunnable] [INFO] [dku.flow.activity] - Run thread failed for activity compute_aka4jcxL_2022-03-01-00__00
com.dataiku.common.server.APIError$SerializedErrorException: Error in python process: At line 37: <class 'AttributeError'>: 'bytes' object has no attribute 'read'
	at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleErrorFile(JobExecutionResultHandler.java:65)
	at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResultNoProcessDiedException(JobExecutionResultHandler.java:32)

 

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
predict_input_folder = dataiku.Folder("predict_input_folder")
predict_input_folder_info = predict_input_folder.get_info()

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Read if input json
current_partition = dataiku.dku_flow_variables["DKU_DST_MINUTE"]
dir_path = f'{dataiku.dku_flow_variables["DKU_DST_YYYYMMDDHH"].replace("-", "/")}/{dataiku.dku_flow_variables["DKU_DST_MINUTE"]}'
filename = f'waf_logs_{dataiku.dku_flow_variables["DKU_DST_YYYYMMDDHH"].replace("-", "")}{dataiku.dku_flow_variables["DKU_DST_MINUTE"]}.json.gz'
s3_path = f'{dir_path}/{filename}'

with predict_input_folder.get_download_stream(s3_path) as f:
    data = f.read()
    jsondict_list = json.load(gzip.decompress(data))

 

 

 

0 Kudos
Young-Sang_Lee
Dataiker
Hi Toshi,
 
As mentioned in the below document, get_download_stream method returns a Python "file-like" object, which can be used to other python functions that support non-seekable file-like objects.

Since gzip.open supports this, you will need to modify the code like below:
โ€‹
import json
import gzip
with predict_input_folder.get_download_stream(s3_path) as stream:
    with gzip.open(stream) as f:
        jsondict_listโ€‹โ€‹โ€‹โ€‹โ€‹ = json.loads(f.read())
Toshi
Level 2
Author
Young-Sang san

I was able to read the file normally. thank you. This ticket will be closed.

Best Regards,
Toshiaki
0 Kudos