How to save the output of the python code recipe which is in Json data without converting into df

Tsurapaneni
Level 3
How to save the output of the python code recipe which is in Json data without converting into df

Hi All,

I have a use case where the output dataset of the code recipe should be a json data. I am making an API call where the data extracted is in Json format. I don't want to convert that Json data into a pandas dataframe but want to keep the JSON data same as it is. Can you please help me know on the way how we can achieve this.

6 Replies
adamnieto

Hi @Tsurapaneni , 

Great question, so the way I would approach this would be to add a DSS managed folder as the output of your python recipe. 

You can do this by going into your recipe, and clicking on the Inputs/Outputs tab and then selecting "New Folder". 

Take a look at the photos below to give you some guidance.

inputs_outputs.png

 

managed_folder.png

 

Once you add the details for the managed folder's storage you can go ahead and press the "Save" button. 

As far as the python recipe goes  you can use these docs here to help you write to the managed folder. 

Here is a code snippet on how to write to the managed folder using the dataiku library package in python. 

import dataiku
handle = dataiku.Folder("folder_name")
with handle.get_writer("myoutputfile.txt") as w:
    w.write("some data")

 

Here is also some documentation for some more code snippets on how to use managed folders in DSS.

Let me know if this helps and if you have any more questions. 

 

Adam

Tsurapaneni
Level 3
Author

Hi,

I think somehow this solution is not answering my question. Let me be a bit more clear on my problem

I have a data extracted from the API call and the data (call it as X)  is in Json format. In a normal python recipe template the the output dataset has to be converted into a pandas data frame which is a structured format. But in my use case, I don't want to convert the data extracted which is X (in json) but yet want to be the same as the same X in Json. 

So the job if the recipe is just the extraction of the data from API. and the output dataset should definitely be as a Json data rather a pandas dataframe.

shortly, API -> Json data (X) -> output dataset (X) in json only.

I cannot select my repo (store into) with the dataset as it is not showing up in the drop down (may be org restrictions).

Please let me know if there is any solution where I can save the extracted data (Json) as Json only rather converting into a pandas dataframe. (limitation that I can't create a folder in the store)

0 Kudos
adamnieto

Hi @Tsurapaneni , 

I believe I understand this better now. However, is it possible for you to provide an example of  "API -> Json data (X) -> output dataset (X) in json only" so I can understand more visually just to make sure?

If I am understanding this correctly,  I don't believe the Flow has the ability to just show the JSON data as an output type of the python code recipe unless the JSON data is stored inside of a managed folder. In other words, DSS only allows a dataset object to represent data tabularly. To represent what I mean in  a visual way:

dss_current_capabilities.PNG

The third example in the picture above doesn't exist currently in DSS. @CoreyS , can you confirm that this is all correct from the Dataiku side please? 

If you cannot create and store data inside of a managed folder in your flow, I would try two different things:

 

1. Contact a DSS administrator and let them know your issue or limitations. They may be able to help you to successfully create and store the data inside of a managed folder by changing some of the permissions of DSS and/or your project settings and or group settings. 

2. You can also represent the data in a tabular format and then when you want to work in JSON you can always convert it. 

To go from JSON to a pandas dataframe here is a code snippet to help you out:

 

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Execute API call and save JSON data as a string and assign it to the variable 'api_response_data'

...

# Convert api_response_data to pandas dataframe
api_response_df = pd.read_json(api_response_data, orient='records')  

# Write recipe outputs
api_response = dataiku.Dataset("api_response")
api_response.write_with_schema(api_response_df)

 

 

To go from a pandas dataframe to JSON you can do the following: 

 

 

import dataiku
import pandas as pd
# Some helpful links: https://pythonbasics.org/pandas-json/

# Read recipe inputs
api_response = dataiku.Dataset("api_response")
api_response_df = api_response.get_dataframe()

json_obj = api_response_df.to_json()

 

Let me know if this helps!

-Adam

CoreyS
Dataiker Alumni

Hey @adamnieto sorry for the delayed response. We do not natively write to JSON formats, but they could develop a plugin for a custom exporter that would produce a JSON. 

You are correct that using a folder to store the json objects on the flow is the best path, or potentially a bucket in S3.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
info-rchitect
Level 6

Hi @CoreyS,

 

So any update on writing a JSON file in managed folder?  I would like to be able to serve this JSON file as a public REST API publicly so people can get to the JSON file.

 

 

# json_data is a Python dict
json_object = json.dumps(json_data, indent = 4)    

handle = dataiku.Folder("dash_public_table_descriptions")
with handle.get_writer("dash_public_table_descriptions.json") as w:
    w.write(json_object)

 

 

The code above results in:

TypeError: a bytes-like object is required, not 'str'

 
I really don't want everyone wanting to use this data to have to convert it to JSON. 

 

thx

0 Kudos
info-rchitect
Level 6

OK, maybe this thread just needed updating, this worked:

 

json_str = json.dumps(json_data, indent = 4) 
json_obj = json.loads(json_str)
handle = dataiku.Folder("dash_public_table_descriptions")
handle.write_json("dash_public_table_descriptions.json", json_obj)