I have a use case where the output dataset of the code recipe should be a json data. I am making an API call where the data extracted is in Json format. I don't want to convert that Json data into a pandas dataframe but want to keep the JSON data same as it is. Can you please help me know on the way how we can achieve this.
Hi @Tsurapaneni ,
Great question, so the way I would approach this would be to add a DSS managed folder as the output of your python recipe.
You can do this by going into your recipe, and clicking on the Inputs/Outputs tab and then selecting "New Folder".
Take a look at the photos below to give you some guidance.
Once you add the details for the managed folder's storage you can go ahead and press the "Save" button.
As far as the python recipe goes you can use these docs here to help you write to the managed folder.
Here is a code snippet on how to write to the managed folder using the dataiku library package in python.
import dataiku handle = dataiku.Folder("folder_name") with handle.get_writer("myoutputfile.txt") as w: w.write("some data")
Here is also some documentation for some more code snippets on how to use managed folders in DSS.
Let me know if this helps and if you have any more questions.
I think somehow this solution is not answering my question. Let me be a bit more clear on my problem
I have a data extracted from the API call and the data (call it as X) is in Json format. In a normal python recipe template the the output dataset has to be converted into a pandas data frame which is a structured format. But in my use case, I don't want to convert the data extracted which is X (in json) but yet want to be the same as the same X in Json.
So the job if the recipe is just the extraction of the data from API. and the output dataset should definitely be as a Json data rather a pandas dataframe.
shortly, API -> Json data (X) -> output dataset (X) in json only.
I cannot select my repo (store into) with the dataset as it is not showing up in the drop down (may be org restrictions).
Please let me know if there is any solution where I can save the extracted data (Json) as Json only rather converting into a pandas dataframe. (limitation that I can't create a folder in the store)
Hi @Tsurapaneni ,
I believe I understand this better now. However, is it possible for you to provide an example of "API -> Json data (X) -> output dataset (X) in json only" so I can understand more visually just to make sure?
If I am understanding this correctly, I don't believe the Flow has the ability to just show the JSON data as an output type of the python code recipe unless the JSON data is stored inside of a managed folder. In other words, DSS only allows a dataset object to represent data tabularly. To represent what I mean in a visual way:
The third example in the picture above doesn't exist currently in DSS. @CoreyS , can you confirm that this is all correct from the Dataiku side please?
If you cannot create and store data inside of a managed folder in your flow, I would try two different things:
1. Contact a DSS administrator and let them know your issue or limitations. They may be able to help you to successfully create and store the data inside of a managed folder by changing some of the permissions of DSS and/or your project settings and or group settings.
2. You can also represent the data in a tabular format and then when you want to work in JSON you can always convert it.
To go from JSON to a pandas dataframe here is a code snippet to help you out:
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Execute API call and save JSON data as a string and assign it to the variable 'api_response_data' ... # Convert api_response_data to pandas dataframe api_response_df = pd.read_json(api_response_data, orient='records') # Write recipe outputs api_response = dataiku.Dataset("api_response") api_response.write_with_schema(api_response_df)
To go from a pandas dataframe to JSON you can do the following:
import dataiku import pandas as pd # Some helpful links: https://pythonbasics.org/pandas-json/ # Read recipe inputs api_response = dataiku.Dataset("api_response") api_response_df = api_response.get_dataframe() json_obj = api_response_df.to_json()
Let me know if this helps!
Hey @adamnieto sorry for the delayed response. We do not natively write to JSON formats, but they could develop a plugin for a custom exporter that would produce a JSON.
You are correct that using a folder to store the json objects on the flow is the best path, or potentially a bucket in S3.