How to save the output of the python code recipe which is in Json data without converting into df

Tsurapaneni
Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

Hi All,

I have a use case where the output dataset of the code recipe should be a json data. I am making an API call where the data extracted is in Json format. I don't want to convert that Json data into a pandas dataframe but want to keep the JSON data same as it is. Can you please help me know on the way how we can achieve this.

Answers

  • adamnieto
    adamnieto Neuron 2020, Neuron, Registered, Neuron 2021, Neuron 2022, Neuron 2023 Posts: 87 Neuron
    edited July 17

    Hi @Tsurapaneni
    ,

    Great question, so the way I would approach this would be to add a DSS managed folder as the output of your python recipe.

    You can do this by going into your recipe, and clicking on the Inputs/Outputs tab and then selecting "New Folder".

    Take a look at the photos below to give you some guidance.

    inputs_outputs.png

    managed_folder.png

    Once you add the details for the managed folder's storage you can go ahead and press the "Save" button.

    As far as the python recipe goes you can use these docs here to help you write to the managed folder.

    Here is a code snippet on how to write to the managed folder using the dataiku library package in python.

    import dataiku
    handle = dataiku.Folder("folder_name")
    with handle.get_writer("myoutputfile.txt") as w:
        w.write("some data")

    Here is also some documentation for some more code snippets on how to use managed folders in DSS.

    Let me know if this helps and if you have any more questions.

    Adam

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi,

    I think somehow this solution is not answering my question. Let me be a bit more clear on my problem

    I have a data extracted from the API call and the data (call it as X) is in Json format. In a normal python recipe template the the output dataset has to be converted into a pandas data frame which is a structured format. But in my use case, I don't want to convert the data extracted which is X (in json) but yet want to be the same as the same X in Json.

    So the job if the recipe is just the extraction of the data from API. and the output dataset should definitely be as a Json data rather a pandas dataframe.

    shortly, API -> Json data (X) -> output dataset (X) in json only.

    I cannot select my repo (store into) with the dataset as it is not showing up in the drop down (may be org restrictions).

    Please let me know if there is any solution where I can save the extracted data (Json) as Json only rather converting into a pandas dataframe. (limitation that I can't create a folder in the store)

  • adamnieto
    adamnieto Neuron 2020, Neuron, Registered, Neuron 2021, Neuron 2022, Neuron 2023 Posts: 87 Neuron
    edited July 17

    Hi @Tsurapaneni
    ,

    I believe I understand this better now. However, is it possible for you to provide an example of "API -> Json data (X) -> output dataset (X) in json only" so I can understand more visually just to make sure?

    If I am understanding this correctly, I don't believe the Flow has the ability to just show the JSON data as an output type of the python code recipe unless the JSON data is stored inside of a managed folder. In other words, DSS only allows a dataset object to represent data tabularly. To represent what I mean in a visual way:

    dss_current_capabilities.PNG

    The third example in the picture above doesn't exist currently in DSS. @CoreyS
    , can you confirm that this is all correct from the Dataiku side please?

    If you cannot create and store data inside of a managed folder in your flow, I would try two different things:

    1. Contact a DSS administrator and let them know your issue or limitations. They may be able to help you to successfully create and store the data inside of a managed folder by changing some of the permissions of DSS and/or your project settings and or group settings.

    2. You can also represent the data in a tabular format and then when you want to work in JSON you can always convert it.

    To go from JSON to a pandas dataframe here is a code snippet to help you out:

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Execute API call and save JSON data as a string and assign it to the variable 'api_response_data'
    
    ...
    
    # Convert api_response_data to pandas dataframe
    api_response_df = pd.read_json(api_response_data, orient='records')  
    
    # Write recipe outputs
    api_response = dataiku.Dataset("api_response")
    api_response.write_with_schema(api_response_df)

    To go from a pandas dataframe to JSON you can do the following:

    import dataiku
    import pandas as pd
    # Some helpful links: https://pythonbasics.org/pandas-json/
    
    # Read recipe inputs
    api_response = dataiku.Dataset("api_response")
    api_response_df = api_response.get_dataframe()
    
    json_obj = api_response_df.to_json()

    Let me know if this helps!

    -Adam

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hey @adamnieto
    sorry for the delayed response. We do not natively write to JSON formats, but they could develop a plugin for a custom exporter that would produce a JSON.

    You are correct that using a folder to store the json objects on the flow is the best path, or potentially a bucket in S3.

  • info-rchitect
    info-rchitect Registered Posts: 184 ✭✭✭✭✭✭
    edited July 17

    Hi @CoreyS
    ,

    So any update on writing a JSON file in managed folder? I would like to be able to serve this JSON file as a public REST API publicly so people can get to the JSON file.

    TypeError: a bytes-like object is required, not 'str'

    The code above results in:

    # json_data is a Python dict
    json_object = json.dumps(json_data, indent = 4)    
    
    handle = dataiku.Folder("dash_public_table_descriptions")
    with handle.get_writer("dash_public_table_descriptions.json") as w:
        w.write(json_object)


    I really don't want everyone wanting to use this data to have to convert it to JSON.

    thx

  • info-rchitect
    info-rchitect Registered Posts: 184 ✭✭✭✭✭✭
    edited July 17

    OK, maybe this thread just needed updating, this worked:

    json_str = json.dumps(json_data, indent = 4) 
    json_obj = json.loads(json_str)
    handle = dataiku.Folder("dash_public_table_descriptions")
    handle.write_json("dash_public_table_descriptions.json", json_obj) 

Setup Info
    Tags
      Help me…