Scenario

anaanike
anaanike Registered Posts: 7

I want to build a dataset on a monthly basis, but I want to save each result (the built dataset) separately in a folder. How do I do it? I do not want to append the data. Each built dataset month-on-month should be saved in separate datasets.

Best Answer

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    The Export to folder visual recipe (under Other Recipes) can export a dataset to a number of file formats including CSV to a Dataiku Managed Folder. However I don't believe you will be able to customise the name of the file which will be based on the input dataset to export and this means it will always overwrite the output file name. For a more complete solution you will need to use a Python recipe which outputs to a managed folder.

  • anaanike
    anaanike Registered Posts: 7

    Thank you for suggesting this. Let me try and get back to you. As I am new to Scenario building, I have a follow-up question. If I am building the final dataset and including only that dataset to build in the step of Scenario, will Dataiku automatically run downstream related recipes prior to that dataset? If not, how do I ensure the same?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    When you add datasets to be build in the scenario step you can select if you want a recursive build or just that dataset.

  • anaanike
    anaanike Registered Posts: 7

    When I am writing the python script so that data is exported, I am getting the following error: ImportError: cannot import name 'DSSClient'

    Python Script being used is:

    import datetime
    from dataikuapi.dss import DSSClient

    # Connect to Dataiku DSS
    client = DSSClient("https://***********.merck.com", "*****************8xrKg3tuejZwdQISmj")

    # Get the project
    project = client.get_project("DIAGNOSTIC_STATS")

    # Get the dataset
    dataset = project.get_dataset("COHORT_PTNTS_Scenario")

    # Get the Hadoop connection
    hadoop_connection = project.get_connection("hhdee-prod")

    # Get the destination schema
    destination_schema = hadoop_connection.get_schema("pah")

    # Generate the export filename
    now = datetime.datetime.now()
    export_filename = "COHORT_PTNTS_Scenario_{}.csv".format(now.strftime("%Y%m%d_%H%M%S"))

    # Export the dataset to the destination schema
    dataset.export_to_hive_table(destination_schema, export_filename)

    Is there any alternative to this? If you can help me with the python script, that will be great.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    Where is this piece of Python code running? Inside DSS (ie as a Python recipe) or outside DSS (from a Python script running on a machine which needs to connect remotely to the DSS server)?

  • anaanike
    anaanike Registered Posts: 7

    This is inside DSS as a python recipe.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
    edited July 17

    OK. This is the correct way to create an API client inside DSS:

    import dataiku
    client = dataiku.api_client()

    Did you use GenAI (ie ChatGPT) to generate your code?

  • anaanike
    anaanike Registered Posts: 7

    Yes, I did use GPT to generate the same. However, the corrected version is also not working. Is there any alternative to this Python Script? It would be very helpful.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
    edited July 17

    Please do not paste GenAI code without explicitly warning people your code was generated by a GenAI bot. It's one thing to try to understand and fix someone else's code and a completely different thing is to do the same with GenAI code. Your GenAI code suffers from an AI Hallucination since it uses an API method that doesn't exist: dataset.export_to_hive_table(). And if you are posting GenAI code please include the prompt you used to generate it, so people looking at the code can try to understand what the GenAI was asked to do and therefore understand your true requirement better.

    You should drop all the code your GenAI produced as it is pure junk. In your flow select the last dataset where you want to produce the file output from. Click on the Python icon on the right pane to add a new Python code recipe. On the outputs section click on Add but rather than adding a Dataset click on New Folder at the bottom. Place your folder in your desired Dataiku file system connection and click on Create Folder. Finally click on Create Recipe to create the Python Recipe. Dataiku will now create a working code sample which will read your input dataset into a Pandas DataFrame and then get you a handle to the output Folder you created. By adding an extra line at the end you can write the contents of the Pandas DataFrame to a CSV file in the output folder. Here is a working code sample:

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    Sample_Data_prepared_windows_copy = dataiku.Dataset("Sample_Data_prepared_windows_copy")
    Sample_Data_prepared_windows_copy_df = Sample_Data_prepared_windows_copy.get_dataframe()
    
    # Write recipe outputs
    Output_Folder = dataiku.Folder("cLNxPEjq")
    Output_Folder_info = Output_Folder.get_info()
    
    Output_Folder.upload_stream('/some_file.csv', Sample_Data_prepared_windows_copy_df.to_csv().encode("utf-8"))

    You can easily modify this code to generate a dynamic file name which can contain the YYYY-MM to make the files unique for every month.

  • anaanike
    anaanike Registered Posts: 7

    Thank you so much for this. I was able to execute this. I believed the solution should not be this easy because we are doing this over Dataiku. But now I know better. All thanks to you.

Setup Info
    Tags
      Help me…