Scenario
I want to build a dataset on a monthly basis, but I want to save each result (the built dataset) separately in a folder. How do I do it? I do not want to append the data. Each built dataset month-on-month should be saved in separate datasets.
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,908 Neuron
No worries. Please mark the thread as Accepted Solution.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,908 Neuron
The Export to folder visual recipe (under Other Recipes) can export a dataset to a number of file formats including CSV to a Dataiku Managed Folder. However I don't believe you will be able to customise the name of the file which will be based on the input dataset to export and this means it will always overwrite the output file name. For a more complete solution you will need to use a Python recipe which outputs to a managed folder.
-
Thank you for suggesting this. Let me try and get back to you. As I am new to Scenario building, I have a follow-up question. If I am building the final dataset and including only that dataset to build in the step of Scenario, will Dataiku automatically run downstream related recipes prior to that dataset? If not, how do I ensure the same?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,908 Neuron
When you add datasets to be build in the scenario step you can select if you want a recursive build or just that dataset.
-
When I am writing the python script so that data is exported, I am getting the following error: ImportError: cannot import name 'DSSClient'
Python Script being used is:import datetime
from dataikuapi.dss import DSSClient# Connect to Dataiku DSS
client = DSSClient("https://***********.merck.com", "*****************8xrKg3tuejZwdQISmj")# Get the project
project = client.get_project("DIAGNOSTIC_STATS")# Get the dataset
dataset = project.get_dataset("COHORT_PTNTS_Scenario")# Get the Hadoop connection
hadoop_connection = project.get_connection("hhdee-prod")# Get the destination schema
destination_schema = hadoop_connection.get_schema("pah")# Generate the export filename
now = datetime.datetime.now()
export_filename = "COHORT_PTNTS_Scenario_{}.csv".format(now.strftime("%Y%m%d_%H%M%S"))# Export the dataset to the destination schema
dataset.export_to_hive_table(destination_schema, export_filename)
Is there any alternative to this? If you can help me with the python script, that will be great. -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,908 Neuron
Where is this piece of Python code running? Inside DSS (ie as a Python recipe) or outside DSS (from a Python script running on a machine which needs to connect remotely to the DSS server)?
-
This is inside DSS as a python recipe.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,908 Neuron
OK. This is the correct way to create an API client inside DSS:
import dataiku client = dataiku.api_client()
Did you use GenAI (ie ChatGPT) to generate your code?
-
Yes, I did use GPT to generate the same. However, the corrected version is also not working. Is there any alternative to this Python Script? It would be very helpful.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,908 Neuron
Please do not paste GenAI code without explicitly warning people your code was generated by a GenAI bot. It's one thing to try to understand and fix someone else's code and a completely different thing is to do the same with GenAI code. Your GenAI code suffers from an AI Hallucination since it uses an API method that doesn't exist: dataset.export_to_hive_table(). And if you are posting GenAI code please include the prompt you used to generate it, so people looking at the code can try to understand what the GenAI was asked to do and therefore understand your true requirement better.
You should drop all the code your GenAI produced as it is pure junk. In your flow select the last dataset where you want to produce the file output from. Click on the Python icon on the right pane to add a new Python code recipe. On the outputs section click on Add but rather than adding a Dataset click on New Folder at the bottom. Place your folder in your desired Dataiku file system connection and click on Create Folder. Finally click on Create Recipe to create the Python Recipe. Dataiku will now create a working code sample which will read your input dataset into a Pandas DataFrame and then get you a handle to the output Folder you created. By adding an extra line at the end you can write the contents of the Pandas DataFrame to a CSV file in the output folder. Here is a working code sample:
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs Sample_Data_prepared_windows_copy = dataiku.Dataset("Sample_Data_prepared_windows_copy") Sample_Data_prepared_windows_copy_df = Sample_Data_prepared_windows_copy.get_dataframe() # Write recipe outputs Output_Folder = dataiku.Folder("cLNxPEjq") Output_Folder_info = Output_Folder.get_info() Output_Folder.upload_stream('/some_file.csv', Sample_Data_prepared_windows_copy_df.to_csv().encode("utf-8"))
You can easily modify this code to generate a dynamic file name which can contain the YYYY-MM to make the files unique for every month.
-
Thank you so much for this. I was able to execute this. I believed the solution should not be this easy because we are doing this over Dataiku. But now I know better. All thanks to you.