Scenario

anaanike · February 2024

I want to build a dataset on a monthly basis, but I want to save each result (the built dataset) separately in a folder. How do I do it? I do not want to append the data. Each built dataset month-on-month should be saved in separate datasets.

Turribeach · February 2024

No worries. Please mark the thread as Accepted Solution.

Turribeach · February 2024

The Export to folder visual recipe (under Other Recipes) can export a dataset to a number of file formats including CSV to a Dataiku Managed Folder. However I don't believe you will be able to customise the name of the file which will be based on the input dataset to export and this means it will always overwrite the output file name. For a more complete solution you will need to use a Python recipe which outputs to a managed folder.

anaanike · February 2024

Thank you for suggesting this. Let me try and get back to you. As I am new to Scenario building, I have a follow-up question. If I am building the final dataset and including only that dataset to build in the step of Scenario, will Dataiku automatically run downstream related recipes prior to that dataset? If not, how do I ensure the same?

Turribeach · February 2024

When you add datasets to be build in the scenario step you can select if you want a recursive build or just that dataset.

anaanike · February 2024

When I am writing the python script so that data is exported, I am getting the following error: ImportError: cannot import name 'DSSClient'

Python Script being used is:

import datetime
from dataikuapi.dss import DSSClient

# Connect to Dataiku DSS
client = DSSClient("https://***********.merck.com", "*****************8xrKg3tuejZwdQISmj")

# Get the project
project = client.get_project("DIAGNOSTIC_STATS")

# Get the dataset
dataset = project.get_dataset("COHORT_PTNTS_Scenario")

# Get the Hadoop connection
hadoop_connection = project.get_connection("hhdee-prod")

# Get the destination schema
destination_schema = hadoop_connection.get_schema("pah")

# Generate the export filename
now = datetime.datetime.now()
export_filename = "COHORT_PTNTS_Scenario_{}.csv".format(now.strftime("%Y%m%d_%H%M%S"))

# Export the dataset to the destination schema
dataset.export_to_hive_table(destination_schema, export_filename)

Is there any alternative to this? If you can help me with the python script, that will be great.

Turribeach · February 2024

Where is this piece of Python code running? Inside DSS (ie as a Python recipe) or outside DSS (from a Python script running on a machine which needs to connect remotely to the DSS server)?

anaanike · February 2024

This is inside DSS as a python recipe.

Turribeach · February 2024

OK. This is the correct way to create an API client inside DSS:

import dataiku
client = dataiku.api_client()

Did you use GenAI (ie ChatGPT) to generate your code?

anaanike · February 2024

Yes, I did use GPT to generate the same. However, the corrected version is also not working. Is there any alternative to this Python Script? It would be very helpful.

Turribeach · February 2024

Please do not paste GenAI code without explicitly warning people your code was generated by a GenAI bot. It's one thing to try to understand and fix someone else's code and a completely different thing is to do the same with GenAI code. Your GenAI code suffers from an AI Hallucination since it uses an API method that doesn't exist: dataset.export_to_hive_table(). And if you are posting GenAI code please include the prompt you used to generate it, so people looking at the code can try to understand what the GenAI was asked to do and therefore understand your true requirement better.

You should drop all the code your GenAI produced as it is pure junk. In your flow select the last dataset where you want to produce the file output from. Click on the Python icon on the right pane to add a new Python code recipe. On the outputs section click on Add but rather than adding a Dataset click on New Folder at the bottom. Place your folder in your desired Dataiku file system connection and click on Create Folder. Finally click on Create Recipe to create the Python Recipe. Dataiku will now create a working code sample which will read your input dataset into a Pandas DataFrame and then get you a handle to the output Folder you created. By adding an extra line at the end you can write the contents of the Pandas DataFrame to a CSV file in the output folder. Here is a working code sample:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
Sample_Data_prepared_windows_copy = dataiku.Dataset("Sample_Data_prepared_windows_copy")
Sample_Data_prepared_windows_copy_df = Sample_Data_prepared_windows_copy.get_dataframe()

# Write recipe outputs
Output_Folder = dataiku.Folder("cLNxPEjq")
Output_Folder_info = Output_Folder.get_info()

Output_Folder.upload_stream('/some_file.csv', Sample_Data_prepared_windows_copy_df.to_csv().encode("utf-8"))

You can easily modify this code to generate a dynamic file name which can contain the YYYY-MM to make the files unique for every month.

anaanike · February 2024

Thank you so much for this. I was able to execute this. I believed the solution should not be this easy because we are doing this over Dataiku. But now I know better. All thanks to you.

Scenario

Best Answer

Answers

Categories

Setup Info

Tags