Using Dataiku

Recipe to create multiple datasets from a single dataset
I have a dataset called "MasterData" in the flow. I want to subset this data based on the "Country" column and save it after including the name of the country (e.g. "MasterDataAustralia") and save it in a zone built exclusively for that Country. I have around 60+ countries in the master data, and new countries may be added…
How to create a dynamic contains function
I have 2 files / sheets, one that shows file names and one that shows colors filename sheet colors sheet what I'm trying to do is somehow link these two datasets, if filename contains one of the colors in my colors sheet, then in a new column list that color. desired output is that possible to do in Dataiku? I thought…
Github configuration
Hi everyone, I'm trying to connect with my private githun repo via ssh key but I gave this respond: An error happened while adding the following remote: origin (git@github.com:user-repo.git), and then fetching it, caused by: IOException: Process failure, caused by: IOException: Process execution failed (return code 1)…
Not able to read text files using Pyspark in Dataiku
Hi, I'm trying to read the text files from my managed folder using pyspark in Dataiku. I have created RDD but when I use collect() in RDD it throws error that path doesn't exist. Below is the code : # -*- coding: utf-8 -*- import dataiku from dataiku import spark as dkuspark from pyspark import SparkContext from…
Hide API Keys in Project Library Editor
Hi, so I have this following code in my project's Library Editor, however, I have manually defined keys, which I do not want to be shared. Let me give more contexts. I have a python script that calls the following function, and extracts my files from Confluence using from langchain.document_loaders import ConfluenceLoader…
How to read the text file using pyspark in Dataiku
I'm new to Dataiku and trying to read the text file using Pyspark in Dataiku. Tried creating dataframe using spark.read.text() & used sparl context to create RDD but both methods throw some error. Now when I'm creating spark context it throws error like "RuntimeError: Java gateway process exited before sending its port…
Assistance Needed with Custom Python Triggers in Dataiku
Hello Folks, I recently created a project in Dataiku aimed at collecting metric data at the beginning and end of each month. Here is a quick summary of my project: I used a scenario to execute an SQL query and set up triggers for the beginning and end of the month, with specific parameters to launch only on working days.…
Weighting method documented in ML Model Results?
Hi all, I have been unable to find documentation of the weighting method setting in the model results summary. Is it not there or am I just somehow missing it? I typically compare the performance of weighting (class weights) and no weighting methods when tuning my models. I'd like to be able to look at the results summary…
How to train a model on a partitioned dataset using API
How do I build a model on a partitioned dataset in Dataiku using the API? I'm using the below code to develop the model. How do I modify this code to build a partitioned model? "trainset" is partitioned data in the column "Market". # client is a DSS API clientp = client.get_project("MYPROJECT")# Create a new ML Task to…
Using large context for a Gen AI prompt
Hi, I'm trying to create a prompt to ask questions to a LLM and get an answer based on 5,000 reviews for a product. I know there are ways to classify or perform sentiment analysis, but what I want to do is to ask an LLM a question about the whole bunch of reviews. I tried using RAG, but it is my understanding the this…

Trending Discussions

Docs for "pandasutils"?
Hello, My apologies if this is a remedial question, but at the start of every Python recipe the boilerplate code includes an import of: from dataiku import pandasutils as pdu Is there documentation for pandasutils? Is it a package that can be used in Python recipes? I've tried looking through the Dataiku Developer Guide,…
Run a Time Series Forecasting Model
I get the following error message Error message: Failed to train : <class 'ImportError'> : libcuda.so.1: cannot open shared object file: No such file or directory Operating system used: 13.1.4
Identifying the Node Type in a DSS Notebook using Python
In Python, in a DSS notebook, I want to know if the code is running in the design node or the automation node. How can I do that?

Leaderboard

Turribeach 3539

tgb417 2473

Ignacio_Toledo 1079