Split and output single dataset to multiple datasets based on dynamic values

Jtbonner86

Hi There,

I have a single file with multiple suppliers and want to split the file into individual files for individual suppliers.

However the suppliers can be dynamic week to week (when the file is refreshed)

Is there a way to split the file based on the dynamic values in the supplier column?

Operating system used: Windows

Turribeach

A recipe, even a Python code recipe, can not change it's outputs dynamically. Why do you need to split the files in multiple datasets? There might be a better way of doing this, please explain your requirement in full.

Jtbonner86

I have a single file containing all my open orders including the supplier name. I want to separate these open orders into supplier files and then send them directly to the supplier. So I want 1 file per supplier.

ATM the python library is also not fully functional. So a recipe / solution outside of this would be appreciated

Turribeach

OK so your requirement is to load a file, then write the output of the file by supplier into separate files. That's relatevely easily to do since you are not going to use Dataiku datasets, you can write the files to a Dataiku managed folder. Here is some sample code to write an inout dataset to a Dataiku Managed folder as a CSV file:

import dataiku
import pandas as pd

folder_name = "some_managed_folder"
path_output_file = "output.csv"
input_dataset = dataiku.Dataset("dataset_name")

handle = dataiku.Folder(folder_name)

df = input_dataset.get_dataframe()

with handle.get_writer(path_output_file) as w:
    w.write(df.to_csv().encode('utf-8'))

You can customise this code to write the files by supplier in a for loop. You can also use Python code to send the files to your suppliers, depending on whatever method to transfer them you want to use.

Jtbonner86

yes, so a single input file.

is it possible to do this without python ? With one of the standard recipe's, as i said. Our library isn't working yet

Turribeach

No, it's not possible. To have dynamic outputs you need to use a Python code recipe and output to files. I am not sure what you mean bu "our library isn't working yet". You don't need any libraries to do this. Pure Python code, which Dataiku supports out of the box, is all you need.

Jtbonner86

When installed the config seems to have been messed up by our IM team. for example when trying to import the dataiku library...

Turribeach

The screen shot does not show the error in the log. We would need to see the error in the log to be able to help you.

Jtbonner86

I have sent it

Turribeach

It looks like your system has not been setup properly. The first error I see is this:

Failed to apply rule {"key":"memory.limit_in_bytes","value":"min(0.75*total_memory_of_machine, total_memory_of_machine - 2*backend.xmx)"} to cgroup /sys/fs/cgroup/memory/DSS
java.io.IOException: Invalid argument

It looks like your administrator left the total_memory_of_machine and backend.xmx names in there but should be replaced the actual numbers. On top of that it seems your dataiku install directory is missing the /data/app/dataiku-dss-11.2.0/python/dataiku package lives. It's also weird to use a DATA_DIR with /data/data/ in the path, seems like an unwanted duplication. Finally if this is a new install why did the administrator used v11.2.0 which is from Dec 2022 when much newer versions have been available like v12.6.1 released in April 2024.

Jtbonner86

Hello,

So finally have the install sorted and running now, but when running the code above (with changed parameters)I get the below error.

Apologies in advance for being such a noob

ERROR:root:Pipe to generator thread failed
Traceback (most recent call last):
  File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/dkuio.py", line 264, in run
    self.consumer(self._generate())
  File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/managed_folder.py", line 59, in upload_call
    jek_or_backend_void_call("managed-folders/upload-path", params={
  File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/intercom.py", line 601, in jek_or_backend_void_call
    return backend_void_call(path, data, err_msg, **kwargs)
  File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/intercom.py", line 592, in backend_void_call
    return _handle_void_resp(backend_api_post_call(path, data, **kwargs), err_msg = err_msg)
  File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/intercom.py", line 659, in _handle_void_resp
    raise Exception("%s: %s" % (err_msg, _get_error_message(err_data).encode("utf8")))
Exception: None: b'Mkdirs failed to create /user/dataiku/dss_managed_datasets/OOBSUPPLIER/usm6XSWN (exists=false, cwd=file:/data/data/run)'

Turribeach

Please post your Python code in a code block (the </> icon in the toolbar).

Turribeach

Your managed folder is using an incorrect path. If Dataiku is installed under /data/data/ then the managed folder path should usually be /data/data/mf however the error log shows Dataiku is trying to create a folder under /user/dataiku/dss_managed_datasets/. So please the root path for the file system connection you are using to place your managed folder and make sure it's a valid path and that the Dataiku user has full rights there.

Jtbonner86

Ill be honest i'm not entirely sure how. I just created the folder inside the dataiku flow and I guess that is the default storage within the network set up by my employer.

Jtbonner86

import dataiku
import pandas as pd

folder_name = "Managed_Folder_OOB"
path_output_file = "output.csv"
input_dataset = dataiku.Dataset("OOB_2301")

handle = dataiku.Folder(folder_name)

df = input_dataset.get_dataframe()

with handle.get_writer(path_output_file) as w:
w.write(df.to_csv().encode('utf-8'))

Sign up to take part

Split and output single dataset to multiple datasets based on dynamic values

Split and output single dataset to multiple datasets based on dynamic values

Setup info