Split and output single dataset to multiple datasets based on dynamic values

Jtbonner86
Jtbonner86 Registered Posts: 10

Hi There,

I have a single file with multiple suppliers and want to split the file into individual files for individual suppliers.

However the suppliers can be dynamic week to week (when the file is refreshed)

Is there a way to split the file based on the dynamic values in the supplier column?


Operating system used: Windows

Tagged:

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    A recipe, even a Python code recipe, can not change it's outputs dynamically. Why do you need to split the files in multiple datasets? There might be a better way of doing this, please explain your requirement in full.

  • Jtbonner86
    Jtbonner86 Registered Posts: 10

    I have a single file containing all my open orders including the supplier name. I want to separate these open orders into supplier files and then send them directly to the supplier. So I want 1 file per supplier.

    ATM the python library is also not fully functional. So a recipe / solution outside of this would be appreciated

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron
    edited July 17

    OK so your requirement is to load a file, then write the output of the file by supplier into separate files. That's relatevely easily to do since you are not going to use Dataiku datasets, you can write the files to a Dataiku managed folder. Here is some sample code to write an inout dataset to a Dataiku Managed folder as a CSV file:

    import dataiku
    import pandas as pd
    
    folder_name = "some_managed_folder"
    path_output_file = "output.csv"
    input_dataset = dataiku.Dataset("dataset_name")
    
    handle = dataiku.Folder(folder_name)
    
    df = input_dataset.get_dataframe()
    
    with handle.get_writer(path_output_file) as w:
        w.write(df.to_csv().encode('utf-8'))

    You can customise this code to write the files by supplier in a for loop. You can also use Python code to send the files to your suppliers, depending on whatever method to transfer them you want to use.

  • Jtbonner86
    Jtbonner86 Registered Posts: 10

    yes, so a single input file.

    is it possible to do this without python ? With one of the standard recipe's, as i said. Our library isn't working yet

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    No, it's not possible. To have dynamic outputs you need to use a Python code recipe and output to files. I am not sure what you mean bu "our library isn't working yet". You don't need any libraries to do this. Pure Python code, which Dataiku supports out of the box, is all you need.

  • Jtbonner86
    Jtbonner86 Registered Posts: 10

    When installed the config seems to have been messed up by our IM team. for example when trying to import the dataiku library...

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    The screen shot does not show the error in the log. We would need to see the error in the log to be able to help you.

  • Jtbonner86
    Jtbonner86 Registered Posts: 10

    I have sent it

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    It looks like your system has not been setup properly. The first error I see is this:

    Failed to apply rule {"key":"memory.limit_in_bytes","value":"min(0.75*total_memory_of_machine, total_memory_of_machine - 2*backend.xmx)"} to cgroup /sys/fs/cgroup/memory/DSS
    java.io.IOException: Invalid argument

    It looks like your administrator left the total_memory_of_machine and backend.xmx names in there but should be replaced the actual numbers. On top of that it seems your dataiku install directory is missing the /data/app/dataiku-dss-11.2.0/python/dataiku package lives. It's also weird to use a DATA_DIR with /data/data/ in the path, seems like an unwanted duplication. Finally if this is a new install why did the administrator used v11.2.0 which is from Dec 2022 when much newer versions have been available like v12.6.1 released in April 2024.

  • Jtbonner86
    Jtbonner86 Registered Posts: 10
    edited July 17

    Hello,

    So finally have the install sorted and running now, but when running the code above (with changed parameters)I get the below error.

    Apologies in advance for being such a noob

    ERROR:root:Pipe to generator thread failed
    Traceback (most recent call last):
      File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/dkuio.py", line 264, in run
        self.consumer(self._generate())
      File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/managed_folder.py", line 59, in upload_call
        jek_or_backend_void_call("managed-folders/upload-path", params={
      File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/intercom.py", line 601, in jek_or_backend_void_call
        return backend_void_call(path, data, err_msg, **kwargs)
      File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/intercom.py", line 592, in backend_void_call
        return _handle_void_resp(backend_api_post_call(path, data, **kwargs), err_msg = err_msg)
      File "/data/app/dataiku-dss-12.6.1/python/dataiku/core/intercom.py", line 659, in _handle_void_resp
        raise Exception("%s: %s" % (err_msg, _get_error_message(err_data).encode("utf8")))
    Exception: None: b'Mkdirs failed to create /user/dataiku/dss_managed_datasets/OOBSUPPLIER/usm6XSWN (exists=false, cwd=file:/data/data/run)'

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    Please post your Python code in a code block (the </> icon in the toolbar).

  • Jtbonner86
    Jtbonner86 Registered Posts: 10

    import dataiku
    import pandas as pd

    folder_name = "Managed_Folder_OOB"
    path_output_file = "output.csv"
    input_dataset = dataiku.Dataset("OOB_2301")

    handle = dataiku.Folder(folder_name)

    df = input_dataset.get_dataframe()

    with handle.get_writer(path_output_file) as w:
    w.write(df.to_csv().encode('utf-8'))

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    Your managed folder is using an incorrect path. If Dataiku is installed under /data/data/ then the managed folder path should usually be /data/data/mf however the error log shows Dataiku is trying to create a folder under /user/dataiku/dss_managed_datasets/. So please the root path for the file system connection you are using to place your managed folder and make sure it's a valid path and that the Dataiku user has full rights there.

  • Jtbonner86
    Jtbonner86 Registered Posts: 10

    Ill be honest i'm not entirely sure how. I just created the folder inside the dataiku flow and I guess that is the default storage within the network set up by my employer.

Setup Info
    Tags
      Help me…