Exporting a Dataset into Multiple .CSV files

Options
tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron

I have a dataset that will have to be imported into another system as a series of .CSV files.

Is there a way to have DSS write several files into a folder based on selection from several columns of the dataset.

For example if I had a vehicle dataset.

With Cars and Trucks, with 4, 6 and 8 Cylinder Engines.

I'd like all of the

4 Cylinder Cars in an appropriately named .csv file.

Like wise the 6 Cylinder Cars in a seperate .csv file

The 6 Cylinder Trucks in a seperate .csv file and So forth.

I could do this kind of thing with the Split Recipe, However I'd have to write out a formula for each of the datasets. I've got about 22 data sets. Although I could do this with visual split recipe. It would be tedious at best. Thoughts for a #Easy Button.

For this project I'm using the community edition with a Nonprofit, so methods using partitions will not work out for this project.


Operating system used: Ubuntu 18.4 LTE (Running under WLS2 on Windows 11)

Tagged:

Best Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Answer ✓
    Options

    Hi @tgb417
    ,

    I guess you'd like to avoid using a python recipe? That is what I'd do. I can't think about a way of doing this using no-code recipes, without being tedious, as you say.

    Cheers!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17 Answer ✓
    Options

    Here is what I ended up with. Not elegant. But quick, and it seems to work ok.

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu

    # Get the Output Folder
    output_folder = dataiku.Folder("<the identifier of the managed folder>")

    # Read recipe inputs
    Book_Data_For_Import = dataiku.Dataset("Book_Data_For_Import")
    Book_Data_For_Import_df = Book_Data_For_Import.get_dataframe()

    # Read the Unique Books List
    Books = Book_Data_For_Import_df['sp_act_desc'].unique()

    # Loop through the Books and get the Eligible and Not Eligible Books
    for Book in Books:
    Book_df = Book_Data_For_Import_df.loc[Book_Data_For_Import_df['sp_act_desc'] == Book]

    file_name = Book + ' True.csv'
    print('Writing :', file_name)
    Book_Eligible_df = Book_df.loc[Book_df['Eligible']]
    output_folder.upload_data(file_name, Book_Eligible_df.to_csv(index=False).encode("utf-8"))
    print()

    file_name = Book + ' False.csv'
    print('Writing :', file_name)
    Book_Not_Eligible_df = Book_df.loc[~Book_df['Eligible']]
    output_folder.upload_data(file_name, Book_Not_Eligible_df.to_csv(index=False).encode("utf-8"))
    print()

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17 Answer ✓
    Options

    Here is my latest version, and in this case I'm exporting Excel Datasets.
    This seems to work Ok with a variety of Dataiku Managed Folders. I've tested with Local Folders, and Google Drive Plugin. Although I've not tested this with SFTP and other managed file folders, I would expect this to work the same.

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Fill in these values
    DSS_input_data_set = "<Input Data Set Name>"
    DSS_output_folder = "<Dataiku Managed Output Folder>"

    n = 2000 # specify the chunk size here in rows of data from input

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Read recipe inputs
    Stage1_Re_invite_SHL = dataiku.Dataset(DSS_input_data_set)
    df = Stage1_Re_invite_SHL.get_dataframe()

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Write recipe outputs
    outfolder = dataiku.Folder(DSS_output_folder)

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Divide the DataFrame into chunks of n rows each
    chunk_list = [df[i:i+n] for i in range(0, df.shape[0], n)]

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Write each chunk to a separate Excel file
    for i, chunk in enumerate(chunk_list):
    file_path = DSS_input_data_set +"_{}.xlsx".format(i+1)
    chunk.to_excel(file_path, index=False)
    with open(file_path, 'rb') as file:
    outfolder.upload_data(file_path, file.read())

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Ignacio_Toledo
    ,

    Ok, I'm writing up something right now.

    Thanks.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    It looks good! One could perhaps save some lines at the end, but that is just being pedantic

    Have a great weekend Tom!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Ignacio_Toledo

    Agreed, The reason for the lack of loop through the true and false variants is; in the moment, I could not see a quick way to loop through the true and false variants of the second column. Because of the cool ~ negation operator I was using to select from the Boolean column in the data.

    I guess I could of nested some for loops. The first time iterate over the book type, and the second pass iterating through the items in true and false column. That could cut some redundant code and be a tad more elegant. … Maybe …

    However, that is a project for some other day. This was about getting something done.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    At some point this might make a very good simple recipe plugin.

Setup Info
    Tags
      Help me…