It is regarding data mismatch

Options
mangeshp23
mangeshp23 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3

I have a input folder in which there are more than 20 csv files, So when I'm trying to read those CSV files one after another(stacking the files). I'm getting incorrect columns.

I tried using create dataset after folder, in which I'm getting less number of columns but no. of record count are the exactly same but not the columns.

And when I tried using Python code, we are getting more number of columns than excepted.

The csv files which we are uploading is having different column names and different schema.


Operating system used: Windows

Answers

  • Ioannis
    Ioannis Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 28 ✭✭✭✭✭
    edited July 17
    Options

    Try the following snippet:

    import dataiku
    import pandas as pd, numpy as np
    import os
    import tempfile
    import re
    
    # Read recipe inputs
    data = dataiku.Folder("FOLDER_ID")
    data_info = data.get_info()
    
    paths = data.list_paths_in_partition()
    
    dataframes = []
    
    for path in paths:
        if path.endswith('.csv'):
            with data.get_download_stream(path) as file_stream:
                with tempfile.NamedTemporaryFile(mode='wb', delete = False) as temp:
                    temp.write(file_stream.read())
                    df = pd.read_csv(temp.name)
                    dataframes.append(df)
                    
    combined_df = pd.concat(dataframes, ignore_index = True)
    
    # Write recipe outputs
    output = dataiku.Dataset("output_dataset_nam")
    output.write_with_schema(combined_df)
  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,740 Neuron
    Options

    You should use the Files in Folder dataset as I explain in the following post. Create as many Files in Folder datasets as different schema files you have. You can't aggregate data that has a different schema. Then select a pattern to select the relevant CSV files to read from the folder.

    https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214

    The Files in Folder dataset lets you quickly load all files having the same schema into a single dataset. But you must make sure you only load files with the same schema. You can have all 20 files in a single folder and then feed the different schema files to different Files in Folder datasets.

Setup Info
    Tags
      Help me…