It is regarding data mismatch
I have a input folder in which there are more than 20 csv files, So when I'm trying to read those CSV files one after another(stacking the files). I'm getting incorrect columns.
I tried using create dataset after folder, in which I'm getting less number of columns but no. of record count are the exactly same but not the columns.
And when I tried using Python code, we are getting more number of columns than excepted.
The csv files which we are uploading is having different column names and different schema.
Operating system used: Windows
Answers
-
Ioannis Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 28 ✭✭✭✭✭
Try the following snippet:
import dataiku import pandas as pd, numpy as np import os import tempfile import re # Read recipe inputs data = dataiku.Folder("FOLDER_ID") data_info = data.get_info() paths = data.list_paths_in_partition() dataframes = [] for path in paths: if path.endswith('.csv'): with data.get_download_stream(path) as file_stream: with tempfile.NamedTemporaryFile(mode='wb', delete = False) as temp: temp.write(file_stream.read()) df = pd.read_csv(temp.name) dataframes.append(df) combined_df = pd.concat(dataframes, ignore_index = True) # Write recipe outputs output = dataiku.Dataset("output_dataset_nam") output.write_with_schema(combined_df)
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
You should use the Files in Folder dataset as I explain in the following post. Create as many Files in Folder datasets as different schema files you have. You can't aggregate data that has a different schema. Then select a pattern to select the relevant CSV files to read from the folder.
https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214
The Files in Folder dataset lets you quickly load all files having the same schema into a single dataset. But you must make sure you only load files with the same schema. You can have all 20 files in a single folder and then feed the different schema files to different Files in Folder datasets.