Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a input folder in which there are more than 20 csv files, So when I'm trying to read those CSV files one after another(stacking the files). I'm getting incorrect columns.
I tried using create dataset after folder, in which I'm getting less number of columns but no. of record count are the exactly same but not the columns.
And when I tried using Python code, we are getting more number of columns than excepted.
The csv files which we are uploading is having different column names and different schema.
Operating system used: Windows
Try the following snippet:
import dataiku import pandas as pd, numpy as np import os import tempfile import re # Read recipe inputs data = dataiku.Folder("FOLDER_ID") data_info = data.get_info() paths = data.list_paths_in_partition() dataframes =  for path in paths: if path.endswith('.csv'): with data.get_download_stream(path) as file_stream: with tempfile.NamedTemporaryFile(mode='wb', delete = False) as temp: temp.write(file_stream.read()) df = pd.read_csv(temp.name) dataframes.append(df) combined_df = pd.concat(dataframes, ignore_index = True) # Write recipe outputs output = dataiku.Dataset("output_dataset_nam") output.write_with_schema(combined_df)
You should use the Files in Folder dataset as I explain in the following post. Create as many Files in Folder datasets as different schema files you have. You can't aggregate data that has a different schema. Then select a pattern to select the relevant CSV files to read from the folder.
The Files in Folder dataset lets you quickly load all files having the same schema into a single dataset. But you must make sure you only load files with the same schema. You can have all 20 files in a single folder and then feed the different schema files to different Files in Folder datasets.