It is regarding data mismatch

mangeshp23 · ‎08-18-2023

I have a input folder in which there are more than 20 csv files, So when I'm trying to read those CSV files one after another(stacking the files). I'm getting incorrect columns.

I tried using create dataset after folder, in which I'm getting less number of columns but no. of record count are the exactly same but not the columns.

And when I tried using Python code, we are getting more number of columns than excepted.

The csv files which we are uploading is having different column names and different schema.

Operating system used: Windows

imanousar · ‎08-18-2023

Try the following snippet:

import dataiku
import pandas as pd, numpy as np
import os
import tempfile
import re

# Read recipe inputs
data = dataiku.Folder("FOLDER_ID")
data_info = data.get_info()

paths = data.list_paths_in_partition()

dataframes = []

for path in paths:
    if path.endswith('.csv'):
        with data.get_download_stream(path) as file_stream:
            with tempfile.NamedTemporaryFile(mode='wb', delete = False) as temp:
                temp.write(file_stream.read())
                df = pd.read_csv(temp.name)
                dataframes.append(df)
                
combined_df = pd.concat(dataframes, ignore_index = True)

# Write recipe outputs
output = dataiku.Dataset("output_dataset_nam")
output.write_with_schema(combined_df)

Turribeach · ‎08-18-2023

You should use the Files in Folder dataset as I explain in the following post. Create as many Files in Folder datasets as different schema files you have. You can't aggregate data that has a different schema. Then select a pattern to select the relevant CSV files to read from the folder.

https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214

The Files in Folder dataset lets you quickly load all files having the same schema into a single dataset. But you must make sure you only load files with the same schema. You can have all 20 files in a single folder and then feed the different schema files to different Files in Folder datasets.

It is regarding data mismatch

It is regarding data mismatch

Labels

Data types

File formats

Python

Sign up to take part

It is regarding data mismatch

It is regarding data mismatch

Labels

Data types

File formats

Python