It is regarding data mismatch

Level 2
It is regarding data mismatch

I have a input folder in which there are more than 20 csv files, So when I'm trying to read those CSV files one after another(stacking the files). I'm getting incorrect columns.

I tried using create dataset after folder, in which I'm getting less number of columns but no. of record count are the exactly same but not the columns.

And when I tried using Python code, we are getting more number of columns than excepted. 

The csv files which we are uploading is having different column names and different schema.


Operating system used: Windows

0 Kudos
2 Replies
Level 3

Try the following snippet:


import dataiku
import pandas as pd, numpy as np
import os
import tempfile
import re

# Read recipe inputs
data = dataiku.Folder("FOLDER_ID")
data_info = data.get_info()

paths = data.list_paths_in_partition()

dataframes = []

for path in paths:
    if path.endswith('.csv'):
        with data.get_download_stream(path) as file_stream:
            with tempfile.NamedTemporaryFile(mode='wb', delete = False) as temp:
                df = pd.read_csv(
combined_df = pd.concat(dataframes, ignore_index = True)

# Write recipe outputs
output = dataiku.Dataset("output_dataset_nam")
0 Kudos

You should use the Files in Folder dataset as I explain in the following post. Create as many Files in Folder datasets as different schema files you have. You can't aggregate data that has a different schema. Then select a pattern to select the relevant CSV files to read from the folder.

The Files in Folder dataset lets you quickly load all files having the same schema into a single dataset. But you must make sure you only load files with the same schema. You can have all 20 files in a single folder and then feed the different schema files to different Files in Folder datasets.

0 Kudos


Labels (3)
A banner prompting to get Dataiku