Append CSV

Solved!
mihnea
Level 2
Append CSV

Hello all,

I am new to DataIku, appreciate any support I can get :).

I have 3 .csv files that I load in a Dataiku folder : +Dataset->Folder. The files have the same schema.

I want to append them in dataiku for a consolidated final output. The file names have the same prefix (first 3 chars). when using the python recipe, I don't know how to point it to loop in the folder over each .csv and append it to the final output. I

 

This is what I tried, throwing an error:

Thank you!

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import glob

# Read recipe inputs
test_Mih = dataiku.Folder("V82hHFhe")
test_Mih_info = test_Mih.get_info()


# Compute recipe outputs
# TODO: Write here your actual code that computes the outputs
# all_files = glob.glob(test_Mih + "/*.csv")
#all_files = sorted(glob('test_Mih/30P*.csv'))

all_files = test_Mih.list_paths_in_partition()
 
li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

# NB: DSS supports several kinds of APIs for reading and writing data. Please see doc.

final_test_Mih_df = frame


# Write recipe outputs
final_test_Mih = dataiku.Dataset("Final_test_Mih")
final_test_Mih.write_with_schema(final_test_Mih_df)

 

0 Kudos
1 Solution
fchataigner2
Dataiker

if it's CSV, you can add the column names in the dataset's Schema. DSS reads the CSV by position, so as long as the field order is consistent, it should be fine. Then you can export the dataset to CSV to get all the data in one chunk (with headers if you want)

View solution in original post

4 Replies
fchataigner2
Dataiker

Hi,

if the 3 files have the same schema, then you can do a FilesInFolder dataset on your folder (from the folder's actions tab), Show Advanced options in the dataset's Settings > Connection and filter to select only the 3 files. Then you can access the dataset and get it as a single dataframe with the usual dataiku.Dataset(...).get_dataframe()

mihnea
Level 2
Author

by schema I mean the same structure but in fact they come without headers, that's something I would setup on the output..:(

0 Kudos
fchataigner2
Dataiker

if it's CSV, you can add the column names in the dataset's Schema. DSS reads the CSV by position, so as long as the field order is consistent, it should be fine. Then you can export the dataset to CSV to get all the data in one chunk (with headers if you want)

mihnea
Level 2
Author

Thanks a lot,  it worked !

0 Kudos