Append CSV

Options
mihnea
mihnea Registered Posts: 3 ✭✭✭
edited July 16 in Using Dataiku

Hello all,

I am new to DataIku, appreciate any support I can get :).

I have 3 .csv files that I load in a Dataiku folder : +Dataset->Folder. The files have the same schema.

I want to append them in dataiku for a consolidated final output. The file names have the same prefix (first 3 chars). when using the python recipe, I don't know how to point it to loop in the folder over each .csv and append it to the final output. I

This is what I tried, throwing an error:

Thank you!

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import glob

# Read recipe inputs
test_Mih = dataiku.Folder("V82hHFhe")
test_Mih_info = test_Mih.get_info()


# Compute recipe outputs
# TODO: Write here your actual code that computes the outputs
# all_files = glob.glob(test_Mih + "/*.csv")
#all_files = sorted(glob('test_Mih/30P*.csv'))

all_files = test_Mih.list_paths_in_partition()
 
li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

# NB: DSS supports several kinds of APIs for reading and writing data. Please see doc.

final_test_Mih_df = frame


# Write recipe outputs
final_test_Mih = dataiku.Dataset("Final_test_Mih")
final_test_Mih.write_with_schema(final_test_Mih_df)

Best Answer

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Answer ✓
    Options

    if it's CSV, you can add the column names in the dataset's Schema. DSS reads the CSV by position, so as long as the field order is consistent, it should be fine. Then you can export the dataset to CSV to get all the data in one chunk (with headers if you want)

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi,

    if the 3 files have the same schema, then you can do a FilesInFolder dataset on your folder (from the folder's actions tab), Show Advanced options in the dataset's Settings > Connection and filter to select only the 3 files. Then you can access the dataset and get it as a single dataframe with the usual dataiku.Dataset(...).get_dataframe()

  • mihnea
    mihnea Registered Posts: 3 ✭✭✭
    Options

    by schema I mean the same structure but in fact they come without headers, that's something I would setup on the output..:(

  • mihnea
    mihnea Registered Posts: 3 ✭✭✭
    Options

    Thanks a lot, it worked !

Setup Info
    Tags
      Help me…