Creating a column from header data

SimonM · November 2020

Hi, I need to load a large number of CSV files from the server filesystem where the second row in each file contains a date field (one CSV = 1 date). I would like to copy the date to a new column in the data set. Appreciate any advice on how to do this. Thankyou

tim-wright · November 2020

@SimonM
Are the files you are referencing on the DSS Design Node? Are they stored on some other server? Are you trying to load them into a particular DSS project as managed datasets?

Is this a one-time process? Or are you needing this to occur as part of a flow?

I don't have first hand experience, but I'm happy to try to recreate your use locally to figure it out (if I can)

SimonM · November 2020

Thanks Tim. Where the files are stored is not really important. The CSVs are coming in daily but I also have a backlog of them. Each CSV has a 3 row header. The middle row contains some text including the date. I'd like to extract this date and place it in a new column as each CSV is loaded. I then merge all the CSVs into a single table for analysis. Initially I loaded the data using files in folder.

I did solve this pretty easily using MS PowerQuery but I can't figure out how to do it in DSS. I'm trying to use Visual Recipes.

Ignacio_Toledo · November 2020

Hi @SimonM
. Can you share a sample of your CSV files? That could be useful to help you.

Do you have something like this?

# First line of header bla bla bla
# Second line 2020-10-02 more info blablabal
# Third header line
column1  column2 column3 ... columnN

Could you confirm? (and if possible attach a small csv example... )

Cheers

I.

SimonM · November 2020

I don't have a sample on me at the moment but it is very simple: described below.

Title,
Interval Detail For: 01/06/2020,
"Customer Details",
Col1,Col2, Col3
Time, Decimal, Alpha
Time, Decimal, Alpha

Cheers,

tim-wright · November 2020

@SimonM, I am honestly not sure how to do this in a non-hacky way using visual recipes - though I am not saying it can't be done (I just don't know).

In the meantime I have written some logic that can combine a bunch of files (of the format you specified) from a single folder data source and concatenate those in a single step.

Flow:

Data inside sample_csvs:

recipe output (processing +stacking the 2 csvs):

Here is the code inside the recipe that I used to actually perform the logic. If this looks like it may work for you, I'm happy to help you tweak the python code for your specific need.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

#------------------Two helper functions ----------------------#
#-------------------------------------------------------------#

import re
import os
from datetime import datetime

def get_date(string, date_pattern):
""" define a regex pattern to extract first date match in a string"""
match = re.search(date_pattern, string)
date = datetime.strptime(match.group(1), '%m/%d/%Y').date()
return date


def process_file(file, header_rows, date_row, date_pattern):
"""
process a single csv. This function will skip `header_rows` rows and create a dataframe
from the remaining data. The date_pattern argument will be used to extract the first date
from date_row and insert it as a new column in the data. The modified dataframe will be
returned
"""
with open(file,'r') as f:
for i, line in enumerate(f):
if i == date_row-1:
date = get_date(line, date_pattern)

df = pd.read_csv(file, skiprows=header_rows)
df['date'] = date
return df

#-------------------------------------------------------------#
#-------------------------------------------------------------#

# Read recipe inputs
sample_csvs = dataiku.Folder("b5vBISdg")
folder_path = sample_csvs.get_path() # path to this data folder
files = sample_csvs.list_paths_in_partition() # file names in this folder
file_paths = [os.path.join(folder_path, file[1::]) for file in files] ## create paths to the actual csvs (will be a list of 2 here)
dfs = [] # empty list to store the results of each processed file

for path in file_paths:
dfs.append(process_file(path, header_rows=3, date_row=2, date_pattern='(\d{2}/\d{2}/\d{4})')) # Process each csv and append to list
stacked_dataset_df = pd.concat(dfs) # stack the processed csvs


# Write recipe outputs
stacked_dataset = dataiku.Dataset("stacked_dataset")
stacked_dataset.write_with_schema(stacked_dataset_df)

Ignacio_Toledo · November 2020

Nice solution @tim-wright
! I'm not sure neither if this would be possible with a visual recipe, but for this case I don't think we will find something that matches the efficiency of the python solution.

Unless there is some plugin that provides this functionality as a visual step!

SimonM · November 2020

Thanks @tim-wright
. It's been useful to understand the limitations of the Visual Recipes. I'm specifically trying to do this without code but appreciate your efforts.

I believe one way to do this with Visual Recipes might be to:

load the data twice
Input the filename into a new column using the Enrich processor in both copies
In the second copy, keep only the rows containing the "Interval Detail For" prefix
Use this second copy as a mapping table to create a new column in the original data set

I'm pretty sure this will work but it's a painful way of solving what should be a simple requirement.

Cheers

tim-wright · November 2020

@SimonM
Ah, I missed that part of your post. If you absolutely need to use the visual recipes, then yes, I think the approach you laid out would probably work.

If this is something you plan to do frequently or with a bunch of datasets, I'd propose that the investment time to develop a plugin (which would be minimal) is worth it so that users can simply leverage a UI to accomplish the task without knowing code or getting wrapped up with the logic you mention above.

Creating a column from header data

Answers

Categories

Setup Info

Tags