Creating a column from header data
Hi, I need to load a large number of CSV files from the server filesystem where the second row in each file contains a date field (one CSV = 1 date). I would like to copy the date to a new column in the data set. Appreciate any advice on how to do this. Thankyou
Answers
-
tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
@SimonM
Are the files you are referencing on the DSS Design Node? Are they stored on some other server? Are you trying to load them into a particular DSS project as managed datasets?Is this a one-time process? Or are you needing this to occur as part of a flow?
I don't have first hand experience, but I'm happy to try to recreate your use locally to figure it out (if I can)
-
Thanks Tim. Where the files are stored is not really important. The CSVs are coming in daily but I also have a backlog of them. Each CSV has a 3 row header. The middle row contains some text including the date. I'd like to extract this date and place it in a new column as each CSV is loaded. I then merge all the CSVs into a single table for analysis. Initially I loaded the data using files in folder.
I did solve this pretty easily using MS PowerQuery but I can't figure out how to do it in DSS. I'm trying to use Visual Recipes.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @SimonM
. Can you share a sample of your CSV files? That could be useful to help you.Do you have something like this?
# First line of header bla bla bla
# Second line 2020-10-02 more info blablabal
# Third header line
column1 column2 column3 ... columnNCould you confirm? (and if possible attach a small csv example...
) Cheers
I.
-
I don't have a sample on me at the moment but it is very simple: described below.
Title,
Interval Detail For: 01/06/2020,
"Customer Details",
Col1,Col2, Col3
Time, Decimal, Alpha
Time, Decimal, AlphaCheers,
-
tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
@SimonM, I am honestly not sure how to do this in a non-hacky way using visual recipes - though I am not saying it can't be done (I just don't know).
In the meantime I have written some logic that can combine a bunch of files (of the format you specified) from a single folder data source and concatenate those in a single step.
Flow:
Data inside sample_csvs:
recipe output (processing +stacking the 2 csvs):
Here is the code inside the recipe that I used to actually perform the logic. If this looks like it may work for you, I'm happy to help you tweak the python code for your specific need.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
#------------------Two helper functions ----------------------#
#-------------------------------------------------------------#
import re
import os
from datetime import datetime
def get_date(string, date_pattern):
""" define a regex pattern to extract first date match in a string"""
match = re.search(date_pattern, string)
date = datetime.strptime(match.group(1), '%m/%d/%Y').date()
return date
def process_file(file, header_rows, date_row, date_pattern):
"""
process a single csv. This function will skip `header_rows` rows and create a dataframe
from the remaining data. The date_pattern argument will be used to extract the first date
from date_row and insert it as a new column in the data. The modified dataframe will be
returned
"""
with open(file,'r') as f:
for i, line in enumerate(f):
if i == date_row-1:
date = get_date(line, date_pattern)
df = pd.read_csv(file, skiprows=header_rows)
df['date'] = date
return df
#-------------------------------------------------------------#
#-------------------------------------------------------------#
# Read recipe inputs
sample_csvs = dataiku.Folder("b5vBISdg")
folder_path = sample_csvs.get_path() # path to this data folder
files = sample_csvs.list_paths_in_partition() # file names in this folder
file_paths = [os.path.join(folder_path, file[1::]) for file in files] ## create paths to the actual csvs (will be a list of 2 here)
dfs = [] # empty list to store the results of each processed file
for path in file_paths:
dfs.append(process_file(path, header_rows=3, date_row=2, date_pattern='(\d{2}/\d{2}/\d{4})')) # Process each csv and append to list
stacked_dataset_df = pd.concat(dfs) # stack the processed csvs
# Write recipe outputs
stacked_dataset = dataiku.Dataset("stacked_dataset")
stacked_dataset.write_with_schema(stacked_dataset_df) -
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Nice solution @tim-wright
! I'm not sure neither if this would be possible with a visual recipe, but for this case I don't think we will find something that matches the efficiency of the python solution.Unless there is some plugin that provides this functionality as a visual step!
-
Thanks @tim-wright
. It's been useful to understand the limitations of the Visual Recipes. I'm specifically trying to do this without code but appreciate your efforts.I believe one way to do this with Visual Recipes might be to:
- load the data twice
- Input the filename into a new column using the Enrich processor in both copies
- In the second copy, keep only the rows containing the "Interval Detail For" prefix
- Use this second copy as a mapping table to create a new column in the original data set
I'm pretty sure this will work but it's a painful way of solving what should be a simple requirement.
Cheers
-
tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
@SimonM
Ah, I missed that part of your post. If you absolutely need to use the visual recipes, then yes, I think the approach you laid out would probably work.If this is something you plan to do frequently or with a bunch of datasets, I'd propose that the investment time to develop a plugin (which would be minimal) is worth it so that users can simply leverage a UI to accomplish the task without knowing code or getting wrapped up with the logic you mention above.