Creating a column from header data

Options
SimonM
SimonM Registered Posts: 7 ✭✭✭✭

Hi, I need to load a large number of CSV files from the server filesystem where the second row in each file contains a date field (one CSV = 1 date). I would like to copy the date to a new column in the data set. Appreciate any advice on how to do this. Thankyou

Answers

  • tim-wright
    tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
    Options

    @SimonM
    Are the files you are referencing on the DSS Design Node? Are they stored on some other server? Are you trying to load them into a particular DSS project as managed datasets?

    Is this a one-time process? Or are you needing this to occur as part of a flow?

    I don't have first hand experience, but I'm happy to try to recreate your use locally to figure it out (if I can)

  • SimonM
    SimonM Registered Posts: 7 ✭✭✭✭
    Options

    Thanks Tim. Where the files are stored is not really important. The CSVs are coming in daily but I also have a backlog of them. Each CSV has a 3 row header. The middle row contains some text including the date. I'd like to extract this date and place it in a new column as each CSV is loaded. I then merge all the CSVs into a single table for analysis. Initially I loaded the data using files in folder.

    I did solve this pretty easily using MS PowerQuery but I can't figure out how to do it in DSS. I'm trying to use Visual Recipes.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @SimonM
    . Can you share a sample of your CSV files? That could be useful to help you.

    Do you have something like this?

    # First line of header bla bla bla
    # Second line 2020-10-02 more info blablabal
    # Third header line
    column1 column2 column3 ... columnN

    Could you confirm? (and if possible attach a small csv example... )

    Cheers

    I.

  • SimonM
    SimonM Registered Posts: 7 ✭✭✭✭
    Options

    I don't have a sample on me at the moment but it is very simple: described below.

    Title,
    Interval Detail For: 01/06/2020,
    "Customer Details",
    Col1,Col2, Col3
    Time, Decimal, Alpha
    Time, Decimal, Alpha

    Cheers,

  • tim-wright
    tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
    Options

    @SimonM, I am honestly not sure how to do this in a non-hacky way using visual recipes - though I am not saying it can't be done (I just don't know).

    In the meantime I have written some logic that can combine a bunch of files (of the format you specified) from a single folder data source and concatenate those in a single step.

    Flow:

    flow.PNG

     Data inside sample_csvs:

    csvs_folder.PNG

     recipe output (processing +stacking the 2 csvs):

    stacked_dataset.PNG

    Here is the code inside the recipe that I used to actually perform the logic. If this looks like it may work for you, I'm happy to help you tweak the python code for your specific need.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu

    #------------------Two helper functions ----------------------#
    #-------------------------------------------------------------#

    import re
    import os
    from datetime import datetime

    def get_date(string, date_pattern):
    """ define a regex pattern to extract first date match in a string"""
    match = re.search(date_pattern, string)
    date = datetime.strptime(match.group(1), '%m/%d/%Y').date()
    return date


    def process_file(file, header_rows, date_row, date_pattern):
    """
    process a single csv. This function will skip `header_rows` rows and create a dataframe
    from the remaining data. The date_pattern argument will be used to extract the first date
    from date_row and insert it as a new column in the data. The modified dataframe will be
    returned
    """
    with open(file,'r') as f:
    for i, line in enumerate(f):
    if i == date_row-1:
    date = get_date(line, date_pattern)

    df = pd.read_csv(file, skiprows=header_rows)
    df['date'] = date
    return df

    #-------------------------------------------------------------#
    #-------------------------------------------------------------#

    # Read recipe inputs
    sample_csvs = dataiku.Folder("b5vBISdg")
    folder_path = sample_csvs.get_path() # path to this data folder
    files = sample_csvs.list_paths_in_partition() # file names in this folder
    file_paths = [os.path.join(folder_path, file[1::]) for file in files] ## create paths to the actual csvs (will be a list of 2 here)
    dfs = [] # empty list to store the results of each processed file

    for path in file_paths:
    dfs.append(process_file(path, header_rows=3, date_row=2, date_pattern='(\d{2}/\d{2}/\d{4})')) # Process each csv and append to list
    stacked_dataset_df = pd.concat(dfs) # stack the processed csvs


    # Write recipe outputs
    stacked_dataset = dataiku.Dataset("stacked_dataset")
    stacked_dataset.write_with_schema(stacked_dataset_df)

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Nice solution @tim-wright
    ! I'm not sure neither if this would be possible with a visual recipe, but for this case I don't think we will find something that matches the efficiency of the python solution.

    Unless there is some plugin that provides this functionality as a visual step!

  • SimonM
    SimonM Registered Posts: 7 ✭✭✭✭
    Options

    Thanks @tim-wright
    . It's been useful to understand the limitations of the Visual Recipes. I'm specifically trying to do this without code but appreciate your efforts.

    I believe one way to do this with Visual Recipes might be to:

    • load the data twice
    • Input the filename into a new column using the Enrich processor in both copies
    • In the second copy, keep only the rows containing the "Interval Detail For" prefix
    • Use this second copy as a mapping table to create a new column in the original data set

    I'm pretty sure this will work but it's a painful way of solving what should be a simple requirement.

    Cheers

  • tim-wright
    tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
    Options

    @SimonM
    Ah, I missed that part of your post. If you absolutely need to use the visual recipes, then yes, I think the approach you laid out would probably work.

    If this is something you plan to do frequently or with a bunch of datasets, I'd propose that the investment time to develop a plugin (which would be minimal) is worth it so that users can simply leverage a UI to accomplish the task without knowing code or getting wrapped up with the logic you mention above.

Setup Info
    Tags
      Help me…