Importing data with slightly differing schema (S3 connection)

Noah
Noah Registered Posts: 43 ✭✭✭✭

Is it possible to import data into one database if there are two versions of the spreadsheet with slightly different schema?

For instance I am importing a spreadsheet called Sales (a legacy version and a new one).

The legacy version has columns A,B,C,D and the new one has A,B,C,D, E. Can they feed into the same dataset even though the new version has 5 columns? How can I have dataiku handle the slight variation in format dynamically.

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
    Answer ✓

    @nshapir2
    ,

    Parsing Header is controlled on the same screen.

    I don't have a good answer about in-between columns being added. First thing I'd likely do is import the two data sets separately. Then align the columns and then stack the results with a stack visual recipie. (Divide and contour)

    If I need to do this in one go. I'd either treat every column as text and build recipes to line up the columns. Or work on cleaning up the upstream source of the data. If none of those were possible.

    At that point, I might move to something in Python. Parsing the files one column at a time and building up the schema. Depending on the size of the dataset, your python skills or willingness to gain a few python skills, and maybe the resources you have on the server. I still believe that this would be doable.

    One possibility regarding the extra columns to the right of your data, I've seen situations where a spreadsheet can have a single cell filled in way off to the right and sometimes many lines down below. (This can even show up if the actual cell was deleted. This happens with MS Excel has allocated the space for the extra columns.) Again, I'd look upstream of Dataiku DSS to see if I can clean up the file at the source of the data.

    Another possibility is the line terminator this may be fixable by changing the Quoting Style option.

    Without seeing the files in question not much more I can do to help.

    Good luck, have a great day.

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @nshapir2

    Yes, I would hope that this would be doable.

    I had a similar problem a while back with .CSV files. I found that the "Add. columns behavior (read)" was helpful. I've not tried this on S3. See if there is a similar feature.

    Open datafiles with different column counts.jpg

  • Noah
    Noah Registered Posts: 43 ✭✭✭✭

    I found that function. The problem is the column that would be added is the 4th column in the dataset. Dataiku cannot seem to handle it. There are 12 columns originally. It seems to be generating 100s of blank columns.

    Also its ignoring the headers...

  • Noah
    Noah Registered Posts: 43 ✭✭✭✭

    Thanks for your help.

    You as well.

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17

    Hi @nshapir2
    ,

    To follow up the Python option mentioned, it would be fairly straightforward to combine the legacy data with the new data in a Python recipe, and sync that table to S3. Provided that the final data now fits your "new" schema going forward, you should be set after that.

    For just a brief example of how you might quickly do this in Python:

    import dataiku
    import pandas
    import numpy as np
    from dataiku import pandasutils as pdu
    
    # a folder that contains files of both schemas
    folder = dataiku.Folder("YOUR_FOLDER_ID")
    base_path = folder.get_path()
    paths = folder.list_paths_in_partition()
    
    # now, concat the different datasets together into one final schema 
    
    # read in the first file
    out_df = pandas.read_csv(base_path + paths[0])
    
    # for each file , concat with the existing dataframe 
    for path in paths[1:]:
        df = pandas.read_csv(base_path + path)
        # this is where the nice schema concatination happens
        out_df = pandas.concat([out_df, df], axis=0, ignore_index=True)
    
    output = dataiku.Dataset("output")
    sprint_out.write_with_schema(out_df)
    

    To walk through an example, I have a managed folder with both a "legacy" file and a "new data" file:

    Screen Shot 2021-04-20 at 4.49.42 PM.png

    Screen Shot 2021-04-20 at 4.49.48 PM.png

    I create a Python recipe with my code above, based with this folder as input and a S3 dataset as output:

    Screen Shot 2021-04-20 at 4.55.10 PM.png

    And then here's my output dataset, with the final schema of "A, B, C, D, E" and data from both datasets:

    Screen Shot 2021-04-20 at 4.53.54 PM.png

    Thanks,
    Sarina 

Setup Info
    Tags
      Help me…