Importing data with slightly differing schema (S3 connection)

Noah · April 2021

Is it possible to import data into one database if there are two versions of the spreadsheet with slightly different schema?

For instance I am importing a spreadsheet called Sales (a legacy version and a new one).

The legacy version has columns A,B,C,D and the new one has A,B,C,D, E. Can they feed into the same dataset even though the new version has 5 columns? How can I have dataiku handle the slight variation in format dynamically.

tgb417 · April 2021

@nshapir2
,

Parsing Header is controlled on the same screen.

I don't have a good answer about in-between columns being added. First thing I'd likely do is import the two data sets separately. Then align the columns and then stack the results with a stack visual recipie. (Divide and contour)

If I need to do this in one go. I'd either treat every column as text and build recipes to line up the columns. Or work on cleaning up the upstream source of the data. If none of those were possible.

At that point, I might move to something in Python. Parsing the files one column at a time and building up the schema. Depending on the size of the dataset, your python skills or willingness to gain a few python skills, and maybe the resources you have on the server. I still believe that this would be doable.

One possibility regarding the extra columns to the right of your data, I've seen situations where a spreadsheet can have a single cell filled in way off to the right and sometimes many lines down below. (This can even show up if the actual cell was deleted. This happens with MS Excel has allocated the space for the extra columns.) Again, I'd look upstream of Dataiku DSS to see if I can clean up the file at the source of the data.

Another possibility is the line terminator this may be fixable by changing the Quoting Style option.

Without seeing the files in question not much more I can do to help.

Good luck, have a great day.

tgb417 · April 2021

@nshapir2

Yes, I would hope that this would be doable.

I had a similar problem a while back with .CSV files. I found that the "Add. columns behavior (read)" was helpful. I've not tried this on S3. See if there is a similar feature.

Open datafiles with different column counts.jpg

Noah · April 2021

I found that function. The problem is the column that would be added is the 4th column in the dataset. Dataiku cannot seem to handle it. There are 12 columns originally. It seems to be generating 100s of blank columns.

Also its ignoring the headers...

Noah · April 2021

Thanks for your help.

You as well.

Sarina · April 2021

Hi @nshapir2
,

To follow up the Python option mentioned, it would be fairly straightforward to combine the legacy data with the new data in a Python recipe, and sync that table to S3. Provided that the final data now fits your "new" schema going forward, you should be set after that.

For just a brief example of how you might quickly do this in Python:

import dataiku
import pandas
import numpy as np
from dataiku import pandasutils as pdu

# a folder that contains files of both schemas
folder = dataiku.Folder("YOUR_FOLDER_ID")
base_path = folder.get_path()
paths = folder.list_paths_in_partition()

# now, concat the different datasets together into one final schema 

# read in the first file
out_df = pandas.read_csv(base_path + paths[0])

# for each file , concat with the existing dataframe 
for path in paths[1:]:
    df = pandas.read_csv(base_path + path)
    # this is where the nice schema concatination happens
    out_df = pandas.concat([out_df, df], axis=0, ignore_index=True)

output = dataiku.Dataset("output")
sprint_out.write_with_schema(out_df)

To walk through an example, I have a managed folder with both a "legacy" file and a "new data" file:

Screen Shot 2021-04-20 at 4.49.42 PM.png