Importing data with slightly differing schema (S3 connection)

Solved!
nshapir2
Level 1
Importing data with slightly differing schema (S3 connection)

Is it possible to import data into one database if there are two versions of the spreadsheet with slightly different schema?

 

For instance I am importing a spreadsheet called Sales (a legacy version and a new one).

The legacy version has columns A,B,C,D and the new one has A,B,C,D, E. Can they feed into the same dataset even though the new version has 5 columns? How can I have dataiku handle the slight variation in format dynamically. 

0 Kudos
1 Solution
tgb417

@nshapir2 ,

Parsing Header is controlled on the same screen.

I don't have a good answer about in-between columns being added. First thing I'd likely do is import the two data sets separately. Then align the columns and then stack the results with a stack visual recipie.  (Divide and contour)

If I need to do this in one go.  I'd either treat every column as text and build recipes to line up the columns.  Or work on cleaning up the upstream source of the data. If none of those were possible. 

At that point, I might move to something in Python.  Parsing the files one column at a time and building up the schema.  Depending on the size of the dataset, your python skills or willingness to gain a few python skills, and maybe the resources you have on the server.  I still believe that this would be doable.

One possibility regarding the extra columns to the right of your data, I've seen situations where a spreadsheet can have a single cell filled in way off to the right and sometimes many lines down below. (This can even show up if the actual cell was deleted.  This happens with MS Excel has allocated the space for the extra columns.) Again, I'd look upstream of Dataiku DSS to see if I can clean up the file at the source of the data.

Another possibility is the line terminator this may be fixable by changing the Quoting Style option.

Without seeing the files in question not much more I can do to help.

Good luck, have a great day.

--Tom

View solution in original post

0 Kudos
5 Replies
tgb417

@nshapir2 

Yes, I would hope that this would be doable.  

I had a similar problem a while back with .CSV files.  I found that the  "Add. columns behavior (read)" was helpful.  I've not tried this on S3.  See if there is a similar feature.

Open datafiles with different column counts.jpg

 

--Tom
0 Kudos
nshapir2
Level 1
Author

I found that function. The problem is the column that would be added is the 4th column in the dataset. Dataiku cannot seem to handle it. There are 12 columns originally. It seems to be generating 100s of blank columns.

 

Also its ignoring the headers...

0 Kudos
tgb417

@nshapir2 ,

Parsing Header is controlled on the same screen.

I don't have a good answer about in-between columns being added. First thing I'd likely do is import the two data sets separately. Then align the columns and then stack the results with a stack visual recipie.  (Divide and contour)

If I need to do this in one go.  I'd either treat every column as text and build recipes to line up the columns.  Or work on cleaning up the upstream source of the data. If none of those were possible. 

At that point, I might move to something in Python.  Parsing the files one column at a time and building up the schema.  Depending on the size of the dataset, your python skills or willingness to gain a few python skills, and maybe the resources you have on the server.  I still believe that this would be doable.

One possibility regarding the extra columns to the right of your data, I've seen situations where a spreadsheet can have a single cell filled in way off to the right and sometimes many lines down below. (This can even show up if the actual cell was deleted.  This happens with MS Excel has allocated the space for the extra columns.) Again, I'd look upstream of Dataiku DSS to see if I can clean up the file at the source of the data.

Another possibility is the line terminator this may be fixable by changing the Quoting Style option.

Without seeing the files in question not much more I can do to help.

Good luck, have a great day.

--Tom
0 Kudos
nshapir2
Level 1
Author

Thanks for your help. 

 

You as well. 

0 Kudos
SarinaS
Dataiker

Hi @nshapir2 ,

To follow up the Python option mentioned, it would be fairly straightforward to combine the legacy data with the new data in a Python recipe, and sync that table to S3.  Provided that the final data now fits your "new" schema going forward, you should be set after that.  

For just a brief example of how you might quickly do this in Python:

import dataiku
import pandas
import numpy as np
from dataiku import pandasutils as pdu

# a folder that contains files of both schemas
folder = dataiku.Folder("YOUR_FOLDER_ID")
base_path = folder.get_path()
paths = folder.list_paths_in_partition()

# now, concat the different datasets together into one final schema 

# read in the first file
out_df = pandas.read_csv(base_path + paths[0])

# for each file , concat with the existing dataframe 
for path in paths[1:]:
    df = pandas.read_csv(base_path + path)
    # this is where the nice schema concatination happens
    out_df = pandas.concat([out_df, df], axis=0, ignore_index=True)

output = dataiku.Dataset("output")
sprint_out.write_with_schema(out_df)

 

To walk through an example, I have a managed folder with both a "legacy" file and a "new data" file: 

Screen Shot 2021-04-20 at 4.49.42 PM.png

Screen Shot 2021-04-20 at 4.49.48 PM.png

I create a Python recipe with my code above, based with this folder as input and a S3 dataset as output:

Screen Shot 2021-04-20 at 4.55.10 PM.png

And then here's my output dataset, with the final schema of "A, B, C, D, E" and data from both datasets:

Screen Shot 2021-04-20 at 4.53.54 PM.png

โ€ƒ

Thanks,
Sarinaโ€ƒ

โ€ƒ