Importing data with slightly differing schema (S3 connection)
Is it possible to import data into one database if there are two versions of the spreadsheet with slightly different schema?
For instance I am importing a spreadsheet called Sales (a legacy version and a new one).
The legacy version has columns A,B,C,D and the new one has A,B,C,D, E. Can they feed into the same dataset even though the new version has 5 columns? How can I have dataiku handle the slight variation in format dynamically.
Best Answer
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Parsing Header is controlled on the same screen.
I don't have a good answer about in-between columns being added. First thing I'd likely do is import the two data sets separately. Then align the columns and then stack the results with a stack visual recipie. (Divide and contour)
If I need to do this in one go. I'd either treat every column as text and build recipes to line up the columns. Or work on cleaning up the upstream source of the data. If none of those were possible.
At that point, I might move to something in Python. Parsing the files one column at a time and building up the schema. Depending on the size of the dataset, your python skills or willingness to gain a few python skills, and maybe the resources you have on the server. I still believe that this would be doable.
One possibility regarding the extra columns to the right of your data, I've seen situations where a spreadsheet can have a single cell filled in way off to the right and sometimes many lines down below. (This can even show up if the actual cell was deleted. This happens with MS Excel has allocated the space for the extra columns.) Again, I'd look upstream of Dataiku DSS to see if I can clean up the file at the source of the data.
Another possibility is the line terminator this may be fixable by changing the Quoting Style option.
Without seeing the files in question not much more I can do to help.
Good luck, have a great day.
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Yes, I would hope that this would be doable.
I had a similar problem a while back with .CSV files. I found that the "Add. columns behavior (read)" was helpful. I've not tried this on S3. See if there is a similar feature.
-
I found that function. The problem is the column that would be added is the 4th column in the dataset. Dataiku cannot seem to handle it. There are 12 columns originally. It seems to be generating 100s of blank columns.
Also its ignoring the headers...
-
Thanks for your help.
You as well.
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @nshapir2
,To follow up the Python option mentioned, it would be fairly straightforward to combine the legacy data with the new data in a Python recipe, and sync that table to S3. Provided that the final data now fits your "new" schema going forward, you should be set after that.
For just a brief example of how you might quickly do this in Python:
import dataiku import pandas import numpy as np from dataiku import pandasutils as pdu # a folder that contains files of both schemas folder = dataiku.Folder("YOUR_FOLDER_ID") base_path = folder.get_path() paths = folder.list_paths_in_partition() # now, concat the different datasets together into one final schema # read in the first file out_df = pandas.read_csv(base_path + paths[0]) # for each file , concat with the existing dataframe for path in paths[1:]: df = pandas.read_csv(base_path + path) # this is where the nice schema concatination happens out_df = pandas.concat([out_df, df], axis=0, ignore_index=True) output = dataiku.Dataset("output") sprint_out.write_with_schema(out_df)
To walk through an example, I have a managed folder with both a "legacy" file and a "new data" file:
I create a Python recipe with my code above, based with this folder as input and a S3 dataset as output:
And then here's my output dataset, with the final schema of "A, B, C, D, E" and data from both datasets:
Thanks,
Sarina