Identify new/changed columns in a new file before dataset build

irinasi · August 2023

Hi,

I'm trying to find a way to identify the new columns that are added to a file, but with no succes.

My use case is like this:

- we have a SFTP, where user save excel files

- we use an SFTP connection in dataiku to connect to this files and load in datasets and afterwards in a database

- it may be possible that the user modify the column names or add columns to the excel files

I need to be able to identify the changes in the files and which are the changes bbefore the dataset in build and to be done automatically.

The column count metric doesn't work, because I need more details.

Is there a way to achieve this in dataiku?

Thanks,

Irina

tgb417 · August 2023

@irinasi

Welcome to the Dataiku Community, we are so glad to have you join us.

I like your question.

Dataiku tends to be picky about the schema of the data. In many ways this is a good thing. However, if you have a large number of humans hand entering data is spreadsheets. They are very likely to make changes.

If I was taking on this specific challenge, I would start by thinking about how much data cleanup / guessing about the data I'd be willing to do. If the data providers have changed the format of the data. Do I trust that I can automatically figure out what each column or cell of data actually means. Or will I through out the data. Or will I return it to the source for cleanup. (That is a human question, not so much of a technology question.)

Once I've thought through that a bit the questions comes to tools. Because dataiku is fairly strickt about it's schema's, I'd likely use a Python Code recipe to ingest the data and then sort out the details there.

Other here please feel free to jump in here if you deal a lot with this. (I only do this infrequently and tend to engage with the data by hand to understand the meaning of the descrepancies.) I tend to like the fact that dataiku throws errors when the data does not meet original agreements on what should be where.

Just my $0.02. Hope it helps a little bit.

irinasi · August 2023

Hi,

the next step would be to figure it out how to send emails(automatically) to users if columns have changed in the source file, with the columns that are missing or changed, so that they can modify the source file and reupload it to sftp.

Sending email from Dataiku seams to be easy, the tricky part is identifying the changes of columns and put them in a message.

Maybe I should explore Python ways ..

I'm used to work with an integration tool from Microsoft where this kind of details(columns missing, name changes) can be stored in variables and used in body emails.

Thanks!

Turribeach · August 2023

The column count metric doesn't work, because I need more details.

>> I will need more details on this statetment to understand why you think it doesn't work. If you all you want is to send an email of the number of columns in your spreadsheet changed then this is certainly something that can be done with a metric and reported via a scenario mail reporter. So can you give the full requirement if what you are trying to achieve? Thanks

irinasi · August 2023

Hi,

The reason is that I'll get only the number of columns, but I need also details about those columns that are missing : like the name(maybe the name of a column has changed and the user needs to check it in teh source file).

thanks

Turribeach · August 2023

Whenever posting technical questions please always state all your assumptions clearly because if they are wrong then maybe the whole path to achieve that solution may be wrong too.

I certainly believe your requirement is posible to do with metrics. Crucially you can create custom metrics using Python. Here is one I just created:

# Define here a function that returns the metric.
def process(dataset, partition_id):
     
    df = dataset.get_dataframe()
    columns_list = '/'.join(list(df.columns.values))
    return {'metric_name1' : 42, 'metric_name2' : True, 'metric_name3' : columns_list}

This metric is merely concatenating all the column names into a metric you can then access from a scenario. Of course this is not the code you need but proves you can calculate what you want. So how would this work? You will need to calculate the number of columns metric and the one that gives you the different columns that were added or removed. Then you can use this technique I wrote about to retrieve the values of the metrics in your scenario putting them in scenario variables giving you the ability to execute conditional scenario steps and even include those variables in an email. That's the easy part!

The challenge you have is that in order to be able to know if the number of columns changed you need two different columns counts, the before and after. The issues around this are explained in this product idea I raised which I encourage you to vote for. So to cut the story short you need to have the previous value somehwere in order to compare against. In the idea I suggested using two datasets of the same XLS file, one syncing the next one, so you can "build" the first dataset, calculate metrics, retrieve the metric values and then calculate the metrics of the second dataset without building it (which still has yesterday's file hence old columns) and then retrieve the metrics and compare them so that you can check whether the columns have changed and trigger your conditional email.

It will certainly need some Python coding for the custom metric but most of the rest I mentioned is via visual recipes and formulas.

Identify new/changed columns in a new file before dataset build

Answers

Categories

Setup Info

Tags