Count Columns on file based dataset (from file, not from schema)

PM
Level 2
Count Columns on file based dataset (from file, not from schema)

Hi all,

I noticed that 'Column Count' metric from dataset status is based on schema information, unlike the rest of the metrics that analyze the file

I need to compute the column count based on file's data, this can be achived in the Settings - File tab, pressing TEST button:

["Mismatch in number of columns: 2 in schema, 3 detected in data"]

But i need to do it automatically without user interaction, can it be computed with a check or python code??

 

Thanks!

0 Kudos
6 Replies
EstelleB
Dataiker

Hello,

You can get the columns count with a Python probe this way:

import dataiku

def process(dataset, partition_id):
    df = dataset.get_dataframe()
    columns_count = len(df.columns)
    return {'columns_count' : columns_count}

Though I am not sure it will help you. For that, I would need more context about what you want/need to do with this metric that you can't with the one already provided ๐Ÿ™‚

Have a nice day,

Estelle

Estelle
0 Kudos
PM
Level 2
Author

What i need is to identify changes in the columns of a file. I mean, i defined a dataset from a file with 2 columns, but next week the file is modified with a new column (i'm not the file owner and i dont want to refresh the file schema every week), so i need to identify this change to know that my jobs are still valid.

The column count metric is always showing 2 columns (because this is the number of columns when the dataset was defined: SCHEMA), the code you provided is doing the same, counting the number of columns from the schema.

0 Kudos
EstelleB
Dataiker

From what I understand, your flow takes as input a file whose number of columns will often change. But you only need 2 columns in it, which remain the same over time.

In DSS, some recipes select by default all columns in a dataset, like the recipes "Sync" or "Sample/Filter". It can require that you update the schema when it changes.

To work with your flow input, you can rather choose a recipe where you can specify the columns you want to work with, like a "Prepare" recipe (with the first step "Keep only these columns"), a "Group by" or a "Join" recipe - and SQL recipe if relevant. Then your number of columns in the output dataset will always stay the same and you won't need to update the schema.

Does that answer your needs?

Estelle
0 Kudos
PM
Level 2
Author

No, thats not my problem.

Excuse me if my explanations are not clear. I will try to explain it with an example, step by step:

1. Define a dataset from a file (CSV). The file has 2 columns.

2. Somebody modifies the file, adding a new column.

3. My definition of the dataset is still the same from point 1. A dataset with 2 columns. While exploring the dataset, if I go to Settings - Files - TEST, I can find there is a problem:

PM_0-1581681000627.png

 

4. I need to check this problem without manual interaction (without manually clicking the TEST button)

The checks/metrics from the Status Tab analyze the file, but not the Column Count metric that analyzes the schema.

PM_2-1581678325121.png

 

I hope it is more clear now.

 

Thanks again.

 

 

0 Kudos
EstelleB
Dataiker

Hello,

Thank you for this detail, it helps me understand better. 

If you need to detect that a column was added to this file, you can go in Settings > Format/Preview.

Then at the bottom, you'll find this option: "Add. columns behavior (read)".

EstelleB_0-1581688262249.png

I personally don't know how to get easily a file "Warnings" message, but here is what you can do waiting for a better solution:

1/ Select "Error". It means that your dataset won't be built when a new column is added, but you maybe could find some turnaround. 

2/ Then in a scenario, thanks to a scenario variable, you can add a reporter which sends you the 'message' computed by this Python step:

import dataikuapi

projectKey = dataiku.default_project_key()
client = dataiku.api_client()
project = client.get_project(projectKey)
dataset = project.get_dataset("test_csv")
metrics = dataset.compute_metrics()
runs = metrics['result']['runs']
def test_error_key(run):
    try:
        error = run['error']['message']
        result = 'Unexpected column found' in error
    except:
        result = False
    return result     

r = filter(test_error_key, runs)
if len(r) > 0:
    message = 'error: new columns detected in data'
else: 
    message = 'ok'

  

It worked for me in a Jupyter notebook. You'll need to adjust it depending on your needs.

I hope it helps you even if there is probably a better solution ๐Ÿ˜‰

Best

Estelle
PM
Level 2
Author

Hi Estelle,

thanks for this workaround.

Can i suggest?:

- Mark 'Column Count' metric as a bug because it analyzes the schema (saved metadata) instead of the file (like all other metrics)

or

- Feature request 'Physical Column Count' metric in Status metrics

 

Regards!