Count Columns on file based dataset (from file, not from schema)

PM
PM Registered Posts: 10 ✭✭✭✭

Hi all,

I noticed that 'Column Count' metric from dataset status is based on schema information, unlike the rest of the metrics that analyze the file

I need to compute the column count based on file's data, this can be achived in the Settings - File tab, pressing TEST button:

["Mismatch in number of columns: 2 in schema, 3 detected in data"]

But i need to do it automatically without user interaction, can it be computed with a check or python code??

Thanks!

Answers

  • EstelleB
    EstelleB Dataiker, Registered Posts: 6 Dataiker
    edited July 17

    Hello,

    You can get the columns count with a Python probe this way:

    import dataiku
    
    def process(dataset, partition_id):
        df = dataset.get_dataframe()
        columns_count = len(df.columns)
        return {'columns_count' : columns_count}

    Though I am not sure it will help you. For that, I would need more context about what you want/need to do with this metric that you can't with the one already provided

    Have a nice day,

    Estelle

  • PM
    PM Registered Posts: 10 ✭✭✭✭

    What i need is to identify changes in the columns of a file. I mean, i defined a dataset from a file with 2 columns, but next week the file is modified with a new column (i'm not the file owner and i dont want to refresh the file schema every week), so i need to identify this change to know that my jobs are still valid.

    The column count metric is always showing 2 columns (because this is the number of columns when the dataset was defined: SCHEMA), the code you provided is doing the same, counting the number of columns from the schema.

  • EstelleB
    EstelleB Dataiker, Registered Posts: 6 Dataiker

    From what I understand, your flow takes as input a file whose number of columns will often change. But you only need 2 columns in it, which remain the same over time.

    In DSS, some recipes select by default all columns in a dataset, like the recipes "Sync" or "Sample/Filter". It can require that you update the schema when it changes.

    To work with your flow input, you can rather choose a recipe where you can specify the columns you want to work with, like a "Prepare" recipe (with the first step "Keep only these columns"), a "Group by" or a "Join" recipe - and SQL recipe if relevant. Then your number of columns in the output dataset will always stay the same and you won't need to update the schema.

    Does that answer your needs?

  • PM
    PM Registered Posts: 10 ✭✭✭✭

    No, thats not my problem.

    Excuse me if my explanations are not clear. I will try to explain it with an example, step by step:

    1. Define a dataset from a file (CSV). The file has 2 columns.

    2. Somebody modifies the file, adding a new column.

    3. My definition of the dataset is still the same from point 1. A dataset with 2 columns. While exploring the dataset, if I go to Settings - Files - TEST, I can find there is a problem:

    PM_0-1581681000627.png

    4. I need to check this problem without manual interaction (without manually clicking the TEST button)

    The checks/metrics from the Status Tab analyze the file, but not the Column Count metric that analyzes the schema.

    PM_2-1581678325121.png

    I hope it is more clear now.

    Thanks again.

  • EstelleB
    EstelleB Dataiker, Registered Posts: 6 Dataiker
    edited July 17

    Hello,

    Thank you for this detail, it helps me understand better.

    If you need to detect that a column was added to this file, you can go in Settings > Format/Preview.

    Then at the bottom, you'll find this option: "Add. columns behavior (read)".

    EstelleB_0-1581688262249.png

    I personally don't know how to get easily a file "Warnings" message, but here is what you can do waiting for a better solution:

    1/ Select "Error". It means that your dataset won't be built when a new column is added, but you maybe could find some turnaround.

    2/ Then in a scenario, thanks to a scenario variable, you can add a reporter which sends you the 'message' computed by this Python step:

    import dataikuapi
    
    projectKey = dataiku.default_project_key()
    client = dataiku.api_client()
    project = client.get_project(projectKey)
    dataset = project.get_dataset("test_csv")
    metrics = dataset.compute_metrics()
    runs = metrics['result']['runs']
    def test_error_key(run):
        try:
            error = run['error']['message']
            result = 'Unexpected column found' in error
        except:
            result = False
        return result     
    
    r = filter(test_error_key, runs)
    if len(r) > 0:
        message = 'error: new columns detected in data'
    else: 
        message = 'ok'

    It worked for me in a Jupyter notebook. You'll need to adjust it depending on your needs.

    I hope it helps you even if there is probably a better solution

    Best

  • PM
    PM Registered Posts: 10 ✭✭✭✭

    Hi Estelle,

    thanks for this workaround.

    Can i suggest?:

    - Mark 'Column Count' metric as a bug because it analyzes the schema (saved metadata) instead of the file (like all other metrics)

    or

    - Feature request 'Physical Column Count' metric in Status metrics

    Regards!

Setup Info
    Tags
      Help me…